1. Introduction

Apache Spark provides a rich number of methods for its DataFrame object. In this article, we’ll go through several ways to fetch the first n number of rows from a Spark DataFrame.

2. Setting Up

Let’s create a sample Dataframe of individuals and their associate ages that we’ll use in the coming examples:

import spark.implicits._
val data = Seq(
    ("Ann", 25),
    ("Brian", 16),
    ("Jack", 35),
    ("Conrad", 27),
    ("Grace", 33),
    ("Richard", 40)
  ).toDF("Name", "Age")

In the code above, we defined data which is a sequence of tuples. The spark.implicit._ import provides the toDF() method, which converts our sequence to a Spark DataFrame.

In our case, the toDF() method takes two arguments of type String which translate to the column names.

3. The show(n) Method

The show(n) method provides an easy way to display rows of a DataFrame in a tabular format. It has a return type of Unit similar to the println function in Scala:

data.show(3)

/** | Name  | Age |
  * |:------|:----|
  * | Ann   | 25  |
  * | Brian | 16  |
  * | Jack  | 35  |
  * only showing top 3 rows
  */

The show(n) method can take an argument to specify the number of rows to display. In the above example, we specified three rows. Additionally, If no arguments are supplied to the show() method, it returns 20 rows of data.

4. The head(n) Method

The head(n) method has similar functionality to show(n) except that it has a return type of Array[Row] as shown in the code below:

data.head(2).foreach(println)

/** [Ann,25] [Brian,16]
  */

This method also takes an argument to specify the number of rows to return. If no arguments are provided, only the first row is returned. In the above example, we iterate through the results using the foreach() method, providing println as an argument to display the results in the console.

5. The take(n) Method

The take(n) method is an alias to head(n) and also has a return type of Array[Row]:

def take(n: Int): Array[T] = head(n)

The code below shows a similar output as the head(n) method:

data.take(2).foreach(println)

/** [Ann,25] [Brian,16]
  */

6. The takeAsList(n) Method

Unlike take(n) which returns an Array[Row], takeAsList(n) returns a java.util.List[Row] as seen in the code below:

println(data.takeAsList(2))

/** [[Ann,25], [Brian,16]]
  */

In the code above, takeAsList(2) returns a List of the first two rows of our DataFrame. Uniquely, the takeAsList(n) is the only method in this list that returns a java object.

7. The limit(n) Method

The limit(n) method returns a new Spark DataSet with only the first n rows, as seen in the code below:

data.limit(2).foreach(println(_))

/** [Ann,25] [Brian,16]
  */

Take note of println(_). An underscore was needed to pass the correct function type to foreach(). Alternatively, we could use the show() method to display the results as shown in the code below:

data.limit(2).show()

/** | Name  | Age |
  * |:------|:----|
  * | Ann   | 25  |
  * | Brian | 16  |
  */

8. The first() Method

As a bonus, I would like to mention the first() method, which is another alias to head(n), as shown in the code below:

def first(): T = head()

The first() method simply returns the first row of the DataFrame:

println(data.first())

/** [Ann,25]
  */

9. Conclusion

In this article, we’ve discovered six ways to return the first n rows of a DataSet, namely show(n), head(n), take(n), takeAsList(n), limit(n), and first(). When choosing one of these methods, always remember that they have different return types. Therefore pick one appropriate for your situation.

As always, all the code in this article can be found over on GitHub.

guest
0 Comments
Inline Feedbacks
View all comments