
Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 18, 2024
Apache Spark provides a rich number of methods for its DataFrame object. In this article, we’ll go through several ways to fetch the first n number of rows from a Spark DataFrame.
Let’s create a sample Dataframe of individuals and their associate ages that we’ll use in the coming examples:
import spark.implicits._
val data = Seq(
("Ann", 25),
("Brian", 16),
("Jack", 35),
("Conrad", 27),
("Grace", 33),
("Richard", 40)
).toDF("Name", "Age")
In the code above, we defined data which is a sequence of tuples. The spark.implicit._ import provides the toDF() method, which converts our sequence to a Spark DataFrame.
In our case, the toDF() method takes two arguments of type String which translate to the column names.
The show(n) method provides an easy way to display rows of a DataFrame in a tabular format. It has a return type of Unit similar to the println function in Scala:
data.show(3)
/** | Name | Age |
* |:------|:----|
* | Ann | 25 |
* | Brian | 16 |
* | Jack | 35 |
* only showing top 3 rows
*/
The show(n) method can take an argument to specify the number of rows to display. In the above example, we specified three rows. Additionally, If no arguments are supplied to the show() method, it returns 20 rows of data.
The head(n) method has similar functionality to show(n) except that it has a return type of Array[Row] as shown in the code below:
data.head(2).foreach(println)
/** [Ann,25] [Brian,16]
*/
This method also takes an argument to specify the number of rows to return. If no arguments are provided, only the first row is returned. In the above example, we iterate through the results using the foreach() method, providing println as an argument to display the results in the console.
The take(n) method is an alias to head(n) and also has a return type of Array[Row]:
def take(n: Int): Array[T] = head(n)
The code below shows a similar output as the head(n) method:
data.take(2).foreach(println)
/** [Ann,25] [Brian,16]
*/
Unlike take(n) which returns an Array[Row], takeAsList(n) returns a java.util.List[Row] as seen in the code below:
println(data.takeAsList(2))
/** [[Ann,25], [Brian,16]]
*/
In the code above, takeAsList(2) returns a List of the first two rows of our DataFrame. Uniquely, the takeAsList(n) is the only method in this list that returns a java object.
The limit(n) method returns a new Spark DataSet with only the first n rows, as seen in the code below:
data.limit(2).foreach(println(_))
/** [Ann,25] [Brian,16]
*/
Take note of println(_). An underscore was needed to pass the correct function type to foreach(). Alternatively, we could use the show() method to display the results as shown in the code below:
data.limit(2).show()
/** | Name | Age |
* |:------|:----|
* | Ann | 25 |
* | Brian | 16 |
*/
As a bonus, I would like to mention the first() method, which is another alias to head(n), as shown in the code below:
def first(): T = head()
The first() method simply returns the first row of the DataFrame:
println(data.first())
/** [Ann,25]
*/
In this article, we’ve discovered six ways to return the first n rows of a DataSet, namely show(n), head(n), take(n), takeAsList(n), limit(n), and first(). When choosing one of these methods, always remember that they have different return types. Therefore pick one appropriate for your situation.