In the data engineering development process, comprehending the data and each transformation step is a crucial matter. And when using frameworks such as Apache Spark, based on parallel processing, we must understand the content of its structures.
Thus, we can assure data quality and more assertive development of ETLs.
In this tutorial, we’ll take a look at the practical aspect and how to print the content of Apache Spark RDD, the core Apache Spark data structure that we discussed in a previous article.
2. RDD foreach Implementation
Given that RDDs are a representation of a collection of records, we have some methods similar to data structure iteration methods, for example, map, flatmap, and foreach.
Spark methods are divided into two categories: transformations and actions. Transactions are lazy, therefore, Spark generates a plan to execute them later. Actions are methods that execute the execution plan itself. Therefore, the action foreach can be used to perform the transformation and print out the content of each record in the RDD, giving us the clarity of the ETL.
Firstly, let’s create an RDD:
val spark: SparkSession = SparkSession.builder().master("local").getOrCreate val numbers = List(4, 6, 1, 7, 12, 2) val rdd = spark.sparkContext.parallelize(numbers)
To create a resilient distributed dataset, we create the SparkSession, which is the entry point of Spark applications, and then, we use the parallelize method that receives a Seq[T] and returns an RDD[T].
Now, let’s use foreach to print the numbers inside the RDD:
rdd.foreach(println) /* 4 6 1 7 12 2 */
As simple as it looks, to print each data inside an RDD, we can just use the most appropriate method to print the data.
3. Convert RDD Into Default Data Structure
Another approach to print the data of an RDD is to convert the Spark data structure to a normalized data structure, such as an Array.
To convert an RDD to an Array, there are two methods: collect and take.
3.1. Using collect
collect is a method that transforms the RDD[T] into an Array[T].
Since Array is a standard Scala data structure and will not use parallelism to perform, it’s crucial to be aware that all data in the RDD will be loaded into the driver’s memory. So, it’s recommended to use it with small RDDs.
Let’s take a look in spark-shell at how collect works:
scala> rdd.collect() res0: Array[Int] = Array(4, 6, 1, 7, 12, 2)
We can see that collect transforms our RDD[Int] into an Array[Int].
Consequently, we can use the default approaches to print the content of an Array:
val collectConvertion = rdd.collect() collectConvertion.foreach(println) /* 4 6 1 7 12 2 */
Equally to foreach directly in RDD, we have the content printed in each line.
3.2. Using take
take, similar to collect, is a method that transforms the RDD to an Array[T], however, it’s not the entire RDD.
The take method receives an integer specifying the amount of data to be converted into the array.
Let’s take three values from our RDD and print them out:
val takeConvertion = rdd.take(3) takeConvertion.foreach(println) /* 4 6 1 */
This approach will not print all values inside the RDD, but it certainly helps to avoid memory errors when utilizing massive datasets.
4. DataSet API
Since Spark 1.6, Spark added the DataSet/DataFrame API, which has all benefits of RDD as well as Spark SQL’s optimizations. We’ll take a look at how Spark, with this API, added a simpler way to print the content in a formatted way.
Let’s create a DataFrame and print the content:
import spark.implicits._ val df = spark.sparkContext.parallelize( Seq( ("Math", "Intermediate", 800), ("English", "Basic", 500), ("Science", "Advanced", 400) ) ).toDF("Subject", "Level", "Duration") df.show /** +-------+------------+--------+ |Subject| Level|Duration| +-------+------------+--------+ | Math|Intermediate| 800| |English| Basic| 500| |Science| Advanced| 400| +-------+------------+--------+ */
Here, we can check that show prints the content in a tabular form.
In this tutorial, we’ve looked into a more practical view of development with the Apache Spark framework and how to print the content of an RDD.
As always, the code used in this article is available over on GitHub.