Spark DataFrame

Azure Spring Apps is a fully managed service from Microsoft (built in collaboration with VMware), focused on building and deploying Spring Boot applications on Azure Cloud without worrying about Kubernetes.

And, the Enterprise plan comes with some interesting features, such as commercial Spring runtime support, a 99.95% SLA and some deep discounts (up to 47%) when you are ready for production.

>> Learn more and deploy your first Spring Boot app to Azure.

You can also ask questions and leave feedback on the Azure Spring Apps GitHub page.

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

The Jet Profiler was built for MySQL only, so it can do things like real-time query performance, focus on most used tables or most frequent queries, quickly identify performance issues and basically help you optimize your queries.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

Accelerate Your Jakarta EE Development with Payara Server!

With best-in-class guides and documentation, Payara essentially simplifies deployment to diverse infrastructures.

Beyond that, it provides intelligent insights and actions to optimize Jakarta EE applications.

The goal is to apply an opinionated approach to get to what's essential for mission-critical applications - really solid scalability, availability, security, and long-term support:

>> Download and Explore the Guide (to learn more)

The AI Assistant to boost Boost your productivity writing unit tests - Machinet AI.

AI is all the rage these days, but for very good reason. The highly practical coding companion, you'll get the power of AI-assisted coding and automated unit test generation.
Machinet's Unit Test AI Agent utilizes your own project context to create meaningful unit tests that intelligently aligns with the behavior of the code.
And, the AI Chat crafts code and fixes errors with ease, like a helpful sidekick.

Simplify Your Coding Journey with Machinet AI:

>> Install Machinet AI in your IntelliJ

Looking for the ideal Linux distro for running modern Spring apps in the cloud?

Meet Alpaquita Linux: lightweight, secure, and powerful enough to handle heavy workloads.

This distro is specifically designed for running Java apps. It builds upon Alpine and features significant enhancements to excel in high-density container environments while meeting enterprise-grade security standards.

Specifically, the container image size is ~30% smaller than standard options, and it consumes up to 30% less RAM:

>> Try Alpaquita Containers now.

DbSchema is a super-flexible database designer, which can take you from designing the DB with your team all the way to safely deploying the schema.

The way it does all of that is by using a design model, a database-independent image of the schema, which can be shared in a team using GIT and compared or deployed on to any database.

And, of course, it can be heavily visual, allowing you to interact with the database using diagrams, visually compose queries, explore the data, generate random data, import data or build HTML5 database reports.

>> Take a look at DBSchema

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

1. Overview

Apache Spark is an open-source and distributed analytics and processing system that enables data engineering and data science at scale. It simplifies the development of analytics-oriented applications by offering a unified API for data transfer, massive transformations, and distribution.

The DataFrame is an important and essential component of Spark API. In this tutorial, we’ll look into some of the Spark DataFrame APIs using a simple customer data example.

2. DataFrame in Spark

Logically, a DataFrame is an immutable set of records organized into named columns. It shares similarities with a table in RDBMS or a ResultSet in Java.

As an API, the DataFrame provides unified access to multiple Spark libraries including Spark SQL, Spark Streaming, MLib, and GraphX.

In Java, we use Dataset<Row> to represent a DataFrame.

Essentially, a Row uses efficient storage called Tungsten, which highly optimizes Spark operations in comparison with its predecessors.

3. Maven Dependencies

Let’s start by adding the spark-core and spark-sql dependencies to our pom.xml:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.4.8</version>
</dependency>

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.4.8</version>
</dependency>

4. DataFrame and Schema

Essentially, a DataFrame is an RDD with a schema. The schema can either be inferred or defined as a StructType.

StructType is a built-in data type in Spark SQL that we use to represent a collection of StructField objects.

Let’s define a sample Customer schema StructType:

public static StructType minimumCustomerDataSchema() {
    return DataTypes.createStructType(new StructField[] {
      DataTypes.createStructField("id", DataTypes.StringType, true),
      DataTypes.createStructField("name", DataTypes.StringType, true),
      DataTypes.createStructField("gender", DataTypes.StringType, true),
      DataTypes.createStructField("transaction_amount", DataTypes.IntegerType, true) }
    );
}

Here, each StructField has a name that represents the DataFrame column name, type, and boolean value that represents whether it’s nullable.

5. Constructing DataFrames

The first operation for every Spark application is to get a SparkSession via master.

It provides us with an entry point to access the DataFrames. Let’s start by creating the SparkSession:

public static SparkSession getSparkSession() {
    return SparkSession.builder()
      .appName("Customer Aggregation pipeline")
      .master("local")
      .getOrCreate();
}

Note here that we’re connecting to Spark using the local master. If we were to connect to the cluster, we would give the cluster address instead.

Once we have a SparkSession, we can create a DataFrame using various methods. Let’s briefly look at some of them.

5.1. DataFrame from List<POJO>

Let’s build a List<Customer> first:

List<Customer> customers = Arrays.asList(
  aCustomerWith("01", "jo", "Female", 2000), 
  aCustomerWith("02", "jack", "Male", 1200)
);

Next, let’s construct the DataFrame from the List<Customer> using createDataFrame:

Dataset<Row> df = SPARK_SESSION
  .createDataFrame(customerList, Customer.class);

5.2. DataFrame from Dataset

If we have a Dataset, we can easily convert it to a DataFrame by calling toDF on the Dataset.

Let’s create a Dataset<Customer> first, using createDataset, that takes org.apache.spark.sql.Encoders:

Dataset<Customer> customerPOJODataSet = SPARK_SESSION
  .createDataset(CUSTOMERS, Encoders.bean(Customer.class));

Next, let’s convert it to DataFrame:

Dataset<Row> df = customerPOJODataSet.toDF();

5.3. Row from a POJO Using RowFactory

Since DataFrame is essentially a Dataset<Row>, let’s see how we can create a Row from a Customer POJO.

Basically, by implementing MapFunction<Customer, Row> and overriding the call method, we can map each Customer to a Row using RowFactory.create:

public class CustomerToRowMapper implements MapFunction<Customer, Row> {
    
    @Override
    public Row call(Customer customer) throws Exception {
        Row row = RowFactory.create(
          customer.getId(),
          customer.getName().toUpperCase(),
          StringUtils.substring(customer.getGender(),0, 1),
          customer.getTransaction_amount()
        );
        return row;
    }
}

We should note that we can manipulate the Customer data here before converting it into a Row.

5.4. DataFrame from List<Row>

We can also create a DataFrame from a list of Row objects:

List<Row> rows = customer.stream()
  .map(c -> new CustomerToRowMapper().call(c))
  .collect(Collectors.toList());

Now, let’s give this List<Row> to SparkSession along with the StructType schema:

Dataset<Row> df = SparkDriver.getSparkSession()
  .createDataFrame(rows, SchemaFactory.minimumCustomerDataSchema());

Note here that the List<Row> will be converted to DataFrame based on the schema definition. Any field not present in the schema will not be part of the DataFrame.

5.5. DataFrame from Structured Files and Database

DataFrames can store columnar information, like a CSV file, and nested fields and arrays, like a JSON file.

The DataFrame API remains the same regardless of whether we’re working with CSV files, JSON files, or other formats as well as Databases.

Let’s create the DataFrame from multiline JSON data:

Dataset<Row> df = SparkDriver.getSparkSession()
  .read()
  .format("org.apache.spark.sql.execution.datasources.json.JsonFileFormat")
  .option("multiline", true)
  .load("data/minCustomerData.json");

Similarly, in the case of reading from the database, we’ll have:

Dataset<Row> df = SparkDriver.getSparkSession()
  .read()
  .option("url", "jdbc:postgresql://localhost:5432/customerdb")
  .option("dbtable", "customer")
  .option("user", "user")
  .option("password", "password")
  .option("serverTimezone", "EST")
  .format("jdbc")
  .load();

6. Converting DataFrame to Dataset

Now, let’s see how we can convert our DataFrame into a Dataset. This conversion is useful if we want to manipulate our existing POJOs and the extended API that apply to only the DataFrame.

We’ll continue with the DataFrame created from JSON in the previous section.

Let’s call a mapper function that takes each row of the Dataset<Row> and converts it into a Customer object:

Dataset<Customer> ds = df.map(
  new CustomerMapper(),
  Encoders.bean(Customer.class)
);

Here, the CustomerMapper implements MapFunction<Row, Customer>:

public class CustomerMapper implements MapFunction<Row, Customer> {

    @Override
    public Customer call(Row row) {
        Customer customer = new Customer();
        customer.setId(row.getAs("id"));
        customer.setName(row.getAs("name"));
        customer.setGender(row.getAs("gender"));
        customer.setTransaction_amount(Math.toIntExact(row.getAs("transaction_amount")));
        return customer;
    }
}

We should note that the MapFunction<Row, Customer> is instantiated only once, whatever the number of records we have to process.

7. DataFrame Operations and Transformations

Now, let’s build a simple pipeline using a customer data example. We want to ingest customer data as DataFrames from two disparate file sources, normalize them, and then perform some transformations on the data.

Finally, we’ll write the transformed data to a database.

The purpose of these transformations is to find out the yearly spending, ordered by gender and source.

7.1. Ingesting Data

Firstly, let’s ingest data from a couple of sources using SparkSession‘s read method starting with JSON data:

Dataset<Row> jsonDataToDF = SPARK_SESSION.read()
  .format("org.apache.spark.sql.execution.datasources.json.JsonFileFormat")
  .option("multiline", true)
  .load("data/customerData.json");

Now, let’s do the same with our CSV source:

Dataset<Row> csvDataToDF = SPARK_SESSION.read()
  .format("csv")
  .option("header", "true")
  .schema(SchemaFactory.customerSchema())
  .option("dateFormat", "m/d/YYYY")
  .load("data/customerData.csv"); 

csvDataToDF.show(); 
csvDataToDF.printSchema(); 
return csvData;

Importantly, to read this CSV data, we’re providing a StructType schema that determines the column data types.

Once we’ve ingested the data, we can inspect the contents of the DataFrame using the show method.

Additionally, we can also limit the rows by providing the size in the show method. And, we can use printSchema to inspect the schemas of the newly created DataFrames.

We’ll notice that the two schemas have some differences. Therefore, we need to normalize the schema before we can do any transformations.

7.2. Normalizing DataFrames

Next, we’ll normalize the raw DataFrames representing the CSV and JSON data.

Here, let’s see some of the transformations performed:

private Dataset<Row> normalizeCustomerDataFromEbay(Dataset<Row> rawDataset) {
    Dataset<Row> transformedDF = rawDataset
      .withColumn("id", concat(rawDataset.col("zoneId"),lit("-"), rawDataset.col("customerId")))
      .drop(column("customerId"))
      .withColumn("source", lit("ebay"))
      .withColumn("city", rawDataset.col("contact.customer_city"))
      .drop(column("contact"))
      .drop(column("zoneId"))
      .withColumn("year", functions.year(col("transaction_date")))
      .drop("transaction_date")
      .withColumn("firstName", functions.split(column("name"), " ")
        .getItem(0))
      .withColumn("lastName", functions.split(column("name"), " ")
        .getItem(1))
      .drop(column("name"));

    return transformedDF; 
}

Some important operations on DataFrame in the above example are:

concat to join data from multiple columns and literals to make a new id column
lit static function returns a column with a literal value
functions. year to extract the year from transactionDate
function.split to split name into firstname and lastname columns
drop method removes a column in the data frame
col method returns a dataset’s column based on its name
withColumnRenamed returns a column with renamed value

Importantly, we can see that the DataFrame is immutable. Hence, whenever anything needs to change, we must create a new DataFrame.

Eventually, both data frames are normalized to the same schema as below:

root
 |-- gender: string (nullable = true)
 |-- transaction_amount: long (nullable = true)
 |-- id: string (nullable = true)
 |-- source: string (nullable = false)
 |-- city: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)

7.3. Combining DataFrames

Let’s combine the normalized DataFrames next:

Dataset<Row> combineDataframes(Dataset<Row> df1, Dataset<Row> df2) {
    return df1.unionByName(df2); 
}

Importantly, we should note that:

If we care about column names when we union two DataFrames, we should use unionByName
If we don’t care about column names when we union two DataFrames, we should use union

7.4. Aggregating DataFrames

Next, let’s group the combined DataFrames to find out the yearly spending by year, source, and gender.

We’ll then sort the aggregated data by columns year ascending and yearly spent in descending order:

Dataset<Row> aggDF = dataset
  .groupBy(column("year"), column("source"), column("gender"))
  .sum("transactionAmount")
  .withColumnRenamed("sum(transaction_amount)", "yearly spent")
  .orderBy(col("year").asc(), col("yearly spent").desc());

Some important operations on DataFrame in the above example are:

groupBy is used to arrange identical data into groups on DataFrame and then perform aggregate functions similar to SQL “GROUP BY” clause
sum applies aggregation function on the column transactionAmount after grouping
orderBy sorts the DataFrame by one or more columns
asc and desc functions from Column class can be used to specify the sorting order

Finally, let’s use the show method to see what the data frame looks like after transformation:

+----+------+------+---------------+
|year|source|gender|annual_spending|
+----+------+------+---------------+
|2018|amazon|  Male|          10600|
|2018|amazon|Female|           6200|
|2018|  ebay|  Male|           5500|
|2021|  ebay|Female|          16000|
|2021|  ebay|  Male|          13500|
|2021|amazon|  Male|           4000|
|2021|amazon|Female|           2000|
+----+------+------+---------------+

Consequently, the schema after the final transformation should be:

root
 |-- source: string (nullable = false)
 |-- gender: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- yearly spent: long (nullable = true)

7.5. Writing from DataFrame to a Relational Database

Finally, let’s finish off by writing the transformed DataFrame as a table in a relational database:

Properties dbProps = new Properties();

dbProps.setProperty("connectionURL", "jdbc:postgresql://localhost:5432/customerdb");
dbProps.setProperty("driver", "org.postgresql.Driver");
dbProps.setProperty("user", "postgres");
dbProps.setProperty("password", "postgres");

Next, we can use the Spark session to write to the database:

String connectionURL = dbProperties.getProperty("connectionURL");

dataset.write()
  .mode(SaveMode.Overwrite)
  .jdbc(connectionURL, "customer", dbProperties);

8. Testing

Now, we can test the pipeline end-to-end using the two ingestion sources, with postgres and pgAdmin Docker images:

@Test
void givenCSVAndJSON_whenRun_thenStoresAggregatedDataFrameInDB() throws Exception {
    Properties dbProps = new Properties();
    dbProps.setProperty("connectionURL", "jdbc:postgresql://localhost:5432/customerdb");
    dbProps.setProperty("driver", "org.postgresql.Driver");
    dbProps.setProperty("user", "postgres");
    dbProps.setProperty("password", "postgres");

    pipeline = new CustomerDataAggregationPipeline(dbProps);
    pipeline.run();

    String allCustomersSql = "Select count(*) from customer";

    Statement statement = conn.createStatement();
    ResultSet resultSet = statement.executeQuery(allCustomersSql);
    resultSet.next();
    int count = resultSet.getInt(1);
    assertEquals(7, count);
}

After running this, we can verify that a table exists with the columns and rows corresponding to the DataFrame. Finally, we can also observe this output via the pgAdmin4 client:

We should note a couple of important points here:

The customer table is created automatically as a result of the write operation.
The mode used is SaveMode.Overwrite. Consequently, this will overwrite anything already existing in the table. Other options available are Append, Ignore, and ErrorIfExists.

In addition, we can also use write to export DataFrame data as CSV, JSON, or parquet, among other formats.

9. Conclusion

In this tutorial, we looked at how to use DataFrames to perform data manipulation and aggregation in Apache Spark.

First, we created the DataFrames from various input sources. Then we used some of the API methods to normalize, combine, and then aggregate the data.

Finally, we exported the DataFrame as a table in a relational database.

As always, the full source code is available over on GitHub.

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Overview

2. DataFrame in Spark

3. Maven Dependencies

4. DataFrame and Schema

5. Constructing DataFrames

5.1. DataFrame from List<POJO>

5.2. DataFrame from Dataset

5.3. Row from a POJO Using RowFactory

5.4. DataFrame from List<Row>

5.5. DataFrame from Structured Files and Database

6. Converting DataFrame to Dataset

7. DataFrame Operations and Transformations

7.1. Ingesting Data

7.2. Normalizing DataFrames

7.3. Combining DataFrames

7.4. Aggregating DataFrames

7.5. Writing from DataFrame to a Relational Database

8. Testing

9. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course:

REST with Spring

Learn Spring Security ▼▲

Learn Spring Security Core

Learn Spring Security OAuth

Learn Spring

Learn Spring Data JPA

Persistence

REST

Security

Full Archive

Baeldung Ebooks

About Baeldung

Write for Baeldung

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Overview

2. DataFrame in Spark

3. Maven Dependencies

4. DataFrame and Schema

5. Constructing DataFrames

5.1. DataFrame from List<POJO>

5.2. DataFrame from Dataset

5.3. Row from a POJO Using RowFactory

5.4. DataFrame from List<Row>

5.5. DataFrame from Structured Files and Database

6. Converting DataFrame to Dataset

7. DataFrame Operations and Transformations

7.1. Ingesting Data

7.2. Normalizing DataFrames

7.3. Combining DataFrames

7.4. Aggregating DataFrames

7.5. Writing from DataFrame to a Relational Database

8. Testing

9. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course: