Partner – Orkes – NPI EA (cat=Spring)
announcement - icon

Modern software architecture is often broken. Slow delivery leads to missed opportunities, innovation is stalled due to architectural complexities, and engineering resources are exceedingly expensive.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

With Orkes Conductor managed through Orkes Cloud, developers can focus on building mission critical applications without worrying about infrastructure maintenance to meet goals and, simply put, taking new products live faster and reducing total cost of ownership.

Try a 14-Day Free Trial of Orkes Conductor today.

Partner – Orkes – NPI EA (tag=Microservices)
announcement - icon

Modern software architecture is often broken. Slow delivery leads to missed opportunities, innovation is stalled due to architectural complexities, and engineering resources are exceedingly expensive.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

With Orkes Conductor managed through Orkes Cloud, developers can focus on building mission critical applications without worrying about infrastructure maintenance to meet goals and, simply put, taking new products live faster and reducing total cost of ownership.

Try a 14-Day Free Trial of Orkes Conductor today.

eBook – Guide Spring Cloud – NPI EA (cat=Spring Cloud)
announcement - icon

Let's get started with a Microservice Architecture with Spring Cloud:

>> Join Pro and download the eBook

eBook – Mockito – NPI EA (tag = Mockito)
announcement - icon

Mocking is an essential part of unit testing, and the Mockito library makes it easy to write clean and intuitive unit tests for your Java code.

Get started with mocking and improve your application tests using our Mockito guide:

Download the eBook

eBook – Java Concurrency – NPI EA (cat=Java Concurrency)
announcement - icon

Handling concurrency in an application can be a tricky process with many potential pitfalls. A solid grasp of the fundamentals will go a long way to help minimize these issues.

Get started with understanding multi-threaded applications with our Java Concurrency guide:

>> Download the eBook

eBook – Reactive – NPI EA (cat=Reactive)
announcement - icon

Spring 5 added support for reactive programming with the Spring WebFlux module, which has been improved upon ever since. Get started with the Reactor project basics and reactive programming in Spring Boot:

>> Join Pro and download the eBook

eBook – Java Streams – NPI EA (cat=Java Streams)
announcement - icon

Since its introduction in Java 8, the Stream API has become a staple of Java development. The basic operations like iterating, filtering, mapping sequences of elements are deceptively simple to use.

But these can also be overused and fall into some common pitfalls.

To get a better understanding on how Streams work and how to combine them with other language features, check out our guide to Java Streams:

>> Join Pro and download the eBook

eBook – Jackson – NPI EA (cat=Jackson)
announcement - icon

Do JSON right with Jackson

Download the E-book

eBook – HTTP Client – NPI EA (cat=Http Client-Side)
announcement - icon

Get the most out of the Apache HTTP Client

Download the E-book

eBook – Maven – NPI EA (cat = Maven)
announcement - icon

Get Started with Apache Maven:

Download the E-book

eBook – Persistence – NPI EA (cat=Persistence)
announcement - icon

Working on getting your persistence layer right with Spring?

Explore the eBook

eBook – RwS – NPI EA (cat=Spring MVC)
announcement - icon

Building a REST API with Spring?

Download the E-book

Course – LS – NPI EA (cat=Jackson)
announcement - icon

Get started with Spring and Spring Boot, through the Learn Spring course:

>> LEARN SPRING
Course – RWSB – NPI EA (cat=REST)
announcement - icon

Explore Spring Boot 3 and Spring 6 in-depth through building a full REST API with the framework:

>> The New “REST With Spring Boot”

Course – LSS – NPI EA (cat=Spring Security)
announcement - icon

Yes, Spring Security can be complex, from the more advanced functionality within the Core to the deep OAuth support in the framework.

I built the security material as two full courses - Core and OAuth, to get practical with these more complex scenarios. We explore when and how to use each feature and code through it on the backing project.

You can explore the course here:

>> Learn Spring Security

Partner – LambdaTest – NPI EA (cat=Testing)
announcement - icon

Browser testing is essential if you have a website or web applications that users interact with. Manual testing can be very helpful to an extent, but given the multiple browsers available, not to mention versions and operating system, testing everything manually becomes time-consuming and repetitive.

To help automate this process, Selenium is a popular choice for developers, as an open-source tool with a large and active community. What's more, we can further scale our automation testing by running on theLambdaTest cloud-based testing platform.

Read more through our step-by-step tutorial on how to set up Selenium tests with Java and run them on LambdaTest:

>> Automated Browser Testing With Selenium

Partner – Orkes – NPI EA (cat=Java)
announcement - icon

Modern software architecture is often broken. Slow delivery leads to missed opportunities, innovation is stalled due to architectural complexities, and engineering resources are exceedingly expensive.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

With Orkes Conductor managed through Orkes Cloud, developers can focus on building mission critical applications without worrying about infrastructure maintenance to meet goals and, simply put, taking new products live faster and reducing total cost of ownership.

Try a 14-Day Free Trial of Orkes Conductor today.

Course – LSD – NPI EA (tag=Spring Data JPA)
announcement - icon

Spring Data JPA is a great way to handle the complexity of JPA with the powerful simplicity of Spring Boot.

Get started with Spring Data JPA through the guided reference course:

>> CHECK OUT THE COURSE

Partner – Moderne – NPI EA (cat=Spring Boot)
announcement - icon

Refactor Java code safely — and automatically — with OpenRewrite.

Refactoring big codebases by hand is slow, risky, and easy to put off. That’s where OpenRewrite comes in. The open-source framework for large-scale, automated code transformations helps teams modernize safely and consistently.

Each month, the creators and maintainers of OpenRewrite at Moderne run live, hands-on training sessions — one for newcomers and one for experienced users. You’ll see how recipes work, how to apply them across projects, and how to modernize code with confidence.

Join the next session, bring your questions, and learn how to automate the kind of work that usually eats your sprint time.

1. Overview

When working with large datasets in real-world applications, it’s not uncommon to encounter situations where data is split across multiple files or sources. At the same time, however, we may need to combine these datasets into a single view for further processing. In such a scenario, we can utilize Apache Spark since it provides DataFrames that we can use to combine the datasets.

In this tutorial, we’ll explore how to combine or concatenate two DataFrames with the same column name in Java using Apache Spark.

2. Problem Statement

Here, we want to combine two Spark DataFrames with the same schema (id, name) into a single DataFrame by appending rows. To demonstrate, let’s consider the following two DataFrames, df1 and df2.

The first DataFrame, df1, contains two rows:

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
+---+-----+

The second DataFrame, df2, contains two rows as well:

+---+-------+
| id|   name|
+---+-------+
|  3|Charlie|
|  4|  Diana|
+---+-------+

In the end, we expect an output of the two combined DataFrames:

+---+-------+
| id|   name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  3|Charlie|
|  4|  Diana|
+---+-------+

We present examples that demonstrate how to implement this row-wise concatenation as well as unit tests.

Before we proceed, let’s ensure the following are installed on our system:

  • Java 11 or later – required to run the Spark application
  • Apache Maven – for managing dependencies and building the project
  • Apache Spark (3.x) – to execute Spark jobs locally

To verify our setup, we can run commands to confirm the versions of Java, Maven, and Spark:

# Check Java version
$ java -version

# Check Maven version
$ mvn -version  

# Check Spark version
$ spark-submit --version

With these tools in place, let’s now look at the dependencies needed for our example:

These are the dependencies that we add to the pom.xml file in the next section.

3. Setting up the Maven Project for Spark

First, let’s create a Maven project named sparkdataframeconcat:

$ mvn archetype:generate \
    -DgroupId=com.baeldung.spark.dataframeconcat \
    -DartifactId=sparkdataframeconcat \
    -DarchetypeArtifactId=maven-archetype-quickstart \
    -DinteractiveMode=false

Next, in the directory sparkdataframeconcat, let’s update the pom.xml file with a few dependencies and plugins:

<dependencies>
    <!-- Spark Core -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.12</artifactId>
        <version>3.5.2</version>
        <exclusions>
            <exclusion>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-reload4j</artifactId>
            </exclusion>
            <exclusion>
                <groupId>org.apache.logging.log4j</groupId>
                <artifactId>log4j-slf4j2-impl</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <!-- Spark SQL -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.12</artifactId>
        <version>3.5.2</version>
        <exclusions>
            <exclusion>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-reload4j</artifactId>
            </exclusion>
            <exclusion>
                <groupId>org.apache.logging.log4j</groupId>
                <artifactId>log4j-slf4j2-impl</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <!-- Hadoop Common -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.6</version>
    </dependency>

    <!-- Log4j2 API -->
    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-api</artifactId>
        <version>2.22.1</version>
    </dependency>

    <!-- Log4j2 Core -->
    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-core</artifactId>
        <version>2.22.1</version>
    </dependency>

    <!-- Log4j2 SLF4J binding -->
    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-slf4j2-impl</artifactId>
        <version>2.22.1</version>
    </dependency>

    <!-- JUnit 5 -->
    <dependency>
        <groupId>org.junit.jupiter</groupId>
        <artifactId>junit-jupiter</artifactId>
        <version>5.10.2</version>
        <scope>test</scope>
    </dependency>
</dependencies>

The addition enables Spark and JUnit 5 support.

Further, let’s include a <build> section:

<build>
    <plugins>
        <!-- Compiler plugin -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.11.0</version>
            <configuration>
                <source>11</source>
                <target>11</target>
            </configuration>
        </plugin>

        <!-- Surefire plugin for JUnit 5 -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-surefire-plugin</artifactId>
            <version>3.2.5</version>
            <configuration>
                <useModulePath>false</useModulePath>
            </configuration>
        </plugin>
    </plugins>
</build>

The section configures the Maven Compiler Plugin to use Java 11 and the Surefire Plugin to execute JUnit 5 tests.

Finally, let’s add the Log4j2 configuration for logging. To ensure our log messages appear in the console, let’s create the configuration file src/main/resources/log4j2.properties:

status = error
name = SparkLoggingConfig

appender.console.type = Console
appender.console.name = ConsoleAppender
appender.console.target = SYSTEM_OUT
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = [%p] %c - %m%n

rootLogger.level = info
rootLogger.appenderRefs = console
rootLogger.appenderRef.console.ref = ConsoleAppender

Here’s what the file does:

  • Defines a console appender that prints logs to System.out
  • Uses a pattern layout ([%p] %c – %m%n) to format messages with level, logger name, and message
  • Sets the root logger level to info, so our logger.info() calls appear in the console

Now, our Maven project setup is ready. To clarify, Spark dependencies provide DataFrame support, JUnit 5 enables testing, and Log4j2 ensures we can see meaningful log messages.

4. Implementing Row-Wise Concatenation

In this section, let’s create the ConcatRowsExample class:

public class ConcatRowsExample {

    private static final Logger logger = LoggerFactory.getLogger(ConcatRowsExample.class);

    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder()
          .appName("Row-wise Concatenation Example")
          .master("local[*]")
          .getOrCreate();

        try {
            // Create sample data
            List<Person> data1 = Arrays.asList(
                new Person(1, "Alice"),
                new Person(2, "Bob")
            );

            List<Person> data2 = Arrays.asList(
                new Person(3, "Charlie"),
                new Person(4, "Diana")
            );

            Dataset<Row> df1 = spark.createDataFrame(data1, Person.class);
            Dataset<Row> df2 = spark.createDataFrame(data2, Person.class);

            logger.info("First DataFrame:");
            df1.show();

            logger.info("Second DataFrame:");
            df2.show();

            // Row-wise concatenation using reusable method
            Dataset<Row> combined = concatenateDataFrames(df1, df2);

            logger.info("After row-wise concatenation:");
            combined.show();
        } finally {
            spark.stop();
        }
    }

    /**
     * Concatenates two DataFrames row-wise using unionByName.
     */
    public static Dataset<Row> concatenateDataFrames(Dataset<Row> df1, Dataset<Row> df2) {
        return df1.unionByName(df2);
    }

    public static class Person implements java.io.Serializable {
        private int id;
        private String name;

        public Person() {
        }

        public Person(int id, String name) {
            this.id = id;
            this.name = name;
        }

        public int getId() {
            return id;
        }

        public void setId(int id) {
            this.id = id;
        }

        public String getName() {
            return name;
        }

        public void setName(String name) {
            this.name = name;
        }
    }
}

Let’s briefly analyze the class above:

  • SparkSession spark = SparkSession.builder()… – initializes a Spark session and is the main entry point for working with DataFrames in Spark
  • Dataset<Row> df1 = spark.createDataFrame(…, Person.class) – creates the first DataFrame with the rows (1, Alice) and (2, Bob)
  • Dataset<Row> df2 = spark.createDataFrame(…, Person.class) – creates the second DataFrame with the rows (3, Charlie) and (4, Diana)
  • logger.info(“…”) – writes the message defined inside the double quotes to the logs
  • df1.show(), df2.show() – prints the contents of the first and second DataFrames to the console for inspection, respectively
  • Dataset<Row> combined = concatenateDataFrames(df1, df2) – calls the reusable concatenateDataFrames() method, which concatenates the two DataFrames row-wise using Spark’s unionByName() method
  • combined.show() – displays the final combined DataFrame
  • spark.stop() – stops the Spark session and frees up resources

The example above shows how to combine (concatenate) two DataFrames with the same column name in Java using Spark’s unionByName() method.

Spark also provides a union() method, which matches columns by their order, not their names. It may, however, cause subtle bugs if column positions change, but names remain the same. In contrast, unionByName() matches by column names, making it safer and more reliable for production workloads, particularly when working with evolving schemas.

Additionally, since both DataFrames share identical schemas, Spark can easily append the rows of the second DataFrame to the first, resulting in a single, unified dataset.

5. Testing Row-Wise Concatenation

To verify our class works as expected, let’s create the ConcatRowsExampleUnitTest class:

class ConcatRowsExampleUnitTest {

    private static SparkSession spark;
    private Dataset<Row> df1;
    private Dataset<Row> df2;

    @BeforeAll
    static void setupClass() {
        spark = SparkSession.builder()
          .appName("Row-wise Concatenation Test")
          .master("local[*]")
          .getOrCreate();
    }

    @BeforeEach
    void setup() {
        df1 = spark.createDataFrame(
            Arrays.asList(
                new ConcatRowsExample.Person(1, "Alice"),
                new ConcatRowsExample.Person(2, "Bob")
            ),
            ConcatRowsExample.Person.class
        );

        df2 = spark.createDataFrame(
            Arrays.asList(
                new ConcatRowsExample.Person(3, "Charlie"),
                new ConcatRowsExample.Person(4, "Diana")
            ),
            ConcatRowsExample.Person.class
        );
    }

    @AfterAll
    static void tearDownClass() {
        spark.stop();
    }

    @Test
    void givenTwoDataFrames_whenConcatenated_thenRowCountMatches() {
        Dataset<Row> combined = ConcatRowsExample.concatenateDataFrames(df1, df2);

        assertEquals(
            4,
            combined.count(),
            "The combined DataFrame should have 4 rows"
        );
    }

    @Test
    void givenTwoDataFrames_whenConcatenated_thenSchemaRemainsSame() {
        Dataset<Row> combined = ConcatRowsExample.concatenateDataFrames(df1, df2);

        assertEquals(
            df1.schema(),
            combined.schema(),
            "Schema should remain consistent after concatenation"
        );
    }

    @Test
    void givenTwoDataFrames_whenConcatenated_thenDataContainsExpectedName() {
        Dataset<Row> combined = ConcatRowsExample.concatenateDataFrames(df1, df2);

        assertTrue(
          combined
            .filter("name = 'Charlie'")
            .count() > 0,
          "Combined DataFrame should contain Charlie"
        );
    }
}

Here’s a breakdown of the test file:

  • givenTwoDataFrames_whenConcatenated_thenRowCountMatches() {…} – verifies that after concatenation, the combined DataFrame contains 4 rows, in this case, 2 from df1 + 2 from df2
  • givenTwoDataFrames_whenConcatenated_thenSchemaRemainsSame() {…} – ensures that the schema (id, name) is preserved after concatenation
  • givenTwoDataFrames_whenConcatenated_thenDataContainsExpectedName() {…} – confirms the specific data from the second DataFrame (Charlie) is present in the combined result

Spark jobs often process large amounts of data. Thus, catching schema mismatches or missing rows early in tests avoids costly runtime failures.

At this point, let’s run the project.

6. Compiling, Running, and Testing the Project

Let’s first compile the project:

$ mvn clean compile

After that, let’s run the main class ConcatRowsExample:

$ mvn exec:java -Dexec.mainClass="com.baeldung.spark.dataframeconcat.ConcatRowsExample"
...
[INFO] com.baeldung.spark.dataframeconcat.ConcatRowsExample - First DataFrame:
...
+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
+---+-----+

[INFO] com.baeldung.spark.dataframeconcat.ConcatRowsExample - Second DataFrame:
+---+-------+
| id|   name|
+---+-------+
|  3|Charlie|
|  4|  Diana|
+---+-------+

[INFO] com.baeldung.spark.dataframeconcat.ConcatRowsExample - After row-wise concatenation:
...
+---+-------+
| id|   name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  3|Charlie|
|  4|  Diana|
+---+-------+
...

The logger writes descriptive messages, such as “First DataFrame:” to the logs, whereas the show() method prints the actual tabular content of each DataFrame to the console. Displaying both DataFrames printed separately before concatenation makes debugging easier, since we can confirm both inputs are correct before combining.

Finally, let’s run the tests:

$ mvn test
...
[INFO] Results:
[INFO] 
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  30.741 s
[INFO] Finished at: 2025-09-18T12:44:18+03:00
[INFO] ------------------------------------------------------------------------

Above, we see that all JUnit tests pass.

7. Conclusion

In this article, we demonstrated how to combine (concatenate) two DataFrames with the same column name in Java using Apache Spark.

We leveraged Spark’s unionByName() method to safely append rows from one DataFrame to another while ensuring schema consistency. Additionally, we created JUnit tests to verify that the concatenated DataFrame preserved the schema as well as the expected data.

With this approach, we can handle data that arrives in parts. For instance, combining data from multiple files, sources, or partitions that need further analysis. Thus, with the setup in place, we can easily extend this example to handle larger datasets.

As always, the source code is available over on GitHub.

Baeldung Pro – NPI EA (cat = Baeldung)
announcement - icon

Baeldung Pro comes with both absolutely No-Ads as well as finally with Dark Mode, for a clean learning experience:

>> Explore a clean Baeldung

Once the early-adopter seats are all used, the price will go up and stay at $33/year.

Partner – Orkes – NPI EA (cat = Spring)
announcement - icon

Modern software architecture is often broken. Slow delivery leads to missed opportunities, innovation is stalled due to architectural complexities, and engineering resources are exceedingly expensive.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

With Orkes Conductor managed through Orkes Cloud, developers can focus on building mission critical applications without worrying about infrastructure maintenance to meet goals and, simply put, taking new products live faster and reducing total cost of ownership.

Try a 14-Day Free Trial of Orkes Conductor today.

Partner – Orkes – NPI EA (tag = Microservices)
announcement - icon

Modern software architecture is often broken. Slow delivery leads to missed opportunities, innovation is stalled due to architectural complexities, and engineering resources are exceedingly expensive.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

With Orkes Conductor managed through Orkes Cloud, developers can focus on building mission critical applications without worrying about infrastructure maintenance to meet goals and, simply put, taking new products live faster and reducing total cost of ownership.

Try a 14-Day Free Trial of Orkes Conductor today.

eBook – HTTP Client – NPI EA (cat=HTTP Client-Side)
announcement - icon

The Apache HTTP Client is a very robust library, suitable for both simple and advanced use cases when testing HTTP endpoints. Check out our guide covering basic request and response handling, as well as security, cookies, timeouts, and more:

>> Download the eBook

eBook – Java Concurrency – NPI EA (cat=Java Concurrency)
announcement - icon

Handling concurrency in an application can be a tricky process with many potential pitfalls. A solid grasp of the fundamentals will go a long way to help minimize these issues.

Get started with understanding multi-threaded applications with our Java Concurrency guide:

>> Download the eBook

eBook – Java Streams – NPI EA (cat=Java Streams)
announcement - icon

Since its introduction in Java 8, the Stream API has become a staple of Java development. The basic operations like iterating, filtering, mapping sequences of elements are deceptively simple to use.

But these can also be overused and fall into some common pitfalls.

To get a better understanding on how Streams work and how to combine them with other language features, check out our guide to Java Streams:

>> Join Pro and download the eBook

eBook – Persistence – NPI EA (cat=Persistence)
announcement - icon

Working on getting your persistence layer right with Spring?

Explore the eBook

Course – LS – NPI EA (cat=REST)

announcement - icon

Get started with Spring Boot and with core Spring, through the Learn Spring course:

>> CHECK OUT THE COURSE

Partner – Moderne – NPI EA (tag=Refactoring)
announcement - icon

Modern Java teams move fast — but codebases don’t always keep up. Frameworks change, dependencies drift, and tech debt builds until it starts to drag on delivery. OpenRewrite was built to fix that: an open-source refactoring engine that automates repetitive code changes while keeping developer intent intact.

The monthly training series, led by the creators and maintainers of OpenRewrite at Moderne, walks through real-world migrations and modernization patterns. Whether you’re new to recipes or ready to write your own, you’ll learn practical ways to refactor safely and at scale.

If you’ve ever wished refactoring felt as natural — and as fast — as writing code, this is a good place to start.

eBook Jackson – NPI EA – 3 (cat = Jackson)