Introduction to Apache Spark

Azure Spring Apps is a fully managed service from Microsoft (built in collaboration with VMware), focused on building and deploying Spring Boot applications on Azure Cloud without worrying about Kubernetes.

And, the Enterprise plan comes with some interesting features, such as commercial Spring runtime support, a 99.95% SLA and some deep discounts (up to 47%) when you are ready for production.

>> Learn more and deploy your first Spring Boot app to Azure.

You can also ask questions and leave feedback on the Azure Spring Apps GitHub page.

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

The Jet Profiler was built for MySQL only, so it can do things like real-time query performance, focus on most used tables or most frequent queries, quickly identify performance issues and basically help you optimize your queries.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

Accelerate Your Jakarta EE Development with Payara Server!

With best-in-class guides and documentation, Payara essentially simplifies deployment to diverse infrastructures.

Beyond that, it provides intelligent insights and actions to optimize Jakarta EE applications.

The goal is to apply an opinionated approach to get to what's essential for mission-critical applications - really solid scalability, availability, security, and long-term support:

>> Download and Explore the Guide (to learn more)

The AI Assistant to boost Boost your productivity writing unit tests - Machinet AI.

AI is all the rage these days, but for very good reason. The highly practical coding companion, you'll get the power of AI-assisted coding and automated unit test generation.
Machinet's Unit Test AI Agent utilizes your own project context to create meaningful unit tests that intelligently aligns with the behavior of the code.
And, the AI Chat crafts code and fixes errors with ease, like a helpful sidekick.

Simplify Your Coding Journey with Machinet AI:

>> Install Machinet AI in your IntelliJ

Looking for the ideal Linux distro for running modern Spring apps in the cloud?

Meet Alpaquita Linux: lightweight, secure, and powerful enough to handle heavy workloads.

This distro is specifically designed for running Java apps. It builds upon Alpine and features significant enhancements to excel in high-density container environments while meeting enterprise-grade security standards.

Specifically, the container image size is ~30% smaller than standard options, and it consumes up to 30% less RAM:

>> Try Alpaquita Containers now.

DbSchema is a super-flexible database designer, which can take you from designing the DB with your team all the way to safely deploying the schema.

The way it does all of that is by using a design model, a database-independent image of the schema, which can be shared in a team using GIT and compared or deployed on to any database.

And, of course, it can be heavily visual, allowing you to interact with the database using diagrams, visually compose queries, explore the data, generate random data, import data or build HTML5 database reports.

>> Take a look at DBSchema

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

1. Introduction

Apache Spark is an open-source cluster-computing framework. It provides elegant development APIs for Scala, Java, Python, and R that allow developers to execute a variety of data-intensive workloads across diverse data sources including HDFS, Cassandra, HBase, S3 etc.

Historically, Hadoop’s MapReduce prooved to be inefficient for some iterative and interactive computing jobs, which eventually led to the development of Spark. With Spark, we can run logic up to two orders of magnitude faster than with Hadoop in memory, or one order of magnitude faster on disk.

2. Spark Architecture

Spark applications run as independent sets of processes on a cluster as described in the below diagram:

These set of processes are coordinated by the SparkContext object in your main program (called the driver program). SparkContext connects to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications.

Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application.

Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

3. Core Components

The following diagram gives the clear picture of the different components of Spark:

3.1. Spark Core

Spark Core component is accountable for all the basic I/O functionalities, scheduling and monitoring the jobs on spark clusters, task dispatching, networking with different storage systems, fault recovery, and efficient memory management.

Unlike Hadoop, Spark avoids shared data to be stored in intermediate stores like Amazon S3 or HDFS by using a special data structure known as RDD (Resilient Distributed Datasets).

Resilient Distributed Datasets are immutable, a partitioned collection of records that can be operated on – in parallel and allows – fault-tolerant ‘in-memory’ computations.

RDDs support two kinds of operations:

Transformation – Spark RDD transformation is a function that produces new RDD from the existing RDDs. The transformer takes RDD as input and produces one or more RDD as output. Transformations are lazy in nature i.e., they get execute when we call an action

Action – transformations create RDDs from each other, but when we want to work with the actual data set, at that point action is performed. Thus, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system

An action is one of the ways of sending data from Executor to the driver.

Executors are agents that are responsible for executing a task. While the driver is a JVM process that coordinates workers and execution of the task. Some of the actions of Spark are count and collect.

3.2. Spark SQL

Spark SQL is a Spark module for structured data processing. It’s primarily used to execute SQL queries. DataFrame constitutes the main abstraction for Spark SQL. Distributed collection of data ordered into named columns is known as a DataFrame in Spark.

Spark SQL supports fetching data from different sources like Hive, Avro, Parquet, ORC, JSON, and JDBC. It also scales to thousands of nodes and multi-hour queries using the Spark engine – which provides full mid-query fault tolerance.

3.3. Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from a number of sources, such as Kafka, Flume, Kinesis, or TCP sockets.

Finally, processed data can be pushed out to file systems, databases, and live dashboards.

3.4. Spark Mlib

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:

ML Algorithms – common learning algorithms such as classification, regression, clustering, and collaborative filtering
Featurization – feature extraction, transformation, dimensionality reduction, and selection
Pipelines – tools for constructing, evaluating, and tuning ML Pipelines
Persistence – saving and load algorithms, models, and Pipelines
Utilities – linear algebra, statistics, data handling, etc.

3.5. Spark GraphX

GraphX is a component for graphs and graph-parallel computations. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.

To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages).

In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.

4. “Hello World” in Spark

Now that we understand the core components, we can move on to simple Maven-based Spark project – for calculating word counts.

We’ll be demonstrating Spark running in the local mode where all the components are running locally on the same machine where it’s the master node, executor nodes or Spark’s standalone cluster manager.

4.1. Maven Setup

Let’s set up a Java Maven project with Spark-related dependencies in pom.xml file:

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
	<artifactId>spark-core_2.10</artifactId>
	<version>1.6.0</version>
    </dependency>
</dependencies>

4.2. Word Count – Spark Job

Let’s now write Spark job to process a file containing sentences and output distinct words and their counts in the file:

public static void main(String[] args) throws Exception {
    if (args.length < 1) {
        System.err.println("Usage: JavaWordCount <file>");
        System.exit(1);
    }
    SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount");
    JavaSparkContext ctx = new JavaSparkContext(sparkConf);
    JavaRDD<String> lines = ctx.textFile(args[0], 1);

    JavaRDD<String> words 
      = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());
    JavaPairRDD<String, Integer> ones 
      = words.mapToPair(word -> new Tuple2<>(word, 1));
    JavaPairRDD<String, Integer> counts 
      = ones.reduceByKey((Integer i1, Integer i2) -> i1 + i2);

    List<Tuple2<String, Integer>> output = counts.collect();
    for (Tuple2<?, ?> tuple : output) {
        System.out.println(tuple._1() + ": " + tuple._2());
    }
    ctx.stop();
}

Notice that we pass the path of the local text file as an argument to a Spark job.

A SparkContext object is the main entry point for Spark and represents the connection to an already running Spark cluster. It uses SparkConf object for describing the application configuration. SparkContext is used to read a text file in memory as a JavaRDD object.

Next, we transform the lines JavaRDD object to words JavaRDD object using the flatmap method to first convert each line to space-separated words and then flatten the output of each line processing.

We again apply transform operation mapToPair which basically maps each occurrence of the word to the tuple of words and count of 1.

Then, we apply the reduceByKey operation to group multiple occurrences of any word with count 1 to a tuple of words and summed up the count.

Lastly, we execute collect RDD action to get the final results.

4.3. Executing – Spark Job

Let’s now build the project using Maven to generate apache-spark-1.0-SNAPSHOT.jar in the target folder.

Next, we need to submit this WordCount job to Spark:

${spark-install-dir}/bin/spark-submit --class com.baeldung.WordCount 
  --master local ${WordCount-MavenProject}/target/apache-spark-1.0-SNAPSHOT.jar
  ${WordCount-MavenProject}/src/main/resources/spark_example.txt

Spark installation directory and WordCount Maven project directory needs to be updated before running above command.

On submission couple of steps happens behind the scenes:

From the driver code, SparkContext connects to cluster manager(in our case spark standalone cluster manager running locally)
Cluster Manager allocates resources across the other applications
Spark acquires executors on nodes in the cluster. Here, our word count application will get its own executor processes
Application code (jar files) is sent to executors
Tasks are sent by the SparkContext to the executors.

Finally, the result of spark job is returned to the driver and we will see the count of words in the file as the output:

Hello 1
from 2
Baledung 2
Keep 1
Learning 1
Spark 1
Bye 1

5. Conclusion

In this article, we discussed the architecture and different components of Apache Spark. We also demonstrated a working example of a Spark job giving word counts from a file.

As always, the full source code is available over on GitHub.

Introduction to Apache Spark

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Introduction

2. Spark Architecture

3. Core Components

3.1. Spark Core

3.2. Spark SQL

3.3. Spark Streaming

3.4. Spark Mlib

3.5. Spark GraphX

4. “Hello World” in Spark

4.1. Maven Setup

4.2. Word Count – Spark Job

4.3. Executing – Spark Job

5. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course:

REST with Spring

Learn Spring Security ▼▲

Learn Spring Security Core

Learn Spring Security OAuth

Learn Spring

Learn Spring Data JPA

Persistence

REST

Security

Full Archive

Baeldung Ebooks

About Baeldung

Write for Baeldung

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Introduction

2. Spark Architecture

3. Core Components

3.1. Spark Core

3.2. Spark SQL

3.3. Spark Streaming

3.4. Spark Mlib

3.5. Spark GraphX

4. “Hello World” in Spark

4.1. Maven Setup

4.2. Word Count – Spark Job

4.3. Executing – Spark Job

5. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course: