Introduction to Hazelcast Jet

Last updated: January 23, 2024

Written by: baeldung

Reviewed by: Tom Hombergs

Performance

Refactor Java code safely — and automatically — with OpenRewrite.

Refactoring big codebases by hand is slow, risky, and easy to put off. That’s where OpenRewrite comes in. The open-source framework for large-scale, automated code transformations helps teams modernize safely and consistently.

Each month, the creators and maintainers of OpenRewrite at Moderne run live, hands-on training sessions — one for newcomers and one for experienced users. You’ll see how recipes work, how to apply them across projects, and how to modernize code with confidence.

Join the next session, bring your questions, and learn how to automate the kind of work that usually eats your sprint time.

1. Introduction

In this tutorial, we’ll learn about Hazelcast Jet. It’s a distributed data processing engine provided by Hazelcast, Inc. and is built on top of Hazelcast IMDG.

If you want to learn about Hazelcast IMDG, here is an article for getting started.

2. What Is Hazelcast Jet?

Hazelcast Jet is a distributed data processing engine that treats data as streams. It can process data that is stored in a database or files as well as the data that is streamed by a Kafka server.

Moreover, it can perform aggregate functions over infinite data streams by dividing the streams into subsets and applying aggregation over each subset. This concept is known as windowing in the Jet terminology.

We can deploy Jet in a cluster of machines and then submit our data processing jobs to it. Jet will make all the members of the cluster automatically process the data. Each member of the cluster consumes a part of the data, and that makes it easy to scale up to any level of throughput.

Here are the typical use cases for Hazelcast Jet:

Real-Time Stream Processing
Fast Batch Processing
Processing Java 8 Streams in a distributed way
Data processing in Microservices

3. Setup

To setup Hazelcast Jet in our environment, we just need to add a single Maven dependency to our pom.xml.

Here’s how we do it:

<dependency>
    <groupId>com.hazelcast.jet</groupId>
    <artifactId>hazelcast-jet</artifactId>
    <version>4.2</version>
</dependency>

Including this dependency will download a 10 Mb jar file which provides us with all the infrastructure we need to build a distributed data processing pipeline.

The latest version for Hazelcast Jet can be found here.

4. Sample Application

In order to learn more about Hazelcast Jet, we’ll create a sample application that takes an input of sentences and a word to find in those sentences and returns the count of the specified word in those sentences.

4.1. The Pipeline

A Pipeline forms the basic construct for a Jet application. Processing within a pipeline follows these steps:

read data from a source
transform the data
write data into a sink

For our application, the pipeline will read from a distributed List, apply the transformation of grouping and aggregation and finally write to a distributed Map.

Here’s how we write our pipeline:

private Pipeline createPipeLine() {
    Pipeline p = Pipeline.create();
    p.readFrom(Sources.<String>list(LIST_NAME))
      .flatMap(word -> traverseArray(word.toLowerCase().split("\\W+")))
      .filter(word -> !word.isEmpty())
      .groupingKey(wholeItem())
      .aggregate(counting())
      .writeTo(Sinks.map(MAP_NAME));
    return p;
}

Once we’ve read from the source, we traverse the data and split it around the space using a regular expression. After that, we filter out the blanks.

Finally, we group the words, aggregate them and write the results to a Map.

4.2. The Job

Now that our pipeline is defined, we create a job for executing the pipeline.

Here’s how we write a countWord function which accepts parameters and returns the count:

public Long countWord(List<String> sentences, String word) {
    long count = 0;
    JetInstance jet = Jet.newJetInstance();
    try {
        List<String> textList = jet.getList(LIST_NAME);
        textList.addAll(sentences);
        Pipeline p = createPipeLine();
        jet.newJob(p).join();
        Map<String, Long> counts = jet.getMap(MAP_NAME);
        count = counts.get(word);
        } finally {
            Jet.shutdownAll();
      }
    return count;
}

We create a Jet instance first in order to create our job and use the pipeline. Next, we copy the input List to a distributed list so that it’s available over all the instances.

We then submit a job using the pipeline that we have built above. The method newJob() returns an executable job that is started by Jet asynchronously. The join method waits for the job to complete and throws an exception if the job is completed with an error.

When the job completes the results are retrieved in a distributed Map, as we defined in our pipeline. So, we get the Map from the Jet instance and get the counts of the word against it.

Lastly, we shut down the Jet instance. It is important to shut it down after our execution has ended, as Jet instance starts its own threads. Otherwise, our Java process will still be alive even after our method has exited.

Here is a unit test that tests the code we have written for Jet:

@Test
public void whenGivenSentencesAndWord_ThenReturnCountOfWord() {
    List<String> sentences = new ArrayList<>();
    sentences.add("The first second was alright, but the second second was tough.");
    WordCounter wordCounter = new WordCounter();
    long countSecond = wordCounter.countWord(sentences, "second");
    assertEquals(3, countSecond);
}