Generic Top

I just announced the new Learn Spring course, focused on the fundamentals of Spring 5 and Spring Boot 2:

>> CHECK OUT THE COURSE

1. Introduction

Apache Kafka is a messaging platform. With it, we can exchange data between different applications at scale.

Spring Cloud Stream is a framework for building message-driven applications. It can simplify the integration of Kafka into our services.

Conventionally, Kafka is used with the Avro message format, supported by a schema registry. In this tutorial, we'll use the Confluent Schema Registry. We'll try both Spring's implementation of integration with the Confluent Schema Registry and also the Confluent native libraries.

2. Confluent Schema Registry

Kafka represents all data as bytes, so it's common to use an external schema and serialize and deserialize into bytes according to that schema. Rather than supply a copy of that schema with each message, which would be an expensive overhead, it's also common to keep the schema in a registry and supply just an id with each message.

Confluent Schema Registry provides an easy way to store, retrieve and manage schemas. It exposes several useful RESTful APIs.

Schemata are stored by subject, and by default, the registry does a compatibility check before allowing a new schema to be uploaded against a subject.

Each producer will know the schema it's producing with, and each consumer should be able to either consume data in ANY format or should have a specific schema it prefers to read in. The producer consults the registry to establish the correct ID to use when sending a message. The consumer uses the registry to fetch the sender's schema. 

When the consumer knows both the sender's schema and its own desired message format, the Avro library can convert the data into the consumer's desired format.

3. Apache Avro

Apache Avro is a data serialization system.

It uses a JSON structure to define the schema, providing for serialization between bytes and structured data.

One strength of Avro is its support for evolving messages written in one version of a schema into the format defined by a compatible alternative schema.

The Avro toolset is also able to generate classes to represent the data structures of these schemata, making it easy to serialize in and out of POJOs.

4. Setting up the Project

To use a schema registry with Spring Cloud Stream, we need the Spring Cloud Kafka Binder and schema registry Maven dependencies:

<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-stream-binder-kafka</artifactId>
</dependency>

<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-stream-schema</artifactId>
</dependency>

For Confluent's serializer, we need:

<dependency>
    <groupId>io.confluent</groupId>
    <artifactId>kafka-avro-serializer</artifactId>
    <version>4.0.0</version>
</dependency>

And the Confluent's Serializer is in their repo:

<repositories>
    <repository>
        <id>confluent</id>
        <url>https://packages.confluent.io/maven/</url>
    </repository>
</repositories>

Also, let's use a Maven plugin to generate the Avro classes:

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro-maven-plugin</artifactId>
            <version>1.8.2</version>
            <executions>
                <execution>
                    <id>schemas</id>
                    <phase>generate-sources</phase>
                    <goals>
                        <goal>schema</goal>
                        <goal>protocol</goal>
                        <goal>idl-protocol</goal>
                    </goals>
                    <configuration>                        
                        <sourceDirectory>${project.basedir}/src/main/resources/</sourceDirectory>
                        <outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

For testing, we can use either an existing Kafka and Schema Registry set up or use a dockerized Confluent and Kafka.

5. Spring Cloud Stream

Now that we've got our project set up, let's next write a producer using Spring Cloud Stream. It'll publish employee details on a topic.

Then, we'll create a consumer that will read events from the topic and write them out in a log statement.

5.1. Schema

First, let's define a schema for employee details. We can name it employee-schema.avsc.

We can keep the schema file in src/main/resources:

{
    "type": "record",
    "name": "Employee",
    "namespace": "com.baeldung.schema",
    "fields": [
    {
        "name": "id",
        "type": "int"
    },
    {
        "name": "firstName",
        "type": "string"
    },
    {
        "name": "lastName",
        "type": "string"
    }]
}

After creating the above schema, we need to build the project. Then, the Apache Avro code generator will create a POJO named Employee under the package com.baeldung.schema.

5.2. Producer

Spring Cloud Stream provides the Processor interface. This provides us with an output and input channel.

Let's use this to make a producer that sends Employee objects to the employee-details Kafka topic:

@Autowired
private Processor processor;

public void produceEmployeeDetails(int empId, String firstName, String lastName) {

    // creating employee details
    Employee employee = new Employee();
    employee.setId(empId);
    employee.setFirstName(firstName);
    employee.setLastName(lastName);

    Message<Employee> message = MessageBuilder.withPayload(employee)
                .build();

    processor.output()
        .send(message);
}

5.2. Consumer

Now, let's write our consumer:

@StreamListener(Processor.INPUT)
public void consumeEmployeeDetails(Employee employeeDetails) {
    logger.info("Let's process employee details: {}", employeeDetails);
}

This consumer will read events published on the employee-details topic. Let's direct its output to the log to see what it does.

5.3. Kafka Bindings

So far we've only been working against the input and output channels of our Processor object. These channels need configuring with the correct destinations.

Let's use application.yml to provide the Kafka bindings:

spring:
  cloud:
    stream: 
      bindings:
        input:
          destination: employee-details
          content-type: application/*+avro
        output:
          destination: employee-details
          content-type: application/*+avro

We should note that, in this case, destination means the Kafka topic. It may be slightly confusing that it is called destination since it is the input source in this case, but it's a consistent term across consumers and producers.

5.4. Entry Point

Now that we have our producer and consumer, let's expose an API to take inputs from a user and pass it to the producer:

@Autowired
private AvroProducer avroProducer;

@PostMapping("/employees/{id}/{firstName}/{lastName}")
public String producerAvroMessage(@PathVariable int id, @PathVariable String firstName, 
  @PathVariable String lastName) {
    avroProducer.produceEmployeeDetails(id, firstName, lastName);
    return "Sent employee details to consumer";
}

5.5. Enable the Confluent Schema Registry and Bindings

Finally, to make our application apply both the Kafka and schema registry bindings, we'll need to add @EnableBinding and @EnableSchemaRegistryClient on one of our configuration classes:

@SpringBootApplication
@EnableBinding(Processor.class)
@EnableSchemaRegistryClient
public class AvroKafkaApplication {

    public static void main(String[] args) {
        SpringApplication.run(AvroKafkaApplication.class, args);
    }

}

And we should provide a ConfluentSchemaRegistryClient bean:

@Value("${spring.cloud.stream.kafka.binder.producer-properties.schema.registry.url}")
private String endPoint;

@Bean
public SchemaRegistryClient schemaRegistryClient() {
    ConfluentSchemaRegistryClient client = new ConfluentSchemaRegistryClient();
    client.setEndpoint(endPoint);
    return client;
}

The endPoint is the URL for the Confluent Schema Registry.

5.6. Testing our Service

Let's test the service with a POST request:

curl -X POST localhost:8080/employees/1001/Harry/Potter

The logs tell us that this has worked:

2019-06-11 18:45:45.343  INFO 17036 --- [container-0-C-1] com.baeldung.consumer.AvroConsumer       : Let's process employee details: {"id": 1001, "firstName": "Harry", "lastName": "Potter"}

5.7. What happened during Processing?

Let's try to understand what exactly happened with our example application:

  1. The producer built the Kafka message using the Employee object
  2. The producer registered the employee schema with the schema registry to get a schema version ID, this either creates a new ID or reuses the existing one for that exact schema
  3. Avro serialized the Employee object using the schema
  4. Spring Cloud put the schema-id in the message headers
  5. The message was published on the topic
  6. When the message came to the consumer, it read the schema-id from the header
  7. The consumer used schema-id to get the Employee schema from the registry
  8. The consumer found a local class that could represent that object and deserialized the message into it

6. Serialization/Deserialization Using Native Kafka Libraries

Spring Boot provides a few out of box message converters. By default, Spring Boot uses the Content-Type header to select an appropriate message converter.

In our example, the Content-Type is application/*+avro, Hence it used AvroSchemaMessageConverter to read and write Avro formats. But, Confluent recommends using KafkaAvroSerializer and KafkaAvroDeserializer for message conversion.

While Spring's own format works well, it has some drawbacks in terms of partitioning, and it is not interoperable with the Confluent standards, which some non-Spring services on our Kafka instance may need to be.

Let's update our application.yml to use the Confluent converters:

spring:
  cloud:
    stream:
      default: 
        producer: 
          useNativeEncoding: true
        consumer:  
          useNativeEncoding: true     
      bindings:
        input:
          destination: employee-details
          content-type: application/*+avro
        output:
          destination: employee-details
          content-type: application/*+avro
      kafka:
         binder:        
           producer-properties:
             key.serializer: io.confluent.kafka.serializers.KafkaAvroSerializer
             value.serializer: io.confluent.kafka.serializers.KafkaAvroSerializer
             schema.registry.url: http://localhost:8081 
           consumer-properties:
             key.deserializer: io.confluent.kafka.serializers.KafkaAvroDeserializer
             value.deserializer: io.confluent.kafka.serializers.KafkaAvroDeserializer
             schema.registry.url: http://localhost:8081
             specific.avro.reader: true

We have enabled the useNativeEncoding. It forces Spring Cloud Stream to delegate serialization to the provided classes.

We should also know how we can provide native settings properties for Kafka within Spring Cloud using kafka.binder.producer-properties and kafka.binder.consumer-properties.

7. Consumer Groups and Partitions

The consumer groups are the set of consumers belonging to the same application. Consumers from the same Consumer Group share the same group name.

Let's update application.yml to add a consumer group name:

spring:
  cloud:
    stream:
      // ...     
      bindings:
        input:
          destination: employee-details
          content-type: application/*+avro
          group: group-1
      // ...

All the consumers distribute the topic partitions among them evenly. Messages in different partitions will be processed in parallel.

In a consumer group, the max number of consumers reading messages at a time is equal to the number of partitions. So we can configure the number of partitions and consumers to get the desired parallelism. In general, we should have more partitions than the total number of consumers across all replicas of our service.

7.1. Partition Key

When processing our messages, the order they are processed may be important. When our messages are processed in parallel, the sequence of processing would be hard to control.

Kafka provides the rule that in a given partition, the messages are always processed in the sequence they arrived. So, where it matters that certain messages are processed in the right order, we ensure that they land in the same partition as each other.

We can provide a partition key while sending a message to a topic. The messages with the same partition key will always go to the same partition. If the partition key is not present, messages will be partitioned in round-robin fashion.

Let's try to understand this with an example. Imagine we are receiving multiple messages for an employee and we want to process all the messages of an employee in the sequence. The department name and employee id can identify an employee uniquely.

So let's define the partition key with employee's id and department name:

{
    "type": "record",
    "name": "EmployeeKey",
    "namespace": "com.baeldung.schema",
    "fields": [
     {
        "name": "id",
        "type": "int"
    },
    {
        "name": "departmentName",
        "type": "string"
    }]
}

After building the project, the EmployeeKey POJO will get generated under the package com.baeldung.schema.

Let's update our producer to use the EmployeeKey as a partition key:

public void produceEmployeeDetails(int empId, String firstName, String lastName) {

    // creating employee details
    Employee employee = new Employee();
    employee.setId(empId);
    // ...

    // creating partition key for kafka topic
    EmployeeKey employeeKey = new EmployeeKey();
    employeeKey.setId(empId);
    employeeKey.setDepartmentName("IT");

    Message<Employee> message = MessageBuilder.withPayload(employee)
        .setHeader(KafkaHeaders.MESSAGE_KEY, employeeKey)
        .build();

    processor.output()
        .send(message);
}

Here, we're putting the partition key in the message header.

Now, the same partition will receive the messages with the same employee id and department name.

7.2 Consumer Concurrency

Spring Cloud Stream allows us to set the concurrency for a consumer in application.yml:

spring:
  cloud:
    stream:
      // ... 
      bindings:
        input:
          destination: employee-details
          content-type: application/*+avro
          group: group-1
          concurrency: 3

Now our consumers will read three messages from the topic concurrently. In other words, Spring will spawn three different threads to consume independently.

8. Conclusion

In this article, we integrated a producer and consumer against Apache Kafka with Avro schemas and the Confluent Schema Registry.

We did this in a single application, but the producer and consumer could have been deployed in different applications and would have been able to have their own versions of the schemas, kept in sync via the registry.

We looked at how to use Spring's implementation of Avro and Schema Registry client, and then we saw how to switch over to the Confluent standard implementation of serialization and deserialization for the purposes of interoperability.

Finally, we looked at how to partition our topic and ensure we have the correct message keys to enable safe parallel processing of our messages.

The complete code used for this article can be found over GitHub.

Generic bottom

I just announced the new Learn Spring course, focused on the fundamentals of Spring 5 and Spring Boot 2:

>> CHECK OUT THE COURSE
Comments are closed on this article!