Guide to Apache Avro

Refactor Java code safely — and automatically — with OpenRewrite.

Refactoring big codebases by hand is slow, risky, and easy to put off. That’s where OpenRewrite comes in. The open-source framework for large-scale, automated code transformations helps teams modernize safely and consistently.

Each month, the creators and maintainers of OpenRewrite at Moderne run live, hands-on training sessions — one for newcomers and one for experienced users. You’ll see how recipes work, how to apply them across projects, and how to modernize code with confidence.

Join the next session, bring your questions, and learn how to automate the kind of work that usually eats your sprint time.

1. Overview

Data serialization is a technique of converting data into binary or text format. There are multiple systems available for this purpose. Apache Avro is one of those data serialization systems.

Avro is a language independent, schema-based data serialization library. It uses a schema to perform serialization and deserialization. Moreover, Avro uses a JSON format to specify the data structure which makes it more powerful.

In this tutorial, we’ll explore more about Avro setup, the Java API to perform serialization and a comparison of Avro with other data serialization systems.

We’ll focus primarily on schema creation which is the base of the whole system.

2. Apache Avro

Avro is a language-independent serialization library. To do this Avro uses a schema which is one of the core components. It stores the schema in a file for further data processing.

Avro is the best fit for Big Data processing. It’s quite popular in the Hadoop and Kafka world for its faster processing.

Avro creates a data file where it keeps data along with schema in its metadata section. Above all, it provides a rich data structure which makes it more popular than other similar solutions.

To use Avro for serialization, we need to follow the steps mentioned below.

3. Problem Statement

Let’s start with defining a class called AvroHttRequest that we’ll use for our examples. The class contains primitive as well as complex type attributes:

class AvroHttpRequest {
    
    private long requestTime;
    private ClientIdentifier clientIdentifier;
    private List<String> employeeNames;
    private Active active;
}

Here, requestTime is a primitive value. ClientIdentifier is another class which represents a complex type. We also have employeeNames which is again a complex type. Active is an enum to describe whether the given list of employees is active or not.

Our objective is to serialize and de-serialize the AvroHttRequest class using Apache Avro.

4. Avro Data Types

Before proceeding further, let’s discuss the data types supported by Avro.

Avro supports two types of data:

Primitive type: Avro supports all the primitive types. We use primitive type name to define a type of a given field. For example, a value which holds a String should be declared as {“type”: “string”} in the schema
Complex type: Avro supports six kinds of complex types: records, enums, arrays, maps, unions and fixed

For example, in our problem statement, ClientIdentifier is a record.

In that case schema for ClientIdentifier should look like:

{
   "type":"record",
   "name":"ClientIdentifier",
   "namespace":"com.baeldung.avro.model",
   "fields":[
      {
         "name":"hostName",
         "type":"string"
      },
      {
         "name":"ipAddress",
         "type":"string"
      }
   ]
}

5. Using Avro

To start with, let’s add the Maven dependencies we’ll need to our pom.xml file.

We should include the following dependencies:

Apache Avro – core components
Compiler – Apache Avro Compilers for Avro IDL and Avro Specific Java APIT
Tools – which includes Apache Avro command line tools and utilities
Apache Avro Maven Plugin for Maven projects

We’re using version 1.8.2 for this tutorial.

However, it’s always advised to find the latest version on Maven Central:

<dependency>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro-compiler</artifactId>
    <version>1.8.2</version>
</dependency>
<dependency>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro-maven-plugin</artifactId>
    <version>1.8.2</version>
</dependency>

After adding maven dependencies, the next steps will be:

Schema creation
Reading the schema in our program
Serializing our data using Avro
Finally, de-serialize the data

6. Schema Creation

Avro describes its schema using a JSON format. There are mainly four attributes for a given Avro schema:

type- which describes the type of schema whether its complex type or primitive value
namespace- which describes the namespace where the given schema belongs to
name – the name of the schema
fields- which tells about the fields associated with a given schema. Fields can be of primitive as well as complex type.

One way of creating the schema is to write the JSON representation, as we saw in the previous sections.

We can also create a schema using SchemaBuilder which is undeniably a better and efficient way to create it.

6.1. SchemaBuilder Utility

The class org.apache.avro.SchemaBuilder is useful for creating the schema.

First of all, let’s create the schema for ClientIdentifier:

Schema clientIdentifier = SchemaBuilder.record("ClientIdentifier")
  .namespace("com.baeldung.avro.model")
  .fields()
  .requiredString("hostName")
  .requiredString("ipAddress")
  .endRecord();

Now, let’s use this for creating an avroHttpRequest schema:

Schema avroHttpRequest = SchemaBuilder.record("AvroHttpRequest")
  .namespace("com.baeldung.avro.model")
  .fields().requiredLong("requestTime")
  .name("clientIdentifier")
    .type(clientIdentifier)
    .noDefault()
  .name("employeeNames")
    .type()
    .array()
    .items()
    .stringType()
    .arrayDefault(null)
  .name("active")
    .type()
    .enumeration("Active")
    .symbols("YES","NO")
    .noDefault()
  .endRecord();

It’s important to note here that we’ve assigned clientIdentifier as the type for the clientIdentifier field. In this case, clientIdentifier used to define type is the same schema we created before.

6.2. Using the Schema Object

As we have seen, we can utilize SchemaBuilder‘s fluent API to generate an org.apache.avro.Schema object declaratively. After that, we can apply the toString() method to get the JSON structure of Schema.

Let’s verify that the Schema instance we created for the ClientIdentifier record generates the correct JSON. We can use a dedicated assertion library like JsonUnit for this:

@Test
void whenCallingSchemaToString_thenReturnJsonAvroSchema() {
    Schema clientIdSchema = clientIdentifierSchema();

    assertThatJson(clientIdSchema.toString())
      .isEqualTo("""
          {
             "type":"record",
             "name":"ClientIdentifier",
             "namespace":"com.baeldung.avro.model",
             "fields":[
                {
                   "name":"hostName",
                   "type":"string"
                },
                {
                   "name":"ipAddress",
                   "type":"string"
                }
             ]
          }
          """);
}

Needless to say, we can do the same to generate the Avro schema for the AvroHttpRequest record.

Then, we can save these generated schemas as .avsc files under src/main/resources. This allows us to use the files with avro-maven-plugin plugins later.

7. Reading the Schema

We can use the Schema instance to create org.apache.avro.generic.GenericRecord objects. This GenericRecord API allows us to store data in a schema-based format, without needing a predefined Java class.

However, the more popular approach is to use the .avro schema files to create Avro classes. Once the classes are created, we can use them to serialize and deserialize objects. There are two ways to create Avro classes:

Programmatically generating Avro classes: Classes can be generated using SchemaCompiler. There are a couple of APIs which we can use for generating Java classes. We can find the code for generation classes on GitHub.
Using a Maven plugin to generate classes

We can use the avro-maven-plugin to generate the Java classes based on the .avsc files. Let’s include the plugin in our pom.xml:

<plugin>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro-maven-plugin</artifactId>
    <version>${avro.version}</version>
        <executions>
            <execution>
                <id>schemas</id>
                <phase>generate-sources</phase>
                <goals>
                    <goal>schema</goal>
                    <goal>protocol</goal>
                    <goal>idl-protocol</goal>
                </goals>
                <configuration>
                    <sourceDirectory>${project.basedir}/src/main/resources/</sourceDirectory>
                    <outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
                </configuration>
            </execution>
        </executions>
</plugin>

Now, we can simply run “mvn clean install” and the plugin generates the Java classes based on our .avsc files, during the generate-sources phase.

8. Serialization and Deserialization With Avro

As we’re done with generating the schema let’s continue exploring the serialization part.

There are two data serialization formats which Avro supports: JSON format and Binary format.

First, we’ll focus on the JSON format and then we’ll discuss the Binary format.

Before proceeding further, we should go through a few key interfaces. We can use the interfaces and classes below for serialization:

DatumWriter<T>: We should use this to write data on a given Schema. We’ll be using the SpecificDatumWriter implementation in our example, however, DatumWriter has other implementations as well. Other implementations are GenericDatumWriter, Json.Writer, ProtobufDatumWriter, ReflectDatumWriter, ThriftDatumWriter.

Encoder: Encoder is used or defining the format as previously mentioned. EncoderFactory provides two types of encoders, binary encoder, and JSON encoder.

DatumReader<D>: Single interface for de-serialization. Again, it got multiple implementations, but we’ll be using SpecificDatumReader in our example. Other implementations are- GenericDatumReader, Json.ObjectReader, Json.Reader, ProtobufDatumReader, ReflectDatumReader, ThriftDatumReader.

Decoder: Decoder is used while de-serializing the data. Decoderfactory provides two types of decoders: binary decoder and JSON decoder.

Next, let’s see how serialization and de-serialization happen in Avro.

8.1. Serialization

We’ll take the example of AvroHttpRequest class and serialize it using Avro.

First of all, let’s serialize it in the JSON format:

public byte[] serializeAvroHttpRequestJSON(
  AvroHttpRequest request) {
 
    DatumWriter<AvroHttpRequest> writer = new SpecificDatumWriter<>(
      AvroHttpRequest.class);
    byte[] data = new byte[0];
    ByteArrayOutputStream stream = new ByteArrayOutputStream();
    Encoder jsonEncoder = null;
    try {
        jsonEncoder = EncoderFactory.get().jsonEncoder(
          AvroHttpRequest.getClassSchema(), stream);
        writer.write(request, jsonEncoder);
        jsonEncoder.flush();
        data = stream.toByteArray();
    } catch (IOException e) {
        logger.error("Serialization error:" + e.getMessage());
    }
    return data;
}

Let’s have a look at a test case for this method:

@Test
public void givenJSONEncoder_whenSerialized_thenObjectGetsSerialized(){
    byte[] data = serializer.serializeAvroHttpRequestJSON(request);
    assertTrue(Objects.nonNull(data));
    assertTrue(data.length > 0);
}

Here we’ve used the jsonEncoder method and passed the schema to it.

If we wanted to use a binary encoder, we need to replace the jsonEncoder() method with binaryEncoder():

Encoder jsonEncoder = EncoderFactory.get().binaryEncoder(stream,null);

8.2. Deserialization

To do this, we’ll be using the above-mentioned DatumReader and Decoder interfaces.

As we used EncoderFactory to get an Encoder, similarly we’ll use DecoderFactory to get a Decoder object.

Let’s de-serialize the data using JSON format:

public AvroHttpRequest deSerializeAvroHttpRequestJSON(byte[] data) {
    DatumReader<AvroHttpRequest> reader
     = new SpecificDatumReader<>(AvroHttpRequest.class);
    Decoder decoder = null;
    try {
        decoder = DecoderFactory.get().jsonDecoder(
          AvroHttpRequest.getClassSchema(), new String(data));
        return reader.read(null, decoder);
    } catch (IOException e) {
        logger.error("Deserialization error:" + e.getMessage());
    }
}

And let’s see the test case:

@Test
public void givenJSONDecoder_whenDeserialize_thenActualAndExpectedObjectsAreEqual(){
    byte[] data = serializer.serializeAvroHttpRequestJSON(request);
    AvroHttpRequest actualRequest = deSerializer
      .deSerializeAvroHttpRequestJSON(data);
    assertEquals(actualRequest,request);
    assertTrue(actualRequest.getRequestTime()
      .equals(request.getRequestTime()));
}

Similarly, we can use a binary decoder:

Decoder decoder = DecoderFactory.get().binaryDecoder(data, null);

9. Conclusion

Apache Avro is especially useful while dealing with big data. It offers data serialization in binary as well as JSON format which can be used as per the use case.

The Avro serialization process is faster, and it’s space efficient as well. Avro doesn’t keep the field type information with each field; instead, it creates metadata in a schema.

Last but not least Avro has a great binding with a wide range of programming languages, which gives it an edge.

The code backing this article is available on GitHub. Once you're logged in as a Baeldung Pro Member, start learning and coding on the project.