Introduction to Apache Accumulo

Last updated: December 14, 2024

Written by: Anshul Bansal

Reviewed by: Michael Krimgen

Apache Hadoop

Modern software architecture is often broken. Slow delivery leads to missed opportunities, innovation is stalled due to architectural complexities, and engineering resources are exceedingly expensive.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

With Orkes Conductor managed through Orkes Cloud, developers can focus on building mission critical applications without worrying about infrastructure maintenance to meet goals and, simply put, taking new products live faster and reducing total cost of ownership.

Try a 14-Day Free Trial of Orkes Conductor today.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.

Refactor Java code safely — and automatically — with OpenRewrite.

Refactoring big codebases by hand is slow, risky, and easy to put off. That’s where OpenRewrite comes in. The open-source framework for large-scale, automated code transformations helps teams modernize safely and consistently.

Each month, the creators and maintainers of OpenRewrite at Moderne run live, hands-on training sessions — one for newcomers and one for experienced users. You’ll see how recipes work, how to apply them across projects, and how to modernize code with confidence.

Join the next session, bring your questions, and learn how to automate the kind of work that usually eats your sprint time.

1. Introduction

Humanity produces massive amounts of sensitive data daily, and organizations need to manage everything from personal information and financial records to classified documents and cybersecurity logs. Statistics show that traditional databases often struggle with both the volume of Big Data and the complex security requirements of modern enterprises.

Since the role of secure data management in the modern world has become more strategic and the majority of organizations require fine-grained access controls down to the cell level, we need a database system capable of handling massive datasets while maintaining strict security protocols – at the scale of petabytes of data with billions of individual access decisions.

In this introductory article, we’ll explore Apache Accumulo, a powerful distributed key-value store with unparalleled cell-level security, high performance, and scalability.

2. What Is Apache Accumulo?

Apache Accumulo, originally developed by the National Security Agency (NSA) based on Google’s Bigtable design, is a distributed key-value store.

Built on top of Apache Hadoop and Apache ZooKeeper, it’s designed to handle massive data volumes across clusters of commodity hardware.

Accumulo enables efficient data ingestion, retrieval, and storage. It also provides server-side programming to allow complex data processing directly within the database, making it a sophisticated solution with fine-grained access control to handle sensitive big data.

The key features of Apache Accumulo are the following:

Scalability: can manage petabytes of data across large clusters
High Performance: uses in-memory processing and optimizations for efficient data access
Cell-Level Security: allows fine-grained access control, where each cell can have a unique visibility label
Rich API for Customization: offers features like iterators for in-database processing

Similar to Google’s Bigtable, which is utilized in web indexing, Google Earth, and Google Finance, Apache Accumulo is useful in a variety of applications but is not limited to:

Government and military data systems
Healthcare record management
Financial services data
Cybersecurity analytics
Large-scale graph processing

3. Installation and Setup

First, let’s make sure that prerequisites like Java 11, Apache Hadoop, YARN, and Apache ZooKeeper are installed along with corresponding JAVA_HOME, HADOOP_HOME, and ZOOKEEPER_HOME set in the path.

Then, we’ll download the latest version of Apache Accumulo and extract it:

$ tar -xzf accumulo-2.1.3-bin.tar.gz

Likewise, we can add ACCUMULO_HOME to the path variable:

$ export ACCUMULO_HOME=/path/to/accumulo
$ export PATH=$ACCUMULO_HOME/bin:$PATH

Next, we start services like ZooKeeper, Hadoop HDFS, and YARN in that order:

$ zkServer start
$ start-dfs.sh
$ start-yarn.sh

Also, we need to make sure that HDFS starts on localhost:8020 and the ZooKeeper host is set to localhost:2181, since these are the default properties set in the accumulo.properties.

Let’s confirm everything is running perfectly using the jps command, which should show the output like:

82306 Main
81385 DataNode
81745 ResourceManager
82867 Jps
81846 NodeManager
81530 SecondaryNameNode
68474 ResourceManager
81276 NameNode

Now, we’re ready to set up Accumulo to store data in ZooKeeper and HDFS:

$ accumulo init

The init command is required only once and prompts for instance name and root password.

Then, we’ll create additional configuration files required to start the cluster:

$ accumulo-cluster create-config

Finally, we’re ready to start the cluster:

$ accumulo-cluster start

Once started, we can run the Accumulo shell – a command-line tool for interacting with Apache Accumulo:

$ accumulo shell -u root

Note: This command asks to set the instance name and password in the accumulo-client.properties.

Accumulo Shell provides basic commands to manage, query, and perform administrative tasks on tables and instances.

Let’s take a look at a few commands that are most handy:

tables: lists all tables in the instance
createtable <table>: creates a new table
deletetable <table>: deletes a table
scan: scans and displays data from the current table
insert <row> <colfam> <colqual> <value>: inserts a value into the table
delete <row> <colfam> <colqual>: deletes a specific entry from the table
setiter -t <table>: sets a table-specific iterator
listiter [-scan | -table]: lists the iterators for a scanner or a table
createuser <username>: creates a new user
info: displays system information about the Accumulo instance
config: views or changes configuration settings
flush <table>: forces a flush of memory to disk for a table
compact <table>: compacts the table’s data

4. Data Model

The Accumulo data model is similar to Google’s Bigtable, providing a sparse, distributed, persistent multi-dimensional sorted map.

Specifically, the key of the Accumulo instance consists of three components (helping it to be unique for each value stored):

Row ID: The primary identifier for a row of data, used for lexicographical sorting of data
Column:
- Family: Columns are grouped into families, which act as categories or namespaces for the data. Column families provide a way to organize related data.
- Qualifier: Within each column family, individual columns are identified by a column qualifier. This allows for fine-grained differentiation of data within a column family.
- Visibility: Each key-value pair can be associated with a security label or visibility. This allows for cell-level access control, where users must have the appropriate authorizations to read the data.
TimeStamp: A version number associated with each key-value pair, allowing Accumulo to store multiple versions of the same data

Overall, the Accumulo data model provides a flexible and secure framework for managing large-scale, structured datasets with intricate security needs.

Its use of row IDs, column families, and qualifiers enables robust data organization and querying, while cell-level visibility controls ensure the protection of sensitive information.

5. Operations and Features

5.1. Basic Table Operations

Accumulo offers robust capabilities for managing tables. We can create new tables as needed, clone existing tables for testing or development purposes, and split large tables into smaller tablets for performance optimization.

Additionally, tables can be merged to consolidate data and improve query efficiency. Accumulo also supports flexible data import and export operations, enabling seamless data migration and integration with other systems.

5.2. Data Handling

Accumulo provides fundamental data manipulation to create, update, and delete data. For efficient handling of large datasets, Accumulo offers batch operations, allowing for the bulk processing of data.

Furthermore, range-based scans enable efficient retrieval of specific data subsets, optimizing query performance.

5.3. Security Features

Accumulo provides cell-level security by setting security labels for every piece of data. We can set up complex security rules using boolean expressions and manage user access to enforce fine-grained authorization policies.

5.4. Iterator Framework

Accumulo provides powerful Iterators that act as on-the-spot data processors, working directly where the data resides. They handle filtering, aggregating, and transforming data on the server itself, so we don’t need to send large amounts of raw data over the network.

This results in faster query processing, greater efficiency, and reduced network traffic.

5.5. Performance Optimizations

Accumulo incorporates various performance optimizations like write-ahead logging, memory-based writing, Bloom filters, and Locality groups to ensure efficient data storage and retrieval.

Write-ahead logging guarantees data durability, while memory-based writing accelerates data ingestion. Bloom filters enable fast lookups, reducing the need for full table scans. Locality groups optimize data placement, improving read and write performance.

5.6. Scaling and Distribution

Accumulo automatically splits tablets and balances the load when more data is added, and integrating new machines into the cluster is as simple as pointing them to it. The system manages data distribution smoothly as the data expands.

5.7. Real-Time Insights

Accumulo provides real-time insights by allowing us to monitor performance metrics, track resource usage, and detect issues as they arise.

With its efficient data processing capabilities and integration with monitoring tools, we can quickly respond to changes and ensure optimal system performance.

5.8. Administration

Accumulo offers robust administrative capabilities, including reliable backup and recovery mechanisms, intelligent data compaction strategies, and flexible system configuration options.

It also provides benefits like comprehensive user management and resource control, ensuring secure access and optimal performance.

6. Accumulo Clients

Now that we’ve covered Accumulo’s installation process, data model, operations, and features, let’s explore Accumulo Clients to interact with Accumulo through Java API.

The Accumulo Client API allows us to perform administrative tasks, query data, and manage tables programmatically.

6.1. Maven Dependency

First, let’s add the latest accumulo-core Maven dependency to our pom.xml:

<dependency>
    <groupId>org.apache.accumulo</groupId>
    <artifactId>accumulo-core</artifactId>
    <version>2.1.3</version>
</dependency>

This dependency adds the necessary classes and methods to work with Accumulo.

6.2. Create the Accumulo Client

Next, let’s create a client to interact with Accumulo:

AccumuloClient client = Accumulo.newClient()
  .to("accumuloInstanceName", "localhost:2181")
  .as("username", "password").build();

We’ve used the builder method to initialize the connection by specifying the Accumulo’s instance name, ZooKeeper host details, username, and password of the Accumulo instance.

6.3. Basic Operations

Next, with the client set up, let’s perform the basic operation of creating a table:

client.tableOperations().create(tableName);

Then, to add data to the table, we can use the BatchWriter class that offers high-performance, batch-oriented writes:

try (BatchWriter writer = client.createBatchWriter(tableName, new BatchWriterConfig())) {
    Mutation mutation1 = new Mutation("row1");
    mutation1.at()
      .family("column family 1")
      .qualifier("column family 1 qualifier 1")
      .visibility("public").put("value 1");

    Mutation mutation2 = new Mutation("row2");
    mutation2.at()
      .family("column family 1")
      .qualifier("column family 1 qualifier 2")
      .visibility("private").put("value 2");

    writer.addMutation(mutation1);
    writer.addMutation(mutation2);
}

Here, each entry is represented by the Mutation object that accepts column info like family, qualifier, and visibility as discussed previously in the data model.

Similarly, let’s retrieve data from the table using the Scanner class:

try (var scanner = client.createScanner(tableName, new Authorizations("public"))) {
    scanner.setRange(new Range());
    for (Map.Entry<Key, Value> entry : scanner) {
        System.out.println(entry.getKey() + " -> " + entry.getValue());
    }
}

Here, we iterate over rows within a specified range that scans the entire table, and we apply filters like authorizations, ensuring only publicly visible data is fetched.

7. Conclusion

In this tutorial, we’ve discussed Apache Accumulo, a versatile, scalable database that excels in handling massive datasets with complex access requirements.

Its unique features, such as cell-level security, iterators, and flexible data models, make it an excellent choice for applications requiring secure and efficient data management for real-time analytics, secure data processing, or large-scale data storage.

First, we explored the steps for installation and setup. Then, we educated ourselves with its unique data model. Last, we familiarized ourselves with the available operations and features.

The code backing this article is available on GitHub. Once you're logged in as a Baeldung Pro Member, start learning and coding on the project.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.

Modern Java teams move fast — but codebases don’t always keep up. Frameworks change, dependencies drift, and tech debt builds until it starts to drag on delivery. OpenRewrite was built to fix that: an open-source refactoring engine that automates repetitive code changes while keeping developer intent intact.

The monthly training series, led by the creators and maintainers of OpenRewrite at Moderne, walks through real-world migrations and modernization patterns. Whether you’re new to recipes or ready to write your own, you’ll learn practical ways to refactor safely and at scale.

If you’ve ever wished refactoring felt as natural — and as fast — as writing code, this is a good place to start.

REST with Spring Boot

Learn Spring Security ▼▲

Learn Spring Security Core

Learn Spring Security OAuth

Learn Spring

Learn Spring Data JPA

Learn JUnit

Learn Maven

Learn Hibernate JPA

Learn Mockito

Learn JSON with Jackson

Full Archive

Baeldung Ebooks

About Baeldung

1. Introduction

2. What Is Apache Accumulo?

3. Installation and Setup

4. Data Model

5. Operations and Features

5.1. Basic Table Operations

5.2. Data Handling

5.3. Security Features

5.4. Iterator Framework

5.5. Performance Optimizations

5.6. Scaling and Distribution

5.7. Real-Time Insights

5.8. Administration

6. Accumulo Clients

6.1. Maven Dependency

6.2. Create the Accumulo Client

6.3. Basic Operations

7. Conclusion