Persistence top

I just announced the new Learn Spring course, focused on the fundamentals of Spring 5 and Spring Boot 2:

>> CHECK OUT THE COURSE

1. Overview

In this tutorial, we'll take a dive into the MongoDB Aggregation framework using the MongoDB Java driver.

We'll first look at what aggregation means conceptually, and then set up a dataset. Finally, we'll see various aggregation techniques in action using Aggregates builder.

2. What are Aggregations?

Aggregations are used in MongoDB to analyze data and derive meaningful information out of it.

These are usually performed in various stages, and the stages form a pipeline – such that the output of one stage is passed on as input to the next stage.

The most commonly used stages can be summarized as:

Stage SQL Equivalent Description
 project SELECT selects only the required fields, can also be used to compute and add derived fields to the collection
 match WHERE filters the collection as per specified criteria
 group GROUP BY gathers input together as per the specified criteria (e.g. count, sum) to return a document for each distinct grouping
 sort ORDER BY sorts the results in ascending or descending order of a given field
 count COUNT counts the documents the collection contains
 limit LIMIT limits the result to a specified number of documents, instead of returning the entire collection
 out SELECT INTO NEW_TABLE writes the result to a named collection; this stage is only acceptable as the last in a pipeline


The SQL Equivalent for each aggregation stage is included above to give us an idea of what the said operation means in the SQL world.

We'll look at Java code samples for all of these stages shortly. But before that, we need a database.

3. Database Setup

3.1. Dataset

The first and foremost requirement for learning anything database-related is the dataset itself!

For the purpose of this tutorial, we'll use a publicly available restful API endpoint that provides comprehensive information about all the countries of the world. This API gives us a lot of data points for a country in a convenient JSON format. Some of the fields that we'll be using in our analysis are:

  • name – the name of the country; for example, United States of America
  • alpha3Code – a shortcode for the country name; for example, IND (for India)
  • region – the region the country belongs to; for example, Europe
  • area – the geographical area of the country
  • languages – official languages of the country in an array format; for example, English
  • borders – an array of neighboring countries' alpha3Codes

Now let's see how to convert this data into a collection in a MongoDB database.

3.2. Importing to MongoDB

First, we need to hit the API endpoint to get all countries and save the response locally in a JSON file. The next step is to import it into MongoDB using the mongoimport command:

mongoimport.exe --db <db_name> --collection <collection_name> --file <path_to_file> --jsonArray

Successful import should give us a collection with 250 documents.

4. Aggregation Samples in Java

Now that we have the bases covered, let's get into deriving some meaningful insights from the data we have for all the countries. We'll use several JUnit tests for this purpose.

But before we do that, we need to make a connection to the database:

@BeforeClass
public static void setUpDB() throws IOException {
    mongoClient = MongoClients.create();
    database = mongoClient.getDatabase(DATABASE);
    collection = database.getCollection(COLLECTION);
}

In all the examples that follow, we'll be using the Aggregates helper class provided by the MongoDB Java driver.

For better readability of our snippets, we can add a static import:

import static com.mongodb.client.model.Aggregates.*;

4.1. match and count

To begin with, let's start with something simple. Earlier we noted that the dataset contains information about languages.

Now, let's say we want to check the number of countries in the world where English is an official language:

@Test
public void givenCountryCollection_whenEnglishSpeakingCountriesCounted_thenNinetyOne() {
    Document englishSpeakingCountries = collection.aggregate(Arrays.asList(
      match(Filters.eq("languages.name", "English")),
      count())).first();
    
    assertEquals(91, englishSpeakingCountries.get("count"));
}

Here we are using two stages in our aggregation pipeline: match and count.

First, we filter out the collection to match only those documents that contain English in their languages field. These documents can be imagined as a temporary or intermediate collection that becomes the input for our next stage, count. This counts the number of documents in the previous stage.

Another point to note in this sample is the use of the method first. Since we know that the output of the last stage, count, is going to be a single record, this is a guaranteed way to extract out the lone resulting document.

4.2. group (with sum) and sort

In this example, our objective is to find out the geographical region containing the maximum number of countries:

@Test
public void givenCountryCollection_whenCountedRegionWise_thenMaxInAfrica() {
    Document maxCountriedRegion = collection.aggregate(Arrays.asList(
      group("$region", Accumulators.sum("tally", 1)),
      sort(Sorts.descending("tally")))).first();
    
    assertTrue(maxCountriedRegion.containsValue("Africa"));
}

As is evident, we are using group and sort to achieve our objective here.

First, we gather the number of countries in each region by accumulating a sum of their occurrences in a variable tally. This gives us an intermediate collection of documents, each containing two fields: the region and the tally of countries in it.  Then we sort it in the descending order and extract the first document to give us the region with maximum countries.

4.3. sort, limit, and out

Now let's use sort, limit and out to extract the seven largest countries area-wise and write them into a new collection:

@Test
public void givenCountryCollection_whenAreaSortedDescending_thenSuccess() {
    collection.aggregate(Arrays.asList(
      sort(Sorts.descending("area")), 
      limit(7),
      out("largest_seven"))).toCollection();

    MongoCollection<Document> largestSeven = database.getCollection("largest_seven");

    assertEquals(7, largestSeven.countDocuments());

    Document usa = largestSeven.find(Filters.eq("alpha3Code", "USA")).first();

    assertNotNull(usa);
}

Here, we first sorted the given collection in the descending order of area. Then, we used the Aggregates#limit method to restrict the result to seven documents only. Finally, we used the out stage to deserialize this data into a new collection called largest_seven. This collection can now be used in the same way as any other – for example, to find if it contains USA.

4.4. project, group (with max), match

In our last sample, let's try something trickier. Say we need to find out how many borders each country shares with others, and what is the maximum such number.

Now in our dataset, we have a borders field, which is an array listing alpha3Codes for all bordering countries of the nation, but there isn't any field directly giving us the count. So we'll need to derive the number of borderingCountries using project:

@Test
public void givenCountryCollection_whenNeighborsCalculated_thenMaxIsFifteenInChina() {
    Bson borderingCountriesCollection = project(Projections.fields(Projections.excludeId(), 
      Projections.include("name"), Projections.computed("borderingCountries", 
        Projections.computed("$size", "$borders"))));
    
    int maxValue = collection.aggregate(Arrays.asList(borderingCountriesCollection, 
      group(null, Accumulators.max("max", "$borderingCountries"))))
      .first().getInteger("max");

    assertEquals(15, maxValue);

    Document maxNeighboredCountry = collection.aggregate(Arrays.asList(borderingCountriesCollection,
      match(Filters.eq("borderingCountries", maxValue)))).first();
       
    assertTrue(maxNeighboredCountry.containsValue("China"));
}

After that, as we saw before, we'll group the projected collection to find the max value of borderingCountries. One thing to point out here is that the max accumulator gives us the maximum value as a number, not the entire Document containing the maximum value. We need to perform match to filter out the desired Document if any further operations are to be performed.

5. Conclusion

In this article, we saw what are MongoDB aggregations, and how to apply them in Java using an example dataset.

We used four samples to illustrate the various aggregation stages to form a basic understanding of the concept. There are umpteen possibilities for data analytics that this framework offers which can be explored further.

For further reading, Spring Data MongoDB provides an alternative way to handle projections and aggregations in Java.

As always, source code is available over on GitHub.

Persistence bottom

I just announced the new Learn Spring course, focused on the fundamentals of Spring 5 and Spring Boot 2:

>> CHECK OUT THE COURSE
Comments are closed on this article!