In this article, we'll learn to use the Tablesaw library to work with tabular data. Firstly, we'll import some data. Then, we'll get some insight by playing with the data.
We'll use the avocado prices dataset. In short, it contains historical data on avocado prices and sales volume in multiple US markets.
2. Importing Data in Tablesaw
First of all, we need to import the data. Tablesaw supports various formats, including CSV, our dataset's format. So, let's start by loading the dataset from its CSV file:
CsvReadOptions csvReadOptions = CsvReadOptions.builder(file) .separator(',') .header(true) .dateFormat(formatter) .build(); table = Table.read().usingOptions(csvReadOptions);
Above, we create the CsvReadOptions class by passing the File object to the builder. Then, we describe how to read the CSV file by configuring the options object correctly.
First, we set the column separator using the separator() method. Second, we read the file's first line as a header. Third, we provide a DateTimeFormatter to parse the date and time correctly. Finally, we use the freshly created CsvReadOptions to read the table data.
2.1. Validate the Imported Data
Let's use the structure() method to check the table's design. It returns another table with column names, indexes, and data types:
Structure of avocado.csv Index | Column Name | Column Type | ------------------------------------------ 0 | C0 | INTEGER | 1 | Date | LOCAL_DATE | 2 | AveragePrice | DOUBLE | 3 | Total Volume | DOUBLE | ... | ... | ... |
Next, let's inspect its shape by using the shape() method:
assertThat(table.shape()).isEqualTo("avocado.csv: 18249 rows X 14 cols");
This method returns a String with the file name followed by the number of rows and columns. Our dataset contains 18249 rows of data and 14 columns in total.
3. Data Representation Inside Tablesaw
Tablesaw works primarily with tables and columns, which form the base of what is known as a data frame. In short, a table is a set of columns where each column has a fixed type. A row in the table is a set of values, with each value assigned to its matching column.
Tablesaw supports a variety of column types. In addition to the ones extending the primitive types in Java, it offers textual and temporal columns.
3.1. Textual Types
In Tablesaw, there are two two text types: TextColumn and StringColumn. The first is a general-purpose type that holds any text as it is. On the other hand, StringColumn encodes values in a dictionary-like data structure before storing them. This saves data efficiently instead of repeating values inside the column.
For example, in the avocado dataset, the region and the type columns are of type StringColumn. Their repeating values inside the column vector are stored more efficiently and point to the same instance of the text:
StringColumn type = table.stringColumn("type"); List<String> conventional = type.where(type.isEqualTo("conventional")).asList().stream() .limit(2) .toList(); assertThat(conventional.get(0)).isSameAs(conventional.get(1));
3.2. Temporal Types
There are four temporal types available inside Tablesaw. They map to their equivalent Java objects: DateColumn, DateTimeColumn, TimeColumn, and InstantColumn. As seen above, we can configure, when importing, how to parse these values.
4. Working With Columns
Next, let's look at how to work with the imported data and extract insights from it. For example, in Tablesaw, we can transform individual columns or work with the whole table.
4.1. Creating a New Column
Let's create a new column by calling the static method .create() defined on each type of available column. For example, to make a TimeColumn named time, we write:
TimeColumn time = TimeColumn.create("Time");
This column can then be added to a table using the .addColumns() method:
Table table = Table.create("test"); table.addColumns(time); assertThat(table.columnNames()).contains("time");
4.2. Adding or Modifying Columns Data
Let's add data to the end of a column using the .append() method:
DoubleColumn averagePrice = table.doubleColumn("AveragePrice"); averagePrice.append(1.123); assertThat(averagePrice.get(averagePrice.size() - 1)).isEqualTo(1.123);
For tables, we must provide a value for each column to ensure all columns have at least a value. Otherwise, it will throw an IllegalArgumentException when creating tables with columns having different sizes:
DoubleColumn averagePrice2 = table.doubleColumn("AveragePrice").copy(); averagePrice2.setName("AveragePrice2"); averagePrice2.append(1.123); assertThatExceptionOfType(IllegalArgumentException.class).isThrownBy(() -> table.addColumns(averagePrice2));
We use the .set() method to change a specific value inside the column vector. To use it, we must know the index of the value we want to change:
Removing data from a column can be problematic, especially in the case of tables. So, Tablesaw doesn't allow removing values from the column vector. Instead, let's flag values we wish to remove as missing using the .setMissing() and pass the index of each value to this method:
DoubleColumn averagePrice = table.doubleColumn("AveragePrice").setMissing(0); assertThat(averagePrice.get(0)).isNull();
As a result, it doesn't remove the value holder from the vector but sets it to null. Consequently, the size of the vector remains the same.
5. Sorting the Data
Next, let's sort the data we imported previously. To begin with, we'll sort our table rows according to a set of columns. To do this, we use the .sortAscending() and .sortDescending() methods, which accept the names of columns. Let's sort to get the oldest and the most recent date present in our dataset:
Table ascendingDateSortedTable = table.sortAscendingOn("Date"); assertThat(ascendingDateSortedTable.dateColumn("Date").get(0)).isEqualTo(LocalDate.parse("2015-01-04")); Table descendingDateSortedTable = table.sortDescendingOn("Date"); assertThat(descendingDateSortedTable.dateColumn("Date").get(0)).isEqualTo(LocalDate.parse("2018-03-25"));
However, these methods are very limiting. For example, we cannot mix the ascending and descending sorting orders. To work around these limitations, we use the .sortOn() method. It accepts a set of column names and orders them by default. To sort a specific column in descending order, we preface the column name with a minus sign “-“. For example, let's sort the data by year and with the highest average price in descending order:
Table ascendingYearAndAveragePriceSortedTable = table.sortOn("year", "-AveragePrice"); assertThat(ascendingYearAndAveragePriceSortedTable.intColumn("year").get(0)).isEqualTo(2015); assertThat(ascendingYearAndAveragePriceSortedTable.numberColumn("AveragePrice").get(0)).isEqualTo(2.79);
These methods don't fit all use cases. For such cases, Tablesaw accepts a custom implementation of Comparator<VRow> for the .sortOn() method.
6. Filtering the Data
Filters allow us to get a subset of data from the original table. Filtering a table returns another table, and we use the .where() and .dropWhere() methods to apply filters. The first method will return values or rows matching the criteria we specified. And the second method, instead, will drop them.
To specify the filtering criteria, we first need to understand Selections.
A Selection is a logical bitmap. In other words, it's an array containing boolean values that mask values on a column vector. For example, applying the selection to a column will yield another column containing the filtered values — for instance, dropping values where the mask is 0 for a given index. Additionally, the selection vector will be the same size as its original column.
Let's put this into practice by getting a table of the data from 2017 with an average price above $2 only:
DateColumn dateTable = table.dateColumn("Date"); DoubleColumn averagePrice = table.doubleColumn("AveragePrice"); Selection selection = dateTable.isInYear(2017).and(averagePrice.isGreaterThan(2D)); Table table2017 = table.where(selection); assertThat(table2017.intColumn("year")).containsOnly(2017); assertThat(table2017.doubleColumn("AveragePrice")).allMatch(avrgPrice -> avrgPrice > 2D);
Above, we used the method .isInYear() defined on the DateColumn and .isGreaterThan() defined on the DoubleColumn. We combined them in a query-like language with the .and() method. Tablesaw offers many such built-in helper methods. Therefore, we rarely have to build a custom selection ourselves for simple tasks. For complex tasks, we combine them using .and(), .andNot(), or(), and other column filters.
Alternatively, we write custom filters by creating a Predicate and passing it to the .eval() method available on each column. This method returns a Selection object we use to filter our tables or columns.
7. Summarizing Data
After working with the data, we want to extract some insight from it. We use the .summarize() method to aggregate data to learn about it. For example, from the avocado dataset, let's extract the min, max, mean, and standard deviation of the average price:
Table summary = table.summarize("AveragePrice", max, min, mean, stdDev).by("year"); System.out.println(summary.print());
First, we pass the column name we want to aggregate and the list of AggregateFunction to the .summarize() method. Next, we group the result by each year using the .by() method. Finally, we print the result on the standard output:
avocado.csv summary year | Mean [AveragePrice] | Max [AveragePrice] | Min [AveragePrice] | Std. Deviation [AveragePrice] | ---------------------------------------------------------------------------------------------------------------- 2015 | 1.375590382902939 | 2.79 | 0.49 | 0.37559477067238917 | 2016 | 1.3386396011396013 | 3.25 | 0.51 | 0.39370799476072077 | 2017 | 1.5151275777700104 | 3.17 | 0.44 | 0.4329056466203253 | 2018 | 1.3475308641975308 | 2.3 | 0.56 | 0.3058577391135024 |
Tablesaw provides AggregateFunction for most common operations. Alternatively, we can implement a custom AggregateFunction object, but since that's beyond the scope of this article, we'll keep things simple.
8. Saving the Data
Until now, we've been printing the data to the standard output. Printing to the console is good when validating our results on the fly, but we need to save our data to files to make the results reusable by others. So, let's use the .write() method directly on the table:
Above, we used the .csv() method to save the data in CSV format. Currently, Tablesaw supports only the CSV format and the fixed-width format, which is similar to what the .print() method displays on the console. Additionally, we use CsvWriterOptions to customize the CSV output of our data.
In this article, we explored working with tabular data using the Tablesaw library.
First, we explained how to import data. Then, we described the internal representation of the data and how to work with it. Next, we explored modifying the structure of the imported table and creating filters to extract the necessary data before aggregating it. Finally, we saved it to a CSV file.
As always, the full source code is available over on GitHub.