Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Introduction

The Vector API, which is an incubator API in the Java ecosystem, is used to express vector computations within Java on supported CPU architectures. It aims to provide performance gains on vector computations that are superior to the equivalent scalar alternative.

In Java 19, a fourth round of incubation was proposed for the Vector API as part of JEP 426.

In this tutorial, we’ll explore the Vector API, its associated terminologies, and how we can leverage the API.

2. Scalars, Vectors, and Parallelism

Understanding the idea of scalars and vectors in CPU operations is important before diving deep into Vector API.

2.1. Processing Units and CPU

A CPU utilizes a bunch of processing units to perform the operations. A processing unit can compute only one value at a time by operating. This value is called a scalar value, as it is just that, a value. An operation can either be a unary operation, which operates on a single operand, or a binary operation, which operates on two. Incrementing a number by 1 is an example of a unary operation, whereas adding two numbers is a binary operation.

A processing unit takes a certain amount of time to perform these operations. We measure time in cycles. The processing unit might take 0 cycles to perform an operation and many cycles to perform another, such as adding numbers.

2.2. Parallelism

A conventional modern CPU has multiple cores, and each core houses multiple processing units which are capable of performing operations. This provides the ability to execute operations on these processing units at the same time in parallel. We can have several threads running their programs in their cores, we get parallel execution of operations.

When we have a massive calculation, such as adding huge numbers from a massive data source, we can split the data into smaller chunks of data and distribute them among several threads and hopefully, we will get faster processing. This is one of the ways to do parallel computing.

2.3. SIMD Processors

We can do parallel computing differently by using what is called a SIMD processor. SIMD stands for Single Instruction Multiple Data. In these processors, there is no concept of multithreading. These SIMD processors rely on multiple processing units and these units perform the same operation in a single CPU cycle, i.e. at the same time. They share the program (instruction) that is executed but not the underlying data, hence the name. They have the same operation but operate on different operands.

Unlike how a processor loads a scalar value from memory, a SIMD machine loads an array of integers from memory onto the registers before operating. The way SIMD hardware is organized enables the load operation of the array of values to occur in a single cycle. SIMD machines allow us to perform computations on arrays in parallel without actually relying on concurrent programming.

Since a SIMD machine will see memory as an array, or a range of values, we call these a Vector, and any operation that a SIMD machine performs becomes a vector operation. Hence, this is a very powerful and efficient way to do parallel processing tasks by leveraging the principles of the SIMD architecture.

3. The Vector API

Now that we know what vectors are, let’s try to understand the basics of the Vector API that are provided by Java. A Vector, in Java, is represented by the abstract class, Vector<E>. Here, E is the boxed type of the following scalar primitive integer types (byte, short, int, long) and floating point types (float, double).

3.1. Shapes, Species, and Lanes

We only have a pre-defined space to store and work with a vector, which ranges from 64 to 512 bits as of now. Imagine, if we have a Vector of Integer values and we have 256 bits to store it, we will have 8 components in total. This is because the size of a primitive int value is 32 bits. These components are called lanes in the context of the Vector API. 

The shape of the vector is the bit-wise size or the number of bits of a vector.  A vector with a shape of 512 bits will have 16 lanes and can operate on 16 ints at a time, while a 64-bit one will have only 2. Here, we use the term lane to indicate the similarity of how data flows in lanes within a SIMD machine.

The species of the vector is the combination of the vector’s shape and datatype, such as int, float, etc. It is represented by VectorSpecies<E>. 

3.2. Lane Operations on Vectors

There are broadly two types of vector operations classified as lane-wise operations and cross-lane operations.

A lane-wise operation, as the name suggests, only performs a scalar operation on a single lane on one or more vectors at a time. These operations can combine one lane of a vector with a lane of a second vector, for instance, during an add operation.

On the other hand, a cross-lane operation can compute or modify data from different lanes of a vector. Sorting the components of a vector is an example of a cross-lane operation. Cross-lane operations can produce scalars or vectors of different shapes from the source vectors. Cross-lane operations can be further classified into permutation and reduction operations.

3.3. Hierarchy of the Vector<E> API

The Vector<E> class has six abstract subclasses for each of the six supporting types: ByteVector, ShortVector, IntVector, LongVector, FloatVector, and DoubleVector. Specific implementations are important with SIMD machines, which is why shape-specific subclasses further extend these classes for each type. For example Int128Vector, Int512Vector, etc.

4. Computations Using Vector API

Let’s finally look at some Vector API code. We’ll look at lane-wise and cross-lane operations in the upcoming sections.

4.1. Adding Two Arrays

We want to add two integer arrays and store the information in a third array. The traditional scalar way to do this would be:

public int[] addTwoScalarArrays(int[] arr1, int[] arr2) {
    int[] result = new int[arr1.length];
    for(int i = 0; i< arr1.length; i++) {
        result[i] = arr1[i] + arr2[i];
    }
    return result;
}

Let’s now write the same code, the vector way. The Vector API packages are available under jdk.incubator.vector, which we need to import into our class.

Since we would be dealing with vectors, the very first thing we need to do is to create vectors from the two arrays. We use the fromArray() method of the Vector API for this step. This method requires us to provide the species of the vector that we want to create and the start offset of the array from where to begin the loading.

The offset would be 0 in our case, as we want to load the entire array from the start. We can use the default SPECIES_PREFERRED for our species, which uses the maximal bit size suitable for its platform:

static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_PREFERRED;
var v1 = IntVector.fromArray(SPECIES, arr1, 0);
var v2 = IntVector.fromArray(SPECIES, arr2, 0);

Once we have the two vectors from the array, we use the add() method on one of the vectors by passing the second vector:

var result = v1.add(v2);

Finally, we convert the vector result into an array and return:

public int[] addTwoVectorArrays(int[] arr1, int[] arr2) {
    var v1 = IntVector.fromArray(SPECIES, arr1, 0);
    var v2 = IntVector.fromArray(SPECIES, arr2, 0);
    var result = v1.add(v2);
    return result.toArray();
}

Considering the above code ran on a SIMD machine, the add operation adds all the lanes of the two vectors in the same CPU cycle.

4.2. VectorMasks

The code demonstrated above comes with its limitations as well. It runs well and provides the advertised performance only if the number of lanes matches the size of the vectors the SIMD machine can handle. This introduces us to the idea of using vector masks, represented by VectorMasks<E>, which is like a boolean value array. We take the help of VectorMasks when we are unable to fill the entire input data into our vector.

A mask selects the lane to which an operation is to be applied. The operation is applied if the corresponding value in the lane is true, or a different fallback action is performed if it is false.

These masks help us perform operations independent of the vector shape and size. We can use the predefined length() method, which will return the shape of the vector at runtime.

Here’s a slightly modified code with masks to help us iterate over the input arrays in strides of the vector length and then do a tail cleanup:

public int[] addTwoVectorsWithMasks(int[] arr1, int[] arr2) {
    int[] finalResult = new int[arr1.length];
    int i = 0;
    for (; i < SPECIES.loopBound(arr1.length); i += SPECIES.length()) {
        var mask = SPECIES.indexInRange(i, arr1.length);
        var v1 = IntVector.fromArray(SPECIES, arr1, i, mask);
        var v2 = IntVector.fromArray(SPECIES, arr2, i, mask);
        var result = v1.add(v2, mask);
        result.intoArray(finalResult, i, mask);
    }

    // tail cleanup loop
    for (; i < arr1.length; i++) {
        finalResult[i] = arr1[i] + arr2[i];
    }
    return finalResult;
}

This code is now much safer to execute and runs independently of the shape of the vector.

4.3. Computing the Norm of a Vector

In this section, we look at another simple mathematical calculation, the normal of two values. The norm is the value we get when we add the squares of two values and then perform a square root of the sum.

Let’s see what the scalar operation looks like first:

public float[] scalarNormOfTwoArrays(float[] arr1, float[] arr2) {
    float[] finalResult = new float[arr1.length];
    for (int i = 0; i < arr1.length; i++) {
        finalResult[i] = (float) Math.sqrt(arr1[i] * arr1[i] + arr2[i] * arr2[i]);
    }
    return finalResult;
}

We’ll now try to write the vector alternative to the above code.

First, we obtain our preferred species of type FloatVector which is optimal in this scenario:

static final VectorSpecies<Float> PREFERRED_SPECIES = FloatVector.SPECIES_PREFERRED;

We’ll use the concept of masks, as we discussed in the previous section in this example. Our loop runs till the loopBound value of the first array and does so in strides of the species length. In each step, we load the float value into a vector and perform the same mathematical operation as we did in our scalar version.

Finally, we perform a tail clean-up with an ordinary scalar loop on the leftover elements. The final code is quite similar to our previous example:

public float[] vectorNormalForm(float[] arr1, float[] arr2) {
    float[] finalResult = new float[arr1.length];
    int i = 0;
    int upperBound = SPECIES.loopBound(arr1.length);
    for (; i < upperBound; i += SPECIES.length()) {
        var va = FloatVector.fromArray(PREFERRED_SPECIES, arr1, i);
        var vb = FloatVector.fromArray(PREFERRED_SPECIES, arr2, i);
        var vc = va.mul(va)
          .add(vb.mul(vb))
          .sqrt();
        vc.intoArray(finalResult, i);
    }
    
    // tail cleanup
    for (; i < arr1.length; i++) {
        finalResult[i] = (float) Math.sqrt(arr1[i] * arr1[i] + arr2[i] * arr2[i]);
    }
    return finalResult;
}

4.4. Reduction Operation

Reduction operations in the Vector API refer to those operations that combine multiple elements of a vector into a single result. It allows us to perform calculations such as summing the elements of a vector or finding the maximum, minimum, and average value within the vector.

The Vector API provides multiple reduction operation capabilities that can leverage the SIMD architecture machines. Some common APIs include the following:

  • reduceLanes():  This method takes in a mathematical operation, such as ADD, and combines all elements of the vector into a single value
  • reduceAll(): This method is similar to the above, except that, this expects a binary reduction operation that can take two values and output a single value
  • reduceLaneWise(): This method reduces the elements in a specific lane and produces a vector with a reduced lane value.

We’ll see an example to compute the average of a vector.

We can use the reduceLanes(ADD) to compute the sum of all the elements and then perform a scalar division by the length of the array:

public double averageOfaVector(int[] arr) {
    double sum = 0;
    for (int i = 0; i< arr.length; i += SPECIES.length()) {
        var mask = SPECIES.indexInRange(i, arr.length);
        var V = IntVector.fromArray(SPECIES, arr, i, mask);
        sum += V.reduceLanes(VectorOperators.ADD, mask);
    }
    return sum / arr.length;
}

5. Caveats Associated With Vector API

While we can appreciate Vector API’s benefits, we should accept it with a pinch of salt. Firstly, this API is still in the incubation phase. There is, however, a plan to have vector classes declared as primitive classes.

As mentioned above, the Vector API has a hardware dependency as it relies on SIMD instructions. Many of the features may not be available on other platforms and architectures. Moreover, there is always an overhead of maintaining vectorized operations over traditional scalar ones.

It is also difficult to perform benchmark comparisons of vector operations on generic hardware without knowing the underlying architecture. However, the JEP provides some guidance on doing this.

6. Conclusion

The benefits of using the Vector API, albeit cautiously, are tremendous. The performance gains and the simplified vectorization of operations provide benefits to the graphics industry, large-scale computation, and many more. We looked at the important terminologies associated with the Vector API. We also dived deep into some code examples as well.

As usual, all code samples can be found over on GitHub.

Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
res – REST with Spring (eBook) (everywhere)
Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.