Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Introduction

In this tutorial, we’ll show how to share memory between two or more JVMs running on the same machine. This capability enables very fast inter-process communication since we can move data blocks around without any I/O operation.

2. How Shared Memory Works?

A process running in any modern operating system gets what’s called a virtual memory space. We call it virtual because, although it looks like a large, continuous, and private addressable memory space, in fact, it’s made of pages spread all over the physical RAM. Here, page is just OS slang for a block of contiguous memory, whose size range depends on the particular CPU architecture in use. For x86-84, a page can be as small as 4KB or as large as 1 GB.

At a given time, only a fraction of this virtual space is actually mapped to physical pages. As time passes and the process starts to consume more memory for its tasks, the OS starts to allocate more physical pages and map them to the virtual space. When the demand for memory exceeds what’s physically available, the OS will start to swap out pages that are not being used at that moment to secondary storage to make room for the request.

A shared memory block behaves just like regular memory, but, in contrast with regular memory, it is not private to a single process. When a process changes the contents of any byte within this block, any other process with access to this same shared memory “sees” this change instantly.

This is a list of common uses for shared memory:

  • Debuggers (ever wondered how a debugger can inspect variables in another process?)
  • Inter-process communication
  • Read-only content sharing between processes (ex: dynamic library code)
  • Hacks of all sorts ;^)

3. Shared Memory and Memory-Mapped Files

A memory-mapped file, as the name suggests, is a regular file whose contents are directly mapped to a contiguous area in the virtual memory of a process. This means that we can read and/or change its contents without explicit use of I/O operations. The OS will detect any writes to the mapped area and will schedule a background I/O operation to persist the modified data.

Since there are no guarantees on when this background operation will happen, the OS also offers a system call to flush any pending changes. This is important for use cases like database redo logs, but not needed for our inter-process communication (IPC, for short) scenario.

Memory-mapped files are commonly used by database servers to achieve high throughput I/O operations, but we can also use them to bootstrap a shared-memory-based IPC mechanism. The basic idea is that all processes that need to share data map the same file and, voilà, they now have a shared memory area.

4. Creating Memory-Mapped Files in Java

In Java, we use the FileChannel‘s map() method to map a region of a file into memory, which returns a MappedByteBuffer that allows us to access its contents:

MappedByteBuffer createSharedMemory(String path, long size) {

    try (FileChannel fc = (FileChannel)Files.newByteChannel(new File(path).toPath(),
      EnumSet.of(
        StandardOpenOption.CREATE,
        StandardOpenOption.SPARSE,
        StandardOpenOption.WRITE,
        StandardOpenOption.READ))) {

        return fc.map(FileChannel.MapMode.READ_WRITE, 0, size);
    }
    catch( IOException ioe) {
        throw new RuntimeException(ioe);
    }
}

The use of the SPARSE option here is quite relevant. As long the underlying OS and file system supports it, we can map sizable memory area without actually consuming disk space.

Now, let’s create a simple demo application. The Producer application will allocate a shared memory large enough to hold 64KB of data plus a SHA1 hash (20 bytes). Next, it will start a loop where it will fill the buffer with random data, followed by its SHA1 hash. We’ll repeat this operation continuously for 30 seconds and then exit:

// ... SHA1 digest initialization omitted

MappedByteBuffer shm = createSharedMemory("some_path.dat", 64*1024 + 20);
Random rnd = new Random();

long start = System.currentTimeMillis();
long iterations = 0;
int capacity = shm.capacity();
System.out.println("Starting producer iterations...");
while(System.currentTimeMillis() - start < 30000) {

    for (int i = 0; i < capacity - hashLen; i++) {
        byte value = (byte) (rnd.nextInt(256) & 0x00ff);
        digest.update(value);
        shm.put(i, value);
    }

    // Write hash at the end
    byte[] hash = digest.digest();
    shm.put(capacity - hashLen, hash);
    iterations++;
}

System.out.printf("%d iterations run\n", iterations);

To test that we indeed can share memory, we’ll also create a Consumer app that will read the buffer’s content, compute its hash, and compare it with the Producer-generated one. We’ll repeat this process for 30 seconds. At each iteration, will also compute the buffer content’s hash and compare it with the one present at the buffer’s end:

// ... digest initialization omitted

MappedByteBuffer shm = createSharedMemory("some_path.dat", 64*1024 + 20);
long start = System.currentTimeMillis();
long iterations = 0;
int capacity = shm.capacity();
System.out.println("Starting consumer iterations...");

long matchCount = 0;
long mismatchCount = 0;
byte[] expectedHash = new byte[hashLen];

while (System.currentTimeMillis() - start < 30000) {

    for (int i = 0; i < capacity - 20; i++) {
        byte value = shm.get(i);
        digest.update(value);
    }

    byte[] hash = digest.digest();
    shm.get(capacity - hashLen, expectedHash);

    if (Arrays.equals(hash, expectedHash)) {
        matchCount++;
    } else {
        mismatchCount++;
    }
    iterations++;
}

System.out.printf("%d iterations run. matches=%d, mismatches=%d\n", iterations, matchCount, mismatchCount);

To test our memory-sharing scheme, let’s start both programs at the same time. This is their output when running on a 3Ghz, quad-core Intel I7 machine:

# Producer output
Starting producer iterations...
11722 iterations run


# Consumer output
Starting consumer iterations...
18893 iterations run. matches=11714, mismatches=7179

We can see that, in many cases, the consumer detects that the expected computed values are different. Welcome to the wonderful world of concurrency issues!

5. Synchronizing Shared Memory Access

The root cause for the issue we’ve seen is that we need to synchronize access to the shared memory buffer. The Consumer must wait for the Producer to finish writing the hash before it starts reading the data. On the other hand, the Producer also must wait for the Consumer to finish consuming the data before writing to it again.

For a regular multithreaded application, solving this issue is no big deal. The standard library offers several synchronization primitives that allow us to control who can write to the shared memory at a given time.

However, ours is a multi-JVM scenario, so none of those standard methods apply. So, what should we do? Well, the short answer is that we’ll have to cheat. We could resort to OS-specific mechanisms like semaphores, but this would hinder our application’s portability. Also, this implies using JNI or JNA, which also complicates things.

Enter Unsafe. Despite its somewhat scary name, this standard library class offers exactly what we need to implement a simple lock mechanism: the compareAndSwapInt() method.

This method implements an atomic test-and-set primitive that takes four arguments. Although not clearly stated in the documentation, it can target not only Java objects but also a raw memory address. For the latter, we pass null in the first argument, which makes it treat the offset argument as a virtual memory address.

When we call this method, it will first check the value at the target address and compare it with the expected value. If they’re equal, then it will modify the location’s content to the new value and return true indicating success. If the value at the location is different from expected, nothing happens, and the method returns false.

More importantly, this atomic operation is guaranteed to work even in multicore architectures, which is a critical feature for synchronizing multiple executing threads.

Let’s create a SpinLock class that takes advantage of this method to implement a (very!) simple lock mechanism:

//... package and imports omitted

public class SpinLock {
    private static final Unsafe unsafe;

    // ... unsafe initialization omitted
    private final long addr;

    public SpinLock(long addr) {
        this.addr = addr;
    }

    public boolean tryLock(long maxWait) {
        long deadline = System.currentTimeMillis() + maxWait;
        while (System.currentTimeMillis() < deadline ) {
            if (unsafe.compareAndSwapInt(null, addr, 0, 1)) {
                return true;
            }
        }
        return false;
    }

    public void unlock() {
        unsafe.putInt(addr, 0);
    }
}

This implementation lacks key features, like checking whether it owns the lock before releasing it, but it will suffice for our purpose.

Okay, so how do we get the memory address that we’ll use to store the lock status? This must be an address within the shared memory buffer so both processes can use it, but the MappedByteBuffer class does not expose the actual memory address.

Inspecting the object that map() returns, we can see that it is a DirectByteBuffer. This class has a public method called address() that returns exactly what we want. Unfortunately, this class is package-private so we can’t use a simple cast to access this method.

To bypass this limitation, we’ll cheat a little again and use reflection to invoke this method:

private static long getBufferAddress(MappedByteBuffer shm) {
    try {
        Class<?> cls = shm.getClass();
        Method maddr = cls.getMethod("address");
        maddr.setAccessible(true);
        Long addr = (Long) maddr.invoke(shm);
        if (addr == null) {
            throw new RuntimeException("Unable to retrieve buffer's address");
        }
        return addr;
    } catch (NoSuchMethodException | InvocationTargetException | IllegalAccessException ex) {
        throw new RuntimeException(ex);
    }
}

Here, we’re using setAccessible() to make the address() method callable through the Method handle. However, be aware that, from Java 17 onwards, this technique won’t work unless we explicitly use the runtime –add-opens flag.

6. Adding Synchronization to Producer and Consumer

Now that we have a lock mechanism, let’s apply it to the Producer first. For the purposes of this demo, we’ll assume that the Producer will always start before the Consumer. We need this so we can initialize the buffer, clearing its content including the area we’ll use with the SpinLock:

public static void main(String[] args) throws Exception {

    // ... digest initialization omitted
    MappedByteBuffer shm = createSharedMemory("some_path.dat", 64*1024 + 20);

    // Cleanup lock area 
    shm.putInt(0, 0);

    long addr = getBufferAddress(shm);
    System.out.println("Starting producer iterations...");

    long start = System.currentTimeMillis();
    long iterations = 0;
    Random rnd = new Random();
    int capacity = shm.capacity();
    SpinLock lock = new SpinLock(addr);
    while(System.currentTimeMillis() - start < 30000) {

        if (!lock.tryLock(5000)) {
            throw new RuntimeException("Unable to acquire lock");
        }

        try {
            // Skip the first 4 bytes, as they're used by the lock
            for (int i = 4; i < capacity - hashLen; i++) {
                byte value = (byte) (rnd.nextInt(256) & 0x00ff);
                digest.update(value);
                shm.put(i, value);
            }

            // Write hash at the end
            byte[] hash = digest.digest();
            shm.put(capacity - hashLen, hash);
            iterations++;
        }
        finally {
            lock.unlock();
        }
    }
    System.out.printf("%d iterations run\n", iterations);
}

Compared to the unsynchronized version, there are just minor changes:

  • Retrieve the memory address associated with the MappedByteBufer
  • Create a SpinLock instance using this address. The lock uses an int, so it will take the four initial bytes of the buffer
  • Use the SpinLock instance to protect the code that fills the buffer with random data and its hash

Now, let’s apply similar changes to the Consumer side:

private static void main(String args[]) throws Exception {

    // ... digest initialization omitted
    MappedByteBuffer shm = createSharedMemory("some_path.dat", 64*1024 + 20);
    long addr = getBufferAddress(shm);

    System.out.println("Starting consumer iterations...");

    Random rnd = new Random();
    long start = System.currentTimeMillis();
    long iterations = 0;
    int capacity = shm.capacity();

    long matchCount = 0;
    long mismatchCount = 0;
    byte[] expectedHash = new byte[hashLen];
    SpinLock lock = new SpinLock(addr);
    while (System.currentTimeMillis() - start < 30000) {

        if (!lock.tryLock(5000)) {
            throw new RuntimeException("Unable to acquire lock");
        }

        try {
            for (int i = 4; i < capacity - hashLen; i++) {
                byte value = shm.get(i);
                digest.update(value);
            }

            byte[] hash = digest.digest();
            shm.get(capacity - hashLen, expectedHash);

            if (Arrays.equals(hash, expectedHash)) {
                matchCount++;
            } else {
                mismatchCount++;
            }

            iterations++;
        } finally {
            lock.unlock();
        }
    }

    System.out.printf("%d iterations run. matches=%d, mismatches=%d\n", iterations, matchCount, mismatchCount);
}

With those changes, we can now run both sides and compare them with the previous result:

# Producer output
Starting producer iterations...
8543 iterations run

# Consumer output
Starting consumer iterations...
8607 iterations run. matches=8607, mismatches=0

As expected, the reported iteration count will be lower compared to the non-synchronized version. The main reason is that we spend most part of the time within the critical section of the code holding the lock. Whichever program holding the lock prevents the other side from doing anything.

If we compare the average iteration count reported from the first case, it will be approximately the same as the sum of iterations we got this time. This shows that the overhead added by the lock mechanism itself is minimal.

6. Conclusion

In this tutorial, we’ve explored how to use share a memory area between two JVMs running on the same machine. We can use the technique presented here as the foundation for high-throughput, low-latency inter-process communication libraries.

As usual, all code is available over on GitHub.

Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
res – REST with Spring (eBook) (everywhere)
Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.