Sharing Memory Between JVMs

Azure Spring Apps is a fully managed service from Microsoft (built in collaboration with VMware), focused on building and deploying Spring Boot applications on Azure Cloud without worrying about Kubernetes.

The Enterprise plan comes with some interesting features, such as commercial Spring runtime support, a 99.95% SLA and some deep discounts (up to 47%) when you are ready for production.

>> Learn more and deploy your first Spring Boot app to Azure.

And, you can participate in a very quick (1 minute) paid user research from the Java on Azure product team.

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

The Jet Profiler was built for MySQL only, so it can do things like real-time query performance, focus on most used tables or most frequent queries, quickly identify performance issues and basically help you optimize your queries.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

Accelerate Your Jakarta EE Development with Payara Server!

With best-in-class guides and documentation, Payara essentially simplifies deployment to diverse infrastructures.

Beyond that, it provides intelligent insights and actions to optimize Jakarta EE applications.

The goal is to apply an opinionated approach to get to what's essential for mission-critical applications - really solid scalability, availability, security, and long-term support:

>> Download and Explore the Guide (to learn more)

The AI Assistant to boost Boost your productivity writing unit tests - Machinet AI.

AI is all the rage these days, but for very good reason. The highly practical coding companion, you'll get the power of AI-assisted coding and automated unit test generation.
Machinet's Unit Test AI Agent utilizes your own project context to create meaningful unit tests that intelligently aligns with the behavior of the code.
And, the AI Chat crafts code and fixes errors with ease, like a helpful sidekick.

Simplify Your Coding Journey with Machinet AI:

>> Install Machinet AI in your IntelliJ

Looking for the ideal Linux distro for running modern Spring apps in the cloud?

Meet Alpaquita Linux: lightweight, secure, and powerful enough to handle heavy workloads.

This distro is specifically designed for running Java apps. It builds upon Alpine and features significant enhancements to excel in high-density container environments while meeting enterprise-grade security standards.

Specifically, the container image size is ~30% smaller than standard options, and it consumes up to 30% less RAM:

>> Try Alpaquita Containers now.

DbSchema is a super-flexible database designer, which can take you from designing the DB with your team all the way to safely deploying the schema.

The way it does all of that is by using a design model, a database-independent image of the schema, which can be shared in a team using GIT and compared or deployed on to any database.

And, of course, it can be heavily visual, allowing you to interact with the database using diagrams, visually compose queries, explore the data, generate random data, import data or build HTML5 database reports.

>> Take a look at DBSchema

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

1. Introduction

In this tutorial, we’ll show how to share memory between two or more JVMs running on the same machine. This capability enables very fast inter-process communication since we can move data blocks around without any I/O operation.

2. How Shared Memory Works?

A process running in any modern operating system gets what’s called a virtual memory space. We call it virtual because, although it looks like a large, continuous, and private addressable memory space, in fact, it’s made of pages spread all over the physical RAM. Here, page is just OS slang for a block of contiguous memory, whose size range depends on the particular CPU architecture in use. For x86-84, a page can be as small as 4KB or as large as 1 GB.

At a given time, only a fraction of this virtual space is actually mapped to physical pages. As time passes and the process starts to consume more memory for its tasks, the OS starts to allocate more physical pages and map them to the virtual space. When the demand for memory exceeds what’s physically available, the OS will start to swap out pages that are not being used at that moment to secondary storage to make room for the request.

A shared memory block behaves just like regular memory, but, in contrast with regular memory, it is not private to a single process. When a process changes the contents of any byte within this block, any other process with access to this same shared memory “sees” this change instantly.

This is a list of common uses for shared memory:

Debuggers (ever wondered how a debugger can inspect variables in another process?)
Inter-process communication
Read-only content sharing between processes (ex: dynamic library code)
Hacks of all sorts ;^)

3. Shared Memory and Memory-Mapped Files

A memory-mapped file, as the name suggests, is a regular file whose contents are directly mapped to a contiguous area in the virtual memory of a process. This means that we can read and/or change its contents without explicit use of I/O operations. The OS will detect any writes to the mapped area and will schedule a background I/O operation to persist the modified data.

Since there are no guarantees on when this background operation will happen, the OS also offers a system call to flush any pending changes. This is important for use cases like database redo logs, but not needed for our inter-process communication (IPC, for short) scenario.

Memory-mapped files are commonly used by database servers to achieve high throughput I/O operations, but we can also use them to bootstrap a shared-memory-based IPC mechanism. The basic idea is that all processes that need to share data map the same file and, voilà, they now have a shared memory area.

4. Creating Memory-Mapped Files in Java

In Java, we use the FileChannel‘s map() method to map a region of a file into memory, which returns a MappedByteBuffer that allows us to access its contents:

MappedByteBuffer createSharedMemory(String path, long size) {

    try (FileChannel fc = (FileChannel)Files.newByteChannel(new File(path).toPath(),
      EnumSet.of(
        StandardOpenOption.CREATE,
        StandardOpenOption.SPARSE,
        StandardOpenOption.WRITE,
        StandardOpenOption.READ))) {

        return fc.map(FileChannel.MapMode.READ_WRITE, 0, size);
    }
    catch( IOException ioe) {
        throw new RuntimeException(ioe);
    }
}

The use of the SPARSE option here is quite relevant. As long the underlying OS and file system supports it, we can map sizable memory area without actually consuming disk space.

Now, let’s create a simple demo application. The Producer application will allocate a shared memory large enough to hold 64KB of data plus a SHA1 hash (20 bytes). Next, it will start a loop where it will fill the buffer with random data, followed by its SHA1 hash. We’ll repeat this operation continuously for 30 seconds and then exit:

// ... SHA1 digest initialization omitted

MappedByteBuffer shm = createSharedMemory("some_path.dat", 64*1024 + 20);
Random rnd = new Random();

long start = System.currentTimeMillis();
long iterations = 0;
int capacity = shm.capacity();
System.out.println("Starting producer iterations...");
while(System.currentTimeMillis() - start < 30000) {

    for (int i = 0; i < capacity - hashLen; i++) {
        byte value = (byte) (rnd.nextInt(256) & 0x00ff);
        digest.update(value);
        shm.put(i, value);
    }

    // Write hash at the end
    byte[] hash = digest.digest();
    shm.put(capacity - hashLen, hash);
    iterations++;
}

System.out.printf("%d iterations run\n", iterations);

To test that we indeed can share memory, we’ll also create a Consumer app that will read the buffer’s content, compute its hash, and compare it with the Producer-generated one. We’ll repeat this process for 30 seconds. At each iteration, will also compute the buffer content’s hash and compare it with the one present at the buffer’s end:

// ... digest initialization omitted

MappedByteBuffer shm = createSharedMemory("some_path.dat", 64*1024 + 20);
long start = System.currentTimeMillis();
long iterations = 0;
int capacity = shm.capacity();
System.out.println("Starting consumer iterations...");

long matchCount = 0;
long mismatchCount = 0;
byte[] expectedHash = new byte[hashLen];

while (System.currentTimeMillis() - start < 30000) {

    for (int i = 0; i < capacity - 20; i++) {
        byte value = shm.get(i);
        digest.update(value);
    }

    byte[] hash = digest.digest();
    shm.get(capacity - hashLen, expectedHash);

    if (Arrays.equals(hash, expectedHash)) {
        matchCount++;
    } else {
        mismatchCount++;
    }
    iterations++;
}

System.out.printf("%d iterations run. matches=%d, mismatches=%d\n", iterations, matchCount, mismatchCount);

To test our memory-sharing scheme, let’s start both programs at the same time. This is their output when running on a 3Ghz, quad-core Intel I7 machine:

# Producer output
Starting producer iterations...
11722 iterations run


# Consumer output
Starting consumer iterations...
18893 iterations run. matches=11714, mismatches=7179

We can see that, in many cases, the consumer detects that the expected computed values are different. Welcome to the wonderful world of concurrency issues!

5. Synchronizing Shared Memory Access

The root cause for the issue we’ve seen is that we need to synchronize access to the shared memory buffer. The Consumer must wait for the Producer to finish writing the hash before it starts reading the data. On the other hand, the Producer also must wait for the Consumer to finish consuming the data before writing to it again.

For a regular multithreaded application, solving this issue is no big deal. The standard library offers several synchronization primitives that allow us to control who can write to the shared memory at a given time.

However, ours is a multi-JVM scenario, so none of those standard methods apply. So, what should we do? Well, the short answer is that we’ll have to cheat. We could resort to OS-specific mechanisms like semaphores, but this would hinder our application’s portability. Also, this implies using JNI or JNA, which also complicates things.

Enter Unsafe. Despite its somewhat scary name, this standard library class offers exactly what we need to implement a simple lock mechanism: the compareAndSwapInt() method.

This method implements an atomic test-and-set primitive that takes four arguments. Although not clearly stated in the documentation, it can target not only Java objects but also a raw memory address. For the latter, we pass null in the first argument, which makes it treat the offset argument as a virtual memory address.

When we call this method, it will first check the value at the target address and compare it with the expected value. If they’re equal, then it will modify the location’s content to the new value and return true indicating success. If the value at the location is different from expected, nothing happens, and the method returns false.

More importantly, this atomic operation is guaranteed to work even in multicore architectures, which is a critical feature for synchronizing multiple executing threads.

Let’s create a SpinLock class that takes advantage of this method to implement a (very!) simple lock mechanism:

//... package and imports omitted

public class SpinLock {
    private static final Unsafe unsafe;

    // ... unsafe initialization omitted
    private final long addr;

    public SpinLock(long addr) {
        this.addr = addr;
    }

    public boolean tryLock(long maxWait) {
        long deadline = System.currentTimeMillis() + maxWait;
        while (System.currentTimeMillis() < deadline ) {
            if (unsafe.compareAndSwapInt(null, addr, 0, 1)) {
                return true;
            }
        }
        return false;
    }

    public void unlock() {
        unsafe.putInt(addr, 0);
    }
}

This implementation lacks key features, like checking whether it owns the lock before releasing it, but it will suffice for our purpose.

Okay, so how do we get the memory address that we’ll use to store the lock status? This must be an address within the shared memory buffer so both processes can use it, but the MappedByteBuffer class does not expose the actual memory address.

Inspecting the object that map() returns, we can see that it is a DirectByteBuffer. This class has a public method called address() that returns exactly what we want. Unfortunately, this class is package-private so we can’t use a simple cast to access this method.

To bypass this limitation, we’ll cheat a little again and use reflection to invoke this method:

private static long getBufferAddress(MappedByteBuffer shm) {
    try {
        Class<?> cls = shm.getClass();
        Method maddr = cls.getMethod("address");
        maddr.setAccessible(true);
        Long addr = (Long) maddr.invoke(shm);
        if (addr == null) {
            throw new RuntimeException("Unable to retrieve buffer's address");
        }
        return addr;
    } catch (NoSuchMethodException | InvocationTargetException | IllegalAccessException ex) {
        throw new RuntimeException(ex);
    }
}

Here, we’re using setAccessible() to make the address() method callable through the Method handle. However, be aware that, from Java 17 onwards, this technique won’t work unless we explicitly use the runtime –add-opens flag.

6. Adding Synchronization to Producer and Consumer

Now that we have a lock mechanism, let’s apply it to the Producer first. For the purposes of this demo, we’ll assume that the Producer will always start before the Consumer. We need this so we can initialize the buffer, clearing its content including the area we’ll use with the SpinLock:

public static void main(String[] args) throws Exception {

    // ... digest initialization omitted
    MappedByteBuffer shm = createSharedMemory("some_path.dat", 64*1024 + 20);

    // Cleanup lock area 
    shm.putInt(0, 0);

    long addr = getBufferAddress(shm);
    System.out.println("Starting producer iterations...");

    long start = System.currentTimeMillis();
    long iterations = 0;
    Random rnd = new Random();
    int capacity = shm.capacity();
    SpinLock lock = new SpinLock(addr);
    while(System.currentTimeMillis() - start < 30000) {

        if (!lock.tryLock(5000)) {
            throw new RuntimeException("Unable to acquire lock");
        }

        try {
            // Skip the first 4 bytes, as they're used by the lock
            for (int i = 4; i < capacity - hashLen; i++) {
                byte value = (byte) (rnd.nextInt(256) & 0x00ff);
                digest.update(value);
                shm.put(i, value);
            }

            // Write hash at the end
            byte[] hash = digest.digest();
            shm.put(capacity - hashLen, hash);
            iterations++;
        }
        finally {
            lock.unlock();
        }
    }
    System.out.printf("%d iterations run\n", iterations);
}

Compared to the unsynchronized version, there are just minor changes:

Retrieve the memory address associated with the MappedByteBufer
Create a SpinLock instance using this address. The lock uses an int, so it will take the four initial bytes of the buffer
Use the SpinLock instance to protect the code that fills the buffer with random data and its hash

Now, let’s apply similar changes to the Consumer side:

private static void main(String args[]) throws Exception {

    // ... digest initialization omitted
    MappedByteBuffer shm = createSharedMemory("some_path.dat", 64*1024 + 20);
    long addr = getBufferAddress(shm);

    System.out.println("Starting consumer iterations...");

    Random rnd = new Random();
    long start = System.currentTimeMillis();
    long iterations = 0;
    int capacity = shm.capacity();

    long matchCount = 0;
    long mismatchCount = 0;
    byte[] expectedHash = new byte[hashLen];
    SpinLock lock = new SpinLock(addr);
    while (System.currentTimeMillis() - start < 30000) {

        if (!lock.tryLock(5000)) {
            throw new RuntimeException("Unable to acquire lock");
        }

        try {
            for (int i = 4; i < capacity - hashLen; i++) {
                byte value = shm.get(i);
                digest.update(value);
            }

            byte[] hash = digest.digest();
            shm.get(capacity - hashLen, expectedHash);

            if (Arrays.equals(hash, expectedHash)) {
                matchCount++;
            } else {
                mismatchCount++;
            }

            iterations++;
        } finally {
            lock.unlock();
        }
    }

    System.out.printf("%d iterations run. matches=%d, mismatches=%d\n", iterations, matchCount, mismatchCount);
}

With those changes, we can now run both sides and compare them with the previous result:

# Producer output
Starting producer iterations...
8543 iterations run

# Consumer output
Starting consumer iterations...
8607 iterations run. matches=8607, mismatches=0

As expected, the reported iteration count will be lower compared to the non-synchronized version. The main reason is that we spend most part of the time within the critical section of the code holding the lock. Whichever program holding the lock prevents the other side from doing anything.

If we compare the average iteration count reported from the first case, it will be approximately the same as the sum of iterations we got this time. This shows that the overhead added by the lock mechanism itself is minimal.

6. Conclusion

In this tutorial, we’ve explored how to use share a memory area between two JVMs running on the same machine. We can use the technique presented here as the foundation for high-throughput, low-latency inter-process communication libraries.

As usual, all code is available over on GitHub.

Sharing Memory Between JVMs

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Introduction

2. How Shared Memory Works?

3. Shared Memory and Memory-Mapped Files

4. Creating Memory-Mapped Files in Java

5. Synchronizing Shared Memory Access

6. Adding Synchronization to Producer and Consumer

6. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course:

REST with Spring

Learn Spring Security ▼▲

Learn Spring Security Core

Learn Spring Security OAuth

Learn Spring

Learn Spring Data JPA

Persistence

REST

Security

Full Archive

Baeldung Ebooks

About Baeldung

Write for Baeldung

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Introduction

2. How Shared Memory Works?

3. Shared Memory and Memory-Mapped Files

4. Creating Memory-Mapped Files in Java

5. Synchronizing Shared Memory Access

6. Adding Synchronization to Producer and Consumer

6. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course: