Multipart Upload on S3 with jclouds

Azure Spring Apps is a fully managed service from Microsoft (built in collaboration with VMware), focused on building and deploying Spring Boot applications on Azure Cloud without worrying about Kubernetes.

And, the Enterprise plan comes with some interesting features, such as commercial Spring runtime support, a 99.95% SLA and some deep discounts (up to 47%) when you are ready for production.

>> Learn more and deploy your first Spring Boot app to Azure.

You can also ask questions and leave feedback on the Azure Spring Apps GitHub page.

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

The Jet Profiler was built for MySQL only, so it can do things like real-time query performance, focus on most used tables or most frequent queries, quickly identify performance issues and basically help you optimize your queries.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

Accelerate Your Jakarta EE Development with Payara Server!

With best-in-class guides and documentation, Payara essentially simplifies deployment to diverse infrastructures.

Beyond that, it provides intelligent insights and actions to optimize Jakarta EE applications.

The goal is to apply an opinionated approach to get to what's essential for mission-critical applications - really solid scalability, availability, security, and long-term support:

>> Download and Explore the Guide (to learn more)

The AI Assistant to boost Boost your productivity writing unit tests - Machinet AI.

AI is all the rage these days, but for very good reason. The highly practical coding companion, you'll get the power of AI-assisted coding and automated unit test generation.
Machinet's Unit Test AI Agent utilizes your own project context to create meaningful unit tests that intelligently aligns with the behavior of the code.
And, the AI Chat crafts code and fixes errors with ease, like a helpful sidekick.

Simplify Your Coding Journey with Machinet AI:

>> Install Machinet AI in your IntelliJ

Looking for the ideal Linux distro for running modern Spring apps in the cloud?

Meet Alpaquita Linux: lightweight, secure, and powerful enough to handle heavy workloads.

This distro is specifically designed for running Java apps. It builds upon Alpine and features significant enhancements to excel in high-density container environments while meeting enterprise-grade security standards.

Specifically, the container image size is ~30% smaller than standard options, and it consumes up to 30% less RAM:

>> Try Alpaquita Containers now.

DbSchema is a super-flexible database designer, which can take you from designing the DB with your team all the way to safely deploying the schema.

The way it does all of that is by using a design model, a database-independent image of the schema, which can be shared in a team using GIT and compared or deployed on to any database.

And, of course, it can be heavily visual, allowing you to interact with the database using diagrams, visually compose queries, explore the data, generate random data, import data or build HTML5 database reports.

>> Take a look at DBSchema

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

1. Goal

In the previous article on S3 uploading, we looked at how we can use the generic Blob APIs from jclouds to upload content to S3. In this article we will use the S3 specific asynchronous API from jclouds to upload content and leverage the multipart upload functionality provided by S3.

2. Preparation

2.1. Set up the Custom API

The first part of the upload process is creating the jclouds API – this is a custom API for Amazon S3:

public AWSS3AsyncClient s3AsyncClient() {
   String identity = ...
   String credentials = ...

   BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").
      credentials(identity, credentials).buildView(BlobStoreContext.class);

   RestContext<AWSS3Client, AWSS3AsyncClient> providerContext = context.unwrap();
   return providerContext.getAsyncApi();
}

2.2. Determining the Number of Parts for the Content

Amazon S3 has a 5 MB limit for each part to be uploaded. As such, the first thing we need to do is determine the right number of parts that we can split our content into so that we don’t have parts below this 5 MB limit:

public static int getMaximumNumberOfParts(byte[] byteArray) {
   int numberOfParts= byteArray.length / fiveMB; // 5*1024*1024
   if (numberOfParts== 0) {
      return 1;
   }
   return numberOfParts;
}

2.3. Breaking the Content into Parts

Were going to break the byte array into a set number of parts:

public static List<byte[]> breakByteArrayIntoParts(byte[] byteArray, int maxNumberOfParts) {
   List<byte[]> parts = Lists.<byte[]> newArrayListWithCapacity(maxNumberOfParts);
   int fullSize = byteArray.length;
   long dimensionOfPart = fullSize / maxNumberOfParts;
   for (int i = 0; i < maxNumberOfParts; i++) {
      int previousSplitPoint = (int) (dimensionOfPart * i);
      int splitPoint = (int) (dimensionOfPart * (i + 1));
      if (i == (maxNumberOfParts - 1)) {
         splitPoint = fullSize;
      }
      byte[] partBytes = Arrays.copyOfRange(byteArray, previousSplitPoint, splitPoint);
      parts.add(partBytes);
   }

   return parts;
}

We’re going to test the logic of breaking the byte array into parts – we’re going to generate some bytes, split the byte array, recompose it back together using Guava and verify that we get back the original:

@Test
public void given16MByteArray_whenFileBytesAreSplitInto3_thenTheSplitIsCorrect() {
   byte[] byteArray = randomByteData(16);

   int maximumNumberOfParts = S3Util.getMaximumNumberOfParts(byteArray);
   List<byte[]> fileParts = S3Util.breakByteArrayIntoParts(byteArray, maximumNumberOfParts);

   assertThat(fileParts.get(0).length + fileParts.get(1).length + fileParts.get(2).length, 
      equalTo(byteArray.length));
   byte[] unmultiplexed = Bytes.concat(fileParts.get(0), fileParts.get(1), fileParts.get(2));
   assertThat(byteArray, equalTo(unmultiplexed));
}

To generate the data, we simply use the support from Random:

byte[] randomByteData(int mb) {
   byte[] randomBytes = new byte[mb * 1024 * 1024];
   new Random().nextBytes(randomBytes);
   return randomBytes;
}

2.4. Creating the Payloads

Now that we have determined the correct number of parts for our content and we managed to break the content into parts, we need to generate the Payload objects for the jclouds API:

public static List<Payload> createPayloadsOutOfParts(Iterable<byte[]> fileParts) {
   List<Payload> payloads = Lists.newArrayList();
   for (byte[] filePart : fileParts) {
      byte[] partMd5Bytes = Hashing.md5().hashBytes(filePart).asBytes();
      Payload partPayload = Payloads.newByteArrayPayload(filePart);
      partPayload.getContentMetadata().setContentLength((long) filePart.length);
      partPayload.getContentMetadata().setContentMD5(partMd5Bytes);
      payloads.add(partPayload);
   }
   return payloads;
}

3. Upload

The upload process is a flexible multi-step process – this means:

the upload can be started before having all the data – data can be uploaded as it’s coming in
data is uploaded in chunks – if one of these operations fails, it can simply be retrieved
chunks can be uploaded in parallel – this can greatly increase the upload speed, especially in the case of large files

3.1. Initiating the Upload Operation

The first step in the Upload operation is to initiate the process. This request to S3 must contain the standard HTTP headers – the Content–MD5 header in particular needs to be computed. Were going to use the Guava hash function support here:

Hashing.md5().hashBytes(byteArray).asBytes();

This is the md5 hash of the entire byte array, not of the parts yet.

To initiate the upload, and for all further interactions with S3, we’re going to use the AWSS3AsyncClient – the asynchronous API we created earlier:

ObjectMetadata metadata = ObjectMetadataBuilder.create().key(key).contentMD5(md5Bytes).build();
String uploadId = s3AsyncApi.initiateMultipartUpload(container, metadata).get();

The key is the handle assigned to the object – this needs to be a unique identifier specified by the client.

Also notice that, even though we’re using the async version of the API, we’re blocking for the result of this operation – this is because we will need the result of the initialize to be able to move forward.

The result of the operation is an upload id returned by S3 – this will identify the upload throughout it’s lifecycle and will be present in all subsequent upload operations.

3.2. Uploading the Parts

The next step is uploading the parts. Our goal here is to send these requests in parallel, as the upload parts operation represent the bulk of the upload process:

List<ListenableFuture<String>> ongoingOperations = Lists.newArrayList();
for (int partNumber = 0; partNumber < filePartsAsByteArrays.size(); partNumber++) {
   ListenableFuture<String> future = s3AsyncApi.uploadPart(
      container, key, partNumber + 1, uploadId, payloads.get(partNumber));
   ongoingOperations.add(future);
}

The part numbers need to be continuous but the order in which the requests are send is not relevant.

After all of the upload part requests have been submitted, we need to wait for their responses so that we can collect the individual ETag value of each part:

Function<ListenableFuture<String>, String> getEtagFromOp = 
  new Function<ListenableFuture<String>, String>() {
   public String apply(ListenableFuture<String> ongoingOperation) {
      try {
         return ongoingOperation.get();
      } catch (InterruptedException | ExecutionException e) {
         throw new IllegalStateException(e);
      }
   }
};
List<String> etagsOfParts = Lists.transform(ongoingOperations, getEtagFromOp);

If, for whatever reason, one of the upload part operations fails, the operation can be retried until it succeeds. The logic above does not contain the retry mechanism, but building it in should be straightforward enough.

3.3. Completing the Upload Operation

The final step of the upload process is completing the multipart operation. The S3 API requires the responses from the previous parts upload as a Map, which we can now easily create from the list of ETags that we obtained above:

Map<Integer, String> parts = Maps.newHashMap();
for (int i = 0; i < etagsOfParts.size(); i++) {
   parts.put(i + 1, etagsOfParts.get(i));
}

And finally, send the complete request:

s3AsyncApi.completeMultipartUpload(container, key, uploadId, parts).get();

This will return final ETag of the finished object and will complete the entire upload process.

4. Conclusion

In this article we built a multipart enabled, fully parallel upload operation to S3, using the custom S3 jclouds API. This operation is ready to be used as is, but it can be improved in a few ways.

First, retry logic should be added around the upload operations to better deal with failures.

Next, for really large files, even though the mechanism is sending all upload multipart requests in parallel, a throttling mechanism should still limit the number of parallel requests being sent. This is both to avoid bandwidth becoming a bottleneck as well as to make sure Amazon itself doesn’t flag the upload process as exceeding an allowed limit of requests per second – the Guava RateLimiter can potentially be very well suited for this.

Multipart Upload on S3 with jclouds

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Goal

2. Preparation

2.1. Set up the Custom API

2.2. Determining the Number of Parts for the Content

2.3. Breaking the Content into Parts

2.4. Creating the Payloads

3. Upload

3.1. Initiating the Upload Operation

3.2. Uploading the Parts

3.3. Completing the Upload Operation

4. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course:

REST with Spring

Learn Spring Security ▼▲

Learn Spring Security Core

Learn Spring Security OAuth

Learn Spring

Learn Spring Data JPA

Persistence

REST

Security

Full Archive

Baeldung Ebooks

About Baeldung

Write for Baeldung

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Goal

2. Preparation

2.1. Set up the Custom API

2.2. Determining the Number of Parts for the Content

2.3. Breaking the Content into Parts

2.4. Creating the Payloads

3. Upload

3.1. Initiating the Upload Operation

3.2. Uploading the Parts

3.3. Completing the Upload Operation

4. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course: