1. Overview

In this tutorial, we’ll explore using Liquibase, Spring Boot, and Kubernetes. We can build applications that configure their database on start-up using these technologies. That’s powerful, but it creates issues when we run it at scale.

2. What Happens When the Service Starts

When we deploy on Kubernetes, we can specify the number of replicas we want. If we handle user requests, we should have more than one application instance.

When we use the Spring Boot Liquibase starter, it’ll attempt to run the database migration when the application starts. The service will wait for the migration to complete before it is ready to handle requests. When Liquibase is running a migration, it first writes a row to the database that creates a lock. Running two migrations simultaneously would cause errors, so Liquibase protects the schema by only allowing the migration to run one at a time.

If we start two Spring Boot applications with Liquibase simultaneously, we’ll have a race condition to see which one will run the migration. One of them will acquire the lock and run the migration. The other will wait until the lock is available. Once the migration completes, the second instance will acquire the lock, and it will see that the migrations are already complete, release the lock and continue.

Usually, this is not a big problem. If there is no migration to run, the service will start quickly, so even if we are starting many replicas, they will still start quickly.

The problem is a combination of Spring Boot, Liquibase, and Kubernetes. Kubernetes is designed to manage a cluster of servers and manage services inside containers. It’ll use probes to ensure everything is working, move pods between nodes to manage the load, and replace failing services. This behavior is extremely powerful but adds risks when running a long process like a database migration.

3. Kubernetes Probes

In Kubernetes, it is recommended to use Readiness and Liveliness probes to test if the service is working correctly. The most common approach is to use the Spring Boot Actuator Health endpoint. This ensures that the services do not receive requests before their ready, and if there is a problem, the new instances will be stopped.

We would define the probes like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: api
  template:
    metadata:
      labels:
        app.kubernetes.io/name: api
    spec:
      containers:
         - image: myapp:latest
           name: api
           livenessProbe:
             httpGet:
               path: /health
               port: http
             failureThreshold: 1
             periodSeconds: 10
           startupProbe:
             httpGet:
               path: /health
               port: http
             failureThreshold: 12
             periodSeconds: 5
           ports: 
             - containerPort: 8080
               name: http

In this example, we expect the services to be ready within 60 seconds of starting. If they are not, the deployment will fail, and the new pods will be terminated. Usually, this is fine, but remember, Spring Boot will not report it is healthy until the migrations are complete.

Most migrations are fast, but some can take a long time. Adding large amounts of data or modifying indexes in a large database can be slow. The service will be killed without releasing the lock if the migration is still running after the readiness probe time has expired.

Starting a new instance will fail because the lock was never released. Every instance will wait to acquire the database lock until the readiness probe time expires, and it’s killed too. We can manually delete the lock from the database we might have problems when the migrations run. The migration was stopped, and starting it again might cause errors or damage the schema, which is not good for the production database.

4. How Can We Avoid This Problem in the Future?

In most cases, removing the lock will fix the problem, but we should be careful with production databases. We want to ensure the migration can be fully completed with no errors or interruptions.

First, it’s essential to test the migration on a production-like database. The only way to know how long a migration might take is to run it on a similar database. Ideally, we want to have a staging database that is similar to production. It should have around the same amount of data, processing power, and simulated traffic. Obviously, that is not always available and is a luxury for many organizations.

Production data usually is closely guarded, so we cannot simply use it in our staging database. That means we need to create test data, so will be differences. Often staging data is more uniform and consistent because it’s generated by scripts. We are also unlikely to have the same load on a staging database. Our production database handles user requests and updates. That means a migration on a staging database is likely to be faster than production.

Even though we don’t have a perfect test environment, it’s still important to test our migration. The test will tell us if the migration works and the minimum time it will take to complete. We can then look at the differences between staging the production to estimate how much longer we should account for. As a rule, we should assume it will take at least twice as long on a production system.

5. Migration Options

We have several options to reduce the risk of the migration timing out. Some are more difficult to implement and will make the overall architecture more complex, but they add the security that the migration will succeed.

5.1. Extending Timelines

The most straightforward method is to extend the time for the start-up probe in the service. This approach only requires us to modify the start-up probe time so that the migration can complete. The start-up probe is intended to protect services that take time to become ready.

It would not be a good to idea to remove the probes completely because there could be a different problem that wouldn’t be detected without the probe.

In this example, we are going to give the service 10 minutes to start.

startupProbe: 
  httpGet: 
    path: /health 
    port: http 
  failureThreshold: 30 
  periodSeconds: 20

Kubernetes will check it every 20 seconds. As soon as the migration is complete, the probe will be successful. This solution will work if the migration can be completed in time. If the migration test shows it should be complete within 3 minutes, it should be safe to give it 10 minutes.

The advantage of this approach is that it can be easily implemented. There is a risk that migration could still take too long and the process is interrupted. There is also the risk of a different problem that prevents the service from being ready. A long timeout means we’ll have to wait for the deployment to fail.

5.2. Lifecycle Hooks

There is the option to call some code when the container exits. Lifecycle hooks can be run whenever a container starts or exits to provide fine-grained control of the process. Using a lifecycle hook, we can ensure the lock is released so the next container can acquire it and proceed.

There is still a danger that the migration will terminate during an important step and cause an error when it restarts, but it removes the risk that all future containers will be stuck waiting for a container.

A lifecycle hook can be a script or an HTTP request. We can implement a simple hook to remove the lock before exiting.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  # Replicas, selectors and metadata omitted
  template:
    spec:
      containers:
         - image: myapp:latest
           name: api
           # Probes and ports omitted
           lifecycle:
             preStop:
               exec:
                 command: ["/bin/sh","-c","/stopservice.sh"]

This approach will ensure the lock is released and the next container can start. The next container should be able to start the migration from the last completed step and continue as normal.

It is important to ensure the hook is well-tested. The hook will be executed at least once, but it could be called more times. It will also be called whenever a container exists, so it might not be an error situation. If there is a failure in the hook, the container will be stuck in a terminating state and require manual intervention.

5.3. Separating Containers

If the migration takes a long time, it might be worth separating database migration from the application code. The advantage of this approach is that the migration can take any amount of time to complete without impacting the services.

To separate the containers, we will need to modify our pipeline to produce two containers, one that contains the built application ‘jar’ and one that contains the migration. Liquibase has a range of ways to run its migration, including a CLI, Maven, and others. By creating a container that contains the migration script, we can run it as a standalone task.

apiVersion: v1
kind: Pod
metadata:
  name: db-migrationn
spec:
  containers:
  - name: mymigration:latest
    image: migration
    resources:
      limits:
        memory: "200Mi"
        cpu: "700m"
      requests:
        memory: "200Mi"
        cpu: "700m"

This solution is more complex because two containers are now running, but it will allow us to avoid the risk of the probes killing the service during the migration. The migration could take a long time to complete, and the pod will exit once it finishes. We have also added resource limits to ensure Kubernetes will not attempt to move the pod before the migration is complete (see Quality of Service documentation)

The migration might need to be completed before the services, so we need to plan the order of changes. It might be possible to release the service first, but that will depend on our use case.

5.4. Using InitContainers

Another option for running the migration from a separate container is to use an InitContainer. An InitContainer runs before the main container and is not impacted by the probes. We still need to manage QoS, but we do not need separately deploy the container.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  # Replicas, selectors and metadata omitted
  template:
    spec:
      containers:
         - image: myapp:latest
           name: api
           # Probes and ports omitted
      initContainers:
         - name: mymigration:latest
           image: migration

The advantage of this approach is that the InitContainer will run as soon as we start. The disadvantage is that we have coupled the migration to the service. That means the deployment will need to wait for the migration to complete.

5.5. Running the Migration Separately

The final option is completely separating the database migration from the Kubernetes environment. There is always a risk that Kubernetes will need to move pods from one node to another, either because of QoS, a node failure, or anything else that requires intervention. There is no way to be sure the migration will always run to completion.

To avoid this problem, we could run the migration from a separate server. It could be a CI server that would allow the migration to run for a long period. Alternatively, we could start a dedicated server, particularly in the cloud, and then tear it down afterward.

This approach would involve more planning and infrastructure, but it ensuring the migration has time to complete fully. Long tasks like inserting lots of data, changing indexes, or restricting tables can be a slow process, and interrupting it be difficult, so having a separate server for large tasks might make sense.

6. Conclusion

In this article, we’ve seen several ways to handle database migrations in a Kubernetes environment. Running a database migration in Kubernetes is normally fine, but there are times when we need to plan it in more detail. Knowing approximately how long a migration will take is key to knowing how to approach the problem.

There are a range of solutions, from the simplest to the more complex choices, and the critical factor will always be time.

1 Comment
Oldest
Newest
Inline Feedbacks
View all comments
Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.