How would you migrate a database, without your users noticing?

At Atlassian we migrate databases at scale as part of our regular server rebalancing work, to help maintain an even spread of load across the Jira database fleet. On average we migrate one thousand databases every day, all without tangible interruption to our customers. But most recently, we had to deal with a new challenge:

How would you migrate four million databases, with minimal impact to your users?

Background

Jira uses Postgres as its backing store. All your issues, projects, workflows, custom fields: everything is stored in your database – one database per Jira tenant. We have four million Jira tenants, ergo we have four million Jira databases!

One database per tenant is an uncommon architecture, and we’ve opted for it in order to maximize isolation, scalability, and operational control at Atlassian’s massive scale. It makes it much easier to ensure that data from one tenant cannot accidentally or maliciously be accessed by another, and allows us to scale our fleet horizontally, balancing load and optimising performance for tenants of significantly different sizes.

We spread these four million databases across about 3,000 PostgreSQL servers in 13 AWS regions worldwide, where our PostgreSQL servers are instances either of AWS RDS for PostgreSQL or AWS Aurora PostgreSQL, so we generally don’t use the word ‘server’ and opt instead for ‘instance’.

We migrate our databases around the Jira database instance fleet as part of regular server rebalancing to help maintain a nice even spread of load. We perform these migrations via two methods: for small databases we are able to take a backup and then restore it on the destination instance very quickly, whereas for larger databases we set up logical replication of the target database between the source and destination instances, which allows us to take our time copying the data while the database and its tenant continue to operate normally. On average we migrate one thousand databases every day, with minimal interruption to our customers.

We split our databases and their host instances similar to what the AWS Well-Architected Framework calls the “bridge model”: the vast majority of our tenants are located on shared infrastructure, with a small number of very large tenants residing on dedicated infrastructure. These very large tenants require very large amounts of resources, more than we could supply with an RDS instance, and thus they’ve been platformed on Aurora PostgreSQL for many years.

In late 2023 we started exploring a potential project to replatform the remainder of the Jira database fleet also to Aurora PostgreSQL in order to achieve some bold cost, reliability, and performance objectives. Our RDS configuration enabled the use of only a single instance at a time, whereas on Aurora we could access both the writer and the reader instance (multiple readers, if necessary) simultaneously. More on this later but in short the use of twice the instances meant we could reduce our instance sizes by half. Much more effective! We would also be able to take advantage of Aurora’s better SLA (99.99% as opposed to 99.95%), plus autoscaling during peak loads up to 15 readers – which we do occasionally reach. The discovery work achieved positive results and so we decided to move ahead with the project.

Migration Design

So, we had desired outcomes and an objective we believed could deliver them. But we had additional goals along the path to get us there as well:

  1. Minimise tenant downtime
  2. Minimise cost by controlling the amount of migration-related infrastructure
  3. Complete the migrations in a timely manner – in practice, this was a few months in total.

With these objectives in mind we determined the best method would be, for each RDS PostgreSQL Multi-AZ DB instance to migrate to Aurora:

  1. Add a DB instance read replica to the instance, and allow the replica to to synchronise the underlying data to a new Aurora cluster volume; this functionality is a standard offering in RDS.
  2. Perform a cutover at a suitable window by “promoting” the new read replica to a standalone Aurora cluster.

We referred to this process as “conversion” of an RDS instance to an Aurora cluster. At a surface level it sounds like a textbook approach but at Atlassian scale there are complexities!

Conversion

The first complexity is that our clusters host up to about 4,000 databases each, with each database representing a unique Jira tenant. To perform a “promotion” to a new, standalone Aurora cluster would mean we would need to perform that cutover in unison for all 4,000 tenants, each with their own unique connection endpoint and credentials. We would need to update the Jira application, running on EC2 instances, that connects to those database backends to use the new standalone cluster’s new connection endpoint for each tenant, in a way that safely guaranteed we would not accidentally write to the wrong (old) database. To do so, our cutover process locked out every SQL user on the source database instance, completed the “promotion” to a separate cluster, and then unlocked all the SQL users only on the destination.

All of this was orchestrated through an AWS Step Function that performed a host of safety checks preflight, during, and after conversion to safely return to standard operation as soon as anything went wrong. We monitored customer traffic post conversion for a number of hours to make sure we could see everything was working correctly.

To make things really snappy during cutover, we used feature flags as a forcing function: when things were ready, we used a feature flag to override the tenants’ database endpoints on our application servers immediately, rather than the slow periodic refresh they do normally. Due to this, and because the initial synchronisation process was able to be performed ahead of time, we were able to limit the actual cutover time to less than 3 minutes apiece – well within our SLAs – even for our most massive instances we’d need to convert. Excellent!

Our build and test phase proceeded without any major hassles, until one day our support team at AWS contacted us. We had a large test RDS instance that was synchronising in preparation for conversion but AWS were alerted that although synchronisation had completed, the new cluster had failed to start. From our side looking at the AWS console this failed replica instance still appeared to be a healthy, still-replicating instance, but AWS’s control plane detected that the instance’s startup process had timed out.

The reason the startup had timed out was, unknown to us at the time, there were so many files on our source RDS database instance (and therefore, also on our new destination Aurora cluster volume) that the new read replica instance was timing out while performing a status check activity that involved enumerating all those files. The more files one had, the longer the process would take, and the higher the likelihood of hitting this startup timeout threshold. And we had millions and millions of files!

File counts

In Postgres, each high-level database object like tables, indexes, and sequences are stored in at least one file on disk each: the more tables, indexes, and sequences in your database schema, the more files on disk you will have.

Jira has a large number of these high-level database objects, which means that in total a single Jira database needs about 5,000 files on disk. With the large number of databases we co-host together on Jira database instances, we wound up creating substantially more files on our new Aurora cluster volumes than any other AWS customer normally would. So even if we weren’t using up all the enormous space available on an Aurora cluster volume, we were still effectively pushing another boundary – impacting our ability to convert our clusters safely.

The advice from AWS was: drastically reduce your file counts on your RDS instances if you want to perform safe RDS->Aurora conversions. The only ways to reduce the file counts on the cluster volume were either to reduce the number of files per database, or to reduce the number of databases on a given instance. Because it wasn’t really going to be possible to reduce the number of files per database drastically (we do need to actually store our tenants’ tables, after all!) the only path available to us was to reduce the number of tenants on instances to be converted, which we referred to as “draining”.

AWS also advised that once an RDS instance had been successfully converted to an Aurora cluster, we were free to re-populate it with tenant databases again up to sensible limits. This was important to us since one of our objectives was to minimise migration-related infrastructure: it enabled us to provision only enough capacity in the fleet to accommodate migrating tenants of in-flight conversions. Once the conversions were complete, we could begin refilling them again on the other side.

Our plan was simple:

  1. Drain an instance to a suitable minimum number of tenants
  2. Convert that instance to an Aurora cluster
  3. “Refill” the new cluster by using it as a target for tenants being migrated from the next instances being drained.

This approach effectively established a daisy chain of sorts of instances and clusters performing a drainpromoterefill loop, and allowed us to maintain a low overhead of excess infrastructure during the migrations.

Begun, the draining had!

Draining

Following the advice from AWS, we built a tool to orchestrate our business-as-usual database migration tool, but on steroids: the number of databases we needed to move to reach file count targets was in the millions, and we didn’t want to waste any time. We analysed our tenant distributions and prioritised the smallest tenants with the least usage, to reduce the amount of data we’d need to migrate (and therefore increase our throughput). At peak, our tool averaged 38,000 daily migrations with a peak of nearly 90,000 – a hefty scale-up on its usual 1,000! All of this, and while still meeting our high reliability targets.

As RDS DB instances were drained to their file counts threshold the tool would move onto other targets, leaving “drained” instances ready for our conversion process to kick off the data sync and await a suitable cutover period. Following cutover, source RDS DB instances were cleaned up automatically after a short safety period.

Race to the finish

Once the conversion and draining processes were established, the flow was just a large numbers game in a well-oiled machine. There were a lot of numbers! Source concurrency was the number of outbound database migrations we allowed in parallel per RDS instance, and destination concurrency was how many inbound we allowed per Aurora cluster. Controlling both types of concurrency was key to making sure we didn’t impact normal operations during the migration process.

Draining throughput was largely determined by how many Aurora clusters – i.e., destinations – were available in a given region: we found that increasing destination concurrency too much began to impact the destination clusters’ performance too much, well before we reached any limits on source concurrency. Since we had a ceiling on the throughput rate of a given Aurora destination cluster, to boost overall throughput we should have more Aurora clusters. Ultimately, we had to find a balance between how much additional infrastructure we wanted in each region to perform the migrations (and how much that would cost), and how long we were comfortable with each region taking to complete. We wanted the project to run reasonably lean, so to maintain throughput rates we ensured good discipline in converting drained RDS instances to Aurora clusters – thus making them available as destinations for draining from other RDS instances – as soon as possible!

We projected daily targets based on a sliding window of actuals, and tinkered here and there as we needed to in order to achieve our milestones. Ultimately, we were able to complete the project ahead of schedule!

In summary:

End Game

We achieved our migration targets on schedule, but what did it give us? It’s worth sharing one more data point:

Keen eyes observing that we have nearly triple the Aurora cluster instances than we had RDS database instances may ask why: the answer is that through Aurora we were able to scale down our instance class size by half, and still get better performing database instances. Our standard instance size went down from m5.4xlarge on RDS to r6.2xlarge on Aurora: we kept the same memory size, but reduced the number (and type) of CPUs. Was this loss of CPUs significant? Well, no, not really. Recall the earlier point about being able to use all instances in an Aurora cluster versus just the primary in our old RDS configuration. We effectively have the same amount of CPUs, or more during scale-out events! So, during peak hours our clusters scale out; in off-peak they scale back in and greatly reduce our instance footprint.

Count of instances over time in a typical Jira Aurora cluster, where the peaks correspond with local business hours and troughs with off-peak.

Although the path was long and technically challenging, we achieved our ambitious cost-saving target and improved our reliability and performance in doing so. It was a great win for databases at Atlassian!

A final note: we’d like to extend a special thanks to the good folks at AWS who have been tireless in their continued collaboration with us, working with our abnormally high file counts and striving to ensure the best possible experience with Aurora.

Migrating the Jira Database Platform to AWS Aurora