Mastering Database Scalability: Figma's Approach to Vertical Partitioning Database Explained

Figma used this strategy to Vertically Partition their Database

Learn the tradeoffs for your next System Design Interview 👇

Definition

Vertical partitioning is when a database is split by moving tables and/or columns into a separate database

Background

Initially they were on a single large Amazon RDS database and observed upwards of 65% CPU utilization during peak traffic with Database latencies becoming increasingly unpredictable

First Tactical Fixes

Upgrade database to the largest instance available (from r5.12xlarge to r5.24xlarge) to maximize CPU utilization runway
Create multiple read replicas to scale read traffic
Establish new databases for new use cases to limit growth on original database
Add PgBouncer as a connection pooler to limit the impact of a growing number of connections

on further Analysis they learned

Writes contributed a significant portion of database utilization.
Not all reads could be moved to replicas due to application sensitivity to replication lag.

Hence, they still needed to offload more work from our original database.

Possible Solutions

1. Horizontal Scaling the Database

They observed many popular managed solutions are not natively compatible with Postgres.

So, they would either have to find a Postgres-compatible managed solution, or self-host.

2. Migrating to NoSQL

It would require

Complex double read and write migration
Significant application side changes

3. Postgres-compatible NewSQL

They would’ve had one of the largest single-cluster footprints for cloud-managed distributed Postgres.

So they didn’t want to bear the burden of being the first customer to hit certain scaling issues while having little control over managed solutions.

4. Self Hosting

It would mean

Upfront work to acquire the training, knowledge, and skills to support self-hosting.
Large operational cost

5. Vertically Partition the DB

They would move groups of tables onto their own databases.

This proved to have both short- and long-term benefits

Vertical partitioning relieves the original database
Provide a path forward for horizontally sharding subsets of tables in the future

How they implemented Vertical Partitioning

They first need to identify tables to partition into their own database. They chose the tables via following factors

Impact: Moving the tables should move a significant portion of workload
Isolation: The tables should not be strongly connected to other tables

1. Measuring impact

They looked at Average Active Sessions (AAS) for queries (average number of active threads dedicated to a given query at a certain point in time).
They calculated this information by querying pg_stat_activity in 10 millisecond intervals to identify CPU waits associated with a query, and then aggregated the information by table name.

2. Measuring isolation

They have Ruby based backend which use ActiveRecord to write queries
They created Validators that hooked into ActiveRecord and send production query and transaction information into Snowflake to look for queries and transactions that consistently referenced the same group of tables
If these workloads turned out to be costly, those tables would be identified as prime candidates for vertical partitioning

Table Migration

The data movement has to be coordinated across thousands of application backend instances, so they could route queries to the new database at the correct moment.

They wanted a solution that met the following goals:

Limit potential availability impact to <1 minute
Automate the procedure so it is easily repeatable
Have the ability to undo a recent partition

Solution

At a high level, they implemented the following

Prepare client applications to query from multiple database partitions
Replicate tables from original database to a new database until replication lag is near 0
Pause activity on original database
Wait for databases to synchronize
Reroute query traffic to the new database
Resume activity

1. Preparing client application

They leveraged PgBouncer and created a separate PgBouncer services to virtually split traffic.

Partitioning the PgBouncer layer first would give clients leeway to route queries incorrectly.

They would be able to detect the routing mismatch, but since both PgBouncers have the same target database, the client would still successfully query data.

Start state looks as follows

State of the database after partitioning PgBouncer

State of the database after partitioning data

Once they verified that applications are prepared with separate connections for each PgBouncer (and sending traffic appropriately), they’d proceed.

2. Data Replication

There are two ways to replicate data: Streaming Replication or Logical Replication.

They chose logical replication because it allows them to

Port over a subset of tables, so they can start with a much smaller storage footprint in the destination database
Replicate to a database, which is running a different Postgres major version, meaning we can perform minimal-downtime major version upgrades with this tooling.
Set up reverse replication, which allows us to roll back the operation.

Working with terabytes of production data, Logical Replication was quite slow because it copies rows in bulk but inefficiently updates indexes one row at a time.

So they removed indexes in the destination database and rebuild the indexes after the initial copying reducing the copy time to a matter of hours.

Modifications to the new database were also replicated back to the old database (via reverse replication stream), for the case they rolled back.

3. Coordinating Query Rerouting

They performed sharding operation in two phases (partitioning PgBouncers, then data)

Overview of the operation:

They stoped all relevant database traffic briefly in order for logical replication to synchronize the new database
PgBouncer paused new connections and revoked clients’ query privileges on the partitioned tables in the original database.
After a brief grace period, they canceled any remaining in flight queries.
With traffic paused, they used LSNs (Log Sequence Number) to determine if two databases were synchronized. They sample an LSN from original database and waited for the replica to replay past this LSN. At this point, the data is identical in both the original and the replica.

A visualization of our synchronization mechanism

Post checking the replica is synchronized, they stoped replication and promote the replica to a new database.

Reverse replication is set up as previously mentioned. Then, they resume traffic in PgBouncer, but now the queries are routed to the new database.

A summary of the procedure

Conclusion

They observed a ~30 second period of partial availability impact (~2% of requests dropped).

Each database partition is operating with greatly increased headroom. Their largest partition has CPU utilization hovering ~10%

Understanding the Foundational Principles is the key to Effective Learning!

Follow along to Improve System Design Skills.

Reference and Image Credit

How Figma Scaled to multiple databases

Mastering Database Scalability: Figma's Approach to Vertical Partitioning Database Explained

Definition

Background

First Tactical Fixes