Keeping up with Wayfair’s blisteringly fast year over year growth is a blessing, and alas, also incredibly challenging. E-commerce companies naturally experience their busiest time during the holiday season, and our engineering teams are often thinking about the next holiday season as soon as the previous one ends.
To scale with our growing number of customers, we shard our customers’ data across nine customer shards, including, but not limited to: Registry data, Idea Boards (wishlist) data, basket data, and customers’ personal data. All data pertaining to each singular customer is stored on one shard, which offers us both performance advantages and resiliency.
Strategic Refactoring, Shard-Style
In preparation for Holiday 2017, we found ourselves needing to re-evaluate our sharding strategy. At the time, we had five shards located in: Boston (serving East Coast Americans and Canadian customers), Seattle (serving West Coast American customers) and Ireland (serving European customers). Our rethink outlined the need for more space to accommodate our growth, which might seem relatively straightforward. We added four new shards, distributing them between our worldwide locations.
However, the next task impacting this growth was not as straightforward. With four brand new shards that housed little-to-no customer records, our older shards were working disproportionately hard to serve our customers. To achieve balance in time for the holiday shopping season, we needed to rebalance the load each customer shard was experiencing. In other words, we needed to move customer records between each of the shards.
Our previous customer distribution of the Boston shards can be seen below:
And what were we aiming to achieve? Our goal customer distribution for Boston would look something like this:
This goal, this exceptional looking pie chart dream, was mired with complex problems, both in planning and execution. The concept of moving a customer’s records between shards had been previously explored, and implemented, but with little success due to its fragility. It’s also incredibly difficult to maintain. At this point, with the holidays only a few months away, we needed to be able to move customer records quickly and in parallel. Moving a customer one at a time, even if it took a second per customer, would be entirely too slow -- it would have taken us until the year 2100 to complete the task!
Mo’ Shards, Mo’ Problems?
To get to this state, we first had to decide which customers to move between shards. We started by moving all customers who were squatting in the wrong shard; for example, any customer living on the American West Coast whose records lived in a Boston shard got their records migrated to a shard in Seattle. Beyond logical moves such as these, we also decided to randomly select customers. The randomness was crucial to ensure the database load was balanced. Customer records that were created less recently tend to be less active, thus moving only recent records would disproportionately tip the scale.
The bulk of the complexity lay in the move process, and anything that happened immediately before a move or immediately after. Given the varying table schemas of customer information, there was no unifying, easy way to retrieve information for a specific customer. Some tables had a column indicating which customer a certain row of data corresponded to, but more than half the tables had to be joined in some fashion to ensure we retrieved the correct rows. We didn’t want to have to write these join queries manually, nor did we want engineers that were adding new sharded tables to the database having to write queries to add to this script.
We opted to create a schema that stored every sharded table, and other table(s) we had to join to, in order to eventually connect back to a customer ID. This enabled us to write a script that would loop through all the tables, dynamically write both simple and joined queries (even at several levels), and retrieve, insert, and delete data without an engineer having to write the query themselves. If an engineer were adding a new sharded table to the database, they would only have to add their table to the schema, which was a quick, two minute change.
Once able to move customer records, we quickly realized that some customer moves were causing collisions in primary key IDs between shards, and that in some cases, we needed to reseed a table prior to the insertion of a new record. If it happened that a customer we moved had a very high ID number, a number even higher than the largest on the destination shard, it would cause the destination shard to lose its correct seed until our automated jobs could detect the error and fix it.
If seeded incorrectly, a table would start generating colliding identifiers with the same table on a different shard, which can cause problems elsewhere in our systems where identical identifiers, belonging to two separate customers, isn’t a supported use case. How did we remedy this? Prior to moving customers, we checked whether any of the rows we wanted to move had primary keys larger than the largest primary key in the destination table. If so, we then reseeded the destination table to have a maximum seed higher than the largest incoming primary key.
Happy Shards, Happy Bards
Handling error cases efficiently, such as the case where a move fails midway through, was crucial to ensure we kept a customers’ business. Though we called it “moving customer records”, in truth it was more akin to a “copy and delete” activity to allow for much more graceful error handling: If a move failed midway through, we would simply clean up the orphaned data on the destination shard, leaving all records on the “origin” shard intact, in order that it continued to endure as the data source.
With some great team work, we were able to achieve our pie in the sky distribution:
This only begins to scratch the surface of the types of issues and impact that moving customer records can have. With our customer experience at the center of our business, this project represented a big risk, but for a big reward. We were thrilled to meet (and even exceed) our goals for scaling our customer shards, and rather merry at being able to get through our peak shopping season unscathed by database sharding issues. But, at Wayfair, we are never done, and as the 2017 peak holiday season was wrapping up, 2018 planning was already underway. There will be more bard-like tales of our sharding adventures to come!