About Wayfair | The Road to the Cloud — Part 3: Streamlining the Workflows

As our journey continues, we have already covered testing at a smaller scale, learning throughout the process, and setting our ambitious goal to move Wayfair Storefront to the cloud!

Now, it’s time to talk about setting up efficient workflows by acknowledging the challenges, introducing standardized procedures, and finding ways of making progress without disturbing business operations.

Setting up the workstreams

I was working closely with our project manager and teams across Engineering to come up with a balanced list of implementation tasks, identify platform restrictions, order of operations, validation plans, tentative timelines and provide updates on a regular basis.

Compiling the list of tasks with decent levels of granularity proved to be a challenge. We needed to work with hundreds of small components in different environments. In many situations, we found ourselves finding new items and getting them completed in a few hours, so spending a lot of time managing the agile boards was burdensome and contributed to increased context switching.

Realizing the impact of that, we decided to physically colocate, where possible, our core Infra team and focus on higher-level tasks. Higher-level tasks would take at least a day or two to complete and we could work through smaller items via ad-hoc conversations or regular standups. Updates, QA planning and other types of outward communication had been channeled through the project management organization and the leadership team.

While we lost some visibility into work in progress as we started reporting higher-level items on completion or once a week, we gained a lot of productivity by increasing our focus factor.

Are we doing the right things?

Google Cloud Platform had great and scalable components. To achieve what we wanted to accomplish and make the most of its functionality, we were constantly sharing ideas and feedback with the Google team, and contributing to open source projects.

The team was working closely with Google in many areas – talking to its Cloud team added more confidence that we would have what we needed as we got closer to the finish line.

When you embark on a journey involving Infrastructure as a Service (IaaS), one of the better things you can do is involve the IaaS provider as a partner in validating your designs against their best practices.

We worked closely with the Google Cloud team to analyze projected traffic, components used, usage models, required capacity and machine types. The teams have been in constant communication, including representatives from Google product teams when tougher questions needed to be answered.

This type of collaboration helped our team to avoid some of the edge cases and be more confident in the design we came up with.

Building a plane while you’re flying it

The biggest challenges large companies face often aren’t related to designing or implementing new services – they’re more often about finding balanced pathways to upgrade thousands of existing systems, integrate them with newer services and do all that with minimal production disruptions.

We needed to come up with a standardized way of approaching service owners and scheduling implementation, validation and integration work individually per component, or as part of a bundle of services that typically work together. This helped us to streamline the work and assess the service readiness in a unified way.

Typical service migration questionnaire resembled the following:

What other backend services does your service rely on?
How does your service tolerate added latency if some of its dependencies are remote?
What’s the preferred path for validating the service individually?
How does running systems in parallel affect operations, capacity and integrity of your data?
How does switching from an older service to a newer one affect customer experience?
How to synchronize caches or rule out stale data?
How does new performance characteristics of your service affect downstream dependencies?
What’s the fallback plan?

Focusing on individual components helps introduce the services in new datacenter locations in a controlled manner, validating them individually. It also helps in reducing the variables before final “all-in” integration testing, where live traffic switches completely to your new datacenter.

I found that this approach is often the one that allows you to make measurable progress and ensure predictable delivery.

Working in a dynamic environment

At Wayfair, we change things rapidly and constantly make improvements by deploying code thousands of times a day. A fast-paced environment adds an element of entropy to the system and you often feel like chasing a moving target when doing QA work.

Two typical ways of handling such situations are to either “freeze” the environment for the duration of the migration or to do it asynchronously, in a “non-blocking” way.

Freezing such a large environment for many months just wasn’t an option, as it would be detrimental to developers productivity and customer experience. A “Non-blocking” approach lets you move fast by deploying a parallel infrastructure for integration testing, working on a copy of a dataset and application codebase. When you plan large infrastructure migrations, in a non-blocking approach, you’re at risk of operating on stale data or older versions of the code.

The key for making such work a success is to invest in synchronization tools to validate that data is accurate, credentials aren’t stale and rapid code updates are possible. Investments at the beginning of the process will likely save businesses a lot of time fighting regressions down the road.

The following items helped us to move fast, minimizing the regressions:

Deploying code separately to your pre-production and production environments.
Creating automation to validate environment settings between prod and pre-prod.
Automating synchronized deploy to both environments once major issues are resolved.
Configuring database replication and enabling cache warming as a part of the same production pipeline.
Plugging pre-production systems into your monitoring and logging environments with alerting disabled.

Once the workflows had been established, the teams refocused on adjusting and optimizing the support infrastructure and the code itself.
In the next installment in this series on our road to the cloud, we’ll dive deeper into technical implementation, validation and integration work, and Way Day 2019! See you soon!

Want to work at Wayfair? Head over to our Careers page to see our open roles. Be sure to follow us on Twitter, Instagram, Facebook and LinkedIn to get a look at life at Wayfair and see what it’s like to be part of our growing team.