Wayfair Tech Blog

The Wayfair Communications Platform: The One-Stop Shop for all Customer, Supplier, and Employee Communications

If you're a Wayfair customer or supplier, you have received our promotional and transactional messages. These come in a variety of flavors. For promotions, it could be a Thanksgiving/Black Friday sale or a Summer/July 4th deal. With transactional messages, the focus is on areas such as order updates, shipment confirmations, and cart abandonment messages.

Internally, we refer to these as Batch Comms (Promotions) and Streaming Comms (Transactional), which are issued through the Wayfair Communications Platform and work to keep audiences up to date. In this blog, we would like to go behind these scenes and show how it processes about a billion promotional and transactional messages a week and is responsible for nearly 20% of the GRS (Gross Revenue Stable).

Our internal stakeholders, such as the Marketing department send B2C customers messages through the following channels - email, SMS, and mobile app push notifications. We also send messages to B2B customers and suppliers. We even use this platform to power many of our internal corporate communications, such as our employee newsletters. The Highly available, scalable, reliable platform that powers all the messages across a diverse set of stakeholders is the communications platform (called Comms Platform for the rest of the post).

This unified platform powers two main communication workflows, the first being the Batch (scheduled) comms which is responsible for sending the promotional messages mentioned above. Next is Streaming (triggered) comms, which is responsible for sending transactional messages. Both the Batch and the Streaming architectures rely on a number of services to get the job done. These include Customer Intelligence, Customer Identity, Recommendations, Data Science models, Notifications infrastructure, and more. It's via these elements that we are able to process approximately a billion messages a week which are responsible for nearly 20 percent of the GRS (Gross Revenue Stable).

Batch (Scheduled) Comms

Our Comms Platform supports Batch (Scheduled) features that ensure we deliver the right marketing message to the right person at the right time and with the right frequency. We do this by providing marketers with a set of generalized tools and flexible configuration mechanisms. These make it easy to prepare and send scheduled messages (as opposed to triggered messages sent by Streaming) through various channels (e.g., email, push).

The Comms Platform also makes it easy to process data for all brands, including Wayfair, Birch Lane, All Modern, Joss & Main, and Perigold. The reason why is that it's able to recognize that there are differences in target audience and positioning, which makes it easy for marketers to provide brand-specific parameters.

Message Volume

As I mentioned earlier, the Comms Platform processes an enormous number of messages. For January of 2022, we sent more than 240 million emails and 80 million push notifications a day at peaks (major events, promos). For low-volume days, the average was approximately 80 million emails and 11-12 million push notifications.

As you can imagine, monitoring, tracking, and sending progress updates is not an easy task, but in the typical Wayfair style, we found a winning formula. The Comms Platform currently provides various metrics through DataDog, Grafana, BigQuery, SQL, and other mechanisms. The Marketing, BI, and Data Science teams can then query the data programmatically or use visual tools (dashboards, query consoles, etc.).

In addition to data preparation and send progress, the metrics are also used by engineering teams to track platform performance and compliance with defined SLA's. In the event the platform detects a deviation from the expected value ranges, it triggers automatic scaling events and recovery procedures. If manual intervention is required, the platform will send an alert to the operations team.

Key components

Comms batch components

Generalized Platform API

We use the "platform approach" to provide access to our data. One of the main components is our GraphQL API, which serves different clients such as web and data processing applications.

Previously, we utilized a legacy implementation that provided direct access to the underlying data sources and services. This broke control over the data layer and exposed internal details. Our new API orchestrates data access seamlessly, reading and writing data to/from different types of sources and converting data if needed to follow the API contract.

  • Tech Stack: Java, Spring Boot, GraphQL, Kubernetes

Notifications Configuration/Tooling
The Batch Tooling (UI) is a set of web applications used by the customers (primarily the Marketers) to configure and manage notifications for various channels. It includes:

  • Notification templates.
  • Content strategies for different variations and content types (events, SKU's, etc.)
  • Customer segmentation.
  • Multivariate test groups to utilize experimentation capabilities as well as many other parameters.

Marketers can plan notification sending ahead of time by implementing complex marketing strategies. The tool can also help project management by allowing marketers to track jobs and monitor send progress.

Comms batch calendar

  • Tech Stack
  • Frontend: React, JavaScript/TypeScript
  • Backend: Java, GraphQL (see "Generalized platform API")

Data Processing Applications
Using the build schedule, the Comms Platform starts a set of data processing pipelines that are implemented as Apache Spark applications. Pulling data from various sources and services, these applications prepare personalized content for the later scheduling stage. The processing steps don't work with actual media content (e.g. text blocks and images), only with different types of identifiers (e.g. integers and strings).

Comms batch data apps

This is a processing-heavy part of the platform that operates terabytes of historical data. To comply with its SLA's, the platform uses auto-scaled Spark Dataproc clusters with hundreds of worker nodes. The jobs are orchestrated by Airflow and custom overseer applications.

  • Tech Stack: Spark, Dataproc, Parquet, Hive, Google BigQuery, Aerospike, Airflow, Java, Python, GraphQL.

Notification Scheduling
Every notification has one or several "Send batches (of customers)" attributed with a send time. Customers are segmented into different groups based on their viewing preferences (managed by the Customer Intelligence team), which are gathered along with other details by the Batch Schedulers. The platform then schedules notifications for each subset of customers, which are sent at set times during the day (e.g., at 9 am, 10 am, 8 pm, etc.). These notifications are sent for every brand (typically every 30-60 minutes per brand).

Comms batch scheduler

  • Tech Stack: Spark, Dataproc, Kafka, Parquet, Hive, Aerospike, Airflow, Java, Python, GraphQL.

Streaming (Triggered) Comms

The Streaming Notifications team's mission is to provide a best-in-class automation application that enables any team at Wayfair to configure sequences of actions and notifications of real-time events quickly and safely through an intuitive user interface (UI). We designed our application to be scalable, performant, and highly reliable, making it the smartest option for a wide array of automation use cases.

The application also includes a self-service user interface, which gives a simple and intuitive way of configuring workflows and hooking them up to real-time triggers. It's so easy that operations do not require any support from engineering. Everything can also be configured in real-time, with changes going into production in seconds.

While the trigger can be anything, most come from the Wayfair tracking system. The workflow is a directed graph of configurable linked nodes, which can be almost anything. Some of the more common include REST service request node, condition node, and delay node. We are also improving the application so we can introduce new possibilities and use-cases. We already have several non-notifications workflows and are looking forward to helping our teams solve their problems more easily.

Example workflow:

GM Flow

Key features:

  • Self Service: Campaigns can be created and updated by users through our User Interface. As a result, what used to take weeks to set up can now be done without having to rely on code deploys.
  • Fast Testing and Analytics: Tight integration with A/B testing allows for powerful testing with intuitive reporting. 
  • Flexible: It's designed to allow for easy integration with new services, channels, and workflow functionality.
  • Unified UI experience: Streaming Notifications Application UI is part of the Unified Notifications UI Suite. This provides all necessary tooling for comms platforms configuration and analysis.

Some of the workflows we support include:

  • Marketing notifications (picks for you, cart abandonment, low in stock, product view reminder, back in stock reminder)
  • Post-order notifications (order confirmation, delivery confirmation, returns, cancellations)
  • Supplier notifications (purchase order, replacements)
  • Account notifications (welcome, password change, magic link, google one-tap)
  • And many more (direct mail, warehouse, b2b, etc.)

Our stakeholders include:

  • Marketing
  • Service Engineering
  • B2B Marketing
  • B2B Sales
  • Sales Engineering
  • Castlegate
  • Direct Mail

Conclusion

As you can see, the Comms Platform supports Wayfair’s scale with rich features for our stakeholders that cover the Batch and Streaming use cases. Today we are the one-stop-shop for all communications to our customers, suppliers, and employees, and we are not close to being done. Looking ahead, the team will make the platform seamless and more intelligent by incorporating additional data science models and supporting additional business use cases.

For example, with the Batch Comms, we will switch from monolith processing pipelines to many independent steps with different activation triggers. As a result, some steps may start early (once all the prerequisites have been met) and others later. This will allow us to reduce the time between the build and the message send processes and, in doing so, utilize the most recent events.

As for Streaming, the team will replace some of the tech stack components with more scalable and stable GCP alternatives while rethinking the way we use the "cloud compute" to improve performance and scalability opportunities.

These are just a handful of the many changes you can expect, and to ensure we are bringing the best experiences to our stakeholders and customers, we will collaborate with multiple teams across Wayfair to deliver.

Come Join Us

If you find our work interesting, please connect with us! We are always looking for talented engineers at all levels and managers to help build the next generation of scalable, available, reliable, monitorable, observable well-architected platforms at Wayfair! Head over to our Careers page to see our open roles. Be sure to follow us on Twitter, Instagram, Facebook, and LinkedIn to catch a glimpse of life at Wayfair and see what it's like to be part of our growing team.

Share