Wayfair Tech Blog

Building a Vulnerability Management Program for the Cloud


In 2018, Wayfair decided to move to a hybrid cloud infrastructure to host our storefront. We landed on Google Cloud Platform (GCP) as our infrastructure as a service provider, enabling our developer and infrastructure teams to move fast by providing a robust set of APIs and a highly flexible and scalable environment.

As a security team, one key thing that we had to consider was that our perimeter had significantly changed. We had moved to a software defined perimeter from a traditional network perimeter. In this blog post we'll discuss our approach to this transition and things we learned along the way.

Vulnerability Assessments

As we transitioned into a hybrid infrastructure model, there was a need to take a holistic view of our scanning processes.

We divided our scanning processes into two approaches:

  1. Cloud Asset Misconfiguration 
  2. Operating System

Cloud Asset Misconfiguration

Our move to a software defined perimeter meant that we could no longer rely on just operating system scanning as it would not cover other parts of our cloud environments such as bucket storage, VPCs, etc. We needed a central application to poll those respective APIs and gather information about our assets and the configurations behind them.



Figure (1a) The combination of Forseti Security and Security Health Analytics to get a full picture of Asset Misconfigurations

Asset misconfiguration scanning becomes particularly important in the case of public cloud. 

The inherent nature of the public cloud APIs that makes them highly accessible, reliable, and scalable also makes them vulnerable when assets behind these APIs are misconfigured.

Public cloud storage buckets that are created and managed via cloud storage APIs are good examples. Storage buckets are widely used by organizations for data lakes and static data hosting purposes, therefore they support a very robust Identity and Access Management structure. This allows organizations to link various tools to these buckets and consequently the data stored in them. But a robust service that offers almost unlimited ways of integration also opens up the possibility of misunderstanding a particular construct. For example, IAM binding for `AllAuthenticedUsers` might seem harmless on its face but in reality means that anyone authenticated with a Google account will have access to the underlying data.

To centralize our efforts towards scanning all of these APIs as shown in Figure (1a), we went with Forseti Security, an open source project built and maintained by Google along with Security Health Analytics.

Security Health Analytics provided default policy checks but did not allow us to write custom rules and policies. Forseti filled that gap by allowing us to scan and monitor all of our projects with custom rules and policies

Using Security Health Analytics and Forseti Security helped us achieve a central location for managing alerts. This also made these alerts verbose enough for operations and engineering teams to understand why it was triggered along with steps on how to fix the resource that triggered it.

Operating System

In addition to asset misconfiguration scans, using modern cloud native and open source tools, we also continue to scan assets using traditional scanning tools. An example of this includes scanning virtual machines using agents from our vulnerability management provider. These agents are used for checking package versions, OS versions, and OS misconfiguration.

Our vulnerability scanners integrate with our cloud provider’s APIs using a GCP service account that has compute instance viewer and log viewer permissions in our cloud environments. This is how we keep track of our assets in the cloud.

Agent based scanning can still be an important part of an overall vulnerability management platform when augmented with cloud API integrations as they provide more in depth analysis since they are running within the compute instance.

Agent and agentless both have trade-offs to consider. Agents are less network intrusive, but more cpu intrusive, they do not require credentials, and provide additional capabilities such as compliance configuration checks that some agentless approaches cannot achieve. Agentless is easier to deploy and maintain than agents because agents need specific instance configuration requirements to ensure an agent is correctly deployed to every asset.

We do still perform external network scans from our third-party vulnerability scanners to scan Wayfair public IPs that exist in GCP. A python script collects our external IP addresses and uploads those IPs to our cloud hosted vulnerability scanner. Whether you’re using agent or agentless scans within your environment, scanning the network perimeter is important to catch external risks which cannot be achieved from the previous methods because they only exist within the network perimeter. We utilize a mixture of agents, agentless, and network scanning for best overall coverage.

We also leverage cloud native tools like Google Cloud Security Command Center to handle asset inventory and metadata among other security native functions it provides.

Base compute images used in the cloud are maintained through declarative code, where versions are incremented to provide new base images that provide an updated OS. In GCE this is handled via Terraform and in Kubernetes or GKE this is handled in helm chart deployments. 

We handle immutable and mutable VM scanning differently. Mutable VMs will receive a vulnerability scanning agent where we perform a more traditional continuous scan. Immutable VMs themselves do not receive a vulnerability scanning agent. We use a Jenkins pipeline with a collection of python scripts that scans for available images generated by Hashicorp Packer, and a scheduled Jenkins job that powers up each image as a VM in GCE with a deployed agent and scans them for vulnerabilities. When Packer generates new machine image templates, the machines are first uploaded to a holding queue. The scripts in our GIT repository are responsible for scanning all the templates in the holding queue for vulnerabilities with our scanners. Once this job is completed the VMs are destroyed and the process starts again on a defined schedule.

Immutable Image Scanning Pipeline


Future improvements/roadmap

Currently, reporting can require a lot of manual involvement. We plan to automate vulnerability reporting by creating a workflow that skips the Security team as a necessary first step and goes straight to the asset owners instead.

We are investigating passive approaches to scanning cloud assets that augment the use of agents through more instance scanning capabilities via our cloud API enabled integrations. This approach provides greater coverage across cloud environments. This will allow us to remove the GCE build, scan, and destroy approach to our immutable assets since we will use API hooks into the cloud management interfaces to collect instance state information.

We currently use Rego to write security policies for our infrastructure as a code development pipeline.  In order to have parity with respect to tooling we will be transitioning to Forseti security’s config-validator, which is a Golang library that provides functionality to evaluate GCP resources against Rego-based policies.

We also intend to improve on software registry scans by using native security tools that are included with our software registry vendor.  We plan to integrate these tools within our CI/CD pipelines, git operations, and IDEs for our developers deploying containers and virtual machines.


We would like to thank and acknowledge the Private Cloud team at Wayfair for their help in setting up immutable instance scanning and also maintainers for the Forseti Security Project.

Interested in joining the Wayfair team? Explore our open roles here.