Building a Control Plane (Part 1)
Tom Schutte
Aug 22, 2024
6 Mins Read Time
Since joining Amazon Web Services (AWS) as a full-time engineer in 2018, I have worked on complex distributed control plane systems. It was an easy decision when my friend Luis Galeas reached out to chat about something new he was building: Ambar. He wanted me to join the team as one of the founding engineers and build a scalable cloud-based control plane for a new data-streaming service using AWS. An explanation for the word soup will come; don't worry. His offer was a no-brainer and an exciting challenge for me to build something from scratch. At the time, I didn't realize how stimulating and formidable a problem this would be, creating not just the usual interfaces with the usual integrations but also leveraging new (to me) tools and technologies and building complex deployment management systems.
I'm excited to dive into the nitty-gritty details about what I built and how I built it. But first, we should probably clarify what a control plane is and what it does. So, yes, Mom, you can finally tell people what I do at work all day.
So, then, what exactly is a control plane? A modern web application can be broken down into two to three core components: the 'frontend'—the pretty webpage you use to interact with the application; the 'control plane'— a management and bookkeeping system that tracks and monitors everything; and lastly, the 'data plane'—the part of the application that does the actual work an end user wants.
Let's use an airport as a real-world example. In this case, the website where you buy your tickets and the gate agent would be our frontends, who and what we interact with to get our ticket, make requests like upgrades, and get information about a flight in a helpful format. Since we want to travel from one point to another, the airplane will be our data plane. It's the thing that does the actual work we are paying for and the thing we care most about performing as expected! Everything in between then is the control plane, all the various airports, and the systems and infrastructure.
So then, if the control plane is the airport and its infrastructure in this analogy, what does that mean? Just like an airport is in charge of the comings and goings of airplanes and is used by customers, gate agents, and airlines alike, so too is the control plane in charge of managing our application. Control planes handle everything from creating and managing the data plane resources customers want via the requests they make at frontends (websites and gate agents) — like allowing airplanes into and out of the airport, in our example —to things like security (TSA) and maintenance (ground crew). They are often responsible for tracking how much and when customers should be billed.
But there's more to it. Just as an airport coordinates flight schedules, ensures the safe landing and departure of planes, manages air traffic control, and even takes care of ground services like refueling and baggage handling, the control plane does something similar in a cloud environment. It orchestrates the lifecycle of resources, ensuring they are provisioned, configured, and available when needed. It also monitors the health of these resources, scaling them up or down based on demand (like how an airport might open or close gates depending on the volume of flights).
Imagine what would happen if the control plane—or the airport—failed to perform its duties. Planes would be stuck on the tarmac, flights would be delayed or canceled, and chaos would ensue. Similarly, in a cloud application, if the control plane isn't functioning correctly, the data plane can't do its job effectively, leading to potential downtime, degraded performance, or even security vulnerabilities. This is why the control plane is often considered the nerve center of any distributed system, ensuring everything runs smoothly. Critically, note, though, that an airport failing to perform its duties does not generally impact flights in the air. One of the key reasons we separate these systems is to apply standards and controls separately to allow them to be as isolated from each other's points of failure as possible. Your landing may be delayed, or you may have to sit on the tarmac when you arrive and wait for a gate. But the control plane (airport) having problems does not fundamentally alter the functioning of the data plane (aircraft).
In the case of Ambar, our challenge was to build a control plane that could efficiently manage a high volume of data streaming environments, ensuring that customer requests for new environments and resources were quickly handled and that resources were maintained with high levels of security and isolation from one another. This involved leveraging AWS services and designing robust systems to handle everything from fault tolerance to scaling and security. Just as airports can handle many planes flying in their airspace, departing and arriving quickly, so must our control plane handle multiple customer environments and concurrent requests.
In summary, the control plane is the backbone of any cloud-based system, much like how an airport is essential to the operation of air travel. Just as the airport ensures that planes take off, land, and are serviced efficiently, the control plane manages and orchestrates the resources and operations that keep an application running smoothly. It's not always the most visible part of the system, but without it, the data plane—the actual workhorse—would be unwieldy to try to use.
In the next part of this series, I'll explain the specifics of what we built for Ambar's control plane, from the architectural decisions we made to the technologies we leveraged and how we addressed some of the toughest challenges we encountered along the way. If you've ever wondered what goes on behind the scenes to keep complex systems running, stay tuned—it's about to get interesting.