Our flexible cluster solution - How we run micro-services efficiently
At Globality, we’ve been running Docker in production for the past ~3 years. Over the years, we made many changes to our cluster management, making it better, more flexible, cost-effective, and scalable.
In this post, I describe our solution and dive into the technical details, the challenges, etc.
What is cluster management
Kubernetes/ECS/Mesosphere are cluster managers. They allow you to run any number of tasks on top of machines that act as a cluster.
You can tweak and apply changes to any of these flavors.
In simple terms, you tell the cluster manager what to run; it figures out where to run it, restart it if it fails, etc.
Our flavor of choice is ECS; however, the solutions and practices described in this document apply to any cluster management solution out there.
The internals of our cluster management
To understand better, let’s start with some numbers.
- We have ~140 services running in production.
- ~60 BE services -> Backed by a data store
- ~10 GW services -> Calling BE rest API
- ~50 Daemon services -> pubsub driven
- ~10 FE apps
Our services require varying levels of resources. Service X requires
16G of memory, while service Y requires
Our scaling groups don’t only place a single type of machine; it places machines based on the resources required for services.
It looks kind of like this:
Cluster management internals - going deeper
Our cluster management accepts service, desired count, and resources. Based on that input, it decides where to place that task. You can also specify specific placement constraints like Don’t place all tasks on the same machine, place tasks on a specific instance type (GPU/CPU), etc.
Let’s take a couple of our services as an example:
- service A. Requires 20 CPU units, 150MB of memory. Must be on a
m4.largemachine type. Must be distributed by AZ
- service X. Requires 512 CPU units, 16000MB of memory. It can be on any machine. It must be distributed by AZ.
Deciding on the machine
Earlier on, we only supported a single type of machine. We had a single auto-scaling-group that scaled up/down based on resources.
As we grew, our needs changed. Our AI services now required plenty of resources to run and service traffic. Our daemons are optimized and now require much fewer resources.
Mistakes were made
One of the mistakes we made earlier on was to place BIG machines. To supply the need of our biggest, most massive services; Smaller tasks would get places in large quantities on those big machines
Say we have ~600 tasks we need to place, those would end up on 5 machines.
Cluster management with those needs is not trivial. I would dare say hard.
One of the strategies we adopt to stress test it is using chaos-monkey scripts.
What is chaos-monkey?
We use a variation of Netflix’s Chaos Monkey: https://netflix.github.io/chaosmonkey/
We look at our machines, and we terminate them in no particular order. We can choose to terminate 10% or 50% of our cluster.
We then measure how long it takes to recover, how many services we lost, how much time it takes for alerts to relax, etc.
Ideally, users should not feel any effect if we lose a machine or two. Users should not feel anything even if we lose up to 25% of our machines.
The mistakes of big machines by default
When we evolved into big machines for the AI services, we found out that it was a mistake after using Chaos Monkey.
Because the machines could run many services, losing a machine meant losing a significant chunk of our traffic-service capacity.
Our solution to this was choosing the smallest (performant) machine possible for every service.
The smallest machine we allow is
m4.large, and the biggest is
m4.4xlarge. Choosing smaller machines ensures that tasks are distributed enough, losing 20% of our cluster under normal operations is a task the cluster swallows with no problems.
Because we have huge services and multiple deployments, there can be a time where the cluster does not have enough resources to supply the need.
Our scaling groups are always “observing” the cluster and scheduling machines into it as needed.
The next steps
After deploying that solution successfully, we thought about the next steps.
Comes in - Spot instances
What does our cluster management look like with spots?
Once these services turned on or deploy, the cluster management kicks in and does the following:
Avoiding machine placement
Machine placement is expensive. Not just the cost of it, it is also a slow process. It takes minutes to place machines, not seconds.
To avoid getting machines for every update, we leave headroom for every cluster. We configure every cluster to have X amount of CPU units and Y amount of memory units for new tasks. For general deployments, we usually have more than enough room to supply the need.
Blue / Green deployment
Even as machines are placed, the old tasks are still up and running, accepting traffic. A service will not get updated until it places 200% of the desired count.
Spots are volatile; you can lose the machine at any time (AWS gives you a 2-minute warning).
How we deal with that?
- Chaos Monkey -> We ran chaos monkey scripts. Querying the cluster, killing 20-30% of the machines abruptly and checking the results.
- Tweaking constraints -> Making sure no service can be lost (ever) if 20% of the cluster dies. Making the services AZ distributed, making them distributed across multiple machines, and more
- Tweaking the machine size -> We prefer the smallest possible machine. We prefer
m4.largefor most services. This gives us about 20 tasks per machine. We found that if we have huge machines on the cluster, there are many issues with that.
- Cattle, Not pets -> We don’t configure anything on the machine. We have a pre-baked AMI and no other configuration. The machine will register on the right cluster automatically using user-data.
Service discovery isn’t necessaraly more of a challenge with spot instances. Service discovery is a challenge with every elastic cluster. We use DNS for service discovery. Each task registers itself on a DNS with an SRV record. We found DNS to be the most efficient solution for our needs
If you think about the implication of this, any request can die if you don’t properly handle it.
There are a couple of request types (for us). HTTPS and pubsub.
For pubsub, we wait for ACK from the service. If the service did not ACK, we would retry the message to another service.
For HTTPS, we retry the request if the connection is refused. A connection can be refused when the machine died, and DNS did not update.
When a machine is about to die, we drain all of the services from it. It is an infrequent occasion to kill requests in the middle.
We save 67% on our EC2 costs. I’ll use a percentage here, not raw numbers. Go to your cost-explorer and calculate how much that would be for you.
Our engineers only need to say how much resources they need. They don’t worry about the cluster being able to handle it. This allows our AI team to deploy services to production without worrying about resources or constraints. It allowed them to innovate and move faster.
Clusters that can lose a significant amount of resources and still function, are very reliable and spreads peace-of-mind in an organization like ours.