Avoiding Uncontrolled Container Scaling Costs
Key Points
- The main issue discussed is “scaling gone wild,” where improperly configured auto‑scaling policies cause excess worker nodes to remain active, leading to unexpectedly high costs.
- Critical microservices (e.g., load balancers, monitoring, logging) are often deployed onto these nodes, preventing the cluster from scaling down because the services are marked as essential.
- Proper configuration of auto‑scaling policies is the first step, ensuring the cluster can expand for peak events (like Black Friday) and contract when demand drops.
- Comprehensive observability—capturing telemetry from the application layer through the underlying infrastructure—is needed to monitor resource usage and confirm that scaling actions are appropriate.
- Automated alerting (via email, Slack, SMS, etc.) must be set up so that the right teams are notified instantly, enabling rapid response to scaling anomalies before they inflate costs.
Sections
- Scaling Gone Wild: Stuck Nodes - The expert explains that misconfigured auto‑scaling policies and the deployment of critical microservices onto worker nodes prevent those nodes from being terminated, resulting in unexpectedly high costs.
- Right Tool for Role - The speaker uses a hammer analogy to argue that assigning developers tools like a Kubernetes cluster—unsuitable for their expertise—reduces productivity, highlighting the importance of aligning tooling and responsibilities with each role.
- Kubernetes Resource Limits & Scaling - The speakers explain that perpetual crashes stem from missing CPU/memory limits in pods, emphasizing the need for proper resource planning, a holistic view of all microservices, and coordinated horizontal/vertical autoscaling that respects the capacity of the underlying physical infrastructure.
Full Transcript
# Avoiding Uncontrolled Container Scaling Costs **Source:** [https://www.youtube.com/watch?v=HDTqhqaF8L8](https://www.youtube.com/watch?v=HDTqhqaF8L8) **Duration:** 00:08:52 ## Summary - The main issue discussed is “scaling gone wild,” where improperly configured auto‑scaling policies cause excess worker nodes to remain active, leading to unexpectedly high costs. - Critical microservices (e.g., load balancers, monitoring, logging) are often deployed onto these nodes, preventing the cluster from scaling down because the services are marked as essential. - Proper configuration of auto‑scaling policies is the first step, ensuring the cluster can expand for peak events (like Black Friday) and contract when demand drops. - Comprehensive observability—capturing telemetry from the application layer through the underlying infrastructure—is needed to monitor resource usage and confirm that scaling actions are appropriate. - Automated alerting (via email, Slack, SMS, etc.) must be set up so that the right teams are notified instantly, enabling rapid response to scaling anomalies before they inflate costs. ## Sections - [00:00:00](https://www.youtube.com/watch?v=HDTqhqaF8L8&t=0s) **Scaling Gone Wild: Stuck Nodes** - The expert explains that misconfigured auto‑scaling policies and the deployment of critical microservices onto worker nodes prevent those nodes from being terminated, resulting in unexpectedly high costs. - [00:03:04](https://www.youtube.com/watch?v=HDTqhqaF8L8&t=184s) **Right Tool for Role** - The speaker uses a hammer analogy to argue that assigning developers tools like a Kubernetes cluster—unsuitable for their expertise—reduces productivity, highlighting the importance of aligning tooling and responsibilities with each role. - [00:06:15](https://www.youtube.com/watch?v=HDTqhqaF8L8&t=375s) **Kubernetes Resource Limits & Scaling** - The speakers explain that perpetual crashes stem from missing CPU/memory limits in pods, emphasizing the need for proper resource planning, a holistic view of all microservices, and coordinated horizontal/vertical autoscaling that respects the capacity of the underlying physical infrastructure. ## Full Transcript
Welcome to Lessons Learned.
We have a special edition today which is on container problems.
We're going to explain what had happened and how you could potentially avoid them.
With us today is an expert on it, Chris Rosen.
OK, Container Expert, present our first problem,
"Scaling gone wild." What exactly is the story behind this?
So envision that your application is successful.
That's a good thing.
All of us want our applications, our tools to be successful and grow.
So when we scale up, that's accommodating the resource requirements required to run these workloads.
That's a good thing.
So far so good.
We are adding resources, worker nodes to that cluster.
However, at some point we start to incur a large bill
because those resources are no longer required
and we're not automatically scaling them back down.
I see, so we want these to go away at some point, but they're not.
And presumably there's a cause.
What's behind that cause?
So the cause generally is that
we've not configured the auto-scaling policy properly
and we're deploying critical microservices on to these worker nodes that are service-level.
Maybe that's your application load balancer,
maybe it's monitoring and logging.
But when we do that, when we deploy these microservices to those worker nodes,
we can't automatically delete them because they are critical microservices to run that cluster.
I see, so you can't scale back down to less
because these are critical services and are marked as such.
Exactly.
Got it.
OK, so if using the correct configuration is step one, what else could they have done?
So clearly, like you said, step one is setting the right auto-scaling policy
so that way we can meet the demand for a Black Friday event,
for a weather event, something else that's going to drive unexpected capacity.
But we also want to set it so that way we can not deploy those critical components and scale back down.
So the configuration is very important.
Now, how do we monitor, how do we get the insights, the telemetry to how those applications are performing?
And that's where observability comes into play.
Because in this new container world,
we want insights throughout the entire stack, infrastructure all the way up to our containers.
So we want to make sure they have the resources that are required for them, but not too much.
And that's how we monitor insight to the cluster and scale back down.
Well, what observability really buys you is, is that you get a single thread
of information all the way from the application transaction to the infrastructure it's running on.
But you also going to need something else.
Exactly.
Because no one is sitting around watching the monitoring or the logging dashboards for these events to take place.
So that's where alerting and custom alerts, whether it's email, a Slack integration, a text message,
we want to alert the right teams so that way they can come in and take the right action immediately
and circumvent the problem that is building.
Excellent, so that's our first one. Let's go on to our second one.
OK, that was lesson one, now on to lesson two.
The problem is "I've got a hammer and ..."
I love this example because when we think about
one size fitting all and a hammer being the one tool to solve what you're trying to accomplish.
So as it relates to our container management,
it's that the developer persona is given a tool that is not purpose-fit for their job.
So if we give them the wrong solution, it's going to really drop their productivity.
Because instead of them looking for the right tool,
they're trying to force the wrong tool for that particular situation.
So they were given Kubernetes cluster, for example.
Why would that be the wrong tool?
Because the developer-- say, for example, I'm a front-end developer --I don't
want to learn how to deploy it, manage the lifecycle of my community's cluster,
I want some abstraction from it so that way I can focus on what's important to me, which is writing code.
That's going to be my value-add to the business.
And that Kubernetes then can be monitored by an administrator, for example.
Exactly.
So the administrator that has the right skills in Kubernetes
can create and manage that cluster, thinking about the line of responsibility.
They'll run the cluster and I can focus on application development.
And that kind of brings us to this first point, doesn't it?
Exactly.
It comes down to roles and responsibilities.
Being very prescriptive in the amount of access and controls to what you can do within that cluster.
Ensuring that I'm doing things to manage and run the cluster.
You deal with it in application code level.
So creating those boundaries will really accelerate our utilization of the tool, which is a Kubernetes cluster in this case.
In fact, one of the things that is my pet peeve, is that as developer,
I spend too much time having to learn new tools or new processes.
I spend 80% of my time there.
Where really I want to spend 80% of my time on code and as little as possible.
Right, so we want to flip that.
We want our developers to spend 80% or more of their time writing code.
That's what they want to do.
They don't want to learn these new tools.
So when we think about the hammer analogy,
the Kubernetes cluster was not the right solution for that persona, the developer.
Let's abstract them, give them access to the tools that they're familiar with,
the CI/CD tools to integrate, push code,
and it all comes back to the right container management strategy,
creating the boundaries, giving the right users the right tools to be efficient at their jobs.
Excellent.
Hey, by the way, if you haven't seen Chris's video on container strategy, be sure and check it out.
It'll be right here.
OK, for our last lessons learned for containers we have "I've fallen" and something's gone wrong.
Exactly.
So the problem is that our pods, our containers, have fallen or crashed.
So then Kubernetes is smart enough to redeploy those, but then it happens again and again and again.
So we need to really understand what is causing that continuous process to take place.
So it's not just about managing your specific resource, but a continual failure, essentially.
Okay, great.
So we understand the problem.
What could cause that?
Generally, in Kubernetes,
it's because we deploy those applications, those pods,
without setting the right resource limits.
So think about: you've deployed your application, but you've not allocated enough CPU or memory.
So eventually you're going to consume all that you've been allocated and all you can do is crash.
And I can also see that happening where if you do it in your development environment
and then you go to production, the demands may well be different.
So you really need to plan for that, right?
Exactly. Because real life will drive additional resource requirements
that maybe you've not thought about in your development cycle.
So it does come down to planning.
And you can see here it's the entire stack.
It's what resources will each of my microservices or components in that containerized application require.
So that's when we think about holistically as an application,
we think about individual containers,
and extremely important is to think about the underlying infrastructure
because we could set horizontal and vertical scaling policies within Kubernetes,
but eventually we'll run out of capacity within the physical infrastructure.
So then we need an auto-scaling policy to scale out and accommodate that growth in the workload.
And when you are trying to be accountable to that, this comes into knowing it's going to happen.
Exactly.
It all comes down to having the insights, those in-depth telemetry, again, observability, metrics, logs.
How do we understand how each of these layers are performing
and where are the bottlenecks,
where do we allocate additional resources?
It's really a classic performance optimization pattern of being able to plan, observe and finally adjust.
Adjusting is critical because we can do all of the planning and the forecasting,
but it really comes down to once we deploy that workload, how do we observe it
and then come back to adjust.
With our next deployment
using Kubernetes red/green, I mean blue/green, red/black deployment strategies,
we can roll out and ensure that we've got the right capacity required.
Well, thanks, Chris.
That was excellent.
Before you leave, don't forget to leave us some comments,
if there are problems that we haven't discussed and maybe we'll do that on the next lessons learned.
Thanks for watching!
Before you leave, hey, don't forget to hit like and subscribe.