Automating Server Deployment with Orchestrators
Key Points
- Deploying the same application manually on multiple servers requires individual logins, installations, and troubleshooting, making the process error‑prone and inefficient.
- A workload orchestrator automates the entire lifecycle—describing required resources, handling deployment, scaling, and resiliency—eliminating the need for human intervention.
- When a server or job fails, the orchestrator automatically detects the issue, restores the workload to its prior state, and treats the event as routine rather than a crisis.
- While Kubernetes is a popular orchestration platform, it often involves numerous interdependent components (e.g., ConfigMaps, Secrets, storage) and complex YAML configurations, which can add overhead compared to simpler workload orchestration solutions.
Sections
- Automating Multi-Server Application Deployment - The speaker contrasts the tedious manual process of logging into each VM to install and troubleshoot an app with using a workload orchestrator that automates deployment, scaling, and resiliency across multiple servers.
- Kubernetes Complexity vs Simple Orchestrator - The speaker contrasts Kubernetes’s multi‑YAML, component‑heavy deployments with a lightweight workload orchestrator that uses a single HCL job, claiming it has a gentler learning curve and enables faster application rollouts.
- Flexible Orchestration Beyond Kubernetes - The speaker explains why Kubernetes isn’t ideal for ephemeral batch workloads and proposes a dedicated workload orchestrator that can flexibly manage web apps, training, batch jobs, and inference with resource‑specific job specs.
Full Transcript
# Automating Server Deployment with Orchestrators **Source:** [https://www.youtube.com/watch?v=YsEnqWnZcME](https://www.youtube.com/watch?v=YsEnqWnZcME) **Duration:** 00:11:33 ## Summary - Deploying the same application manually on multiple servers requires individual logins, installations, and troubleshooting, making the process error‑prone and inefficient. - A workload orchestrator automates the entire lifecycle—describing required resources, handling deployment, scaling, and resiliency—eliminating the need for human intervention. - When a server or job fails, the orchestrator automatically detects the issue, restores the workload to its prior state, and treats the event as routine rather than a crisis. - While Kubernetes is a popular orchestration platform, it often involves numerous interdependent components (e.g., ConfigMaps, Secrets, storage) and complex YAML configurations, which can add overhead compared to simpler workload orchestration solutions. ## Sections - [00:00:00](https://www.youtube.com/watch?v=YsEnqWnZcME&t=0s) **Automating Multi-Server Application Deployment** - The speaker contrasts the tedious manual process of logging into each VM to install and troubleshoot an app with using a workload orchestrator that automates deployment, scaling, and resiliency across multiple servers. - [00:03:07](https://www.youtube.com/watch?v=YsEnqWnZcME&t=187s) **Kubernetes Complexity vs Simple Orchestrator** - The speaker contrasts Kubernetes’s multi‑YAML, component‑heavy deployments with a lightweight workload orchestrator that uses a single HCL job, claiming it has a gentler learning curve and enables faster application rollouts. - [00:08:00](https://www.youtube.com/watch?v=YsEnqWnZcME&t=480s) **Flexible Orchestration Beyond Kubernetes** - The speaker explains why Kubernetes isn’t ideal for ephemeral batch workloads and proposes a dedicated workload orchestrator that can flexibly manage web apps, training, batch jobs, and inference with resource‑specific job specs. ## Full Transcript
Imagine you need to run application on a fleet of servers. Let me show you something. We got a VMware
here as VM1 and another one. VM1, uh, VM2 and
then VM3. Looking at this, I need to deploy an
application on each server. What you're going to do? You're going to simply go log in.
in each and every one as a login, okay. And then, you deploy your application.
Even though it's the same application, you still have to go through the process. So after you
logged in and everything and you install the application, let's say you got a problem. You're
going to have to go log in again, troubleshoot and good luck with that. So normally, this is a very
manual process with workload orchestrators. If something went and
happened wrong, you don't have to do all of this because it's really automates that and eliminates
all the human intervention and automate all of this. So you describe what you want the
application to do, right. You say, these are the resources required. This is the runtime
requirements. And the orchestrator is going to place it on its own. Right. It's going to handle
deployment. It's going to handle scaling. It's going to handle resiliency, all automatically. Ray,
tell me, what is workload orchestration? Great question. So, workload orchestrator is a process
that allows organizations to run multiple apps like web apps, for example. And
we got also AI and ML workflows, also job batches, let's say.
All of this, you can run it on multiple servers and environments. Um, and that's kind of
simplified way of automation, which are common and the workflow, like, also automates the
scheduling, automates the placement and the health monitoring as well. As that being said, workload
orchestrator, it just takes the whole manual process that I was talking about earlier and
automates it for you and doesn't, doesn't work. You don't have to worry about it if anything fails.
can you give me an example? Yeah. So, for example, if, let's say we have a server here, and this
server failed, the workload or ... orchestrator, I will say, WO,
will automatically check a job and bring it back to the same state. Did David get
interview here? No. Did anything ... Did Ray touch it? No. We didn't have to log in and didn't have to do
anything. So actually, this whole manual process was eliminated, and it ... and the workload
orchestrator, by itself, was able to treat the failure not as a crisis as before we used to.
Now it's treating it as a business as usual. We use Kubernetes, the two of us, right? Yeah. Many
companies use Kubernetes. We love Kubernetes. Why would we use this approach? Great question.
Kubernetes is an amazing tool. I'm not ... There's no doubt about it. But let's think about it. If you
have a deployment, let's say, this is the deployment, okay. Does deployment come by itself? No.
It comes with a lot of components. You can talk like we can say CM config map. We can say a secret.
We can say storage. How many YAMLs here we're talking about? We're talking about four configured
YAMLs, right? Let's see. So, on the other side, if we're talking about
workload orchestrator, we got the DevOps guy. He just does
something simple. He just pushes one single HCL job.
Where it goes? It goes to your workload orchestrator, and it frees that
job, finds out how many replicas you want and then deploys your application. Assuming
you got three replicas, you got three applications and also respecting your
strategy. You got a deployment strategy here. And also all of this just runs on a single server. You
got an operating system down there, whereas the orchestrating living on the top, all of this is
inside your VM. So I see YAMLs. We know YAMLs. We've written YAMLs for
Kubernetes. What is a job and is it... is there a hard learning curve to learn it? I don't think so
there's a hard learning curve. You have a single file, and it's a very lightweight. The best case
scenarios: the DevOps will go, will deploy a traditional application. It will take him days,
not months. So in this case, as he goes, he has another feature. So we start to learn about it. And
another feature, and I start to learn about it. So it's not about the application itself, it's
about your own pace. So you can go by your own speed. You want to go and learn the whole thing
quick. Sure, you can do that if you want to go, take it easy. And so go by, go as you go. And that's
what I recommend. Yeah. Workload orchestrator sounds useful for traditional application
deployments. What about AI and ML? Do ... they
can be blending in the whole thing here. So what is the traditional approach for
AI, right? Some companies, they're going to spin up a Kubernetes cluster for their web apps, another
one for your training and a third for their batch. Right.
Some other companies are just going to put it all on one cluster. All that entails, right? Complex
namespace, configurations. You have resource quotas. Now,
whichever way you go with this, you're going to end up with a headache, right? If you have multiple
clusters, you have multiple ops teams, multiple monitoring teams. If you have one cluster that's
supposed to be for everybody, it's going to be difficult to deal with. And on top of all of this,
AI workloads keep evolving. Right. You had ... Eight years ago, we didn't have transformer models. Three
years ago, we didn't have GPT-level inference set. Now, if you've worked with AI or ML workloads before, you
may have noticed a pattern. So, for your web teams, they're going to be deploying microservices
on Kubernetes. Your data scientists, they might be using Slurm because that's what they use in
grad school, right? And that's how they work with GPU jobs. Your data engineers, they're going to
work with Airflow as their pipeline management. And then the ML team, they might be deploying
services, you know, in containers. Or maybe they're just SSHing, yeah, directly into a box and running
custom scripts. That's four teams with four tool sets and
four totally different sets of expertise. If something breaks, well, good luck figuring out
which system is the problem. Okay, so this is awesome. Walk me through a real-world
example. Okay, so if you have a data scientist that wants to run a training job,
they're first going to make, file a ticket with their DevOps. Yeah, they're going to wait for the
cluster to be ready. They're going to wait for the approval. And a couple days later, they're able to
train their model. Exactly. At the same exact time, the web team has deployed continuously on
Kubernetes. So this is the same company, the same organization. It's just a totally different
universes of efficiency. AI workloads are fundamentally different. A training job could run
for 3 hours and never run again. Your inferencing service has to be on 24x7.
It's connected to your GPU. And your pipeline runs on a schedule. This needs something like flexible
orchestration, and especially considering we do not know what's coming next but we do know what's
coming, it has to stay flexible. Awesome. So, hold on. Flexible. Sounds
interesting. Tell me more about it. First, a caveat, right? Why can't we do this on Kubernetes?
We can do this on Kubernetes, right. We could run batch jobs. We have cron jobs. All this stuff works.
But if you've ever tried to run ephemeral services like, you know, batch workload and
Kubernetes, it's a bit awkward. And that's because Kubernetes wasn't designed for this It was
designed for long-running containerized services that it keeps alive. Now workload orchestrators,
they're a bit different, right? If you have a batch job or you have a system job, or if you have a
service that you need running, they're all first class citizens in the scheduler So, tell me more
about this in the drawing, maybe. Sure. So, like when you think about flexible orchestration. Let's get
back to that. Yeah. Right. You're going to have one cluster. Sure. Right. And this is gonna be your
workload orchestrator. And it's going to be able to spin up your web app, okay. It's gonna be
able to spin up your training and your batches, then run once and then never run again. It's going
to have your inferencing. And each of these services are going to be tied to
aresource. And each of these resources can be defined within the job spec.
So it makes it particularly easy. So you could have a job spec that says okay I need this spread
out between one data center, or I need this on a combination of data centers, or I need this to be
on a specific data center on a specific rack. Awesome. So what I'm looking at here, I
can imagine there is like, um, correct me if I'm wrong, before and after thing. So if we're looking
at before here, and I'll give you a square here saying we got, you said
fragmented ops. Also, we got
many tools. And also there is
special knowledge, let me say. Yeah. When you go to flexible orchestration, instead of all of
that, we're going to have unified ops, okay. We're going to have
one tool. Perfect. And we're going to have shared knowledge,
which makes this so much easier and so much more scalable. Okay So
what what does that like really mean for teams, you know. So if you look back at the data
scientist, right? This time, they could schedule their own training. They could do it themselves. It
takes a couple of minutes, not a couple of days. They're in the same exact workflow as their web
app team. Your DevOps team, they have one platform that they need to know really well. Okay,
so if something breaks at 2 AM, they know where to look. They have one set of logs to look at, and
it just works well. So, this is the power of flexible orchestration. And when the next
AI breakthrough comes, because we know it's coming, you don't rebuild your infrastructure. You write a
new job spec. That's operational simplicity without sacrificing capability.