Remote Engines for Hybrid Data Integration
Key Points
- In hybrid‑cloud environments data resides across on‑premises systems, cloud platforms, and edge devices, making it often more effective to integrate data where it lives rather than moving it centrally.
- Remote engines are user‑controlled, containerized execution environments (often Kubernetes pods) deployed in the data plane that run integration and quality tasks close to the source, separating design time (control plane) from runtime (remote engine).
- This architecture lets developers design jobs in a centralized, fully‑managed control plane while the compiled job code is executed on the remote engine in the appropriate cloud, on‑prem, or edge location.
- Processing data locally with remote engines cuts egress‑related cloud fees, delivering significant cost savings especially for high‑volume daily data movements.
- By eliminating network transfer overhead, remote engines boost performance and enable low‑latency, high‑throughput data integration and quality operations wherever the data resides.
Sections
- Remote Engines for Hybrid Data Integration - The passage explains how remote engines—containerized, user‑controlled compute environments deployed in on‑premise or cloud data planes—allow organizations to execute data integration and quality tasks close to where their data resides in a hybrid cloud architecture.
- Remote Engines for Secure Data Processing - Remote processing engines run integration jobs where the data resides, delivering cost savings, auto‑scaling performance, and keeping sensitive information within the security perimeter.
- Remote Engines Enable Hybrid Data Processing - The passage explains how remote engines replace the traditional hub‑and‑spoke model with a hybrid deployment that processes data at its native location, delivering timely results while lowering costs and boosting security.
Full Transcript
# Remote Engines for Hybrid Data Integration **Source:** [https://www.youtube.com/watch?v=yGTV5qzv0-0](https://www.youtube.com/watch?v=yGTV5qzv0-0) **Duration:** 00:06:28 ## Summary - In hybrid‑cloud environments data resides across on‑premises systems, cloud platforms, and edge devices, making it often more effective to integrate data where it lives rather than moving it centrally. - Remote engines are user‑controlled, containerized execution environments (often Kubernetes pods) deployed in the data plane that run integration and quality tasks close to the source, separating design time (control plane) from runtime (remote engine). - This architecture lets developers design jobs in a centralized, fully‑managed control plane while the compiled job code is executed on the remote engine in the appropriate cloud, on‑prem, or edge location. - Processing data locally with remote engines cuts egress‑related cloud fees, delivering significant cost savings especially for high‑volume daily data movements. - By eliminating network transfer overhead, remote engines boost performance and enable low‑latency, high‑throughput data integration and quality operations wherever the data resides. ## Sections - [00:00:00](https://www.youtube.com/watch?v=yGTV5qzv0-0&t=0s) **Remote Engines for Hybrid Data Integration** - The passage explains how remote engines—containerized, user‑controlled compute environments deployed in on‑premise or cloud data planes—allow organizations to execute data integration and quality tasks close to where their data resides in a hybrid cloud architecture. - [00:03:03](https://www.youtube.com/watch?v=yGTV5qzv0-0&t=183s) **Remote Engines for Secure Data Processing** - Remote processing engines run integration jobs where the data resides, delivering cost savings, auto‑scaling performance, and keeping sensitive information within the security perimeter. - [00:06:11](https://www.youtube.com/watch?v=yGTV5qzv0-0&t=371s) **Remote Engines Enable Hybrid Data Processing** - The passage explains how remote engines replace the traditional hub‑and‑spoke model with a hybrid deployment that processes data at its native location, delivering timely results while lowering costs and boosting security. ## Full Transcript
Data doesn't sit in one convenient location.
In a hybrid cloud world, there's databases on premises,
applications and analytics platforms running in the cloud
and edge devices collecting sensor data.
For some organizations, the most effective way to integrate
their data is to process it where it lives.
This is where the value of remote engines comes into play.
Remote engines are execution environments
that you deploy and manage in your own systems,
either on premises or in your cloud environments,
to run data integration and data quality tasks.
They're essentially computation resources that you control,
allowing you to keep data integration workloads close to your data sources.
We can think about data integration as how water is processed in a city.
Typically, water is processed at a central processing plant.
In this analogy, think about remote engines
like the water filter I have in my apartment.
Sure, the city treats the water at the main plant,
but having that filter right in my apartment means I can do the filtering
behind my own walls.
So what makes remote engine special.
Picture this.
You deploy a containerized application
running in your data plane.
You can deploy this in your virtual private cloud,
on premises data center or in any cloud environment.
Inside that container, let's say we're using Kubernetes
for now. You can have what we call the conductor pod,
which orchestrates and manages your jobs,
as well as several compute pods,
which actually handle the workload.
The breakthrough is the separation of design time and runtime.
You design your jobs
in a control plane or one fully managed centralized platform,
but the execution happens on the remote engine.
A quick example could be the data plane
over here on the right, where I have my remote engine
and what can be referred to as the control plane
over here on the left, where I design and manage my jobs.
So let's say I have a simple ETL job, where I have two sources
being combined into one,
with some transformations in the middle, and I'm writing to a target.
What we have is the control plane manages
the compilation of these jobs in the code
and then sends that information down to the data plane
where that actually gets executed.
This is a simple explanation of how organizations can use the control plane
to design jobs in a data plane
to actually run the jobs.
So let's break down why this architecture
is becoming essential for modern data operations.
The first is cost efficiency.
Cloud providers
charge egress fees when data leaves their environment.
This adds up when you're moving millions of rows daily.
Remote engines eliminate this by processing the data
in the same cloud as where it lives,
with use cases including running data
quality rules where the data resides, leading
to a substantial cost savings.
The second is performance.
Instead of moving data sets across networks,
you can execute data integration jobs
close to the data and as close to the data as you'd like.
Regardless of if you're running a single job to hundreds of thousands,
the compute pods can autoscale to handle the workload.
They can scale up,
and then once the workload is completed, scale back down
just to have 1 or 2 compute pods running.
The workload is distributed intelligently among the compute pods,
allowing for dynamic processing
to handle bursts in workloads over time.
You can also tune parameters that you control for dynamic,
efficient processing right where your data lives.
The third benefit, and perhaps the most important, is security.
Sensitive data
like financial records, healthcare information or proprietary research
often cannot leave its current environment or locality.
So remote engines allow you to process this data without ever moving
beyond your security perimeter,
as they can be deployed behind what we call here
as the firewall.
This allows you to create secured connections to your sources
and targets without data ever leaving behind your wall.
Just like how my apartment handles filtering
and data processing within my controlled space,
remote engines deliver these three key benefits as well.
So data security because I can filter my data behind my own wall.
Cost because I do not need to pay for these egress and ingress fees.
And I can avoid those charges. And also high performance
by avoiding bottlenecks with the water service.
The beauty is that regardless of where they're deployed,
you manage everything from a single control plane.
With remote engines, you
design jobs once and run them anywhere.
The last question you might be asking
is the benefit of software as a service
is that I don't have to administer anything.
So how does that work in the remote engine?
So because remote engines are containerized runtimes,
they're easier to update than the traditional infrastructure.
This is because traditional infrastructure often
requires downtime, complex migration procedures
or complete system rebuilds.
But with containers, you simply patch the container
with a new version,
and you can update the container image
and the conductor will manage rolling out the compute pods with minimal disruption.
Once everything is running smoothly
on the new version, it shuts down the old pods,
so your data processing never stops.
Just like how I replace my water filter when it's convenient for me,
you control when to refresh your remote engines
without disrupting operations.
Remote engines represent a fundamental shift
from the old hub and spoke model of data
processing to a hybrid deployment
pattern that respects where your data naturally lives.
You get processed data exactly where
and when you need it, while keeping
costs down and security up.