Learning Library

← Back to Library

Remote Engines for Hybrid Data Integration

6m • Unknown Channel • databases • deep-dive • intermediate • Watch on YouTube ↗

Key Points

In hybrid‑cloud environments data resides across on‑premises systems, cloud platforms, and edge devices, making it often more effective to integrate data where it lives rather than moving it centrally.
Remote engines are user‑controlled, containerized execution environments (often Kubernetes pods) deployed in the data plane that run integration and quality tasks close to the source, separating design time (control plane) from runtime (remote engine).
This architecture lets developers design jobs in a centralized, fully‑managed control plane while the compiled job code is executed on the remote engine in the appropriate cloud, on‑prem, or edge location.
Processing data locally with remote engines cuts egress‑related cloud fees, delivering significant cost savings especially for high‑volume daily data movements.
By eliminating network transfer overhead, remote engines boost performance and enable low‑latency, high‑throughput data integration and quality operations wherever the data resides.

Sections

Full Transcript

# Remote Engines for Hybrid Data Integration **Source:** [https://www.youtube.com/watch?v=yGTV5qzv0-0](https://www.youtube.com/watch?v=yGTV5qzv0-0) **Duration:** 00:06:28 ## Summary - In hybrid‑cloud environments data resides across on‑premises systems, cloud platforms, and edge devices, making it often more effective to integrate data where it lives rather than moving it centrally. - Remote engines are user‑controlled, containerized execution environments (often Kubernetes pods) deployed in the data plane that run integration and quality tasks close to the source, separating design time (control plane) from runtime (remote engine). - This architecture lets developers design jobs in a centralized, fully‑managed control plane while the compiled job code is executed on the remote engine in the appropriate cloud, on‑prem, or edge location. - Processing data locally with remote engines cuts egress‑related cloud fees, delivering significant cost savings especially for high‑volume daily data movements. - By eliminating network transfer overhead, remote engines boost performance and enable low‑latency, high‑throughput data integration and quality operations wherever the data resides. ## Sections - [00:00:00](https://www.youtube.com/watch?v=yGTV5qzv0-0&t=0s) **Remote Engines for Hybrid Data Integration** - The passage explains how remote engines—containerized, user‑controlled compute environments deployed in on‑premise or cloud data planes—allow organizations to execute data integration and quality tasks close to where their data resides in a hybrid cloud architecture. - [00:03:03](https://www.youtube.com/watch?v=yGTV5qzv0-0&t=183s) **Remote Engines for Secure Data Processing** - Remote processing engines run integration jobs where the data resides, delivering cost savings, auto‑scaling performance, and keeping sensitive information within the security perimeter. - [00:06:11](https://www.youtube.com/watch?v=yGTV5qzv0-0&t=371s) **Remote Engines Enable Hybrid Data Processing** - The passage explains how remote engines replace the traditional hub‑and‑spoke model with a hybrid deployment that processes data at its native location, delivering timely results while lowering costs and boosting security. ## Full Transcript

0:00Data doesn't sit in one convenient location. 0:03In a hybrid cloud world, there's databases on premises, 0:06applications and analytics platforms running in the cloud 0:09and edge devices collecting sensor data. 0:12For some organizations, the most effective way to integrate 0:15their data is to process it where it lives. 0:18This is where the value of remote engines comes into play. 0:24Remote engines are execution environments 0:27that you deploy and manage in your own systems, 0:30either on premises or in your cloud environments, 0:33to run data integration and data quality tasks. 0:36They're essentially computation resources that you control, 0:40allowing you to keep data integration workloads close to your data sources. 0:44We can think about data integration as how water is processed in a city. 0:48Typically, water is processed at a central processing plant. 0:52In this analogy, think about remote engines 0:54like the water filter I have in my apartment. 0:56Sure, the city treats the water at the main plant, 1:00but having that filter right in my apartment means I can do the filtering 1:03behind my own walls. 1:05So what makes remote engine special. 1:07Picture this. 1:09You deploy a containerized application 1:11running in your data plane. 1:20You can deploy this in your virtual private cloud, 1:23on premises data center or in any cloud environment. 1:26Inside that container, let's say we're using Kubernetes 1:29for now. You can have what we call the conductor pod, 1:35which orchestrates and manages your jobs, 1:38as well as several compute pods, 1:45which actually handle the workload. 1:47The breakthrough is the separation of design time and runtime. 1:52You design your jobs 1:54in a control plane or one fully managed centralized platform, 1:59but the execution happens on the remote engine. 2:08A quick example could be the data plane 2:11over here on the right, where I have my remote engine 2:13and what can be referred to as the control plane 2:16over here on the left, where I design and manage my jobs. 2:19So let's say I have a simple ETL job, where I have two sources 2:23being combined into one, 2:26with some transformations in the middle, and I'm writing to a target. 2:30What we have is the control plane manages 2:32the compilation of these jobs in the code 2:35and then sends that information down to the data plane 2:38where that actually gets executed. 2:40This is a simple explanation of how organizations can use the control plane 2:45to design jobs in a data plane 2:48to actually run the jobs. 2:49So let's break down why this architecture 2:51is becoming essential for modern data operations. 2:54The first is cost efficiency. 2:58Cloud providers 3:00charge egress fees when data leaves their environment. 3:03This adds up when you're moving millions of rows daily. 3:06Remote engines eliminate this by processing the data 3:09in the same cloud as where it lives, 3:12with use cases including running data 3:14quality rules where the data resides, leading 3:16to a substantial cost savings. 3:19The second is performance. 3:22Instead of moving data sets across networks, 3:25you can execute data integration jobs 3:27close to the data and as close to the data as you'd like. 3:30Regardless of if you're running a single job to hundreds of thousands, 3:35the compute pods can autoscale to handle the workload. 3:38They can scale up, 3:41and then once the workload is completed, scale back down 3:45just to have 1 or 2 compute pods running. 3:49The workload is distributed intelligently among the compute pods, 3:53allowing for dynamic processing 3:55to handle bursts in workloads over time. 3:58You can also tune parameters that you control for dynamic, 4:02efficient processing right where your data lives. 4:05The third benefit, and perhaps the most important, is security. 4:12Sensitive data 4:13like financial records, healthcare information or proprietary research 4:17often cannot leave its current environment or locality. 4:20So remote engines allow you to process this data without ever moving 4:24beyond your security perimeter, 4:26as they can be deployed behind what we call here 4:30as the firewall. 4:32This allows you to create secured connections to your sources 4:36and targets without data ever leaving behind your wall. 4:39Just like how my apartment handles filtering 4:42and data processing within my controlled space, 4:45remote engines deliver these three key benefits as well. 4:48So data security because I can filter my data behind my own wall. 4:53Cost because I do not need to pay for these egress and ingress fees. 4:57And I can avoid those charges. And also high performance 5:00by avoiding bottlenecks with the water service. 5:03The beauty is that regardless of where they're deployed, 5:06you manage everything from a single control plane. 5:09With remote engines, you 5:11design jobs once and run them anywhere. 5:19The last question you might be asking 5:21is the benefit of software as a service 5:23is that I don't have to administer anything. 5:25So how does that work in the remote engine? 5:28So because remote engines are containerized runtimes, 5:32they're easier to update than the traditional infrastructure. 5:35This is because traditional infrastructure often 5:37requires downtime, complex migration procedures 5:40or complete system rebuilds. 5:42But with containers, you simply patch the container 5:46with a new version, 5:47and you can update the container image 5:51and the conductor will manage rolling out the compute pods with minimal disruption. 5:55Once everything is running smoothly 5:57on the new version, it shuts down the old pods, 6:00so your data processing never stops. 6:03Just like how I replace my water filter when it's convenient for me, 6:07you control when to refresh your remote engines 6:09without disrupting operations. 6:11Remote engines represent a fundamental shift 6:13from the old hub and spoke model of data 6:16processing to a hybrid deployment 6:18pattern that respects where your data naturally lives. 6:21You get processed data exactly where 6:23and when you need it, while keeping 6:26costs down and security up.