Learning Library

← Back to Library

AutoSQL Enables Unified Data Lakehouse Queries

Key Points

  • The exploding volume of data across on‑prem, cloud, and vendor environments demands a simpler way to access and manage it.
  • Traditional architectures with tightly‑coupled storage‑compute and heavy ETL pipelines cause scaling problems and data duplication, prompting a shift to “lakehouse” designs that layer independent compute over inexpensive object stores.
  • IBM Cloud Pak for Data’s new AutoSQL engine provides a unified, SQL‑based compute layer that can query structured and unstructured data directly on data lakes, integrate Spark, and leverage data virtualization for external sources.
  • AutoSQL also embeds end‑to‑end governance, allowing custom policies to be applied to any ingested or virtualized data set across on‑prem warehouses, S3 buckets, Azure, Snowflake, Oracle, Teradata, etc.
  • A live demo shows how quickly users can create connections to sources like AWS S3, virtualize multiple data assets, and combine them into a single virtual table for use in both BI and data‑science workloads.

Full Transcript

# AutoSQL Enables Unified Data Lakehouse Queries **Source:** [https://www.youtube.com/watch?v=N2CuentyYGw](https://www.youtube.com/watch?v=N2CuentyYGw) **Duration:** 00:05:42 ## Summary - The exploding volume of data across on‑prem, cloud, and vendor environments demands a simpler way to access and manage it. - Traditional architectures with tightly‑coupled storage‑compute and heavy ETL pipelines cause scaling problems and data duplication, prompting a shift to “lakehouse” designs that layer independent compute over inexpensive object stores. - IBM Cloud Pak for Data’s new AutoSQL engine provides a unified, SQL‑based compute layer that can query structured and unstructured data directly on data lakes, integrate Spark, and leverage data virtualization for external sources. - AutoSQL also embeds end‑to‑end governance, allowing custom policies to be applied to any ingested or virtualized data set across on‑prem warehouses, S3 buckets, Azure, Snowflake, Oracle, Teradata, etc. - A live demo shows how quickly users can create connections to sources like AWS S3, virtualize multiple data assets, and combine them into a single virtual table for use in both BI and data‑science workloads. ## Sections - [00:00:00](https://www.youtube.com/watch?v=N2CuentyYGw&t=0s) **Simplifying Data Lakes with AutoSQL** - IBM’s AutoSQL engine in Cloud Pak for Data unifies compute across structured, unstructured, and external sources, allowing direct queries over cloud object stores and lakehouses while providing built‑in, end‑to‑end governance. ## Full Transcript
0:00it's no surprise that the volume of data 0:03across multiple stores locations 0:05clouds and even vendors is accelerating 0:08but how do you manage this complexity 0:10and make it simple to leverage your data 0:13hi my name is love agarwal and i'm a 0:15solution engineer for ibm data and ai 0:17and today i'm here to talk about one of 0:19our newest capabilities 0:21auto sql so i want to first start with 0:24how we got here traditionally we have 0:27seen many architectures that have 0:28big data warehouses with storage and 0:30compute tightly coupled 0:32as well as data lakes in multiple clouds 0:35with a lot of etl pipelines to move and 0:37replicate data around 0:39for different bi and data science use 0:41cases 0:42this has led to increasingly complex 0:44data pipelines 0:45difficulty in scaling workloads and 0:48unnecessary data duplication 0:50what we have seen become more common is 0:53a new modern architecture which utilizes 0:55separate compute engines 0:57layered over inexpensive cloud object 0:59stores and data lakes resulting in the 1:01concept of a data 1:02lake house so now let's get back to auto 1:06sql 1:07auto sql is our new unified compute 1:09engine in 1:10ibm cloud pack for data that can query 1:12both structured and unstructured data 1:15directly over your data lakes and cloud 1:17object stores 1:18leverage data virtualization to access 1:21other external data sources 1:22as well as support spark in addition 1:25auto sql brings integrated governance as 1:28part of the cloud pack for data platform 1:30which allows any ingested or connected 1:32data source to be fully governed 1:34end-to-end with custom policies now we 1:37have a single interface and engine to 1:40support both data science 1:42and bi across any data source 1:44environment 1:45whether that be your on-prem data 1:46warehouse s3 buckets in aws 1:49data lake and azure snowflake oracle 1:52teradata it doesn't matter 1:54all right now let me show you with a 1:56quick demo how easy it is to access data 1:59from various sources with auto sql 2:01and our end-to-end hybrid data platform 2:03ibm cloud pack for data 2:06so i'll start by logging on to cloud 2:08pack for data and once i do that i'm 2:10presented with my home page 2:12now i want to connect to some different 2:14data sources so i'll go over to 2:16platform connections under data and 2:19click 2:20new connection so we can see there is an 2:23extensive list of both ibm sources as 2:25well as third-party sources 2:28i'm going to connect to an s3 bucket 2:30that i haven't up in aws 2:32i'll put in all my credentials and click 2:34on create connection 2:41so this connection will allow us to 2:43directly query our source 2:45however i also want to virtualize some 2:48data sources 2:49so i'll click over into the data 2:51virtualization tab 2:54now if i look at my sources we can see 2:56the many different instances that i have 2:58virtualized 2:59in my constellation view and now i'm 3:02ready to actually do something that's 3:03very powerful 3:04which is using data virtualization to 3:07combine tables from multiple sources 3:10into one virtual table for us to use 3:14so i'll go ahead and search for the 3:16tables that i have virtualized 3:19and join them into a new virtual table 3:22in a way that allows me to pick and 3:24choose exactly how i want it to be 3:26structured 3:27based on the available attributes okay 3:30great 3:31so now this new table is available for 3:33us to start using to build 3:34insights so i'll hop over to the 3:37projects tab 3:39and open one of the data science 3:41projects that i've been working on 3:44we can see there are several data 3:45science assets in here 3:47but i'll go in and open one of the 3:49notebooks that i've already been working 3:51on 3:56okay in here we can see that i have the 3:59ability to query 4:00that same s3 bucket that i connected to 4:02earlier 4:03as well as that new virtual table that i 4:05created 4:06i can now use this data to build out 4:08whatever model i want 4:09and deploy it directly in the platform 4:11to make it available for consumption by 4:14my business analysts or other data 4:16consumers in my organization 4:19all right so to recap we connected to 4:21various data sources 4:22in the cloud pack for data platform we 4:24virtualized certain sources 4:26and created new virtual tables to 4:28interact with our data in new ways 4:30and then we were able to query those 4:32sources right from our notebook to build 4:34and deploy models 4:36and by the way all this was done in a 4:38governed manner 4:39where any governance policies that were 4:41defined in cloud pack for data 4:43apply to all of the data sources that we 4:45connected to 4:46with auto sql we're reducing costs by 4:49reduced migration and significantly less 4:51data duplication 4:52we're reducing complex etl work as we 4:55saw when 4:56simply creating virtual tables we're 4:58automating 4:59security and governance for trust and 5:01data validity and quality 5:03we're leveraging one performant and 5:05scalable query engine 5:06for both big data and warehousing that 5:09can execute distributed and virtualized 5:11queries 5:1253 faster than the industry standard 5:15and we're avoiding lock-in with our 5:17vendor agnostic design 5:19that allows the same engine to work with 5:21any data source 5:22on any cloud if you'd like to see 5:25more videos like this in the future 5:27please click like and subscribe 5:29and if you want to learn more about ibm 5:31cloud pack for data 5:33make sure to check out the link in the 5:36description 5:40you