Learning Library

← Back to Library

Netflix's Iceberg: Revolutionizing Data for AI

Key Points

  • In 2017 Netflix’s massive catalog overwhelmed traditional relational databases, which couldn’t scale, lacked versioning, and required downtime to modify schemas.
  • To solve this, Netflix built an in‑house table format called Iceberg that stores data as immutable files in cloud object storage (e.g., Amazon S3), decoupling compute from storage.
  • Iceberg’s design introduced features like schema evolution without downtime, lazy loading of only needed data, and searchable metadata, dramatically improving performance and scalability for big‑data workloads.
  • Recognizing the broader industry benefit, Netflix open‑sourced Iceberg, enabling other companies to adopt the same data‑lake architecture.
  • The widespread adoption of Iceberg now underpins many enterprises’ data pipelines, making it easier to prepare massive datasets for downstream AI and analytics applications.

Full Transcript

# Netflix's Iceberg: Revolutionizing Data for AI **Source:** [https://www.youtube.com/watch?v=B-hhzEbKbiE](https://www.youtube.com/watch?v=B-hhzEbKbiE) **Duration:** 00:06:33 ## Summary - In 2017 Netflix’s massive catalog overwhelmed traditional relational databases, which couldn’t scale, lacked versioning, and required downtime to modify schemas. - To solve this, Netflix built an in‑house table format called Iceberg that stores data as immutable files in cloud object storage (e.g., Amazon S3), decoupling compute from storage. - Iceberg’s design introduced features like schema evolution without downtime, lazy loading of only needed data, and searchable metadata, dramatically improving performance and scalability for big‑data workloads. - Recognizing the broader industry benefit, Netflix open‑sourced Iceberg, enabling other companies to adopt the same data‑lake architecture. - The widespread adoption of Iceberg now underpins many enterprises’ data pipelines, making it easier to prepare massive datasets for downstream AI and analytics applications. ## Sections - [00:00:00](https://www.youtube.com/watch?v=B-hhzEbKbiE&t=0s) **Netflix’s 2017 Data Overhaul** - In 2017 Netflix confronted the scaling limits of traditional relational databases, spurring an innovative data architecture that now serves as a model for how large enterprises prepare and manage data for AI applications. ## Full Transcript
0:00step into the time machine with me we're 0:01going to talk about something that 0:03happened way back in 2017 that ended up 0:07powering a lot of the way big 0:10corporations today are thinking about 0:12prepping their data for AI so there's an 0:14AI tie-in at the end here so in 2017 0:18Netflix had a problem they had so many 0:21movies and shows and people were 0:22watching them so much that their 0:24traditional table structures and their 0:26traditional database were breaking down 0:28at the time databases work a lot like I 0:32think most me people's mental models of 0:34databases operate so just to explain 0:37that in detail they have rows they have 0:39tables you look up the row you look up 0:42the table it sits on a file somewhere it 0:44sits on a file on a server somewhere and 0:47there you go right now those kinds of 0:49databases do match what we imagine but 0:52they have problems at scale you cannot 0:55uh update them without shutting down the 0:56database imagine adding a column and 0:58having to shut down the entire database 1:00it's a problem uh they don't have 1:02versioning so you can't go back in time 1:03and see what the data was like before 1:06they don't have the ability to overwrite 1:07or edit necessarily in the same way they 1:11don't have they have performance issues 1:13because you have to look across the 1:14entire database there's not really a way 1:15to do it only 1:17partially I could go on there's a lot of 1:19issues some of them include storage and 1:22Netflix realized they needed to innovate 1:24they needed to fix they needed to make 1:26something that actually served their 1:27needs in 2017 and so what they came up 1:30with was what we now know as 1:33Iceberg and they developed it inhouse at 1:36Netflix in order to serve TVs and movies 1:39and shows effectively so all of us 1:41streaming contributed to 1:43Innovation isn't it's a nice feeling 1:45right uh and what they did was they 1:48converted the traditional model of the 1:50database and they moved it to the cloud 1:52and so it has um a core file storage 1:57Motion in the cloud like would sit 2:00Amazon S3 as an example uh Netflix would 2:02use AWS quite 2:04famously and it that meant it was 2:06infinitely extensible it didn't have to 2:09sit in just sort of uh traditional 2:11compute 2:12limitation it also meant that you could 2:14design it differently than a traditional 2:16database so it you could update it on 2:19the go it had metadata that you could 2:21query it did not have downtime if you 2:23dropped a column you could use lazy 2:25loading on the database which meant that 2:27you could pull only the part that you 2:28cared about at the time you didn't have 2:30to pull the whole thing which made it 2:31more performant there were a lot of 2:34advantages to Iceberg that essentially 2:36added up to Big Data works better here 2:39now Netflix could have kept that they 2:41could have said no no no this is ours 2:43like we don't want to share and our 2:44model of competition in Tech suggests 2:47that they would but our model is bad 2:50because big tech companies both compete 2:53and cooperate and in this case tools 2:55like this that are the foundational 2:57elements of the internet or that power 2:58our apps tend to be open sourced more 3:02often than not and so Netflix open- 3:05sourced it they actually handed it over 3:06to the Apache software Foundation which 3:08is the software foundation for projects 3:11like this they've been running since 3:131999 and by the time 2021 rolled around 3:17this little project that started at 3:19Netflix had been incubated by Apache and 3:21became a top level project at Apache 3:24which means that it uh was considered 3:26stable it was maintained by a rich 3:27community of developers Etc 3:30now you might 3:31wonder why what what possible gain would 3:35Netflix have to do this other than being 3:37nice and Netflix isn't necessarily known 3:39for that I can think of one if you are 3:42going to have a core part of your 3:45infrastructure that you have to maintain 3:48over 3:49time it would be smarter if you could 3:52build it in such a way that you knew 3:54that you could get talent in the door to 3:56maintain it and upgrade it and improve 3:58it over time now you could do that by 4:01training laboriously everyone who comes 4:04into your company on your special 4:06proprietary way of doing things but 4:09because this is a foundational part of 4:10the 4:11internet it makes more sense to just 4:14open source it your competitive 4:17Advantage is still your shows it's not 4:19your 4:20database and allow people who have 4:24learned it elsewhere to come to Netflix 4:26and practice their craft it's a talent 4:28advantage 4:30moving back Apache makes this a top 4:32level project you're still wondering 4:34where the AI connection is well it turns 4:37out they made it a top level project 4:40just before chat GPT exploded like a 4:42meteor on the scene and this was a 4:46perfect open-source solution to Major 4:50data Lakes which means that when all of 4:53these companies around the world began 4:54to ask themselves how do we collect our 4:56data and get it into a state where we 4:58can actually build AI models against it 5:01build AI models on top of it it was 5:03right there and so just like that it 5:06began to be adopted all over the 5:08industry data bricks has it snowflakes 5:10has it AWS has it Azure has it and all 5:14of these Cloud providers and uh data 5:17providers have figured out that they can 5:21use this open- Source tool developed by 5:23Netflix to help us with our scrolling 5:25and our movie watching to enable large 5:28scale data lak that companies around the 5:30world can leverage for AI 5:34deployments and I think that's a really 5:36cool story and if you look at that and 5:39you say wow that's that's kind of neat 5:41there are so many stories like that that 5:43have enabled the world we have today and 5:46they're not always viciously competitive 5:48like this one actually exemplifies 5:50cooperation even after Netflix turned 5:53over the software to 5:55Apache it took the work of hundreds of 5:58developers thousands of developers to 6:01mature The open- Source software so it 6:03was actually something that was stable 6:05enough for large scale deployments at 6:08these 6:09companies that's a big deal the 6:11developer Community is remarkably 6:13cooperative and I don't think we talk 6:15about it enough and I wanted to do a 6:17story that actually shows how that kind 6:19of cooperation unlocks capabilities that 6:23we are building against to this day so 6:25there you go that's the story of Iceberg 6:27and how it helped power the future 6:29future of AI through cooperation cheers