Learning Library

← Back to Library

Scaling Generative AI: Challenges and Solutions

Key Points

  • Model sizes have exploded from thousands to billions‑and‑trillions of parameters, demanding ever‑more powerful hardware just to train and run them.
  • The amount of data consumed by these models is growing orders of magnitude faster than human reading capacity, with synthetic data projected to exceed real‑world data by around 2030.
  • User demand is soaring—ChatGPT jumped from 1 million users in five days to 100 million a year later—creating an “unfathomable” compute load when model size, data, and requests are multiplied together.
  • To handle this scale, an agentic architecture that orchestrates specialized models is required, focusing on efficient inference and resource management.
  • Practical scaling tactics include batch‑based generation paired with CDN caching and edge‑side personalization, as well as cache‑centric approaches to serve billions of requests per second without overloading GPU hardware.

Full Transcript

# Scaling Generative AI: Challenges and Solutions **Source:** [https://www.youtube.com/watch?v=RLdD831I8hk](https://www.youtube.com/watch?v=RLdD831I8hk) **Duration:** 00:07:28 ## Summary - Model sizes have exploded from thousands to billions‑and‑trillions of parameters, demanding ever‑more powerful hardware just to train and run them. - The amount of data consumed by these models is growing orders of magnitude faster than human reading capacity, with synthetic data projected to exceed real‑world data by around 2030. - User demand is soaring—ChatGPT jumped from 1 million users in five days to 100 million a year later—creating an “unfathomable” compute load when model size, data, and requests are multiplied together. - To handle this scale, an agentic architecture that orchestrates specialized models is required, focusing on efficient inference and resource management. - Practical scaling tactics include batch‑based generation paired with CDN caching and edge‑side personalization, as well as cache‑centric approaches to serve billions of requests per second without overloading GPU hardware. ## Sections - [00:00:00](https://www.youtube.com/watch?v=RLdD831I8hk&t=0s) **Exponential Growth Challenges in Generative AI** - The speaker explains how model size, data volume, and user demand are scaling exponentially, creating hardware, cost, and data‑availability hurdles—and predicts synthetic data will surpass real data by 2030. ## Full Transcript
0:00running these generative AI algorithms 0:02at scale can be very challenging 0:03overwhelming and costly in fact there's 0:05three areas that I want to highlight 0:07where there's exponential growth 0:09occurring over time and if I were to log 0:12scale this here then it would look much 0:15like this here but the first one would 0:18be the model 0:19size so at the beginning these models 0:24they were thousands of parameters then 0:25they moved to millions and now billions 0:27and even trillions but this requires 0:30Hardware in order to even run and train 0:32these very large algorithms now the 0:34second one would be data size now the 0:38data size is growing um you know if you 0:40think about Granite um and you think 0:42about llama but a human in turn can read 0:45about a million different words every 0:48single year now a system such as an 0:50algorithm like this can read about a 0:52billion time 10 6 or six orders of 0:54magnitude more in just a single month 0:57we're actually beginning to run out of 0:58data by 2030 I think that we're going to 1:01see synthetic data actually overtake 1:03real world data now the third one is 1:06demand so over time um you you have seen 1:10that these models have become integral 1:13to our daily lives right so when you 1:15look at chat GPT in just 5 days after it 1:18was released it had 1 million users and 1:20if you look about a year later there 1:22were about 100 million users so every 1:25time we think about having to write you 1:27know a piece but we can solve this 1:28what's called a wh space B problem by 1:30using these models to help prompt us or 1:32to tell us what we should think about 1:34and how we should write but now if I 1:36take these and I multiply them together 1:39well this gives us this unfathomable 1:42compute scale that we need in order to 1:44run them right and again this is a log 1:47scale here now what this means is that 1:50we need this agentic architecture to run 1:52these specialized models and to help 1:54solve the problem of trying to make 1:56these more usable so what can we really 1:59do to help right to make this more 2:01manageable and usable systems for 2:03inference and so on well let's go ahead 2:06and find out 2:07next generative AI algorithms could be 2:10scaled across hundreds of gpus in fact 2:12you can put them on v100s A1 100s or 2:15even look at the different H series that 2:17Nvidia uh provides or even other vendors 2:19but even so with hundreds of thousands 2:22of different types of requests per 2:23second this can strain the system but it 2:25could also strain the underlying 2:27Hardware so to help out let's look at a 2:29couple of strategies that we can take 2:31now the first one is called a batch 2:33based generative AI system now here what 2:38happens is we want to create um these 2:40very Dynamic fill-in-the blank sentences 2:43that come from the output of these large 2:45language models we then store them on a 2:47Content delivery Network and this is 2:49really cached all around the world and 2:51then on the edge we will then pull in 2:53all of those fill in the blank sentences 2:55and we'll insert the personalized 2:57information and then serve it over to 2:59the user so this becomes a very 3:00personalized um experience now the 3:02second one is Cash 3:06based generative AI now here the whole 3:10strategy is that we want to cach as much 3:12content as we possibly can on servers 3:15that are around the world on a CDN and 3:17what we do is we want to find the most 3:19common cases that we can generate 3:21content for and then push that up and 3:24that means that the least type of 3:26content that we generate we want to do 3:27that on demand and then in turn serve it 3:30but this gives us kind of The Best of 3:31Both Worlds where we can cut in half 3:33maybe 90% of the request per second and 3:35then the other 10% is created on demand 3:39now the other approach would be what's 3:41called an agentic 3:44architecture and this type of 3:46architecture it's emerging um but it's 3:48where you take these very large complex 3:51models and you break them down into 3:52smaller models so that they're 3:54specialized so it's almost like a 3:55mixture of experts and these agents in 3:58turn can communicate to each other 4:00um one example would be having a large 4:02language model judge the output of 4:04another large language model um you 4:06could Al even have another large 4:08language model be self- introspective 4:10and then pass the output off onto 4:12another specialized model that could 4:14transform that information that's then 4:16in turn served up but these smaller 4:18models then require smaller footprints 4:21that then can run um and be scaled 4:23across these hundreds of different types 4:25of gpus now these types of models they 4:28may not run on these commodity machines 4:31or any of the available gpus that you 4:33might have for your team such as a 32 GB 4:35GPU chip now some of them can there are 4:38some Granite models that are smaller 4:40even some llama models that could fit 4:42and other vendors also have uh pieces 4:45but the vast majority of some of the 4:46most powerful models you need to get to 4:48run and other types of Hardware um now 4:52to combat some of this one of the 4:54options is a technique that's called 4:56Model 4:57distillation um but the whole idea 5:00around this is that we want to be able 5:02to extract the information that really 5:04matters to the domain of which we're 5:06working on we can take that information 5:08and we could do in context learning uh 5:10but traditionally you would want to 5:11teach another smaller model through 5:14gradient update so it becomes much more 5:16powerful and fine-tuned um and can be 5:19applied to your problem in a very 5:21accurate way now the second method that 5:24we could do is called a student teacher 5:28approach 5:30and here instead of looking at the data 5:32per se we want to create what's called 5:35um a new behavior and we can create a 5:37new skill or a composite skill based on 5:40the task that we want to have it could 5:41be text extraction it could be 5:43summarization um it could be just fluent 5:45writing uh but having a model that's the 5:48teacher that might would know some of 5:50the task um and you could even have a 5:52bank of models where you where as where 5:54your teacher models would then be asked 5:57questioned by your student models so 5:58that so the data would would flow this 6:00way and then the output from your 6:02teacher models will go back to the 6:04student so that it could learn over time 6:06and develop those types of skills um 6:08that it would need um and now if I want 6:11to look at another approach I want to 6:13shrink a model uh 6:16quantization is a nice approach so here 6:19is where I want to compress a model into 6:21a much smaller footprint so I might 6:24would take these 32-bit um floating 6:27Point numbers and I want to make them 6:29much smaller so I might want to make it 6:30into an 8bit representation of that 6:32floating Point number or I could do a 6:344bit representation now there are 6:37different pros and cons of the order 6:39which you do it so one of them could be 6:40you shrink it uh before training now 6:43this requires more compute resources 6:45when you do train uh but it creates a 6:48smaller model um and it still can 6:50maintain a lot of the accuracy levels 6:52whichever way you measure accuracy um 6:55whenever you apply that at inference 6:57time now you could also do this uh post 7:00training now this is you know um lesser 7:03requirements on compute when you train 7:06but when you do and apply this at in 7:08inference your accuracy might go down 7:10right so those trade-offs right is 7:11something that you do need to keep in 7:13mind as you apply this compression 7:16technique if you like this video and 7:18want to see more like it please like And 7:20subscribe if you have any questions or 7:22want to share your thoughts about this 7:23topic please leave a comment below