Learning Library

← Back to Library

Scaling Language Models: Size vs Performance

Key Points

  • LLM size is measured by the number of parameters, ranging from lightweight 300 M‑parameter models that run on smartphones to massive systems with hundreds of billions—or even approaching a trillion—parameters that require data‑center‑scale GPU clusters.
  • Model examples illustrate this spectrum: Mistral 7B has roughly 7 billion parameters (a small model), whereas Meta’s LLaMA 3 reaches about 400 billion parameters, placing it in the “large” category, and frontier research is pushing well beyond half a trillion.
  • More parameters generally boost capabilities—enabling better factual recall, multilingual support, and longer reasoning chains—but they also incur exponentially higher compute, energy, and memory costs, so “bigger is not always better.”
  • Progress is tracked with benchmarks like the Massive Multitask Language Understanding (MMLU) test; GPT‑3 (175 B parameters) scored ~44% (above average human), while newer, larger models achieve substantially higher scores, demonstrating that smaller models are closing the gap yet the most capable systems still benefit from sheer scale.

Full Transcript

# Scaling Language Models: Size vs Performance **Source:** [https://www.youtube.com/watch?v=0Wwn5IEqFcg](https://www.youtube.com/watch?v=0Wwn5IEqFcg) **Duration:** 00:09:18 ## Summary - LLM size is measured by the number of parameters, ranging from lightweight 300 M‑parameter models that run on smartphones to massive systems with hundreds of billions—or even approaching a trillion—parameters that require data‑center‑scale GPU clusters. - Model examples illustrate this spectrum: Mistral 7B has roughly 7 billion parameters (a small model), whereas Meta’s LLaMA 3 reaches about 400 billion parameters, placing it in the “large” category, and frontier research is pushing well beyond half a trillion. - More parameters generally boost capabilities—enabling better factual recall, multilingual support, and longer reasoning chains—but they also incur exponentially higher compute, energy, and memory costs, so “bigger is not always better.” - Progress is tracked with benchmarks like the Massive Multitask Language Understanding (MMLU) test; GPT‑3 (175 B parameters) scored ~44% (above average human), while newer, larger models achieve substantially higher scores, demonstrating that smaller models are closing the gap yet the most capable systems still benefit from sheer scale. ## Sections - [00:00:00](https://www.youtube.com/watch?v=0Wwn5IEqFcg&t=0s) **How Large Are LLMs?** - The segment explains that LLM size is measured by parameter count, ranging from 300 million to hundreds of billions (or over a trillion), with examples like Mistral 7B and LLaMA 3 400B, and discusses the trade‑off between increased capability and higher compute, energy, and memory costs. - [00:03:07](https://www.youtube.com/watch?v=0Wwn5IEqFcg&t=187s) **AI MMLU Performance Over Time** - The speaker contrasts human baseline scores on the MMLU benchmark with GPT‑3’s 44% and today's frontier models reaching the high 80s, emphasizing how the 60% competence threshold fell from a 65‑billion‑parameter model in early 2023 to much smaller models within months. - [00:06:11](https://www.youtube.com/watch?v=0Wwn5IEqFcg&t=371s) **Scale Advantages in AI Applications** - The speaker outlines tasks where large language models outperform smaller ones—such as multi‑language code generation, processing lengthy documents, and high‑fidelity multilingual translation—while also noting cases like on‑device AI where compact models are preferable. - [00:09:14](https://www.youtube.com/watch?v=0Wwn5IEqFcg&t=554s) **Decision Driven by Use Case** - The final choice should be based on the specific requirements and context of your application. ## Full Transcript
0:00The first L in LLM stands for large. 0:05But how large is large? 0:08Well, today's language models cover a huge range 0:11of sizes, from lightweight networks that have maybe 0:16300 million parameters that can run entirely on a smartphone 0:21to titanic systems with hundreds of billions, or perhaps even approaching 0:27a trillion parameters that require 0:30racks of GPUs in a hyperscale data center. 0:34And yeah, size in this context, it is measured in parameters. 0:39That's how we measure the size of an LLM 0:44and parameters are the individual floating point weights that a 0:47neural network tweaks while it trains. 0:49And collectively these parameters 0:51encode everything the model can recall or reason about. 0:56Well let's talk about some specific models. 0:58So for example Mistral 7B 1:03that is an example of a small model, 1:08the seven B there that says it contains 1:12roughly 7 billion of those weights or those parameters. 1:16By comparison. 1:17And we could take a look at llama three for example from meta. 1:23Now this one is a much bigger 1:26model.. 400B. 1:29So we would put this in the large 1:33LLM category. 1:36And in fact some frontier models there much bigger than that. 1:39The room to push well beyond half a trillion parameters. 1:42And in broad strokes extra parameters buys extra capability. 1:48Larger models have more room to memorize more facts and support 1:52more languages and carry out more intricate chains of reasoning. 1:56But the trade off, of course, with these guys is 2:01cost. 2:02They demand exponentially more compute and energy and memory, 2:07both to train them in the first place and then to run them in production. 2:12So the story isn't simply bigger is better. 2:16Smaller models are catching up and are punching far above their weight class. 2:21And let me give you an example. 2:23Well, we measure progress in language model capability with benchmarks. 2:28And one of the most enduring benchmarks 2:31is the m m l u. 2:35That's massive multitask language understanding. 2:39Now the MMLU it contains more than 15,000 2:43multiple choice questions across all sorts of domains sub subjects 2:48like math and history and law and medicine and anybody taking the test 2:54needs to combine both factual recall with problem solving across many fields. 3:00So the test is a convenient, if somewhat imperfect snapshot 3:04of kind of broad general purpose ability. 3:07Now, if you took the MMLU, you and you were just guessing at random, 3:12you would score around 25% on the test. 3:18But if you weren't guessing at random, 3:20if you're just kind of a regular Joe, just a regular human, 3:24and you took the test, you might score somewhere around 35%. 3:31It's a it's a pretty hard test, 3:33but what about a domain expert? 3:36Well, a domain expert would score far higher, 3:39something like around 90% 3:43on questions that are within their specialty. 3:48So that's humans. 3:50What about AI models? 3:52Well, when GPT three came out in 2020, 3:58this is a 175 billion parameter model. 4:02It posted a score on the MMLU view of 44%. 4:08I mean, that's pretty respectable. 4:10It's better than the average Joe, 4:13but it's far from mastery. 4:15What about today's models? 4:18Well, if we take a look at today's frontier models, kind of the best models 4:23we have, they can score in the high 80s, 4:26maybe 88% on the test. 4:29But let's use a different benchmark. 4:32Let's use a benchmark of 60%. 4:36And we can say that is a practical cutoff 4:39because above that line, a model begins to look like a 4:43like a competent generalist that can answer everyday questions. 4:48And what is striking is how quickly that 60% barrier 4:52has fallen to ever smaller models. 4:55So in February of 2023, 5:01the smallest model that could score above 60% 5:04was Llama 1-65B 5:0665 B, meaning 65 billion parameters. 5:10But just a few months later, 5:12by July of the same year, Llama 2 - 34B. 5:17They could do it with barely half the parameters. 5:21Then if we fast forward to September 5:24of the same year that saw 5:27Misteral 7B join the cloud, which we know is a 7 billion parameter model, 5:31and then in March of 2024, 5:36Qwen 1.5 MOE became the first model with fewer 5:40than 3 billion active parameters to clear 60%. 5:44In other words, month by month, we are learning to squeeze 5:48competent generalist behavior into smaller and smaller footprints. 5:53So smaller models are getting smarter. 5:55And I think the next natural question becomes which model should I put 5:59into production, large or small? 6:03And the answer, 6:03of course, depends on your workload, your latency, your privacy constraints. 6:07And let's be honest, the size of your GPU budget. 6:11Now I'm generalizing here. 6:13Your case may be different, but certain tasks 6:17do still reward sheer scale. 6:20So let's talk about some large model use cases. 6:24And one of the first really comes down to 6:26broad spectrum code generation. 6:32So a small model can master a handful of programing languages. 6:35But a a frontier model has room for dozens of ecosystems 6:39and can reason across multi file projects and unfamiliar APIs and weird edge cases. 6:46Another good example is when you have document 6:50heavy work that you need to process. 6:54So we might need to ingest a very large contract 6:57and a medical guideline and a technical standard. 7:00And a large model's longer context window means it can keep more of the source text 7:06in mind, reducing hallucinations and improving citation quality. 7:11And the same scale advantage appears in high fidelity 7:15multilingual translation as well, where 7:18we're going from one language to another, and the extra parameters 7:23that the network carve out richer subspaces for each language. 7:27Finally, capturing idioms and nuance that smaller models might kind of gloss over. 7:32But look, there are some cases where small models 7:35are not only good enough, but they are outright preferable. 7:39So let's talk about some of those use cases. 7:42And one of those comes down to 7:45on device a AI. 7:48So keyboard prediction or voice commands that offline search that stuff 7:52lives or dies by sub 100 millisecond latency and strict 7:57data privacy and small models that run on device. 8:01Well, they're great for that. 8:04Also, when it just comes down 8:05to everyday summarization, that's another sweet spot. 8:10In an in news summarization study, Mistral 7B instruct achieved ROGUE and Bert score metrics 8:15that were statistically indistinguishable from a much larger model GPT 3.5 turbo. 8:21And that's despite the model running 30 times cheaper and faster. 8:25And another good use case comes down 8:28to enterprise chat bots. 8:31So with these, a business can fine tune a seven or a 13 billion 8:37parameter model on its own manuals, and it can reach near expert accuracy. 8:42And IBM found that the the granite 13 B family match the performance of models 8:47that were five times larger on typical enterprise Q and A task. 8:51So the rule of thumb is for expansive, open ended reasoning. 8:58Bigger does still buy more headroom for 9:01focused skills like summarizing and classifying. 9:04A carefully trained small model delivers perhaps 9:0790% of the quality at a fraction of the cost. 9:10So go big. 9:13Stay small. 9:14In the end, it's your use case that will drive the decision.