Learning Library

← Back to Library

Scaling Multilingual Data to Trillion Tokens

Key Points

  • The “data‑as‑oil” metaphor highlights a looming scarcity of high‑quality training data for large language models, prompting a search for scalable pathways beyond the current trillion‑token datasets.
  • Scaling to ~10 trillion tokens requires a truly multilingual corpus — roughly 30‑40 % English and the rest diverse languages like Chinese, Hindi, French, and Spanish — supported by automated cleaning, deduplication, and adaptable tokenizers that respect morphological differences.
  • Achieving ~100 trillion tokens demands incorporating multimodal sources such as video streams, high‑quality transcriptions of podcasts and calls (with permission), and vision data, turning the dataset into a continuous, real‑time web‑scale ingest.
  • This massive expansion will need unprecedented compute, storage, and network capacity, as well as next‑generation transformer architectures capable of unifying text, image, and video modalities within a single model.
  • Current models (e.g., ChatGPT) feel “unnatural” in non‑English languages because their training sets are heavily English‑centric; balancing language representation is the simplest and most effective route to truly global, high‑performance LLMs.

Full Transcript

# Scaling Multilingual Data to Trillion Tokens **Source:** [https://www.youtube.com/watch?v=wIWtp0KZa3c](https://www.youtube.com/watch?v=wIWtp0KZa3c) **Duration:** 00:12:29 ## Summary - The “data‑as‑oil” metaphor highlights a looming scarcity of high‑quality training data for large language models, prompting a search for scalable pathways beyond the current trillion‑token datasets. - Scaling to ~10 trillion tokens requires a truly multilingual corpus — roughly 30‑40 % English and the rest diverse languages like Chinese, Hindi, French, and Spanish — supported by automated cleaning, deduplication, and adaptable tokenizers that respect morphological differences. - Achieving ~100 trillion tokens demands incorporating multimodal sources such as video streams, high‑quality transcriptions of podcasts and calls (with permission), and vision data, turning the dataset into a continuous, real‑time web‑scale ingest. - This massive expansion will need unprecedented compute, storage, and network capacity, as well as next‑generation transformer architectures capable of unifying text, image, and video modalities within a single model. - Current models (e.g., ChatGPT) feel “unnatural” in non‑English languages because their training sets are heavily English‑centric; balancing language representation is the simplest and most effective route to truly global, high‑performance LLMs. ## Sections - [00:00:00](https://www.youtube.com/watch?v=wIWtp0KZa3c&t=0s) **Scaling Data for Multilingual LLMs** - The speaker argues that advancing beyond trillion-token corpora requires curated, multilingual datasets and automated cleaning tools to provide useful, diverse training data for large language models. ## Full Transcript
0:00so this past fall at the nurp conference 0:02Ilia suver said that we are in a world 0:05that has data as oil we are in a world 0:07where data is running out and that that 0:09phrase has been haunting me ever since I 0:12want to suggest a pathway to more 0:16data right now large language models can 0:19get trained on something like a trillion 0:22tokens of very curated text it's not the 0:25whole internet I mean you can start with 0:27the whole internet but it's not like 0:28clean curated text 0:31there's problems though that's a static 0:32snapshot the internet continues to grow 0:35there's also of course larger training 0:38uh sets of private data that are 0:40generated all the time so I asked myself 0:44what kinds of breakthroughs are 0:46needed to lead to useful data not total 0:50data but useful data for large language 0:52models basically what is the scaling 0:55framework to move from a trillion tokens 0:57up past 10 trillion tokens and Beyond 0:59the this is what I want to talk about 1:01today if we scale to 10 trillion tokens 1:04in our training data set we're going to 1:07need to do multilingual we're going to 1:09need to do um very focused curated 1:12expansion of historically 1:13underrepresented texts we have to go 1:16beyond our English focused Text corpus 1:18or body for these llms to a much more 1:21multinational one it has to be not 80% 1:24or 90% English but like 30 40% English 1:28lot of Chinese lot of Hind Hindi a lot 1:30of French a lot of Spanish it has to 1:32more represent the world's actual 1:34language set we also need much better 1:37curation tools we'll need to automate 1:39cleaning we have to have better 1:40duplication pipelines that handle 1:42diverse languages and formats we have to 1:45have more flexible tokenizers that adapt 1:47to morphological differences across 1:50languages and if we can get there models 1:54are going to become much more truly 1:55multilingual and they'll capture a 1:57broader range of our expression I will 1:59say someone who knows Indonesian that 2:01the model doesn't feel as natural in 2:03Indonesian as it does in English and 2:06that would make sense because Chad GPT 2:07and other models are mostly trained on 2:10English data and so to me the simplest 2:13path to 10 trillion tokens is actually 2:15making the training set truly reflective 2:18of the languages that we speak as a 2:20global Community let's go a step further 2:23what's the path to 100 trillion tokens 2:26in our training sets that are high 2:28quality 2:29I think video is obviously your 2:31real-time web streams continuous 2:33ingestion of high quality web content 2:35social media streams getting large scale 2:39transcription going of good quality 2:41podcasts of phone calls obviously with 2:44permission if they're high 2:46quality um and multimodal as well like 2:49getting Vision involved where we have 2:51high quality Vision 2:53tokens those are all pieces we have to 2:56Cobble together to make the jump from 10 2:58trillion to 100 trillion tokens it's 3:01going to require an immense amount of 3:02compute a lot of storage huge network 3:06capacity it's going to dwarf anything 3:07we've got today um and it's going to 3:10take Transformers that can absolutely 3:12unify text images and even videos in the 3:15same architecture we are just now 3:18getting to truly native multimodal 3:20architectures for text and images I know 3:23text and images and video are on the 3:25horizon at open AI but it needs to be 3:27scalable 3:30and if we're at 100 trillion tokens and 3:32we have continuous feeds it means that 3:34it opens the possibility of continuously 3:36updated models that stay current on 3:38world events don't ask me quite how we 3:41architect that I think that's look 3:44that's two or three jumps beyond what we 3:45have now it's okay to dream a little bit 3:48I'm mostly trying to color in roughly 3:50what it would take from a tokenization 3:52perspective to get 3:54there let's go beyond what's past 100 3:57trillion tokens if we get into the 3:59quadrillion space for tokens or 10 to 4:02the 15th power we're talking pulling in 4:05sensor logs this is where robotics 4:07starts to come into play I know a lot of 4:09folks have been very excited about 4:10robotics getting into the physical space 4:14because embodied experience should 4:15dramatically expand our healthy token 4:19input selection for llms so sensor logs 4:24from cameras lar tactile sensors motor 4:27commands internet of things and where 4:29Ables and being able to pull all of that 4:32in again and actually train it as a good 4:34quality token so that we can actually 4:37train understand embody that 4:39understanding and output it would take 4:41even bigger GPU and TPU clusters it 4:44would take systems that can learn from 4:46interactions and unlabeled sensor 4:49streams it would take tools that can 4:51compress raw Visual and audio frames 4:53into tokens that are semantically 4:55meaningful and they would have to do all 4:57of this at an unheard of scale 5:00so again that's yet another set of 5:03technical challenge we' challenges we'd 5:05have to 5:06overcome and then if we want to go even 5:08farther get into zettabytes like 10 to 5:10the 20th tokens and 5:12Beyond now we have to look at 5:14three-dimensional scans we have to look 5:16at full audio streams simulation logs 5:20Internet of Things sensor data at City 5:23scale um data across multiple Industries 5:26in real time in standardized formats 5:29healthc care 5:30manufacturing agriculture autonomous 5:33vehicles um our infrastructure has to 5:35handle tens to hundreds of 5:37zettabytes we need to have modular 5:40architectures that can handle 5:41multi-trillion parameter 5:43scales um it's just it's an absolutely 5:47mindboggling 5:49task so there in just a few minutes I've 5:51taken you from where we're at now with 5:54trillion token models up to 10 trillion 5:57tokens up to 100 trillion tokens up to 5:59quadrillion and then to the zetabyte era 6:02I don't know when we'll get there and 6:04one of the interesting things is I don't 6:06know if we have to get there for 6:08artificial general intelligence that's 6:11meaningful if you think about it it is 6:13notable to me that clairo was saying 6:16that uh product requirements documents 6:18are actually higher quality in general 6:20without reasoning models in other words 6:23the stuff we have right now the models 6:24we have today the four class models we 6:26use good enough for that piece of work 6:30that doesn't mean I'm under the illusion 6:31that we have good enough models for 6:34everything else and all general purpose 6:35work obviously not we have other things 6:37to do 6:39but I don't think it's eminently clear 6:43that we need to get to zetabyte scale in 6:45order to have general intelligence or 6:48even to unlock Super intelligence and 6:50start to kick off a chain reaction of 6:53intelligence that's self-improving we 6:56may be able to get there 6:58sooner one of the reasons we may be able 7:00to get there sooner is the reasoning 7:01scale model which doesn't even touch any 7:04of this you can scale reasoning by 7:06scaling compute at test time 7:09obviously regardless of the training 7:11data set size and so imagine having a 7:15much much smarter trained model at the 7:1710 trillion or 100 trillion token scale 7:20and also layering reasoning on top of 7:22that as a second scaling 7:24law and so when I think about sort of 7:28the path to the next few years 7:30I think it makes a lot of sense to be 7:33really honest about some of these big 7:35Lego 7:36blocks that are steps along the pathway 7:39to artificial intelligence 7:41scale we need to be honest that we have 7:44barely scratched the surface of high 7:46quality tokens but at the same time 7:48getting 10x 100x 1,000x and more high 7:51quality tokens is going to be very very 7:53difficult it's going to be difficult 7:55from a uh research perspective just 7:58understanding how to do it at scale it's 7:59going to be difficult from a industrial 8:02data management perspective building the 8:04data centers ingesting the data cleaning 8:07the data at scale that we've never 8:09touched before Etc it's going to be 8:12difficult from the perspective of 8:13serving these models because they're 8:14absolutely huge models to serve um and 8:18we have that sort of crosscutting factor 8:20of reasoning where model makers will 8:22have to decide how big do these models 8:25need to be in order to be useful to do 8:28economic work 8:30especially if you can use reasoning as a 8:31sort of cheat code to scale up this is 8:35what was on SAA nadella's mind when he 8:37was talking about the value of being a 8:38renter in the Gold Rush he was talking 8:40about Azure and how he wants to rent 8:43data centers to people who are 8:44interested in models in model making and 8:47implicitly was saying he was kind of out 8:49of the model making business as just a 8:51Microsoft 8:53Project I think he was seeing the 8:55immense data costs here and he was like 8:57you know what I'd rather just just rent 8:59these guys all the data centers they 9:01need that's a money-making 9:03proposition because if you think about 9:05it this is an immense amount of effort 9:07scaling up data like this and the value 9:10you get is something that you have 9:13to continue to accelerate in order to 9:16harvest open AI is right now writing the 9:19tiger's tail they have gotten a little 9:22lead through 01 and 9:2503 and they need to continue to hold on 9:29to that lead in order to continue to 9:31make the case demonetize for Enterprise 9:34so they are they are locked into the 9:36poker betting cycle for model makers 9:38they can't get out and I think Sacha 9:41looked at that and was like that's a 9:42very expensive game to get in you can 9:45put all of this investment in and then 9:47you have to be locked in to keep doing 9:49more and more investing farther and 9:51farther because right behind you were 9:53people like deep seek who are 9:55relentlessly distilling models down and 9:56people like quen who are distilling deep 9:58seek models and I think there was 9:59another Chinese model today that 10:00distilled even further now you have tiny 10:03tiny models like billion perometer 10:05models or less that are really really 10:07good um and those are only possible 10:10because the original training work that 10:11was done on the four class 10:14models but the fact that you can train 10:16once and then distill across the world 10:19means that you only can harvest the 10:21value of these models if you keep 10:25maintaining your Edge if you keep 10:27growing and growing and growing and 10:28maintaining your Edge no matter matter 10:29what and so when I look at this token uh 10:33ladder up from a trillion tokens where 10:35we are now to the zetabyte 10:37era what I think of is opening eyes kind 10:40of locked into this arms race anthropic 10:43has locked into this arms race Gemini is 10:45pretty locked into this arms race 10:46arguably meta is locked into this arms 10:48race and whether they like it or not I 10:50think some of the Chinese llm uh model 10:53makers are locked in as 10:54well they all need to get to the 10:57Winner's Circle and they don't know how 11:00to define that and they only know they 11:02have to keep climbing and that's very 11:03good for consumers and that's very good 11:05for Builders but there's immense 11:08technical challenges coming up here and 11:09I'll be really curious to see how they 11:12balance the challenge of getting into 11:16these new high quality tokens versus 11:18reasoning one more thought I'll leave 11:21you with this entire ladder is very 11:24friendly to 11:25Nvidia is friendly in terms of gpus it's 11:29friendly in terms of the wearables and 11:31the lar and the robotics that has to 11:33happen for some of the later stages with 11:35sensors you're not going to get robotics 11:38at scale I think without some kind of 11:41Invidia stack 11:43underneath um and I might be wrong maybe 11:46there'll be another company that will 11:47come along that will be able to scale up 11:49uh gpus and scale up sort of sensor 11:52collection but my my gut says Nvidia is 11:56already here with their big robotics 11:58push at the beginning of this year and 12:00they intend to relentlessly build into 12:03this space the way they've relentlessly 12:05built into a bunch of other spaces in 12:07the 12:08past so we'll see anyway some late night 12:11thinking on where we're at with data I 12:16just I don't want to sit there and say 12:18that data is oil and we can't go get 12:21more I want to think about the world in 12:23terms of tokenization and I want to get 12:25more creative so you tell me