Learning Library

← Back to Library

Enhancing Trustworthy, Efficient Foundation Models

Key Points

  • Kate Soule (Senior Manager, Business Strategy at IBM Research) outlines how enterprises can boost foundation‑model trustworthiness and efficiency by targeting three core components: data, architecture, and training.
  • For data, the trade‑off between quantity and cost is key: roughly 10 words per model parameter minimizes training compute, while 100+ words per parameter makes the model more “data‑dense” and reduces inference costs.
  • Data quality directly impacts model trustworthiness; biased or low‑quality inputs produce biased outputs, so careful source selection and extensive filtering (e.g., hate‑speech and profanity removal) are essential.
  • Verifying the massive, often web‑scraped datasets is difficult, making it crucial to avoid “dark‑corner” Internet content and to implement rigorous preprocessing pipelines to ensure both efficiency and ethical reliability.

Full Transcript

# Enhancing Trustworthy, Efficient Foundation Models **Source:** [https://www.youtube.com/watch?v=eHPqfNLeous](https://www.youtube.com/watch?v=eHPqfNLeous) **Duration:** 00:10:45 ## Summary - Kate Soule (Senior Manager, Business Strategy at IBM Research) outlines how enterprises can boost foundation‑model trustworthiness and efficiency by targeting three core components: data, architecture, and training. - For data, the trade‑off between quantity and cost is key: roughly 10 words per model parameter minimizes training compute, while 100+ words per parameter makes the model more “data‑dense” and reduces inference costs. - Data quality directly impacts model trustworthiness; biased or low‑quality inputs produce biased outputs, so careful source selection and extensive filtering (e.g., hate‑speech and profanity removal) are essential. - Verifying the massive, often web‑scraped datasets is difficult, making it crucial to avoid “dark‑corner” Internet content and to implement rigorous preprocessing pipelines to ensure both efficiency and ethical reliability. ## Sections - [00:00:00](https://www.youtube.com/watch?v=eHPqfNLeous&t=0s) **Optimizing Foundation Model Trust and Efficiency** - Kate Soule outlines enterprise‑focused strategies for enhancing foundation model trustworthiness and efficiency by examining data quantity, quality, specialization, and their impact on compute costs. - [00:03:05](https://www.youtube.com/watch?v=eHPqfNLeous&t=185s) **Specialized Data & Model Efficiency** - The speaker outlines how filtering harmful internet data and training on a mix of domain‑specific and general data enables smaller, expert foundation models to perform as well as or better than larger generic models, before transitioning to discussion of model architecture. - [00:06:20](https://www.youtube.com/watch?v=eHPqfNLeous&t=380s) **Alignment Phase and Model Fine‑Tuning** - The speaker explains the post‑pretraining alignment stage, covering safety goals, RLHF, supervised fine‑tuning, and other techniques to adapt a model to desired behavior. - [00:09:29](https://www.youtube.com/watch?v=eHPqfNLeous&t=569s) **Sustainable AI Training and Alignment** - The speaker highlights IBM Research's new LiGO modular training method, eco‑friendly Vela supercomputer, and advanced alignment techniques such as RLHF to create highly sustainable and safely tuned foundation models. ## Full Transcript
0:00My name's Kate Soule. 0:01I'm a senior manager of business strategy at IBM Research. 0:04And today I'm going to give a short overview of what are the different strategies 0:07we can employ in order to improve the foundation model's trustworthiness 0:11and efficiency for enterprise deployment. 0:14At its core, you can break down a foundation model into a couple of key components; 0:18the data, the architecture, and the training. 0:41And for each one of these areas, there are different strategies and techniques that we can employ 0:46in order to improve efficiency and model trustworthiness. 0:50Let's start with the data. 0:52When we talk about data, we're normally going to talk about it in regards to its quantity, quality, and degree of specialization. 1:17On the quantity side, these models are trained on enormous amounts of unlabeled data in a unsupervised fashion. 1:26It's this volume of data that they're trained on that gives them the super power 1:29to be able to perform an immense number of different downstream tasks easily and efficiently. 1:35But the more data that you're training your model on, you're going to drive up your compute costs when it comes time for training and inference. 1:41There's been a lot of research actually going into it. 1:43How much data do you need in order to train a foundation model as efficiently a foundation model, as efficiently as possible? 1:50If you're trying to make it efficient for training, what's the least amount of data you need per parameter in your model? 1:57So that's a way to measure model sizes, the number of parameters in it. 2:00So it’s the least amount of data that you need per model parameter in order to get a degree of accuracy. 2:06If you’re just focused on the training costs. 2:09You need about ten words per parameter in your training dataset to make these models efficient. 2:14If you're talking about the inference cost though, so think of this as like the fixed cost to train the model once 2:20versus the the marginal cost to use the model over and over and over again. 2:24You can improve the efficiency of your model by making your models far more data dense, 2:29using as many as 100 plus words of data in your training dataset per parameter of your model. 2:37The quality of your data is also going to be incredibly important when we talk about model trustworthiness. 2:42It's just like machine learning, if you have a poor input quality of your data, you're going to have a poor quality model output, 2:49if you have biased training data, you're going to have a biased model but unlike traditional machine learning, there's so much data to sort through. 2:58It is really hard to verify the quality of your data. 3:01And that's why it's important to be very careful of where the data is coming from. 3:05Most of this data is scraped from the Internet 3:06and there are dark corners of the Internet that should not be used when creating training data for these foundation models, 3:13but also then employing hate and profanity filtering, 3:17HAP filtering on this training dataset in order to extract and take out remove any hateful or harmful materials. 3:26Finally, another strategy that you could take on the data side to improve the efficiency of your model is the degree of specialization. 3:34Here the insight is coming from, if you have a question, let's say, about a medical issue, 3:40would you rather ask the smartest person that you know or would you rather ask your doctor? 3:44You would probably rather ask your doctor. 3:46You want an expert's opinion. 3:48And the same goes for foundation models. 3:50We're seeing that smaller models that are specialized in a domain, that are experts, 3:55that are trained on maybe a mixture of 50/50 specialized medical data or finance data. 4:01And 50/50 general data can perform as well, if not better than a much larger general purpose foundation model with no degree of specialization 4:10allowing us if we have tasks that are specialized to a domain to get away with much more lighter weight, smaller, efficient expert models. 4:19So that's the data of the picture. 4:21Now, let's talk about architecture. 4:24On the architecture side, we think of architecture as a way to have a blueprint for how this data gets encoded into the model. 4:31And there are different styles that have emerged that have different advantages. 4:36One of the styles that have emerged is a decoder only model. 4:41This is, for example, GPT3. 4:43These models are really performative, they're really powerful, but they're also very dense, 4:48whereas we have other models that have emerged, such as encoder decoder models that are much more lightweight and efficient. 4:55In addition to the style of the architecture, there's the size. 4:59And here I'm talking again about the parameters. 5:02So you both want to make sure that your size of your model is related to the size of the training data that you have. 5:08If you have too big of a model, too many parameters for the amount of training data you're going to over fit leading to trustworthiness issues. 5:16And you're also going to have efficiency issues. 5:18The bigger your model, the more compute costs it will take, both to train it and to run inference. 5:25Finally, is the training, which is how I stitch all of this together, the data and the architecture with compute. 5:32And you can break training down into a couple of different steps. 5:35There's the pre training. 5:43So I say pre-training to be very specific here. 5:46These models, if you remember, are meant to be a starting point. 5:49They're meant to be taken and then retrained in a process that's called tuning and used for different downstream tasks. 5:55So pre-training refers to creating the first foundational model, the starting point, 6:00and the pre-training is going to drive a huge percentage of the compute costs and the carbon footprint of foundation models. 6:08These costs are going to be dictated by your architecture choices, your hardware choices, the inference stack choices, 6:14all of which can be dictated and result in different carbon costs and compute costs. 6:22After the initial pre training, there's a really interesting phase called the alignment phase, 6:26which is when I take my pre-trained model, but it's not ready for prime time yet. 6:32I need to polish it. 6:33I need to align it closer to my values of how I want my model to behave, my values such as safety and trustworthiness. 6:42There is an active area of research along alignments and how to do this effectively and as efficiently as possible. 6:49Some techniques around alignment include things like reinforcement learning with human feedback, R.L.H.F., which is when I have a human actually sit in the loop. 6:58They're evaluating model performances, scoring it, and then creating a reward function. 7:02It's basically a game and you try and get the model to give responses that are going to have the highest scores from the human annotators 7:09there are other methods that are more data driven, so everything in this picture so far has been under unsupervised learning. 7:16There's been no labeled data. 7:17We scraped a bunch of data from the Internet, we've trained it, 7:20but now if we start to bring some supervised learning back in, we bring some labeled data that's specific for our task, 7:26or that is labeled data that shows their domains. 7:30You can do a process called tuning, which is where you actually go through and update some of the parameters 7:36of this pre-trained model and ground it in that labeled data to be more effective. 7:41There's a number of other techniques, things that will happen in the alignment phase, this is an active area of research, 7:46of things like editing post-processing that you can also do and to improve your model's fairness once it's trained. 7:54So now that we've talked about the different components and strategies, 7:57let's talk about what IBM is doing in order to build efficient and trustworthy foundation models 8:02now being made available through watsonx and infused into our product pipelines. 8:08On the data side, IBM is building what we believe to be one of the largest enterprise datasets for foundation model training. 8:17In addition to building a huge amount of data for training, we're taking a heavy emphasis on the quality of that data, 8:24making sure that it's taken from reputable sources, and that every single data point that we collect and curate goes through legal reviews, 8:33ensuring there's no ownership issues, copyright issues, 8:36and then is red teamed in order to identify what potentially toxic information needs to be taken out to make the quality of the data as safe as possible. 8:46And then finally, we're actively targeting different degrees of specialization, things like finance and cybersecurity, 8:51so that we can train these expert models that are going to be more efficient for our enterprise needs. 8:59On the architecture side, we're building a variety of different style models the encoder decoder models, decoder only models and more. 9:07We're coming up also with net new architectures that have never been seen before. 9:12Models that promise to have ultra efficient operations and have modular components that allow expert knowledge 9:19to be infused in them that we think are going to drive a lot of value in the space. 9:23And we're building models of a variety of sizes from smaller 3 billion size models to 20 billion and larger. 9:31And then finally on the training side, what's really exciting here are some of the advancements coming from IBM Research, including a new technique called LiGO, 9:42Learning to Grow, which is a modular training approach, allowing you to recycle models, reuse them, and thereby save significant carbon and compute costs. 9:53When training these models, leading to some of our models being the some of the most sustainably trained foundation models available. 10:02We've also done all of our training on Vela, 10:05the supercomputer built by IBM Research that we've optimized across the stack to ensure that our training and inference is as efficient as possible. 10:16And then IBM Research is working on a number of advanced alignment techniques, 10:20including things like reinforcement learning with human feedback, but also going into advanced tuning approaches, 10:27allowing models to follow strict rules in terms of what type of behaviors they can exhibit. 10:35If you're interested in learning more, please check out the links below 10:39both about the different innovations that are coming out from IBM Research on model safety and about our watsonx product portfolio.