Learning Library

← Back to Library

Domain‑Specific LLM Training with InstructLab

Key Points

  • The traditional LLM pipeline relies on data engineers and data scientists to curate structured database inputs, which makes it hard to incorporate domain‑specific knowledge stored in unstructured formats.
  • Tools like InstructLab let project managers and business analysts feed domain knowledge from documents (Word, PDFs, text files) into a git‑based taxonomy, eliminating the need for a dedicated data‑scientist step.
  • InstructLab automatically generates synthetic variations of questions from the curated content, enriching training data and improving the model’s ability to understand diverse prompts.
  • After training, the model can be deployed on Kubernetes‑based platforms such as Red Hat OpenShift, leveraging GPU/accelerator hardware from NVIDIA, AMD, or Intel.
  • The OpenShift AI extension provides the MLOps stack for model serving, inference configuration, and metrics collection, completing the end‑to‑end lifecycle.

Full Transcript

# Domain‑Specific LLM Training with InstructLab **Source:** [https://www.youtube.com/watch?v=0OOXGwLENyY](https://www.youtube.com/watch?v=0OOXGwLENyY) **Duration:** 00:07:50 ## Summary - The traditional LLM pipeline relies on data engineers and data scientists to curate structured database inputs, which makes it hard to incorporate domain‑specific knowledge stored in unstructured formats. - Tools like InstructLab let project managers and business analysts feed domain knowledge from documents (Word, PDFs, text files) into a git‑based taxonomy, eliminating the need for a dedicated data‑scientist step. - InstructLab automatically generates synthetic variations of questions from the curated content, enriching training data and improving the model’s ability to understand diverse prompts. - After training, the model can be deployed on Kubernetes‑based platforms such as Red Hat OpenShift, leveraging GPU/accelerator hardware from NVIDIA, AMD, or Intel. - The OpenShift AI extension provides the MLOps stack for model serving, inference configuration, and metrics collection, completing the end‑to‑end lifecycle. ## Sections - [00:00:00](https://www.youtube.com/watch?v=0OOXGwLENyY&t=0s) **Integrating Domain Knowledge with InstructLab** - The speaker explains how to augment the traditional LLM data pipeline by involving project managers and analysts and using InstructLab’s git‑based taxonomy to incorporate non‑database documents into model training. - [00:03:12](https://www.youtube.com/watch?v=0OOXGwLENyY&t=192s) **Synthetic Data‑Driven LLM Deployment** - The speaker explains how InstructLab automates synthetic data generation and model training, then deploys the LLM on Kubernetes/OpenShift using AI accelerators and Red Hat OpenShift AI to manage inference, metrics, and the full MLOps lifecycle with governance. - [00:06:25](https://www.youtube.com/watch?v=0OOXGwLENyY&t=385s) **Cost‑Effective RAG Model Pipeline** - The segment outlines a budget‑aware workflow that runs data processing intermittently, leverages RAG and a taxonomy‑driven pipeline—including project manager and analyst input, synthetic data generation via InstructLab, and training with watsonx.ai—before deploying the model on an OpenShift/Kubernetes platform. ## Full Transcript
0:00Hey, everybody. 0:01Today, I want to talk to you about how to apply domain specific knowledge to your LLM lifecycle. 0:09A traditional approach starts with a data engineer. 0:17Who's curating data that is then used. 0:22By data scientists. 0:26Who ultimately takes that data. 0:37Trains it to the model. 0:46And then makes that model available for inference. 0:56One of the challenges with this, though. 0:59Is that the data over here being used, 1:06it's typically a traditional database of some sort, either sequel or no sequel, 1:12and it's usually containing metrics 1:14or sales data or anything that's typically organized or curated by an organization. 1:21The challenge here is getting domain specific knowledge within 1:25an organization and applying it to this same process. 1:30Now let's look at the same approach, 1:31but use a variety of tools that can empower people like project managers, 1:37and business analysts to be contributing to the process. 1:44So here we have a project manager 1:49And business analyst. 1:52They both have domain specific knowledge about processes 2:01within their organization, 2:03but these could be stored in word documents or text files of some sort, 2:09not the traditional data stores that we have that we typically use within a model lifecycle, 2:17but we can change this. 2:19We can use a tool like InstructLab to manage this process. 2:33InstructLab is an open source tool that allows the management of what we call a taxonomy. 2:42This taxonomy is just a git repository typically where we can manage things like indie files or text files 2:52and then apply that to our model. 3:00We could even have more traditional document formats like PDFs, 3:06and have those be transformed and the necessary file structure that InstructLab takes. 3:13Once they've applied that data to the taxonomy that InstructLab manages, 3:20we can then start the more traditional process that we saw earlier, 3:26but we don't actually need a data scientist in this case. 3:30InstructLab is handling all that, 3:32and it will then create synthetic data 3:40through this process. 3:44Now I know synthetic data. 3:47That sounds kind of scary, but I want to approach it in a different way. 3:53Think of synthetic data in this case as just another way of reframing the question, 3:59instead of one way of asking a question, 4:03We have many different ways of asking the same question. 4:08This empowers the model, especially when we go through the training cycle, 4:13to then apply more opportunities to the LLM to be able to accurately reply to your prompts. 4:24Once we've trained the model, we can then go ahead and deploy it into an AI platform. 4:34This could be Kubernetes based like OpenShift, for example. 4:43and this can take advantage of different AI accelerators that can be used 4:50like NVIDIA, 4:56AMD, or Intel, for example. 5:03Now this is the infrastructure layer, 5:06but we want to be able to interact with our model and be able to configure 5:11the inference and be able to apply metrics 5:14and all those things that need to be part of the MLOps lifecycle. 5:19Now we can do this. 5:21With an extension for OpenShift called Red Hat OpenShift AI 5:31Which will provide you all those tools for managing the lifecycle of this model within production. 5:39Now, you may want to then interact with that model. 5:43You want to validate it or apply governance or other things. 5:48You even just sandbox with it. 5:50This could be done with something like watsonx.AI. 5:57I can sit on top of OpenShift and interact with all the models that are being inferred within this AI stack. 6:06Now once this life cycle has finished. 6:10We can then restart the whole process again 6:15and use the new data that has since been built up by our project managers and business analysts 6:22and go through this lifecycle once more. 6:26But one thing to note is that can be really costly. 6:31We may not want to run this process over and over again every week. 6:35We may only have the budget to run it maybe once a month or once every other month. 6:39Well, we can use technologies like RAG, 6:47And how this data come over here in the interim before we go through this process again. 6:58Once we do that, we can flush out our RAG database and then start a new. 7:05As data is collected by our project managers and our business analysts. 7:11All right. 7:11We shown the complete model lifecycle 7:14and how to apply domain specific knowledge 7:17from people within our organization, like project managers and business analysts. 7:22Then manage that data through a taxonomy, through InstructLab, 7:27generate synthetic data that's then used for training that model, 7:31and then ultimately deploying into a Kubernetes based platform like OpenShift. 7:37Utilizing AI services from tools like watsonx.ai, 7:42and ultimately using technologies like RAG to enhance that experience. 7:49Thank you so much for watching.