Learning Library

← Back to Library

Building a YouTube Transcription Agent with Langraph

25m • Unknown Channel • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

The tutorial walks through creating a YouTube transcription AI agent with Langraph, leveraging locally‑run Ollama models, a WXFlows transcription tool, and a Next.js front‑end.
A new Next.js project is bootstrapped using the Create Next App CLI, opting for TypeScript and Tailwind CSS for styling, then the generated `page.tsx` is cleared for custom code.
The main component is set up as a client‑side React component so state can be managed, and a simple UI is built with a header, an input field for a YouTube link, and a submit button.
An iframe placeholder is added beneath the input to display the selected YouTube video, with proper React attributes (referral policy, frameBorder, allowFullScreen) applied.
After saving the changes, the app can be launched locally with `npm run dev` to view the functional transcription interface.

Sections

Full Transcript

# Building a YouTube Transcription Agent with Langraph **Source:** [https://www.youtube.com/watch?v=u6qDSFxY4iw](https://www.youtube.com/watch?v=u6qDSFxY4iw) **Duration:** 00:25:18 ## Summary - The tutorial walks through creating a YouTube transcription AI agent with Langraph, leveraging locally‑run Ollama models, a WXFlows transcription tool, and a Next.js front‑end. - A new Next.js project is bootstrapped using the Create Next App CLI, opting for TypeScript and Tailwind CSS for styling, then the generated `page.tsx` is cleared for custom code. - The main component is set up as a client‑side React component so state can be managed, and a simple UI is built with a header, an input field for a YouTube link, and a submit button. - An iframe placeholder is added beneath the input to display the selected YouTube video, with proper React attributes (referral policy, frameBorder, allowFullScreen) applied. - After saving the changes, the app can be launched locally with `npm run dev` to view the functional transcription interface. ## Sections - [00:00:00](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=0s) **Building a Langraph YouTube Agent** - A walkthrough of creating a JavaScript‑based AI agent with Langraph, Next.js, local Ollama models, and WXFlows to fetch YouTube video transcriptions and display summarized results. - [00:03:04](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=184s) **Adding a LangGraph Agent with Ollama** - The speaker walks through stopping a Next.js server, verifying Ollama installation, installing LangChain‑LangGraph dependencies, and creating an actions.ts file to embed a dynamically powered YouTube video player using a locally run Llama 3.2 model. - [00:06:13](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=373s) **Parsing YouTube ID in React Chat** - The speaker explains how to use a system prompt to extract a YouTube video ID into JSON and manage the URL input and retrieved video data with React state hooks in a .tsx component. - [00:09:24](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=564s) **Implementing Dynamic YouTube Embed** - The speaker modifies the iframe source to use a state‑driven video ID via a template literal, wraps it in conditional brackets, runs the app to verify the embed works, and outlines adding a tool for fetching full video transcriptions. - [00:12:32](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=752s) **Fetching YouTube Details with Playwright** - The speaker explains how to build an async tool that launches Chromium via Playwright to scrape a video's title and description from a YouTube page, then integrates this callback function into an Ollama‑Langgraph agent. - [00:15:50](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=950s) **Defining Video Type for State** - The speaker creates a TypeScript `video` type (id, title, description), applies it to local state and LLM JSON parsing to resolve type errors, verifies the video title and description render correctly, and then prepares to import a transcription tool. - [00:18:59](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=1139s) **Configuring wxflows Endpoint & SDK** - The speaker explains how to create a .environment file with the deployment endpoint and API key, install the beta wxflows SDK via npm, and integrate it into a LangGraph agent for YouTube transcription. - [00:22:12](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=1332s) **Generating Video Descriptions from Captions** - The speaker demonstrates using a YouTube transcript tool to fetch captions, feeding them to an LLM to auto‑generate video descriptions, and incorporating those captions into the app’s JSON output and UI. ## Full Transcript

0:00Let's build an AI agent in Langraph using JavaScript. 0:03Agents can be really helpful to automate parts of your life. 0:06Or for example, take data from one source and turn it into something else. 0:10In this video, we'll be building a YouTube transcription agent 0:13that's able to pull transcriptions from a YouTube video and summarize them on your screen. 0:18For this we'll be using models running locally using Ollama, 0:21we'll be using Next .js to build a frontend app, and then we'll be using a YouTube transcription tool from WXFlows. 0:27So let's dive into VSCode and see how it's built. 0:31In VSCode, I've already set up a new project. 0:33In here, I'm going to run a command to bootstrap a new Next .js application. 0:38For this, I'll be using the Create Next App CLI and I'm going to make sure I use the latest version. 0:43I also need to provide it with a name of my project, which will be the Landgraph YouTube agent. 0:50Setting this up is going to require you to answer a few questions. 0:53For example, would you like to use TypeScript? 0:56And then it's going to ask a few other defaults. 0:58It also asks us to install Tailwind, which is a nice library to help you write CSS. 1:04Depending on your needs, you might make different decisions when building your own project. 1:09Once this is done, it's going to generate a new directory with all my files. 1:13I'm going to move into this new Landgraph YouTube agent directory, 1:16where I can find all the boilerplate code for my Next .js application. 1:20In here you can find a file called page .tsx and this will be the main file that's rendered when someone sees my application. 1:27I'm going to delete all the code that's in there. 1:29I'm going to replace it with something else. 1:31In here I'm going to add my own boilerplate for this application. 1:36I also need to make sure that I set this component up 1:38to be a client-side component, and this means I can use state management later on. 1:43Within this div, which has some Tailwind class names connected to it, 1:46I can set up all the code I need in order to render my application. 1:51That starts with a header, and for this I'm going to be adding a header that contains the name of the agent, 1:57which is the YouTube transcription agent, and then I'm going to be adding a bit where I have an input bar to submit a video link. 2:05Put this in there as well. 2:07Once I save this, I could already visit the application in the browser by starting npm run dev, 2:13but first, I will also add a placeholder video. 2:16So this will mock the application setup that we'll have later on. 2:23So right below my input bar, I'm going to paste this final bit, 2:27which is going to show a input bar with a button to submit a video link, 2:33and then it's going to show an iframe for a YouTube video. 2:37I need to make sure that all these definitions have the correct format though, 2:40because React has different requirements than any other JavaScript application. 2:45I need to make sure I update the referral policy, the frame border, and this allow full screen option. 2:51So let me format this file and then save it. 2:54In my terminal, I can start the Next .js application by running npm run dev. 3:00And this should open up a new page in my browser, which I can visit to see my application. 3:10In the browser, you can see we have a header, we have a bar to submit a video link, 3:15we have a button to actually submit the link, 3:17and then we have a place where we render the video, including an embedded iframe from YouTube. 3:23So we're going to add a LandGraph agent that's able to fill this space dynamically 3:27by using both YouTube and also a model running using ollama. 3:33So let me go back to VS Code, where I'm first going to kill the process of running Next .js in my terminal, 3:39and we're going to check if I have ollama installed properly. 3:42So with ollama, you can run models locally on your machine. 3:46So these are all open source models that you need to download to your machine first. 3:50So if you run this command for the very first time, you might see 3:53an extra command to actually download the llama 3.2 model. 3:57These are all open source models, so you can run them wherever you want. 4:01For example, they're also available in watsonx.ai. 4:05I can see I have llama 3.2 installed, so I can just close this process by running Ctrl D. 4:12Let me clear my terminal so we can proceed by installing LangGraph. 4:17So LangGraph is a superset of Langchain, meaning that you need to install some Langchain libraries in order to use it. 4:25So I'm going to be installing LangChain, LangGraph, and then some other core libraries. 4:34Once these have been installed, I need to create a new file, and this file I'm going to call it actions .ts. 4:40In this actions .ts file, I'm going to create my LandGraph agent. 4:45I also need to make sure that this file is set up to run server-side by setting use server at the very top of the file, 4:53and in here I can start to create my transcribe function, 4:56which will include the LangGraph agent to retrieve YouTube details and show them on the screen. 5:05I'm going to be calling this function transcribe. 5:07It takes one input, which is the video URL, and the video URL should be a string. 5:13In here, I also need to import a lot of different libraries that we just installed. 5:19So let's break down which libraries we need. 5:21We need ChatOllama, which is the chat interface for Ollama models running on your machine. 5:26We need a function called createReactAgent, which is used to create the agent in LangGraph. 5:31We then need to import two libraries related to creation of tools, 5:35and finally, as we are using TypeScript, we need to have some type definitions in here as well. 5:41So this means we can proceed by setting up the agent inside the transcribe function. 5:48You can see again I'm using the chat interface from Ollama. 5:51I'm setting my model to be llama 3.2. 5:53I have the temperature set to zero. 5:56And I also will be forcing the large language model to return JSON format. 6:00So this is going to be important later on. 6:02When we look at the system prompt where we're going to force the LLM to return something that is in a JSON format. 6:09So if you look at our response, it needs a few messages. 6:13One is a system prompt and the other is a human message. 6:16This is your question usually. 6:18But as we're now building a chat app, we're having a predefined question and the video URL is the dynamic one. 6:24In the system prompt, you can see that the LLM should retrieve the video ID for a given YouTube URL. 6:30So we're going to rely on the LLM to dissect the video ID from its URL. 6:35And then return the output in a JSON structure, which includes the video ID, 6:40and finally, we need to return this back. 6:43So whenever you call the transcribe function, you're going to get this JSON object in return. 6:50So let's save the actions .ts file and then connect it all via the page .tsx component. 6:57At the top of this component, we need to set a few state variables. 7:00We're going to create two. 7:01We're going to create one state variable to make the input bar a controlled component. 7:05Meaning that whenever you type in the input bar, it's going to update the state with the latest part of the video URL. 7:12For this, I'm going to create a variable which I call video URL. 7:17And then I'll also be creating a function to update the video URL. 7:21So this will be used to look at the onChange function of our input bar. 7:26For this, I need to use state hook from React, which can be imported at the very top. 7:31And I'm going to set an empty string as the default state. 7:36Then I'll also be creating a state for the video. 7:38So whenever we retrieve a video using the agent, we want to store it in local state. 7:43Meaning that it can be used across this component. 7:47I have const, then I have set video. 7:51And use state. 7:52This time will be empty. 7:53I will be creating a type safe definition later on for this state variable. 7:59After setting this, I can hook it up to the input box. 8:04But first I'm going to create a function that will actually call the transcribe function to use the agent for transcriptions. 8:14This transcribe video function needs the transcribe function which I have in actions .ts. 8:20It then needs to parse the results. 8:22Because even though we forced the LLM to return JSON, 8:25whatever length chain or length graph returns to me is always a string. 8:29So we need to parse this string and make sure we get the actual JSON. 8:33And this JSON will be put in state using the set video function. 8:38If I scroll down, I can hook up my input box to look at the video URL as its value. 8:47So we can have video URL in here, 8:49and then we need to set an unchanged function and hook up the event to store whatever you type in that input box in state. 8:58So I have the function set video URL, 9:00and this will take e.target.value as its input. 9:05Once I save this, I only need to make sure that my transcribe video function is connected to this button. 9:15After doing this, I should be good to actually submit the request to the large language model 9:20that's connected to the agent to retrieve my video transcription. 9:24It's not retrieving the full transcription yet. 9:28It's only going to make sure we get the video ID. 9:30But later on, we'll add a tool to actually retrieve the transcription. 9:35Let me scroll down a bit to this iframe section and let's make sure we wrap it in curly brackets 9:41and check for the presence of the video state first before we render this. 9:47And this means that we can actually hook up this dynamic video ID in this source. 9:52So instead of having the source which we have right here, we're going to create a new source, which is a template literal. 10:00It's going to take all of this because we still need to have the embed URL. 10:05But now the video ID won't be hard coded, but instead we're going to take 10:09the dynamic variable which is available in the state. 10:12This is video ID, 10:16and just let me delete this one. 10:18Once I save this, we should be able to start our application and view the first part of our application in the browser. 10:25I'm going to run npm rundev and this should start the application. 10:32In my browser, I need to make sure I refresh the page and then enter a video link. 10:39I probably need to set up a loading indicator so I know something is happening whenever I press this button, 10:45but we can see that the video is being pulled in correctly and the video ID is being passed in to the YouTube iframe. 10:54If we go back to actions .ts, we need to do a bit more. 10:58You saw we already imported some tool libraries, so we're going to use this to create a tool 11:02to retrieve information from a YouTube video page. 11:05And for this we'll be using Playwright. 11:08Playwright is a library to open a website programmatically and retrieve details from that website. 11:15So I'm going to close the command running on my terminal and then here I'll be running npm install playwright. 11:21And this will install the Playwright library from npm. 11:26After installing the Playwright library, we need to import it of course in our actions .ts file. 11:35So I can import the library at the very top and then we can start to define our YouTube function. 11:40Well first I'm going to define the tool definition. 11:44Because a tool in Langchain or LangGraph, which is LangGraph but then agentic, 11:51is going to need both a tool execution function and also a tool definition. 11:55So if I have a tool for example which I call get YouTube details, 12:00I will be using the tool function from LangChain to create this tool. 12:06I'm going to make this a async function 12:11later on because the execution function should be async but the rest is fine to be synchronous. 12:18And then I'm going to be adding my tool definition in here. 12:20Once I clean this up, you can see that I have added the tool definition for a tool called get YouTube details, 12:27which is described as a tool to get the title and description of a YouTube video. 12:33And its input is a video ID. 12:35So this is the video ID that we have the LLM dissect from a given YouTube URL. 12:40The actual callback function that should be executed whenever you 12:44have the LLM propose to call the get YouTube detail function is something we hook up here. 12:51In here we need to look for the input variables. 12:53We have an async function. 12:57We should be calling playwright. 12:58So let me put this bit of code in here and make sure we take the input argument. 13:03Let me clean it up a bit. 13:05So what Playwright is doing in here, it's launching a Chromium browser. 13:08So Chromium is related to Chrome. 13:10The very first time you run this, you might need to install the command, 13:13you might need to run the command npm install Playwright 13:17which you can use to download the Chromium browser to your project. 13:21It's going to open a given YouTube page in the browser 13:25and then it's going to take different locators and store them as objects. 13:29So first it's going to look for the H1 element which has the title of the video 13:33and then it's going to look for the description of the video which is somewhere in a div, 13:39and then of course it's going to close the browser because it doesn't need to be open all the time. 13:44So I created this get YouTube detail tool which has both the callback function to call and then also the tool definition. 13:51So let me save this and make sure that we pass the get YouTube detail tool 13:56to Ollama which is then hooked up in Landgraph to form our agent. 14:01I'm going to save this but before we actually try it out we need to update our system message 14:06because now there is additional details that need to be retrieved. 14:10It also needs to retrieve the title and the description of the video. 14:15So we want both of these to be present in the object that's being sent to your front end app. 14:23I'll update this a little bit as well. 14:25Use any tool at your disposal if needed is still valid and we probably want to tell it 14:29to don't return any data unless all the fields are filled. 14:37All fields are populated. 14:42So by creating the tool and updating the system prompt we should be able to try this out in a browser. 14:47For this I'm going to run npm run dev and this should make our application available back in our browser. 14:57Let me copy this YouTube URL and refresh the page because that way we're certain we have a fresh history. 15:04I'm going to put in the URL right here and then let's wait for the LLM and the agent are going to generate for us. 15:15You can see now it's still retrieving the videos. 15:17We don't have the title and description yet because we didn't connect this in our front end app. 15:22So let's go back to the code and open page .tsx. 15:26In here we can replace retrieved video with the actual video title 15:31and then we can replace lorem ipsum with the video description. 15:43We are getting some TypeScript errors here so let's make sure we 15:46make this application type safe by creating a type at the very top. 15:50Let's create a type called video which has a video ID which is a string. 15:58It also has a title which is a string as well and finally it has a description which again is represented as a string. 16:09This type should be used by our local state right here 16:13so we can make sure whatever is being set as video state is actually matching the video type definition 16:19and then we should do the same whenever we get the JSON back from the large language model. 16:24We need to make sure that whatever is being parsed ends up being type video 16:28and this should resolve some of the TypeScript errors we saw in the bottom of our screen. 16:34We still get an error for description and this is why I like TypeScript we forgot to put an i there. 16:39And now it should be all good. 16:41If we visit the browser we should see your application with the video title 16:44and the video description being pulled dynamically from the YouTube video page. 16:52So you can see we have the title here building AI apps with large language models, 16:56which matches the embedded video title 16:58and then description you can see the amount of views is in there whenever it was posted 17:03together with the rest of the video description. 17:06So this is a great start and we want to do something more because I told you in the beginning 17:10we're rebuilding an agent that's able to transcribe YouTube videos, 17:14and for this we need to import a community tool from wxflows. 17:19So let me go back to VS Code where I killed the process to run the app 17:23and I will be creating a new directory which I'll be calling wxflows. 17:30We need to move into this directory from which we can use the wxflows CLI to start importing community tools. 17:36So these tools are able for you to be pulled from GitHub. 17:40First, I need to make sure I have the CLI installed correctly 17:43and you can find the installation instructions on the GitHub instruction for wxflows. 17:49We can run the command WXFlows--version and it should render a version in your terminal. 17:56Once you've verified it's installed correctly we can start by setting up our project by running wxflows in it. 18:02It's going to ask us for an endpoint name so all the tools you create in here they will be represented as endpoints. 18:09I always like to use the name of my project as the endpoint name that way I don't get confused later on. 18:16You can see in the wxflows directory there's a new configuration file. 18:21So we can proceed by importing the YouTube transcription tool 18:25and for this I'm going to run the next command which will take a tool from YouTube 18:30that's available on GitHub and put it in our project. 18:36A couple of files are now being created including tools graphql. 18:40In here you can see we have a new tool called YouTube underscore transcript. 18:44It takes a description which is retrieve transcript for a given video ID 18:49and then it has some formatting requirements for the video URL. 18:57I don't need to save any of this but I do need to deploy it. 18:59So as mentioned all the tools are represented as endpoints so by running deploy you can deploy this to an endpoint. 19:06And this endpoint is what we connect to from our LangGraph agent. 19:12The YouTube... 19:15The endpoint here is the endpoint that you need for the SDK and together with your API key. 19:20So we're going to move back into the main project directory and in here we're going to create a .environment file. 19:27In this .environment file we need to set the endpoint and also the API key. 19:32So these are two details that you really need. 19:34Without these you won't be able to execute the YouTube transcription tool. 19:40For the endpoint I'm able to copy paste my endpoint that was in my terminal after running wxflows deploy. 19:47For the API key I need to run the command wxflows bmi dash dash API key 19:51and this will return the API key right here in your terminal. 20:02Make sure to save the environment file and then close it. 20:06We also need to install the wxflows SDK and this SDK is being used to connect and get retrieved to tools. 20:16For this I run a command, now this command, for this command I need to run npm install wxflows sdk add beta. 20:26So I'm going to run npm install add wxflows slash sdk. 20:30I need to make sure I install the beta version of the SDK as this is under active development. 20:40Once it's installed I can hook it up to my LangGraph agent which is in actions .ts. 20:47At the very top I need to import wxflows and I need to import the LangChain integrated version. 20:53I can scroll down a bit. 20:55I won't be needing my get youtube detail tool anymore 20:58but I might be using it later on because you can still use different tools side by side. 21:03In here I'm going to create a tool client and the tool client is able to retrieve and execute tools that are available on wxflows. 21:12I have my endpoint and API key coming from the .environment file in here, 21:17and then I need to retrieve the tools by looking at the lctools variable that's available on the tool client. 21:27The tools that I retrieved here I need to connect them to my LangGraph agent like this. 21:35You might be getting some errors here along the way especially here because 21:39we're trying to put an array inside an array and it's never a good idea. 21:43So let me update this and save it. 21:45Before we actually try it out in our browser we might want to make one small change. 21:50We might want to update the system message because now it's retrieving tools. 21:55So we can actually give it a very specific description for the new tool we gave it. 22:03So next to retrieving the title and description for a given video 22:07we also want to retrieve the transcript for a video using the tool that we just provided. 22:13We want the LLM to also use all the tools that are available 22:17and we're going to give it some examples on how to use the YouTube transcript tool, 22:22and then something that's quite interesting we're going to be using the transcript to generate the description. 22:27So you might remember earlier on we retrieved the description from the YouTube video page 22:32this time we'll be generating it using the LLM based off the captions that were provided by our transcript tool. 22:40So let's make sure to add the captions here as well because we also want to see the captions in our JSON output. 22:48So this is the video captions and let me save this. 22:53I think we're all done in the actions .ts file so we can make some changes in our page.tsx. 23:02In here we now also need to retrieve the captions. 23:05So let's add captions to the type definition as a string. 23:11We are still parsing the result and the captions should be part of this result, 23:16and once we scroll down a bit more we probably want to show the captions on the screen as well. 23:22So not only do we use the captions to generate a description we'll also be using these captions to generate. 23:28We'll also be using these captions and display them right here on the page. 23:33Let me clean this up a bit. 23:34Why am I getting an error here? 23:37And then format the page. 23:40So now if I run my application again using npm run dev I should be able to open the browser and 23:46see the captions in there for my video next to title and description. 23:50So I'm going back to my browser and copy this URL and make sure to refresh the page. 23:59I'm going to be pasting this link 24:01and then let's see what the agent is generating for us using the large language model and the available tools. 24:08And as you can see here we now have a title for the video. 24:11We have our embedded video using the video ID and then we have the description. 24:15The description was generated by looking at the transcript. 24:18I do see the transcript is a bit cut off. 24:21So a couple of things that could happen here, 24:23maybe the amount of tokens that we give to the LLM isn't sufficient 24:27to retrieve the entire transcript or maybe it's just a styling thing. 24:31If you want to update parameters such as the max new tokens you can set them all in our actions .ts file. 24:37If you scroll up a bit you're connecting here to the chat.ollama function. 24:42You can give it parameters like this. 24:45You can also set things like max new tokens or max retries. 24:49You can set a lot of different things here. 24:51If you want to know more about setting this up make sure to go to the line chain documentation. 24:57And that's how easy it is to create your LangGraph agent using JavaScript. 25:00In this video we use Next.js to build a frontline application. 25:04Then we use models running in LLM and connect it to LangGraph. 25:07And finally we used a YouTube transcription tool from wxflows to transcribe videos from YouTube. 25:13If you want to know more about building this application make sure to have a look at the link in the video description.