Learning Library

← Back to Library

Building a YouTube Transcription Agent with Langraph

Key Points

  • The tutorial walks through creating a YouTube transcription AI agent with Langraph, leveraging locally‑run Ollama models, a WXFlows transcription tool, and a Next.js front‑end.
  • A new Next.js project is bootstrapped using the Create Next App CLI, opting for TypeScript and Tailwind CSS for styling, then the generated `page.tsx` is cleared for custom code.
  • The main component is set up as a client‑side React component so state can be managed, and a simple UI is built with a header, an input field for a YouTube link, and a submit button.
  • An iframe placeholder is added beneath the input to display the selected YouTube video, with proper React attributes (referral policy, frameBorder, allowFullScreen) applied.
  • After saving the changes, the app can be launched locally with `npm run dev` to view the functional transcription interface.

Sections

Full Transcript

# Building a YouTube Transcription Agent with Langraph **Source:** [https://www.youtube.com/watch?v=u6qDSFxY4iw](https://www.youtube.com/watch?v=u6qDSFxY4iw) **Duration:** 00:25:18 ## Summary - The tutorial walks through creating a YouTube transcription AI agent with Langraph, leveraging locally‑run Ollama models, a WXFlows transcription tool, and a Next.js front‑end. - A new Next.js project is bootstrapped using the Create Next App CLI, opting for TypeScript and Tailwind CSS for styling, then the generated `page.tsx` is cleared for custom code. - The main component is set up as a client‑side React component so state can be managed, and a simple UI is built with a header, an input field for a YouTube link, and a submit button. - An iframe placeholder is added beneath the input to display the selected YouTube video, with proper React attributes (referral policy, frameBorder, allowFullScreen) applied. - After saving the changes, the app can be launched locally with `npm run dev` to view the functional transcription interface. ## Sections - [00:00:00](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=0s) **Building a Langraph YouTube Agent** - A walkthrough of creating a JavaScript‑based AI agent with Langraph, Next.js, local Ollama models, and WXFlows to fetch YouTube video transcriptions and display summarized results. - [00:03:04](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=184s) **Adding a LangGraph Agent with Ollama** - The speaker walks through stopping a Next.js server, verifying Ollama installation, installing LangChain‑LangGraph dependencies, and creating an actions.ts file to embed a dynamically powered YouTube video player using a locally run Llama 3.2 model. - [00:06:13](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=373s) **Parsing YouTube ID in React Chat** - The speaker explains how to use a system prompt to extract a YouTube video ID into JSON and manage the URL input and retrieved video data with React state hooks in a .tsx component. - [00:09:24](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=564s) **Implementing Dynamic YouTube Embed** - The speaker modifies the iframe source to use a state‑driven video ID via a template literal, wraps it in conditional brackets, runs the app to verify the embed works, and outlines adding a tool for fetching full video transcriptions. - [00:12:32](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=752s) **Fetching YouTube Details with Playwright** - The speaker explains how to build an async tool that launches Chromium via Playwright to scrape a video's title and description from a YouTube page, then integrates this callback function into an Ollama‑Langgraph agent. - [00:15:50](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=950s) **Defining Video Type for State** - The speaker creates a TypeScript `video` type (id, title, description), applies it to local state and LLM JSON parsing to resolve type errors, verifies the video title and description render correctly, and then prepares to import a transcription tool. - [00:18:59](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=1139s) **Configuring wxflows Endpoint & SDK** - The speaker explains how to create a .environment file with the deployment endpoint and API key, install the beta wxflows SDK via npm, and integrate it into a LangGraph agent for YouTube transcription. - [00:22:12](https://www.youtube.com/watch?v=u6qDSFxY4iw&t=1332s) **Generating Video Descriptions from Captions** - The speaker demonstrates using a YouTube transcript tool to fetch captions, feeding them to an LLM to auto‑generate video descriptions, and incorporating those captions into the app’s JSON output and UI. ## Full Transcript
0:00Let's build an AI agent in Langraph using JavaScript. 0:03Agents can be really helpful to automate parts of your life. 0:06Or for example, take data from one source and turn it into something else. 0:10In this video, we'll be building a YouTube transcription agent 0:13that's able to pull transcriptions from a YouTube video and summarize them on your screen. 0:18For this we'll be using models running locally using Ollama, 0:21we'll be using Next .js to build a frontend app, and then we'll be using a YouTube transcription tool from WXFlows. 0:27So let's dive into VSCode and see how it's built. 0:31In VSCode, I've already set up a new project. 0:33In here, I'm going to run a command to bootstrap a new Next .js application. 0:38For this, I'll be using the Create Next App CLI and I'm going to make sure I use the latest version. 0:43I also need to provide it with a name of my project, which will be the Landgraph YouTube agent. 0:50Setting this up is going to require you to answer a few questions. 0:53For example, would you like to use TypeScript? 0:56And then it's going to ask a few other defaults. 0:58It also asks us to install Tailwind, which is a nice library to help you write CSS. 1:04Depending on your needs, you might make different decisions when building your own project. 1:09Once this is done, it's going to generate a new directory with all my files. 1:13I'm going to move into this new Landgraph YouTube agent directory, 1:16where I can find all the boilerplate code for my Next .js application. 1:20In here you can find a file called page .tsx and this will be the main file that's rendered when someone sees my application. 1:27I'm going to delete all the code that's in there. 1:29I'm going to replace it with something else. 1:31In here I'm going to add my own boilerplate for this application. 1:36I also need to make sure that I set this component up 1:38to be a client-side component, and this means I can use state management later on. 1:43Within this div, which has some Tailwind class names connected to it, 1:46I can set up all the code I need in order to render my application. 1:51That starts with a header, and for this I'm going to be adding a header that contains the name of the agent, 1:57which is the YouTube transcription agent, and then I'm going to be adding a bit where I have an input bar to submit a video link. 2:05Put this in there as well. 2:07Once I save this, I could already visit the application in the browser by starting npm run dev, 2:13but first, I will also add a placeholder video. 2:16So this will mock the application setup that we'll have later on. 2:23So right below my input bar, I'm going to paste this final bit, 2:27which is going to show a input bar with a button to submit a video link, 2:33and then it's going to show an iframe for a YouTube video. 2:37I need to make sure that all these definitions have the correct format though, 2:40because React has different requirements than any other JavaScript application. 2:45I need to make sure I update the referral policy, the frame border, and this allow full screen option. 2:51So let me format this file and then save it. 2:54In my terminal, I can start the Next .js application by running npm run dev. 3:00And this should open up a new page in my browser, which I can visit to see my application. 3:10In the browser, you can see we have a header, we have a bar to submit a video link, 3:15we have a button to actually submit the link, 3:17and then we have a place where we render the video, including an embedded iframe from YouTube. 3:23So we're going to add a LandGraph agent that's able to fill this space dynamically 3:27by using both YouTube and also a model running using ollama. 3:33So let me go back to VS Code, where I'm first going to kill the process of running Next .js in my terminal, 3:39and we're going to check if I have ollama installed properly. 3:42So with ollama, you can run models locally on your machine. 3:46So these are all open source models that you need to download to your machine first. 3:50So if you run this command for the very first time, you might see 3:53an extra command to actually download the llama 3.2 model. 3:57These are all open source models, so you can run them wherever you want. 4:01For example, they're also available in watsonx.ai. 4:05I can see I have llama 3.2 installed, so I can just close this process by running Ctrl D. 4:12Let me clear my terminal so we can proceed by installing LangGraph. 4:17So LangGraph is a superset of Langchain, meaning that you need to install some Langchain libraries in order to use it. 4:25So I'm going to be installing LangChain, LangGraph, and then some other core libraries. 4:34Once these have been installed, I need to create a new file, and this file I'm going to call it actions .ts. 4:40In this actions .ts file, I'm going to create my LandGraph agent. 4:45I also need to make sure that this file is set up to run server-side by setting use server at the very top of the file, 4:53and in here I can start to create my transcribe function, 4:56which will include the LangGraph agent to retrieve YouTube details and show them on the screen. 5:05I'm going to be calling this function transcribe. 5:07It takes one input, which is the video URL, and the video URL should be a string. 5:13In here, I also need to import a lot of different libraries that we just installed. 5:19So let's break down which libraries we need. 5:21We need ChatOllama, which is the chat interface for Ollama models running on your machine. 5:26We need a function called createReactAgent, which is used to create the agent in LangGraph. 5:31We then need to import two libraries related to creation of tools, 5:35and finally, as we are using TypeScript, we need to have some type definitions in here as well. 5:41So this means we can proceed by setting up the agent inside the transcribe function. 5:48You can see again I'm using the chat interface from Ollama. 5:51I'm setting my model to be llama 3.2. 5:53I have the temperature set to zero. 5:56And I also will be forcing the large language model to return JSON format. 6:00So this is going to be important later on. 6:02When we look at the system prompt where we're going to force the LLM to return something that is in a JSON format. 6:09So if you look at our response, it needs a few messages. 6:13One is a system prompt and the other is a human message. 6:16This is your question usually. 6:18But as we're now building a chat app, we're having a predefined question and the video URL is the dynamic one. 6:24In the system prompt, you can see that the LLM should retrieve the video ID for a given YouTube URL. 6:30So we're going to rely on the LLM to dissect the video ID from its URL. 6:35And then return the output in a JSON structure, which includes the video ID, 6:40and finally, we need to return this back. 6:43So whenever you call the transcribe function, you're going to get this JSON object in return. 6:50So let's save the actions .ts file and then connect it all via the page .tsx component. 6:57At the top of this component, we need to set a few state variables. 7:00We're going to create two. 7:01We're going to create one state variable to make the input bar a controlled component. 7:05Meaning that whenever you type in the input bar, it's going to update the state with the latest part of the video URL. 7:12For this, I'm going to create a variable which I call video URL. 7:17And then I'll also be creating a function to update the video URL. 7:21So this will be used to look at the onChange function of our input bar. 7:26For this, I need to use state hook from React, which can be imported at the very top. 7:31And I'm going to set an empty string as the default state. 7:36Then I'll also be creating a state for the video. 7:38So whenever we retrieve a video using the agent, we want to store it in local state. 7:43Meaning that it can be used across this component. 7:47I have const, then I have set video. 7:51And use state. 7:52This time will be empty. 7:53I will be creating a type safe definition later on for this state variable. 7:59After setting this, I can hook it up to the input box. 8:04But first I'm going to create a function that will actually call the transcribe function to use the agent for transcriptions. 8:14This transcribe video function needs the transcribe function which I have in actions .ts. 8:20It then needs to parse the results. 8:22Because even though we forced the LLM to return JSON, 8:25whatever length chain or length graph returns to me is always a string. 8:29So we need to parse this string and make sure we get the actual JSON. 8:33And this JSON will be put in state using the set video function. 8:38If I scroll down, I can hook up my input box to look at the video URL as its value. 8:47So we can have video URL in here, 8:49and then we need to set an unchanged function and hook up the event to store whatever you type in that input box in state. 8:58So I have the function set video URL, 9:00and this will take e.target.value as its input. 9:05Once I save this, I only need to make sure that my transcribe video function is connected to this button. 9:15After doing this, I should be good to actually submit the request to the large language model 9:20that's connected to the agent to retrieve my video transcription. 9:24It's not retrieving the full transcription yet. 9:28It's only going to make sure we get the video ID. 9:30But later on, we'll add a tool to actually retrieve the transcription. 9:35Let me scroll down a bit to this iframe section and let's make sure we wrap it in curly brackets 9:41and check for the presence of the video state first before we render this. 9:47And this means that we can actually hook up this dynamic video ID in this source. 9:52So instead of having the source which we have right here, we're going to create a new source, which is a template literal. 10:00It's going to take all of this because we still need to have the embed URL. 10:05But now the video ID won't be hard coded, but instead we're going to take 10:09the dynamic variable which is available in the state. 10:12This is video ID, 10:16and just let me delete this one. 10:18Once I save this, we should be able to start our application and view the first part of our application in the browser. 10:25I'm going to run npm rundev and this should start the application. 10:32In my browser, I need to make sure I refresh the page and then enter a video link. 10:39I probably need to set up a loading indicator so I know something is happening whenever I press this button, 10:45but we can see that the video is being pulled in correctly and the video ID is being passed in to the YouTube iframe. 10:54If we go back to actions .ts, we need to do a bit more. 10:58You saw we already imported some tool libraries, so we're going to use this to create a tool 11:02to retrieve information from a YouTube video page. 11:05And for this we'll be using Playwright. 11:08Playwright is a library to open a website programmatically and retrieve details from that website. 11:15So I'm going to close the command running on my terminal and then here I'll be running npm install playwright. 11:21And this will install the Playwright library from npm. 11:26After installing the Playwright library, we need to import it of course in our actions .ts file. 11:35So I can import the library at the very top and then we can start to define our YouTube function. 11:40Well first I'm going to define the tool definition. 11:44Because a tool in Langchain or LangGraph, which is LangGraph but then agentic, 11:51is going to need both a tool execution function and also a tool definition. 11:55So if I have a tool for example which I call get YouTube details, 12:00I will be using the tool function from LangChain to create this tool. 12:06I'm going to make this a async function 12:11later on because the execution function should be async but the rest is fine to be synchronous. 12:18And then I'm going to be adding my tool definition in here. 12:20Once I clean this up, you can see that I have added the tool definition for a tool called get YouTube details, 12:27which is described as a tool to get the title and description of a YouTube video. 12:33And its input is a video ID. 12:35So this is the video ID that we have the LLM dissect from a given YouTube URL. 12:40The actual callback function that should be executed whenever you 12:44have the LLM propose to call the get YouTube detail function is something we hook up here. 12:51In here we need to look for the input variables. 12:53We have an async function. 12:57We should be calling playwright. 12:58So let me put this bit of code in here and make sure we take the input argument. 13:03Let me clean it up a bit. 13:05So what Playwright is doing in here, it's launching a Chromium browser. 13:08So Chromium is related to Chrome. 13:10The very first time you run this, you might need to install the command, 13:13you might need to run the command npm install Playwright 13:17which you can use to download the Chromium browser to your project. 13:21It's going to open a given YouTube page in the browser 13:25and then it's going to take different locators and store them as objects. 13:29So first it's going to look for the H1 element which has the title of the video 13:33and then it's going to look for the description of the video which is somewhere in a div, 13:39and then of course it's going to close the browser because it doesn't need to be open all the time. 13:44So I created this get YouTube detail tool which has both the callback function to call and then also the tool definition. 13:51So let me save this and make sure that we pass the get YouTube detail tool 13:56to Ollama which is then hooked up in Landgraph to form our agent. 14:01I'm going to save this but before we actually try it out we need to update our system message 14:06because now there is additional details that need to be retrieved. 14:10It also needs to retrieve the title and the description of the video. 14:15So we want both of these to be present in the object that's being sent to your front end app. 14:23I'll update this a little bit as well. 14:25Use any tool at your disposal if needed is still valid and we probably want to tell it 14:29to don't return any data unless all the fields are filled. 14:37All fields are populated. 14:42So by creating the tool and updating the system prompt we should be able to try this out in a browser. 14:47For this I'm going to run npm run dev and this should make our application available back in our browser. 14:57Let me copy this YouTube URL and refresh the page because that way we're certain we have a fresh history. 15:04I'm going to put in the URL right here and then let's wait for the LLM and the agent are going to generate for us. 15:15You can see now it's still retrieving the videos. 15:17We don't have the title and description yet because we didn't connect this in our front end app. 15:22So let's go back to the code and open page .tsx. 15:26In here we can replace retrieved video with the actual video title 15:31and then we can replace lorem ipsum with the video description. 15:43We are getting some TypeScript errors here so let's make sure we 15:46make this application type safe by creating a type at the very top. 15:50Let's create a type called video which has a video ID which is a string. 15:58It also has a title which is a string as well and finally it has a description which again is represented as a string. 16:09This type should be used by our local state right here 16:13so we can make sure whatever is being set as video state is actually matching the video type definition 16:19and then we should do the same whenever we get the JSON back from the large language model. 16:24We need to make sure that whatever is being parsed ends up being type video 16:28and this should resolve some of the TypeScript errors we saw in the bottom of our screen. 16:34We still get an error for description and this is why I like TypeScript we forgot to put an i there. 16:39And now it should be all good. 16:41If we visit the browser we should see your application with the video title 16:44and the video description being pulled dynamically from the YouTube video page. 16:52So you can see we have the title here building AI apps with large language models, 16:56which matches the embedded video title 16:58and then description you can see the amount of views is in there whenever it was posted 17:03together with the rest of the video description. 17:06So this is a great start and we want to do something more because I told you in the beginning 17:10we're rebuilding an agent that's able to transcribe YouTube videos, 17:14and for this we need to import a community tool from wxflows. 17:19So let me go back to VS Code where I killed the process to run the app 17:23and I will be creating a new directory which I'll be calling wxflows. 17:30We need to move into this directory from which we can use the wxflows CLI to start importing community tools. 17:36So these tools are able for you to be pulled from GitHub. 17:40First, I need to make sure I have the CLI installed correctly 17:43and you can find the installation instructions on the GitHub instruction for wxflows. 17:49We can run the command WXFlows--version and it should render a version in your terminal. 17:56Once you've verified it's installed correctly we can start by setting up our project by running wxflows in it. 18:02It's going to ask us for an endpoint name so all the tools you create in here they will be represented as endpoints. 18:09I always like to use the name of my project as the endpoint name that way I don't get confused later on. 18:16You can see in the wxflows directory there's a new configuration file. 18:21So we can proceed by importing the YouTube transcription tool 18:25and for this I'm going to run the next command which will take a tool from YouTube 18:30that's available on GitHub and put it in our project. 18:36A couple of files are now being created including tools graphql. 18:40In here you can see we have a new tool called YouTube underscore transcript. 18:44It takes a description which is retrieve transcript for a given video ID 18:49and then it has some formatting requirements for the video URL. 18:57I don't need to save any of this but I do need to deploy it. 18:59So as mentioned all the tools are represented as endpoints so by running deploy you can deploy this to an endpoint. 19:06And this endpoint is what we connect to from our LangGraph agent. 19:12The YouTube... 19:15The endpoint here is the endpoint that you need for the SDK and together with your API key. 19:20So we're going to move back into the main project directory and in here we're going to create a .environment file. 19:27In this .environment file we need to set the endpoint and also the API key. 19:32So these are two details that you really need. 19:34Without these you won't be able to execute the YouTube transcription tool. 19:40For the endpoint I'm able to copy paste my endpoint that was in my terminal after running wxflows deploy. 19:47For the API key I need to run the command wxflows bmi dash dash API key 19:51and this will return the API key right here in your terminal. 20:02Make sure to save the environment file and then close it. 20:06We also need to install the wxflows SDK and this SDK is being used to connect and get retrieved to tools. 20:16For this I run a command, now this command, for this command I need to run npm install wxflows sdk add beta. 20:26So I'm going to run npm install add wxflows slash sdk. 20:30I need to make sure I install the beta version of the SDK as this is under active development. 20:40Once it's installed I can hook it up to my LangGraph agent which is in actions .ts. 20:47At the very top I need to import wxflows and I need to import the LangChain integrated version. 20:53I can scroll down a bit. 20:55I won't be needing my get youtube detail tool anymore 20:58but I might be using it later on because you can still use different tools side by side. 21:03In here I'm going to create a tool client and the tool client is able to retrieve and execute tools that are available on wxflows. 21:12I have my endpoint and API key coming from the .environment file in here, 21:17and then I need to retrieve the tools by looking at the lctools variable that's available on the tool client. 21:27The tools that I retrieved here I need to connect them to my LangGraph agent like this. 21:35You might be getting some errors here along the way especially here because 21:39we're trying to put an array inside an array and it's never a good idea. 21:43So let me update this and save it. 21:45Before we actually try it out in our browser we might want to make one small change. 21:50We might want to update the system message because now it's retrieving tools. 21:55So we can actually give it a very specific description for the new tool we gave it. 22:03So next to retrieving the title and description for a given video 22:07we also want to retrieve the transcript for a video using the tool that we just provided. 22:13We want the LLM to also use all the tools that are available 22:17and we're going to give it some examples on how to use the YouTube transcript tool, 22:22and then something that's quite interesting we're going to be using the transcript to generate the description. 22:27So you might remember earlier on we retrieved the description from the YouTube video page 22:32this time we'll be generating it using the LLM based off the captions that were provided by our transcript tool. 22:40So let's make sure to add the captions here as well because we also want to see the captions in our JSON output. 22:48So this is the video captions and let me save this. 22:53I think we're all done in the actions .ts file so we can make some changes in our page.tsx. 23:02In here we now also need to retrieve the captions. 23:05So let's add captions to the type definition as a string. 23:11We are still parsing the result and the captions should be part of this result, 23:16and once we scroll down a bit more we probably want to show the captions on the screen as well. 23:22So not only do we use the captions to generate a description we'll also be using these captions to generate. 23:28We'll also be using these captions and display them right here on the page. 23:33Let me clean this up a bit. 23:34Why am I getting an error here? 23:37And then format the page. 23:40So now if I run my application again using npm run dev I should be able to open the browser and 23:46see the captions in there for my video next to title and description. 23:50So I'm going back to my browser and copy this URL and make sure to refresh the page. 23:59I'm going to be pasting this link 24:01and then let's see what the agent is generating for us using the large language model and the available tools. 24:08And as you can see here we now have a title for the video. 24:11We have our embedded video using the video ID and then we have the description. 24:15The description was generated by looking at the transcript. 24:18I do see the transcript is a bit cut off. 24:21So a couple of things that could happen here, 24:23maybe the amount of tokens that we give to the LLM isn't sufficient 24:27to retrieve the entire transcript or maybe it's just a styling thing. 24:31If you want to update parameters such as the max new tokens you can set them all in our actions .ts file. 24:37If you scroll up a bit you're connecting here to the chat.ollama function. 24:42You can give it parameters like this. 24:45You can also set things like max new tokens or max retries. 24:49You can set a lot of different things here. 24:51If you want to know more about setting this up make sure to go to the line chain documentation. 24:57And that's how easy it is to create your LangGraph agent using JavaScript. 25:00In this video we use Next.js to build a frontline application. 25:04Then we use models running in LLM and connect it to LangGraph. 25:07And finally we used a YouTube transcription tool from wxflows to transcribe videos from YouTube. 25:13If you want to know more about building this application make sure to have a look at the link in the video description.