Learning Library

← Back to Library

GraphRAG: Populate and Query Knowledge Graph

14m • Unknown Channel • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

GraphRAG replaces vector search with knowledge graphs, using graph databases to capture both entities (vertices) and their relationships (edges) for richer contextual retrieval.
An LLM first extracts entities and relationships from unstructured text, converts them into structured triples, and populates a Neo4j (or any) graph database.
When a user asks a natural‑language question, the LLM generates a Cypher query, runs it against the graph, and then translates the query results back into a natural‑language answer.
The demo sets up a local Neo4j instance via a container (e.g., Podman/Docker), installs required Python packages (including LangChain), and configures credentials and the APOC plugin for advanced graph operations.
A fresh Python virtual environment is recommended, and the notebook provides step‑by‑step instructions for obtaining API keys, initializing the database, and running the end‑to‑end GraphRAG pipeline.

Sections

Full Transcript

# GraphRAG: Populate and Query Knowledge Graph **Source:** [https://www.youtube.com/watch?v=Za7aG-ooGLQ](https://www.youtube.com/watch?v=Za7aG-ooGLQ) **Duration:** 00:14:19 ## Summary - GraphRAG replaces vector search with knowledge graphs, using graph databases to capture both entities (vertices) and their relationships (edges) for richer contextual retrieval. - An LLM first extracts entities and relationships from unstructured text, converts them into structured triples, and populates a Neo4j (or any) graph database. - When a user asks a natural‑language question, the LLM generates a Cypher query, runs it against the graph, and then translates the query results back into a natural‑language answer. - The demo sets up a local Neo4j instance via a container (e.g., Podman/Docker), installs required Python packages (including LangChain), and configures credentials and the APOC plugin for advanced graph operations. - A fresh Python virtual environment is recommended, and the notebook provides step‑by‑step instructions for obtaining API keys, initializing the database, and running the end‑to‑end GraphRAG pipeline. ## Sections - [00:00:00](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=0s) **Populating and Querying GraphRAG with LLM** - The speaker demonstrates using an LLM to extract entities and relationships from raw text, populate a knowledge graph, generate Cypher queries, and retrieve natural‑language answers through Graph Retrieval Augmented Generation. - [00:03:10](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=190s) **Configuring LangChain Graph Pipeline** - The speaker walks through setting up a fresh Python 3.11 virtual environment, installing LangChain, Neo4j, and IBM watsonx.ai libraries, configuring API credentials, connecting to a local Neo4j database, and using LLM‑driven transformers to build a knowledge graph from employee‑relationship text. - [00:06:15](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=375s) **LLM‑Generated Knowledge Graph Workflow** - The excerpt outlines how allowed relationship types are defined, text is transformed into graph documents, inserted into a graph database via addGraphDocuments, and then visualized and queried with Cypher to verify that the LLM‑derived entities and relationships are correctly represented. - [00:09:25](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=565s) **Guiding LLMs for Cypher Queries** - The passage explains how to use prefixed few‑shot prompts and strict output constraints to force an LLM to generate correct Cypher queries and translate results into natural‑language answers within a graph QA chain. - [00:12:33](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=753s) **GraphRAG vs VectorRAG: Core Differences** - The passage explains that GraphRAG converts text into a structured knowledge graph and uses LLM‑generated Cypher queries for retrieval, offering corpus‑wide summarization unlike VectorRAG’s limited top‑k semantic search, and suggests hybrid approaches can combine both strengths. ## Full Transcript

0:00Today I'm going to show you how to populate a knowledge graph and query it using an LLM. 0:06Graph retrieval augmented generation. 0:08or GraphRAG 0:10is emerging as a powerful alternative to vector search methods. 0:14Instead of using a vector database, 0:16GraphRAG systems store data in the format of a knowledge graph 0:19using a graph database. 0:22In a knowledge graph, the relationships between data points, called edges, 0:26are as meaningful as the connections between data points, 0:29called vertices, or sometimes nodes. 0:32A GraphRAG approach leverages the structured nature of graph databases 0:37to give greater depth and context of retrieved information 0:41about networks or complex relationships. 0:46The first step in setting up our system is creating and populating the knowledge graph. 0:51We'll be using an LLM to assist in creating the knowledge graph. 0:56Given unstructured text data, 0:58the LLM will extract entities and relationships from the data, 1:03transforming the data into structured data, 1:06which can then be inserted into the knowledge graph. 1:11After the knowledge graph is created, 1:13we'll be using the LLM to query data from the knowledge graph 1:17and return the response in natural language. 1:21Cypher is the query language for a graph database. 1:24When a user asks a question in natural language, 1:27the LLM will generate the Cypher query 1:30to extract that information from the knowledge graph. 1:34The Cypher query then gets executed on the database and the results are returned to the LLM. 1:40The last step is for the LLM to interpret the results of the Cypher query 1:45in the context of the natural language question 1:48and return a natural language response. 1:51For this example, we'll need an API key and project ID. 1:56The link to this notebook is in the description below. 1:59In the notebook, you'll find instructions for retrieving these credentials. 2:05We'll be using Neo4j, an open-source graph database. 2:09But any graph database can be used to create the knowledge graph. 2:13We'll create a local instance of the database using a containerization tool. 2:17I'll be using Podman, 2:19but you can use any containerization tool. 2:21For example, Docker, 2:23as long as it allows you to create a Neo4j instance. 2:26If you don't have a containerization tool already, 2:29take a moment to install one. 2:33After installing, initialize and start a machine. 2:39My machine's already initialized, so I'm just starting it here. 2:50Once you have this running, 2:51you can start a database instance 2:54with this configuration. 2:57We need credentials to access the database. 2:59So, I'm setting a name and password here. 3:02We also need to include the APOC library as a plugin 3:05in order to enable additional functionality for working with data and graphs. 3:11It looks like our graph database is up and running now. 3:17It's a good practice to create a fresh virtual environment for this project. 3:21I'm using Python 3.11.3 here. 3:25In the Python environment for your notebook, install the following Python libraries. 3:32We'll be using the OS and getpass modules to set up credentials. 3:37We'll use LangChain's document class to store the text for input into our graph database 3:43and the LLM graph transformer to create a graph from our text input. 3:49To interact with and query the graph database, we'll use the LangChain Neo4j module 3:54and it's accompanying 3:56GraphCypherQAChain class. 3:59To craft our prompts for the LLM, 4:01we'll use LangChain's prompt template 4:03and FewShotPromptTemplate. 4:07We'll use the LangChain IBM and IBM watsonx.ai 4:11modules 4:12to interact with the LLM and to set the parameters for our models. 4:20We'll need to set up our credentials using the API key and project ID that we retrieved earlier. 4:26We'll also need to set the URL from which we'll access these services. 4:31Now that our environment is set up, we can create the knowledge graph. 4:34First, we need to create a connection to the local database instance that we started earlier. 4:40Next, we define our data for input into the knowledge graph. 4:45In this case, 4:46the text describes employees at a company, 4:48groups they work in 4:50and their job titles. 4:51We'll use this set of relationships to test the graph generating capabilities of the LLM. 4:57But you don't have to limit your data to straightforward examples of relationship data. 5:02GraphRAG systems have been shown to be successful 5:05in retrieval and summarization tasks for far more complex narrative and connected data. 5:15Now we'll configure our LLM, 5:17which will generate text describing the graph. 5:20The LLM temperature should be fairly low 5:23and the number of tokens high 5:24to encourage the model to generate as much detail as possible. 5:28without hallucinating entities or relationships that aren't present. 5:36One of the most powerful LLM use cases is transforming unstructured text data into structured data. 5:42The LLM will transform our text input string 5:46into a structure of nodes and relationships that we can use to populate the knowledge graph. 5:53The LLM graph transformer 5:56allows you to set the kinds of nodes and relationships you'd like the LLM to generate. 6:01Restricting the LLM 6:02to just those entities 6:04makes it more likely that you'll get a good representation of the knowledge in a graph. 6:09Given our text input, 6:11we set the allowed nodes to person, title and group. 6:15We also set the allowed relationships to 6:18title, 6:19collaborates and group. 6:22We use the document class to prepare our text to be added to the graph documents. 6:27The call to convert to graph documents 6:30generates text in a format that represents the entities in the graph. 6:38We can inspect this graph documents object 6:41to see how the LLM generated nodes and relationships from the text, 6:45representing the relevant context and relevant entities. 6:53Now that we have the data in the correct format, 6:56we can insert these nodes and edges into the graph database 7:00using the addGraphDocuments method. 7:06Once the graph data is created, 7:09we can visualize it using our browser. 7:14In order to query our graph database, 7:17we'll use Cypher queries. 7:24Cypher is, for a graph database, what SQL is for a relational database. 7:29Instead of operating on tables, 7:31Cypher queries operate on the nodes, relationships and paths in the graph database. 7:37To visualize the graph in the browser, 7:39I ran this query which shows us all the nodes and relationships in the graph. 7:44On a larger knowledge graph, this visualization might be too complex. 7:48But for our example, it works to verify the structure of the graph. 7:54It looks like the relationships in our input text have been correctly represented here in the knowledge graph. 8:03We can also examine the schema and data types in the database 8:07using the get schema property of the graph. 8:12Without the LLM, 8:13creating the knowledge graph might be a manual process to diagram entities and relationships from unstructured text. 8:21Now that we have our knowledge graph, 8:23we can query it, 8:24taking advantage of the graph structure and graph database retrieval capabilities 8:29to derive valuable information 8:31over the data 8:32in a more holistic way than semantic search can perform on a vector database. 8:37Now we'll use natural language to query the knowledge graph. 8:41The natural language query will be passed to the LLM, 8:45which is going to translate the query into Cypher syntax. 8:49This Cypher query will be executed on the database 8:52and the result will be returned to the LLM using natural language. 8:56Prompting the LLM correctly requires some prompt engineering. 9:00We'll think of the prompting step in two parts, so we'll need to set up two different prompts. 9:05The first prompt gives the LLM instructions for generating a correct Cypher query 9:11from the user's natural language query. 9:13LangChain provides a FewShotPromptTemplate 9:17that can be used to give examples to the LLM and the prompt, 9:21encouraging the LLM to write correct and succinct Cypher syntax. 9:26This code block gives several examples 9:28of questions and corresponding Cypher queries that the LLM should use as a guide. 9:34It also constrains the output of the model to only the query. 9:38An overly chatty LLM might add in extra information 9:41that would lead to invalid Cypher queries. 9:45Using a prefix with a specified task and instructions 9:49also helps to constrain the model behavior 9:52and makes it more likely that the LLM will output correct Cypher syntax. 10:00The second prompt provides the LLM instructions for translating the result of the Cypher query 10:06into natural language 10:07given the original natural language question from the user. 10:11We employ a few-shot prompting strategy here, too, 10:15providing examples to the LLM for how to do this. 10:18We call this prompt the QA prompt. 10:21Essentially, 10:22it describes how the LLM should answer the question with the information returned from the graph database. 10:31Now we'll bundle together our Cypher prompt, our QA prompt, 10:35our knowledge graph 10:36and an LLM to create the question answering chain, 10:40using the graph Cypher QA chain class. 10:43We're implementing a simple retrieval procedure here. 10:46But there are ways to improve on this strategy by providing additional context to the LLM 10:51about groupings and summaries of like nodes within the knowledge graph. 10:57Using a temperature of zero and a length penalty 11:00encourages the LLM to keep the Cypher prompt short and straightforward. 11:05If you're wondering why we're configuring a different LLM here, 11:08it's because we're setting different parameters for retrieval of information from the graph 11:13than we used earlier for constructing the graph. 11:16Now we can query the data by invoking the chain with a natural language question. 11:21If you try this out, your responses may be slightly different than what we're seeing here 11:25because LLMs are not strictly deterministic. 11:29Here's our first question. 11:32What is John's title? 11:33We can see the Cypher query generated by this LLM 11:37to retrieve the information, 11:39the result of the Cypher query 11:41and the natural language response from the LLM 11:44as Director of the Digital Marketing Group. 11:47Looks good. 11:48Let's try a slightly more complex question. 11:51Who does John collaborate with? 11:55Again, the LLM generates a Cypher query to retrieve the correct information from the graph database 12:01and returns the correct response. 12:04John collaborates with Jane. 12:06This looks good. 12:07Let's ask the chain about a group relationship. 12:10What group is Jane in? 12:13Jane is in the executive group. Okay. 12:17Let's try one more that requires the LLM to give us two outputs. 12:22Who does Jane collaborate with? 12:24Jane collaborates with Sharon and John. 12:28Even for this more difficult query, 12:30we can see the chain correctly identifies both of the collaborators. 12:34Beyond retrieving the simple titles and relationships from our input string in this example, 12:39GraphRAG can summarize and retrieve contextual information 12:43over the whole structure of the knowledge graph. 12:46So, how is this different from a VectorRAG system? 12:49Firstly, instead of calculating embeddings and storing the resulting embedded information in a vector database, 12:56a GraphRAG system transforms unstructured text data 13:00into structured data 13:02using an LLM. 13:03And a knowledge graph is populated with this data. 13:07The second difference is in the retrieval step. 13:10Instead of performing semantic search and returning results with semantic similarity, 13:15the LLM generates a Cypher query in response to the user's natural language query, 13:21which gets executed on the graph database containing the knowledge graph. 13:25The GraphRAG system avoids one of the limitations of VectorRAG. 13:30If you think about the way VectorRAG returns top semantic search results to a query, 13:35you can recognize that VectorRAG can't provide the LLM with knowledge over the whole text corpus in response to one query. 13:43It's limited to the top semantic search results. 13:46GraphRAG can leverage graph indexes, 13:49which store summaries about groupings of like nodes 13:52to provide summarization over the whole corpus of text within one query result. 13:58In practice, you may want both the capabilities of retrieval from a semantic search on a vector database 14:04and a graph search over a knowledge graph. 14:07It's possible to build these sort of HybridRAG systems using both vector databases and graph databases. 14:15Check out the GitHub link in the description below to try out GraphRAG for yourself.