Learning Library

← Back to Library

GraphRAG: Populate and Query Knowledge Graph

Key Points

  • GraphRAG replaces vector search with knowledge graphs, using graph databases to capture both entities (vertices) and their relationships (edges) for richer contextual retrieval.
  • An LLM first extracts entities and relationships from unstructured text, converts them into structured triples, and populates a Neo4j (or any) graph database.
  • When a user asks a natural‑language question, the LLM generates a Cypher query, runs it against the graph, and then translates the query results back into a natural‑language answer.
  • The demo sets up a local Neo4j instance via a container (e.g., Podman/Docker), installs required Python packages (including LangChain), and configures credentials and the APOC plugin for advanced graph operations.
  • A fresh Python virtual environment is recommended, and the notebook provides step‑by‑step instructions for obtaining API keys, initializing the database, and running the end‑to‑end GraphRAG pipeline.

Sections

Full Transcript

# GraphRAG: Populate and Query Knowledge Graph **Source:** [https://www.youtube.com/watch?v=Za7aG-ooGLQ](https://www.youtube.com/watch?v=Za7aG-ooGLQ) **Duration:** 00:14:19 ## Summary - GraphRAG replaces vector search with knowledge graphs, using graph databases to capture both entities (vertices) and their relationships (edges) for richer contextual retrieval. - An LLM first extracts entities and relationships from unstructured text, converts them into structured triples, and populates a Neo4j (or any) graph database. - When a user asks a natural‑language question, the LLM generates a Cypher query, runs it against the graph, and then translates the query results back into a natural‑language answer. - The demo sets up a local Neo4j instance via a container (e.g., Podman/Docker), installs required Python packages (including LangChain), and configures credentials and the APOC plugin for advanced graph operations. - A fresh Python virtual environment is recommended, and the notebook provides step‑by‑step instructions for obtaining API keys, initializing the database, and running the end‑to‑end GraphRAG pipeline. ## Sections - [00:00:00](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=0s) **Populating and Querying GraphRAG with LLM** - The speaker demonstrates using an LLM to extract entities and relationships from raw text, populate a knowledge graph, generate Cypher queries, and retrieve natural‑language answers through Graph Retrieval Augmented Generation. - [00:03:10](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=190s) **Configuring LangChain Graph Pipeline** - The speaker walks through setting up a fresh Python 3.11 virtual environment, installing LangChain, Neo4j, and IBM watsonx.ai libraries, configuring API credentials, connecting to a local Neo4j database, and using LLM‑driven transformers to build a knowledge graph from employee‑relationship text. - [00:06:15](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=375s) **LLM‑Generated Knowledge Graph Workflow** - The excerpt outlines how allowed relationship types are defined, text is transformed into graph documents, inserted into a graph database via addGraphDocuments, and then visualized and queried with Cypher to verify that the LLM‑derived entities and relationships are correctly represented. - [00:09:25](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=565s) **Guiding LLMs for Cypher Queries** - The passage explains how to use prefixed few‑shot prompts and strict output constraints to force an LLM to generate correct Cypher queries and translate results into natural‑language answers within a graph QA chain. - [00:12:33](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=753s) **GraphRAG vs VectorRAG: Core Differences** - The passage explains that GraphRAG converts text into a structured knowledge graph and uses LLM‑generated Cypher queries for retrieval, offering corpus‑wide summarization unlike VectorRAG’s limited top‑k semantic search, and suggests hybrid approaches can combine both strengths. ## Full Transcript
0:00Today I'm going to show you how to populate a knowledge graph and query it using an LLM. 0:06Graph retrieval augmented generation. 0:08or GraphRAG 0:10is emerging as a powerful alternative to vector search methods. 0:14Instead of using a vector database, 0:16GraphRAG systems store data in the format of a knowledge graph 0:19using a graph database. 0:22In a knowledge graph, the relationships between data points, called edges, 0:26are as meaningful as the connections between data points, 0:29called vertices, or sometimes nodes. 0:32A GraphRAG approach leverages the structured nature of graph databases 0:37to give greater depth and context of retrieved information 0:41about networks or complex relationships. 0:46The first step in setting up our system is creating and populating the knowledge graph. 0:51We'll be using an LLM to assist in creating the knowledge graph. 0:56Given unstructured text data, 0:58the LLM will extract entities and relationships from the data, 1:03transforming the data into structured data, 1:06which can then be inserted into the knowledge graph. 1:11After the knowledge graph is created, 1:13we'll be using the LLM to query data from the knowledge graph 1:17and return the response in natural language. 1:21Cypher is the query language for a graph database. 1:24When a user asks a question in natural language, 1:27the LLM will generate the Cypher query 1:30to extract that information from the knowledge graph. 1:34The Cypher query then gets executed on the database and the results are returned to the LLM. 1:40The last step is for the LLM to interpret the results of the Cypher query 1:45in the context of the natural language question 1:48and return a natural language response. 1:51For this example, we'll need an API key and project ID. 1:56The link to this notebook is in the description below. 1:59In the notebook, you'll find instructions for retrieving these credentials. 2:05We'll be using Neo4j, an open-source graph database. 2:09But any graph database can be used to create the knowledge graph. 2:13We'll create a local instance of the database using a containerization tool. 2:17I'll be using Podman, 2:19but you can use any containerization tool. 2:21For example, Docker, 2:23as long as it allows you to create a Neo4j instance. 2:26If you don't have a containerization tool already, 2:29take a moment to install one. 2:33After installing, initialize and start a machine. 2:39My machine's already initialized, so I'm just starting it here. 2:50Once you have this running, 2:51you can start a database instance 2:54with this configuration. 2:57We need credentials to access the database. 2:59So, I'm setting a name and password here. 3:02We also need to include the APOC library as a plugin 3:05in order to enable additional functionality for working with data and graphs. 3:11It looks like our graph database is up and running now. 3:17It's a good practice to create a fresh virtual environment for this project. 3:21I'm using Python 3.11.3 here. 3:25In the Python environment for your notebook, install the following Python libraries. 3:32We'll be using the OS and getpass modules to set up credentials. 3:37We'll use LangChain's document class to store the text for input into our graph database 3:43and the LLM graph transformer to create a graph from our text input. 3:49To interact with and query the graph database, we'll use the LangChain Neo4j module 3:54and it's accompanying 3:56GraphCypherQAChain class. 3:59To craft our prompts for the LLM, 4:01we'll use LangChain's prompt template 4:03and FewShotPromptTemplate. 4:07We'll use the LangChain IBM and IBM watsonx.ai 4:11modules 4:12to interact with the LLM and to set the parameters for our models. 4:20We'll need to set up our credentials using the API key and project ID that we retrieved earlier. 4:26We'll also need to set the URL from which we'll access these services. 4:31Now that our environment is set up, we can create the knowledge graph. 4:34First, we need to create a connection to the local database instance that we started earlier. 4:40Next, we define our data for input into the knowledge graph. 4:45In this case, 4:46the text describes employees at a company, 4:48groups they work in 4:50and their job titles. 4:51We'll use this set of relationships to test the graph generating capabilities of the LLM. 4:57But you don't have to limit your data to straightforward examples of relationship data. 5:02GraphRAG systems have been shown to be successful 5:05in retrieval and summarization tasks for far more complex narrative and connected data. 5:15Now we'll configure our LLM, 5:17which will generate text describing the graph. 5:20The LLM temperature should be fairly low 5:23and the number of tokens high 5:24to encourage the model to generate as much detail as possible. 5:28without hallucinating entities or relationships that aren't present. 5:36One of the most powerful LLM use cases is transforming unstructured text data into structured data. 5:42The LLM will transform our text input string 5:46into a structure of nodes and relationships that we can use to populate the knowledge graph. 5:53The LLM graph transformer 5:56allows you to set the kinds of nodes and relationships you'd like the LLM to generate. 6:01Restricting the LLM 6:02to just those entities 6:04makes it more likely that you'll get a good representation of the knowledge in a graph. 6:09Given our text input, 6:11we set the allowed nodes to person, title and group. 6:15We also set the allowed relationships to 6:18title, 6:19collaborates and group. 6:22We use the document class to prepare our text to be added to the graph documents. 6:27The call to convert to graph documents 6:30generates text in a format that represents the entities in the graph. 6:38We can inspect this graph documents object 6:41to see how the LLM generated nodes and relationships from the text, 6:45representing the relevant context and relevant entities. 6:53Now that we have the data in the correct format, 6:56we can insert these nodes and edges into the graph database 7:00using the addGraphDocuments method. 7:06Once the graph data is created, 7:09we can visualize it using our browser. 7:14In order to query our graph database, 7:17we'll use Cypher queries. 7:24Cypher is, for a graph database, what SQL is for a relational database. 7:29Instead of operating on tables, 7:31Cypher queries operate on the nodes, relationships and paths in the graph database. 7:37To visualize the graph in the browser, 7:39I ran this query which shows us all the nodes and relationships in the graph. 7:44On a larger knowledge graph, this visualization might be too complex. 7:48But for our example, it works to verify the structure of the graph. 7:54It looks like the relationships in our input text have been correctly represented here in the knowledge graph. 8:03We can also examine the schema and data types in the database 8:07using the get schema property of the graph. 8:12Without the LLM, 8:13creating the knowledge graph might be a manual process to diagram entities and relationships from unstructured text. 8:21Now that we have our knowledge graph, 8:23we can query it, 8:24taking advantage of the graph structure and graph database retrieval capabilities 8:29to derive valuable information 8:31over the data 8:32in a more holistic way than semantic search can perform on a vector database. 8:37Now we'll use natural language to query the knowledge graph. 8:41The natural language query will be passed to the LLM, 8:45which is going to translate the query into Cypher syntax. 8:49This Cypher query will be executed on the database 8:52and the result will be returned to the LLM using natural language. 8:56Prompting the LLM correctly requires some prompt engineering. 9:00We'll think of the prompting step in two parts, so we'll need to set up two different prompts. 9:05The first prompt gives the LLM instructions for generating a correct Cypher query 9:11from the user's natural language query. 9:13LangChain provides a FewShotPromptTemplate 9:17that can be used to give examples to the LLM and the prompt, 9:21encouraging the LLM to write correct and succinct Cypher syntax. 9:26This code block gives several examples 9:28of questions and corresponding Cypher queries that the LLM should use as a guide. 9:34It also constrains the output of the model to only the query. 9:38An overly chatty LLM might add in extra information 9:41that would lead to invalid Cypher queries. 9:45Using a prefix with a specified task and instructions 9:49also helps to constrain the model behavior 9:52and makes it more likely that the LLM will output correct Cypher syntax. 10:00The second prompt provides the LLM instructions for translating the result of the Cypher query 10:06into natural language 10:07given the original natural language question from the user. 10:11We employ a few-shot prompting strategy here, too, 10:15providing examples to the LLM for how to do this. 10:18We call this prompt the QA prompt. 10:21Essentially, 10:22it describes how the LLM should answer the question with the information returned from the graph database. 10:31Now we'll bundle together our Cypher prompt, our QA prompt, 10:35our knowledge graph 10:36and an LLM to create the question answering chain, 10:40using the graph Cypher QA chain class. 10:43We're implementing a simple retrieval procedure here. 10:46But there are ways to improve on this strategy by providing additional context to the LLM 10:51about groupings and summaries of like nodes within the knowledge graph. 10:57Using a temperature of zero and a length penalty 11:00encourages the LLM to keep the Cypher prompt short and straightforward. 11:05If you're wondering why we're configuring a different LLM here, 11:08it's because we're setting different parameters for retrieval of information from the graph 11:13than we used earlier for constructing the graph. 11:16Now we can query the data by invoking the chain with a natural language question. 11:21If you try this out, your responses may be slightly different than what we're seeing here 11:25because LLMs are not strictly deterministic. 11:29Here's our first question. 11:32What is John's title? 11:33We can see the Cypher query generated by this LLM 11:37to retrieve the information, 11:39the result of the Cypher query 11:41and the natural language response from the LLM 11:44as Director of the Digital Marketing Group. 11:47Looks good. 11:48Let's try a slightly more complex question. 11:51Who does John collaborate with? 11:55Again, the LLM generates a Cypher query to retrieve the correct information from the graph database 12:01and returns the correct response. 12:04John collaborates with Jane. 12:06This looks good. 12:07Let's ask the chain about a group relationship. 12:10What group is Jane in? 12:13Jane is in the executive group. Okay. 12:17Let's try one more that requires the LLM to give us two outputs. 12:22Who does Jane collaborate with? 12:24Jane collaborates with Sharon and John. 12:28Even for this more difficult query, 12:30we can see the chain correctly identifies both of the collaborators. 12:34Beyond retrieving the simple titles and relationships from our input string in this example, 12:39GraphRAG can summarize and retrieve contextual information 12:43over the whole structure of the knowledge graph. 12:46So, how is this different from a VectorRAG system? 12:49Firstly, instead of calculating embeddings and storing the resulting embedded information in a vector database, 12:56a GraphRAG system transforms unstructured text data 13:00into structured data 13:02using an LLM. 13:03And a knowledge graph is populated with this data. 13:07The second difference is in the retrieval step. 13:10Instead of performing semantic search and returning results with semantic similarity, 13:15the LLM generates a Cypher query in response to the user's natural language query, 13:21which gets executed on the graph database containing the knowledge graph. 13:25The GraphRAG system avoids one of the limitations of VectorRAG. 13:30If you think about the way VectorRAG returns top semantic search results to a query, 13:35you can recognize that VectorRAG can't provide the LLM with knowledge over the whole text corpus in response to one query. 13:43It's limited to the top semantic search results. 13:46GraphRAG can leverage graph indexes, 13:49which store summaries about groupings of like nodes 13:52to provide summarization over the whole corpus of text within one query result. 13:58In practice, you may want both the capabilities of retrieval from a semantic search on a vector database 14:04and a graph search over a knowledge graph. 14:07It's possible to build these sort of HybridRAG systems using both vector databases and graph databases. 14:15Check out the GitHub link in the description below to try out GraphRAG for yourself.