GraphRAG: Populate and Query Knowledge Graph
Key Points
- GraphRAG replaces vector search with knowledge graphs, using graph databases to capture both entities (vertices) and their relationships (edges) for richer contextual retrieval.
- An LLM first extracts entities and relationships from unstructured text, converts them into structured triples, and populates a Neo4j (or any) graph database.
- When a user asks a natural‑language question, the LLM generates a Cypher query, runs it against the graph, and then translates the query results back into a natural‑language answer.
- The demo sets up a local Neo4j instance via a container (e.g., Podman/Docker), installs required Python packages (including LangChain), and configures credentials and the APOC plugin for advanced graph operations.
- A fresh Python virtual environment is recommended, and the notebook provides step‑by‑step instructions for obtaining API keys, initializing the database, and running the end‑to‑end GraphRAG pipeline.
Sections
- Populating and Querying GraphRAG with LLM - The speaker demonstrates using an LLM to extract entities and relationships from raw text, populate a knowledge graph, generate Cypher queries, and retrieve natural‑language answers through Graph Retrieval Augmented Generation.
- Configuring LangChain Graph Pipeline - The speaker walks through setting up a fresh Python 3.11 virtual environment, installing LangChain, Neo4j, and IBM watsonx.ai libraries, configuring API credentials, connecting to a local Neo4j database, and using LLM‑driven transformers to build a knowledge graph from employee‑relationship text.
- LLM‑Generated Knowledge Graph Workflow - The excerpt outlines how allowed relationship types are defined, text is transformed into graph documents, inserted into a graph database via addGraphDocuments, and then visualized and queried with Cypher to verify that the LLM‑derived entities and relationships are correctly represented.
- Guiding LLMs for Cypher Queries - The passage explains how to use prefixed few‑shot prompts and strict output constraints to force an LLM to generate correct Cypher queries and translate results into natural‑language answers within a graph QA chain.
- GraphRAG vs VectorRAG: Core Differences - The passage explains that GraphRAG converts text into a structured knowledge graph and uses LLM‑generated Cypher queries for retrieval, offering corpus‑wide summarization unlike VectorRAG’s limited top‑k semantic search, and suggests hybrid approaches can combine both strengths.
Full Transcript
# GraphRAG: Populate and Query Knowledge Graph **Source:** [https://www.youtube.com/watch?v=Za7aG-ooGLQ](https://www.youtube.com/watch?v=Za7aG-ooGLQ) **Duration:** 00:14:19 ## Summary - GraphRAG replaces vector search with knowledge graphs, using graph databases to capture both entities (vertices) and their relationships (edges) for richer contextual retrieval. - An LLM first extracts entities and relationships from unstructured text, converts them into structured triples, and populates a Neo4j (or any) graph database. - When a user asks a natural‑language question, the LLM generates a Cypher query, runs it against the graph, and then translates the query results back into a natural‑language answer. - The demo sets up a local Neo4j instance via a container (e.g., Podman/Docker), installs required Python packages (including LangChain), and configures credentials and the APOC plugin for advanced graph operations. - A fresh Python virtual environment is recommended, and the notebook provides step‑by‑step instructions for obtaining API keys, initializing the database, and running the end‑to‑end GraphRAG pipeline. ## Sections - [00:00:00](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=0s) **Populating and Querying GraphRAG with LLM** - The speaker demonstrates using an LLM to extract entities and relationships from raw text, populate a knowledge graph, generate Cypher queries, and retrieve natural‑language answers through Graph Retrieval Augmented Generation. - [00:03:10](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=190s) **Configuring LangChain Graph Pipeline** - The speaker walks through setting up a fresh Python 3.11 virtual environment, installing LangChain, Neo4j, and IBM watsonx.ai libraries, configuring API credentials, connecting to a local Neo4j database, and using LLM‑driven transformers to build a knowledge graph from employee‑relationship text. - [00:06:15](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=375s) **LLM‑Generated Knowledge Graph Workflow** - The excerpt outlines how allowed relationship types are defined, text is transformed into graph documents, inserted into a graph database via addGraphDocuments, and then visualized and queried with Cypher to verify that the LLM‑derived entities and relationships are correctly represented. - [00:09:25](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=565s) **Guiding LLMs for Cypher Queries** - The passage explains how to use prefixed few‑shot prompts and strict output constraints to force an LLM to generate correct Cypher queries and translate results into natural‑language answers within a graph QA chain. - [00:12:33](https://www.youtube.com/watch?v=Za7aG-ooGLQ&t=753s) **GraphRAG vs VectorRAG: Core Differences** - The passage explains that GraphRAG converts text into a structured knowledge graph and uses LLM‑generated Cypher queries for retrieval, offering corpus‑wide summarization unlike VectorRAG’s limited top‑k semantic search, and suggests hybrid approaches can combine both strengths. ## Full Transcript
Today I'm going to show you how to populate a knowledge graph and query it using an LLM.
Graph retrieval augmented generation.
or GraphRAG
is emerging as a powerful alternative to vector search methods.
Instead of using a vector database,
GraphRAG systems store data in the format of a knowledge graph
using a graph database.
In a knowledge graph, the relationships between data points, called edges,
are as meaningful as the connections between data points,
called vertices, or sometimes nodes.
A GraphRAG approach leverages the structured nature of graph databases
to give greater depth and context of retrieved information
about networks or complex relationships.
The first step in setting up our system is creating and populating the knowledge graph.
We'll be using an LLM to assist in creating the knowledge graph.
Given unstructured text data,
the LLM will extract entities and relationships from the data,
transforming the data into structured data,
which can then be inserted into the knowledge graph.
After the knowledge graph is created,
we'll be using the LLM to query data from the knowledge graph
and return the response in natural language.
Cypher is the query language for a graph database.
When a user asks a question in natural language,
the LLM will generate the Cypher query
to extract that information from the knowledge graph.
The Cypher query then gets executed on the database and the results are returned to the LLM.
The last step is for the LLM to interpret the results of the Cypher query
in the context of the natural language question
and return a natural language response.
For this example, we'll need an API key and project ID.
The link to this notebook is in the description below.
In the notebook, you'll find instructions for retrieving these credentials.
We'll be using Neo4j, an open-source graph database.
But any graph database can be used to create the knowledge graph.
We'll create a local instance of the database using a containerization tool.
I'll be using Podman,
but you can use any containerization tool.
For example, Docker,
as long as it allows you to create a Neo4j instance.
If you don't have a containerization tool already,
take a moment to install one.
After installing, initialize and start a machine.
My machine's already initialized, so I'm just starting it here.
Once you have this running,
you can start a database instance
with this configuration.
We need credentials to access the database.
So, I'm setting a name and password here.
We also need to include the APOC library as a plugin
in order to enable additional functionality for working with data and graphs.
It looks like our graph database is up and running now.
It's a good practice to create a fresh virtual environment for this project.
I'm using Python 3.11.3 here.
In the Python environment for your notebook, install the following Python libraries.
We'll be using the OS and getpass modules to set up credentials.
We'll use LangChain's document class to store the text for input into our graph database
and the LLM graph transformer to create a graph from our text input.
To interact with and query the graph database, we'll use the LangChain Neo4j module
and it's accompanying
GraphCypherQAChain class.
To craft our prompts for the LLM,
we'll use LangChain's prompt template
and FewShotPromptTemplate.
We'll use the LangChain IBM and IBM watsonx.ai
modules
to interact with the LLM and to set the parameters for our models.
We'll need to set up our credentials using the API key and project ID that we retrieved earlier.
We'll also need to set the URL from which we'll access these services.
Now that our environment is set up, we can create the knowledge graph.
First, we need to create a connection to the local database instance that we started earlier.
Next, we define our data for input into the knowledge graph.
In this case,
the text describes employees at a company,
groups they work in
and their job titles.
We'll use this set of relationships to test the graph generating capabilities of the LLM.
But you don't have to limit your data to straightforward examples of relationship data.
GraphRAG systems have been shown to be successful
in retrieval and summarization tasks for far more complex narrative and connected data.
Now we'll configure our LLM,
which will generate text describing the graph.
The LLM temperature should be fairly low
and the number of tokens high
to encourage the model to generate as much detail as possible.
without hallucinating entities or relationships that aren't present.
One of the most powerful LLM use cases is transforming unstructured text data into structured data.
The LLM will transform our text input string
into a structure of nodes and relationships that we can use to populate the knowledge graph.
The LLM graph transformer
allows you to set the kinds of nodes and relationships you'd like the LLM to generate.
Restricting the LLM
to just those entities
makes it more likely that you'll get a good representation of the knowledge in a graph.
Given our text input,
we set the allowed nodes to person, title and group.
We also set the allowed relationships to
title,
collaborates and group.
We use the document class to prepare our text to be added to the graph documents.
The call to convert to graph documents
generates text in a format that represents the entities in the graph.
We can inspect this graph documents object
to see how the LLM generated nodes and relationships from the text,
representing the relevant context and relevant entities.
Now that we have the data in the correct format,
we can insert these nodes and edges into the graph database
using the addGraphDocuments method.
Once the graph data is created,
we can visualize it using our browser.
In order to query our graph database,
we'll use Cypher queries.
Cypher is, for a graph database, what SQL is for a relational database.
Instead of operating on tables,
Cypher queries operate on the nodes, relationships and paths in the graph database.
To visualize the graph in the browser,
I ran this query which shows us all the nodes and relationships in the graph.
On a larger knowledge graph, this visualization might be too complex.
But for our example, it works to verify the structure of the graph.
It looks like the relationships in our input text have been correctly represented here in the knowledge graph.
We can also examine the schema and data types in the database
using the get schema property of the graph.
Without the LLM,
creating the knowledge graph might be a manual process to diagram entities and relationships from unstructured text.
Now that we have our knowledge graph,
we can query it,
taking advantage of the graph structure and graph database retrieval capabilities
to derive valuable information
over the data
in a more holistic way than semantic search can perform on a vector database.
Now we'll use natural language to query the knowledge graph.
The natural language query will be passed to the LLM,
which is going to translate the query into Cypher syntax.
This Cypher query will be executed on the database
and the result will be returned to the LLM using natural language.
Prompting the LLM correctly requires some prompt engineering.
We'll think of the prompting step in two parts, so we'll need to set up two different prompts.
The first prompt gives the LLM instructions for generating a correct Cypher query
from the user's natural language query.
LangChain provides a FewShotPromptTemplate
that can be used to give examples to the LLM and the prompt,
encouraging the LLM to write correct and succinct Cypher syntax.
This code block gives several examples
of questions and corresponding Cypher queries that the LLM should use as a guide.
It also constrains the output of the model to only the query.
An overly chatty LLM might add in extra information
that would lead to invalid Cypher queries.
Using a prefix with a specified task and instructions
also helps to constrain the model behavior
and makes it more likely that the LLM will output correct Cypher syntax.
The second prompt provides the LLM instructions for translating the result of the Cypher query
into natural language
given the original natural language question from the user.
We employ a few-shot prompting strategy here, too,
providing examples to the LLM for how to do this.
We call this prompt the QA prompt.
Essentially,
it describes how the LLM should answer the question with the information returned from the graph database.
Now we'll bundle together our Cypher prompt, our QA prompt,
our knowledge graph
and an LLM to create the question answering chain,
using the graph Cypher QA chain class.
We're implementing a simple retrieval procedure here.
But there are ways to improve on this strategy by providing additional context to the LLM
about groupings and summaries of like nodes within the knowledge graph.
Using a temperature of zero and a length penalty
encourages the LLM to keep the Cypher prompt short and straightforward.
If you're wondering why we're configuring a different LLM here,
it's because we're setting different parameters for retrieval of information from the graph
than we used earlier for constructing the graph.
Now we can query the data by invoking the chain with a natural language question.
If you try this out, your responses may be slightly different than what we're seeing here
because LLMs are not strictly deterministic.
Here's our first question.
What is John's title?
We can see the Cypher query generated by this LLM
to retrieve the information,
the result of the Cypher query
and the natural language response from the LLM
as Director of the Digital Marketing Group.
Looks good.
Let's try a slightly more complex question.
Who does John collaborate with?
Again, the LLM generates a Cypher query to retrieve the correct information from the graph database
and returns the correct response.
John collaborates with Jane.
This looks good.
Let's ask the chain about a group relationship.
What group is Jane in?
Jane is in the executive group. Okay.
Let's try one more that requires the LLM to give us two outputs.
Who does Jane collaborate with?
Jane collaborates with Sharon and John.
Even for this more difficult query,
we can see the chain correctly identifies both of the collaborators.
Beyond retrieving the simple titles and relationships from our input string in this example,
GraphRAG can summarize and retrieve contextual information
over the whole structure of the knowledge graph.
So, how is this different from a VectorRAG system?
Firstly, instead of calculating embeddings and storing the resulting embedded information in a vector database,
a GraphRAG system transforms unstructured text data
into structured data
using an LLM.
And a knowledge graph is populated with this data.
The second difference is in the retrieval step.
Instead of performing semantic search and returning results with semantic similarity,
the LLM generates a Cypher query in response to the user's natural language query,
which gets executed on the graph database containing the knowledge graph.
The GraphRAG system avoids one of the limitations of VectorRAG.
If you think about the way VectorRAG returns top semantic search results to a query,
you can recognize that VectorRAG can't provide the LLM with knowledge over the whole text corpus in response to one query.
It's limited to the top semantic search results.
GraphRAG can leverage graph indexes,
which store summaries about groupings of like nodes
to provide summarization over the whole corpus of text within one query result.
In practice, you may want both the capabilities of retrieval from a semantic search on a vector database
and a graph search over a knowledge graph.
It's possible to build these sort of HybridRAG systems using both vector databases and graph databases.
Check out the GitHub link in the description below to try out GraphRAG for yourself.