Building your own RAG
I like to read and I read a good amount. When I'm reading a series and I'm waiting for the next book to come out, I often forget what happened in the previous book. This seems like a great use case to build a rag!
What is needed?
- A book in epub
- An epub parser
- Chunker
- Vector store
- LLM to query
A book in epub
There are tons of public domain ebooks you can get started with at Project Gutenberg.For this example I'm going to used Frankenstein by Mary Wollstonecraft Shelley.
An epub parser
For this example I'll be using the ebooklib python library. Each epub can have a variety of naming so I suggest taking a look at the different documents and see what you actually need to pull out.
from ebooklib import epub
import ebooklib
book = epub.read_epub("frankenstein.epub")
for i in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
print(i.get_name())
Which will print out a list like:
2571340281335387128_84-h-0.htm.xhtml
2571340281335387128_84-h-1.htm.xhtml
2571340281335387128_84-h-2.htm.xhtml
2571340281335387128_84-h-3.htm.xhtml
2571340281335387128_84-h-4.htm.xhtml
2571340281335387128_84-h-5.htm.xhtml
We only care about documents 2571340281335387128_84-h-2.htm.xhtml through 2571340281335387128_84-h-29.htm.xhtml. The rest are table of contents and other miscellaneous pages.
Chunker
One of the most important parts of building a RAG is how you chunk, aka split up your large bodies of text. Sure, LLMs have large context windows now you could throw chapters, or entire books at but that comes with cost and potential retrieval issues. To help with performance, cost, and accuracy, we split our text into smaller chunks. Also most embedding models only support around ~8k tokens so this also becomes a necessity.
For this, I'm using Chonkie's semantic chunker along with OpenAI's text-embedding-3-large embedding models. Semantic chunking will try to group similar chunks together for faster and more accurate querying.
from bs4 import BeautifulSoup
from ebooklib import epub
import ebooklib
from chonkie import SemanticChunker, OpenAIEmbeddings
EMBEDDING_MODEL = "text-embedding-3-large"
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
# see the chonkie docs for what each setting here means
chunker = SemanticChunker(
embedding_model=embeddings,
threshold=0.6,
chunk_size=8196,
similarity_window=5,
skip_window=3,
)
book = epub.read_epub("frankenstein.epub")
chapter_idx = 1
part = 0
for i in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
# get_items_of_type returns an iterator so have to do this
if part >= 2 and part <= 29:
soup = BeautifulSoup(i.get_body_content(), "html.parser")
chapter_text = soup.get_text().strip()
chunks = chunker.chunk(chapter_text)
chunk_idx = 0
for chunk in chunks:
print(
f"chunk for chapter: {chapter_idx} chunk: {chunk_idx} tokens: {chunk.token_count}"
)
chunk_idx += 1
chapter_idx += 1
part += 1
Which will print out something like:
chunk for chapter: 1 chunk: 0 tokens: 77
chunk for chapter: 1 chunk: 1 tokens: 47
chunk for chapter: 1 chunk: 2 tokens: 63
chunk for chapter: 1 chunk: 3 tokens: 67
chunk for chapter: 1 chunk: 4 tokens: 68
chunk for chapter: 1 chunk: 5 tokens: 68
chunk for chapter: 1 chunk: 6 tokens: 48
chunk for chapter: 1 chunk: 7 tokens: 64
Vector store
Now that we have chunked out each chapter, we need to generate embeddings and add them to our database. For this I'm using ChromaDB which is super simple to get started and just like a normal python package.
import chromadb
import chromadb.utils.embedding_functions as embedding_functions
from bs4 import BeautifulSoup
from ebooklib import epub
import ebooklib
from chonkie import SemanticChunker, OpenAIEmbeddings
EMBEDDING_MODEL = "text-embedding-3-large"
db = chromadb.PersistentClient(path="frankenstein_chroma.db")
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.getenv("OPENAI_API_KEY"), model_name=EMBEDDING_MODEL
)
collection = db.create_collection(name="frankenstein", embedding_function=openai_ef)
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
chunker = SemanticChunker(
embedding_model=embeddings,
threshold=0.6,
chunk_size=8196,
similarity_window=5,
skip_window=3,
)
book = epub.read_epub("frankenstein.epub")
chapter_idx = 1
part = 0
for i in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
if part >= 2 and part <= 29:
soup = BeautifulSoup(i.get_body_content(), "html.parser")
chapter_text = soup.get_text().strip()
chunks = chunker.chunk(chapter_text)
chunk_idx = 0
for chunk in chunks:
print(
f"chunk for chapter: {chapter_idx} chunk: {chunk_idx} tokens: {chunk.token_count}"
)
collection.add(
ids=[f"chapter_{chapter_idx}_{chunk_idx}"],
documents=[chunk.text],
metadatas=[
{
"chapter": chapter_idx,
"chunk": chunk_idx,
"tokens": chunk.token_count,
}
],
)
chunk_idx += 1
chapter_idx += 1
part += 1
LLM to query
Now that we have the book chunked and indexed, we can query it. The basics of rag is to do a vector search on the data and pass the returned documents into the LLM as additional context to answer the question.
The prompt looks something like:
Question: <your question>
Context:
<docs returned from vector store>
Answer using only the context.
To do this we don't need a big model to get good results. GPT-5-mini seems to do pretty well and is very cheap.
import chromadb
import chromadb.utils.embedding_functions as embedding_functions
from openai import OpenAI
EMBEDDING_MODEL = "text-embedding-3-large"
oai = OpenAI()
db = chromadb.PersistentClient(path="frankenstein_chroma.db")
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.getenv("OPENAI_API_KEY"), model_name=EMBEDDING_MODEL
)
collection = db.get_collection(name="frankenstein", embedding_function=openai_ef)
query = input("Enter your question: ")
results = collection.query(query_texts=[query], n_results=20)
print(f"Got {len(results['ids'][0])} documents")
context = "\n\n---\n\n".join(results["documents"][0])
SYSTEM_PROMPT = "You are an assistant answering questions about a novel."
user_prompt = f"""Question: {query}
Context:
{context}
Answer using only the context.
"""
response = oai.responses.create(
model="gpt-5-mini", instructions=SYSTEM_PROMPT, input=user_prompt
)
print(response.output_text)
print(response.usage)
Now when I ask a question like Who are the main characters? you'll get a result like:
From the provided context, the principal characters shown are:
- Victor (the narrator, Victor Frankenstein)
- The Creature / “the unfortunate and deserted creature” (narrator in parts)
- Henry Clerval
- Elizabeth (Victor’s cousin)
- William
- Ernest
- Justine
- Felix
- Agatha
- The old man (referred to as father)
- Caroline Beaufort
- Beaufort
Now we got a basic RAG using all the fundamentals. If you wanted to keep going with this you may want to play around with various chunking strategies, reranking, and other ways you can optimize and provide additional context for your query.