Retrieval-Augmented Generation (RAG) is the most practical way to give LLMs knowledge they weren't trained on. Instead of fine-tuning, you connect a vector database and let the model retrieve relevant context at query time. This tutorial walks through a production-ready RAG pipeline in about 200 lines of Python.
What We're Building
A RAG system that answers questions about a company's internal documentation. Given a question like "How do I reset my password?", it retrieves the most relevant document chunks and generates an answer using the retrieved context.
The architecture:
- Document Ingestion: Load and chunk documents โ embed โ store in vector DB
- Query: Embed the question โ retrieve top-k chunks โ pass to LLM with the question
- Generation: LLM generates answer using retrieved context as grounding
Setup
pip install openai chromadb tiktoken requests
import os
import requests
# Celuxe API โ works with OpenAI SDK
CELUXE_API_KEY = os.environ.get("CELUXE_API_KEY")
BASE_URL = "https://api.celuxe.shop/v1"
# Use Celuxe's embedding endpoint for vectorization
def embed_text(texts):
response = requests.post(
f"{BASE_URL}/embeddings",
headers={
"Authorization": f"Bearer {CELUXE_API_KEY}",
"Content-Type": "application/json"
},
json={"model": "text-embedding-3-small", "input": texts}
)
return [d["embedding"] for d in response.json()["data"]]
Step 1: Chunk Your Documents
import tiktoken
def chunk_text(text, chunk_size=500, overlap=50):
"""Split text into overlapping chunks."""
tokenizer = tiktoken.get_encoding("cl100k_base") # GPT-4 tokenizer
tokens = tokenizer.encode(text)
chunks = []
start = 0
while start < len(tokens):
chunk_tokens = tokens[start:start + chunk_size]
chunk_text = tokenizer.decode(chunk_tokens)
chunks.append(chunk_text)
start += chunk_size - overlap
return chunks
# Example usage
with open("docs/engineering-handbook.txt") as f:
text = f.read()
chunks = chunk_text(text)
print(f"Created {len(chunks)} chunks")
Step 2: Embed and Store in ChromaDB
import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")
# Embed all chunks
embeddings = embed_text(chunks)
# Add to vector DB with metadata
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
collection.add(
ids=[f"chunk_{i}"],
embeddings=[embedding],
documents=[chunk],
metadatas=[{"index": i}]
)
print(f"Stored {len(chunks)} chunks in ChromaDB")
Step 3: Retrieve Relevant Context
def retrieve(question, top_k=4):
"""Find the most relevant document chunks for a question."""
question_embedding = embed_text([question])[0]
results = collection.query(
query_embeddings=[question_embedding],
n_results=top_k
)
return results["documents"][0]
question = "How do I reset my password?"
context_chunks = retrieve(question)
context = "\n\n".join(f"- {c}" for c in context_chunks)
print(f"Retrieved {len(context_chunks)} relevant chunks")
Step 4: Generate the Answer
import openai
client = openai.OpenAI(
api_key=CELUXE_API_KEY,
base_url=BASE_URL
)
def answer_question(question, context):
system_prompt = """You are a helpful assistant answering questions based ONLY on the provided context.
If the answer is not in the context, say "I don't have that information." Never make up an answer."""
response = client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
],
temperature=0.3,
max_tokens=512
)
return response.choices[0].message.content
answer = answer_question(question, context)
print(answer)
Putting It Together
class RAGSystem:
def __init__(self, api_key):
self.client = openai.OpenAI(api_key=api_key, base_url=BASE_URL)
self.collection = chromadb.Client().get_collection("docs")
def ingest(self, documents):
chunks = []
for doc in documents:
chunks.extend(self._chunk(doc))
embeddings = embed_text(chunks)
for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
self.collection.add(ids=[f"c_{i}"], embeddings=[emb], documents=[chunk])
return len(chunks)
def query(self, question, top_k=4):
emb = embed_text([question])[0]
results = self.collection.query(query_embeddings=[emb], n_results=top_k)
context = "\n\n".join(results["documents"][0])
return self.answer_question(question, context)
def answer_question(self, question, context):
response = self.client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "system", "content": "Answer based ONLY on the provided context."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
return response.choices[0].message.content
# Usage
rag = RAGSystem(api_key=CELUXE_API_KEY)
rag.ingest(["Your document text here..."])
answer = rag.query("What is the password reset policy?")
print(answer)
Production Considerations
- Chunk size matters: 500 tokens works well for most documents. Technical docs with code may need smaller chunks (200-300 tokens).
- Embedding model: text-embedding-3-small is 9ร cheaper than ada-002 with better performance.
- Hybrid search: Combine vector search with keyword search for better retrieval on technical terms.
- Reranking: After retrieval, rerank results with a cross-encoder for better relevance.
- Streaming: Use streaming responses so users see answers as they're generated.
Estimated Monthly Cost
For a typical knowledge base of 1,000 documents with 10,000 daily queries:
- Embedding (ingestion): ~$0.50 one-time
- Retrieval + Generation: ~$8/month
- Vector storage (ChromaDB local): Free
RAG is not just for enterprise. A single developer can build a production-quality knowledge assistant for less than $10/month.
Build Your RAG System Today
Access embedding models and 30+ LLMs through a single Celuxe API key. No complex SDKs required.
Get Your API Key โ