Blogโ€บEngineering

How to Build a RAG System with OpenAI and Celuxe in 200 Lines

Retrieval-Augmented Generation (RAG) is the most practical way to give LLMs knowledge they weren't trained on. Instead of fine-tuning, you connect a vector database and let the model retrieve relevant context at query time. This tutorial walks through a production-ready RAG pipeline in about 200 lines of Python.

What We're Building

A RAG system that answers questions about a company's internal documentation. Given a question like "How do I reset my password?", it retrieves the most relevant document chunks and generates an answer using the retrieved context.

The architecture:

  • Document Ingestion: Load and chunk documents โ†’ embed โ†’ store in vector DB
  • Query: Embed the question โ†’ retrieve top-k chunks โ†’ pass to LLM with the question
  • Generation: LLM generates answer using retrieved context as grounding

Setup

pip install openai chromadb tiktoken requests

import os
import requests

# Celuxe API โ€” works with OpenAI SDK
CELUXE_API_KEY = os.environ.get("CELUXE_API_KEY")
BASE_URL = "https://api.celuxe.shop/v1"

# Use Celuxe's embedding endpoint for vectorization
def embed_text(texts):
    response = requests.post(
        f"{BASE_URL}/embeddings",
        headers={
            "Authorization": f"Bearer {CELUXE_API_KEY}",
            "Content-Type": "application/json"
        },
        json={"model": "text-embedding-3-small", "input": texts}
    )
    return [d["embedding"] for d in response.json()["data"]]

Step 1: Chunk Your Documents

import tiktoken

def chunk_text(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks."""
    tokenizer = tiktoken.get_encoding("cl100k_base")  # GPT-4 tokenizer
    tokens = tokenizer.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        chunk_tokens = tokens[start:start + chunk_size]
        chunk_text = tokenizer.decode(chunk_tokens)
        chunks.append(chunk_text)
        start += chunk_size - overlap
    return chunks

# Example usage
with open("docs/engineering-handbook.txt") as f:
    text = f.read()
chunks = chunk_text(text)
print(f"Created {len(chunks)} chunks")

Step 2: Embed and Store in ChromaDB

import chromadb

client = chromadb.Client()
collection = client.create_collection("docs")

# Embed all chunks
embeddings = embed_text(chunks)

# Add to vector DB with metadata
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
    collection.add(
        ids=[f"chunk_{i}"],
        embeddings=[embedding],
        documents=[chunk],
        metadatas=[{"index": i}]
    )
print(f"Stored {len(chunks)} chunks in ChromaDB")

Step 3: Retrieve Relevant Context

def retrieve(question, top_k=4):
    """Find the most relevant document chunks for a question."""
    question_embedding = embed_text([question])[0]
    results = collection.query(
        query_embeddings=[question_embedding],
        n_results=top_k
    )
    return results["documents"][0]

question = "How do I reset my password?"
context_chunks = retrieve(question)
context = "\n\n".join(f"- {c}" for c in context_chunks)
print(f"Retrieved {len(context_chunks)} relevant chunks")

Step 4: Generate the Answer

import openai

client = openai.OpenAI(
    api_key=CELUXE_API_KEY,
    base_url=BASE_URL
)

def answer_question(question, context):
    system_prompt = """You are a helpful assistant answering questions based ONLY on the provided context.
If the answer is not in the context, say "I don't have that information." Never make up an answer."""

    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0.3,
        max_tokens=512
    )
    return response.choices[0].message.content

answer = answer_question(question, context)
print(answer)

Putting It Together

class RAGSystem:
    def __init__(self, api_key):
        self.client = openai.OpenAI(api_key=api_key, base_url=BASE_URL)
        self.collection = chromadb.Client().get_collection("docs")

    def ingest(self, documents):
        chunks = []
        for doc in documents:
            chunks.extend(self._chunk(doc))
        embeddings = embed_text(chunks)
        for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
            self.collection.add(ids=[f"c_{i}"], embeddings=[emb], documents=[chunk])
        return len(chunks)

    def query(self, question, top_k=4):
        emb = embed_text([question])[0]
        results = self.collection.query(query_embeddings=[emb], n_results=top_k)
        context = "\n\n".join(results["documents"][0])
        return self.answer_question(question, context)

    def answer_question(self, question, context):
        response = self.client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {"role": "system", "content": "Answer based ONLY on the provided context."},
                {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
            ]
        )
        return response.choices[0].message.content

# Usage
rag = RAGSystem(api_key=CELUXE_API_KEY)
rag.ingest(["Your document text here..."])
answer = rag.query("What is the password reset policy?")
print(answer)

Production Considerations

  • Chunk size matters: 500 tokens works well for most documents. Technical docs with code may need smaller chunks (200-300 tokens).
  • Embedding model: text-embedding-3-small is 9ร— cheaper than ada-002 with better performance.
  • Hybrid search: Combine vector search with keyword search for better retrieval on technical terms.
  • Reranking: After retrieval, rerank results with a cross-encoder for better relevance.
  • Streaming: Use streaming responses so users see answers as they're generated.

Estimated Monthly Cost

For a typical knowledge base of 1,000 documents with 10,000 daily queries:

  • Embedding (ingestion): ~$0.50 one-time
  • Retrieval + Generation: ~$8/month
  • Vector storage (ChromaDB local): Free
RAG is not just for enterprise. A single developer can build a production-quality knowledge assistant for less than $10/month.

Build Your RAG System Today

Access embedding models and 30+ LLMs through a single Celuxe API key. No complex SDKs required.

Get Your API Key โ†’
C

Celuxe Team

Engineering and product team at Celuxe. We write about real production AI infrastructure.