How to Build a RAG System with OpenAI and Celuxe

Retrieval-Augmented Generation (RAG) is the most practical way to give LLMs knowledge they weren't trained on. Instead of fine-tuning, you connect a vector database and let the model retrieve relevant context at query time. This tutorial walks through a production-ready RAG pipeline in about 200 lines of Python.

What We're Building

A RAG system that answers questions about a company's internal documentation. Given a question like "How do I reset my password?", it retrieves the most relevant document chunks and generates an answer using the retrieved context.

The architecture:

Document Ingestion: Load and chunk documents → embed → store in vector DB
Query: Embed the question → retrieve top-k chunks → pass to LLM with the question
Generation: LLM generates answer using retrieved context as grounding

Setup

pip install openai chromadb tiktoken requests

import os
import requests

# Celuxe API — works with OpenAI SDK
CELUXE_API_KEY = os.environ.get("CELUXE_API_KEY")
BASE_URL = "https://api.celuxe.shop/v1"

# Use Celuxe's embedding endpoint for vectorization
def embed_text(texts):
    response = requests.post(
        f"{BASE_URL}/embeddings",
        headers={
            "Authorization": f"Bearer {CELUXE_API_KEY}",
            "Content-Type": "application/json"
        },
        json={"model": "text-embedding-3-small", "input": texts}
    )
    return [d["embedding"] for d in response.json()["data"]]

Step 1: Chunk Your Documents

import tiktoken

def chunk_text(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks."""
    tokenizer = tiktoken.get_encoding("cl100k_base")  # GPT-4 tokenizer
    tokens = tokenizer.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        chunk_tokens = tokens[start:start + chunk_size]
        chunk_text = tokenizer.decode(chunk_tokens)
        chunks.append(chunk_text)
        start += chunk_size - overlap
    return chunks

# Example usage
with open("docs/engineering-handbook.txt") as f:
    text = f.read()
chunks = chunk_text(text)
print(f"Created {len(chunks)} chunks")

Step 2: Embed and Store in ChromaDB

import chromadb

client = chromadb.Client()
collection = client.create_collection("docs")

# Embed all chunks
embeddings = embed_text(chunks)

# Add to vector DB with metadata
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
    collection.add(
        ids=[f"chunk_{i}"],
        embeddings=[embedding],
        documents=[chunk],
        metadatas=[{"index": i}]
    )
print(f"Stored {len(chunks)} chunks in ChromaDB")

Step 3: Retrieve Relevant Context

def retrieve(question, top_k=4):
    """Find the most relevant document chunks for a question."""
    question_embedding = embed_text([question])[0]
    results = collection.query(
        query_embeddings=[question_embedding],
        n_results=top_k
    )
    return results["documents"][0]

question = "How do I reset my password?"
context_chunks = retrieve(question)
context = "\n\n".join(f"- {c}" for c in context_chunks)
print(f"Retrieved {len(context_chunks)} relevant chunks")

Step 4: Generate the Answer

import openai

client = openai.OpenAI(
    api_key=CELUXE_API_KEY,
    base_url=BASE_URL
)

def answer_question(question, context):
    system_prompt = """You are a helpful assistant answering questions based ONLY on the provided context.
If the answer is not in the context, say "I don't have that information." Never make up an answer."""

    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0.3,
        max_tokens=512
    )
    return response.choices[0].message.content

answer = answer_question(question, context)
print(answer)

Putting It Together

class RAGSystem:
    def __init__(self, api_key):
        self.client = openai.OpenAI(api_key=api_key, base_url=BASE_URL)
        self.collection = chromadb.Client().get_collection("docs")

    def ingest(self, documents):
        chunks = []
        for doc in documents:
            chunks.extend(self._chunk(doc))
        embeddings = embed_text(chunks)
        for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
            self.collection.add(ids=[f"c_{i}"], embeddings=[emb], documents=[chunk])
        return len(chunks)

    def query(self, question, top_k=4):
        emb = embed_text([question])[0]
        results = self.collection.query(query_embeddings=[emb], n_results=top_k)
        context = "\n\n".join(results["documents"][0])
        return self.answer_question(question, context)

    def answer_question(self, question, context):
        response = self.client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {"role": "system", "content": "Answer based ONLY on the provided context."},
                {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
            ]
        )
        return response.choices[0].message.content

# Usage
rag = RAGSystem(api_key=CELUXE_API_KEY)
rag.ingest(["Your document text here..."])
answer = rag.query("What is the password reset policy?")
print(answer)

Production Considerations

Chunk size matters: 500 tokens works well for most documents. Technical docs with code may need smaller chunks (200-300 tokens).
Embedding model: text-embedding-3-small is 9× cheaper than ada-002 with better performance.
Hybrid search: Combine vector search with keyword search for better retrieval on technical terms.
Reranking: After retrieval, rerank results with a cross-encoder for better relevance.
Streaming: Use streaming responses so users see answers as they're generated.

Estimated Monthly Cost

For a typical knowledge base of 1,000 documents with 10,000 daily queries:

Embedding (ingestion): ~$0.50 one-time
Retrieval + Generation: ~$8/month
Vector storage (ChromaDB local): Free

RAG is not just for enterprise. A single developer can build a production-quality knowledge assistant for less than $10/month.

Build Your RAG System Today

Access embedding models and 30+ LLMs through a single Celuxe API key. No complex SDKs required.

Get Your API Key →

How to Build a RAG System with OpenAI and Celuxe in 200 Lines

What We're Building

Setup

Step 1: Chunk Your Documents

Step 2: Embed and Store in ChromaDB

Step 3: Retrieve Relevant Context

Step 4: Generate the Answer

Putting It Together

Production Considerations

Estimated Monthly Cost

Build Your RAG System Today

Celuxe Team

What We're Building

Setup

Step 1: Chunk Your Documents

Step 2: Embed and Store in ChromaDB

Step 3: Retrieve Relevant Context

Step 4: Generate the Answer

Putting It Together

Production Considerations

Estimated Monthly Cost

Build Your RAG System Today

Celuxe Team

Related Articles

Why Developers Are Switching to OpenAI Alternatives

DeepSeek V3 vs GPT-4o: Real-World Comparison

Get more like this in your inbox