A RAG-powered application that lets organizations upload internal documents and gives employees an AI chat interface to get instant, cited answers. Built to solve the two most common knowledge-access problems: onboarding ramp-up time and cross-department information silos.
New employees spend their first weeks digging through scattered PDFs, Word docs, and wiki pages. Answers are buried. The result: slow ramp-up, repetitive questions to managers, and friction across departments trying to understand each other's processes.
Upload your company documents. Ask questions in plain English. Get instant, accurate answers with citations pointing back to the exact source. No more Slack-pinging another department to ask how their API works.
Retrieval-Augmented Generation keeps the AI grounded in your actual documents. No hallucinated policies. Every answer cites its source, so users can verify and dig deeper. Documents can be added or removed without retraining anything.
React frontend on Vercel, FastAPI backend on Render, ChromaDB for vector storage. Total infrastructure cost for ~500 queries/month is under a dollar. Scales to persistent storage for $7/month when needed.
OnboardAI was designed around two specific, high-impact scenarios that almost every mid-size company faces. Both share the same underlying problem: people need answers from documents they don't know how to navigate.
New hires upload (or are given access to) the employee handbook, SOPs, and process docs. Instead of reading 200 pages or pinging their manager for every question, they ask the chat interface directly.
Reduces ramp-up time, eliminates repetitive questions to managers, and ensures new employees get accurate, policy-compliant answers rather than hallway hearsay.
"What visas and work permits does the company not sponsor?"
"How can I access the shared drives and recorded videos?"
"What's the PTO policy for my first year?"
Engineering needs to understand a limitation in the billing system. Marketing wants to know what the API supports before writing copy. Instead of Slack-pinging another team and waiting hours, they query the docs directly.
Breaks down information silos without creating meeting overhead. Teams interact with each other's documentation on their own schedule.
"What rate limits does the payments API have?"
"What's the SLA for the data pipeline?"
"How does the returns process work for international orders?"
Why these two? Onboarding is where companies feel the pain most acutely — every new hire is a repeated cost. Cross-department search is where the compounding value lives — it scales with team size and document volume. Together they cover the "new employee" and "existing employee" halves of the knowledge-access problem.
Every layer was chosen to balance capability against cost and complexity. The goal: a system that runs in production for under a dollar a month while being genuinely useful, not a demo.
Component-based UI with fast HMR. Chat interface, upload panel, document sidebar, and collapsible source citations.
Best AI/ML ecosystem, async support, auto-generated API docs. Three core routes: /ingest, /query, /documents.
$0.02 per 1M tokens. 1536-dimension vectors. High quality at the lowest cost tier in the OpenAI lineup.
$0.15/$0.60 per 1M input/output tokens. Cheapest capable model — strong enough for grounded Q&A with retrieved context.
Zero infrastructure, free, persists to disk. Good enough for demo and small-to-mid scale. Swap to Pinecone or Weaviate when needed.
React on Vercel (free), FastAPI on Render (free tier). Total cost: ~$0.50/month for OpenAI API usage at moderate volume.
When a user uploads a document (PDF, DOCX, MD, or TXT), it goes through a four-step pipeline before it's queryable.
Extract — The document processor pulls raw text based on file type. PDFs get page-by-page extraction, Word docs get paragraph parsing, markdown and text pass through directly.
Chunk — Text is split into overlapping 500-character chunks with 50-character overlap. Each chunk carries metadata: filename, file type, section header (if detected), and character positions. The overlap ensures no context is lost at boundaries.
Embed — Each chunk is sent to OpenAI's embedding API, returning a 1536-dimension vector that captures semantic meaning. "What's the PTO policy?" and "How many vacation days do I get?" produce similar vectors even though they share no keywords.
Store — Vectors and their associated text/metadata are written to ChromaDB. The document is now searchable. A GitLab Employee Handbook produces ~3,069 chunks; a smaller policy doc might be 25.
When a user asks a question, retrieval and generation happen in a single fast pass.
Embed the question — The user's query is converted to a vector using the same embedding model.
Retrieve — ChromaDB performs a cosine similarity search, returning the top 5 most relevant chunks. Results below a 0.3 relevance threshold are filtered out — if nothing is relevant, the system says so rather than guessing.
Generate — The retrieved chunks, conversation history (last 5 turns), and system instructions are assembled into a prompt. GPT-4o-mini generates the answer, grounded in the retrieved context.
Cite — The response includes clickable source references pointing back to the original document chunks, so users can verify every claim.
Three-layer architecture with clean separation between the client, API, and processing/storage concerns.
┌─────────────────────────────────────────────────────────┐ │ CLIENT LAYER │ │ │ │ React Frontend (Vite + Tailwind) │ │ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │ │ Upload │ │ Chat │ │ Source │ │ │ │ Panel │ │ Interface│ │ Viewer │ │ │ └──────────┘ └──────────┘ └──────────────┘ │ └──────────────────────┬──────────────────────────────────┘ │ HTTP/REST (JSON) ┌──────────────────────▼──────────────────────────────────┐ │ API LAYER — FastAPI Server │ │ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │ │ /ingest │ │ /query │ │ /documents │ │ │ └──────────┘ └──────────┘ └──────────────┘ │ └──────┬──────────────┬──────────────┬────────────────────┘ │ │ │ ┌──────▼──────┐ ┌────▼─────┐ ┌────▼─────────────────────┐ │ PROCESSING │ │ RETRIEVAL│ │ EXTERNAL APIs │ │ │ │ │ │ │ │ Doc Process │ │ ChromaDB │ │ OpenAI Embeddings API │ │ Text Chunker│ │ (local) │ │ OpenAI Chat API │ └─────────────┘ └──────────┘ └───────────────────────────┘
Services Layer: The backend is organized into discrete services — document_processor, chunker, embeddings, vector_store, retriever, and generator. Each handles one concern. This means swapping ChromaDB for Pinecone, or GPT-4o-mini for Claude, is a single-file change rather than a rewrite.
Why 500-character chunks with 50-char overlap? Too small and you lose context — a sentence about PTO policy gets separated from the details. Too large and retrieval gets noisy — irrelevant content rides along with the good match. 500 characters (~80–100 words) is the sweet spot for Q&A retrieval. The overlap ensures sentences that span chunk boundaries aren't lost.
Why a relevance threshold of 0.3? Without a floor, ChromaDB always returns its top-k results — even if nothing is relevant. A cosine similarity score of 0.3 is generous enough to catch fuzzy matches but strong enough to prevent the LLM from fabricating answers from tangentially related content. When nothing clears the threshold, the system tells the user it can't find the answer rather than guessing.
Why conversation history (last 5 turns)? Follow-up questions like "What's the vesting schedule?" only make sense if the system remembers you were just asking about 401k matching. Sending the last 5 turns gives enough context for multi-step exploration without bloating the prompt or running up token costs.
Why RAG over fine-tuning? Documents change. People get promoted, policies update, new SOPs get written. RAG lets you add or remove documents instantly — no retraining, no waiting, no versioning headaches. The knowledge base is always current because it's always reading from the source docs.
Designed to run on free tiers with minimal infrastructure. The frontend deploys to Vercel, the backend to Render, and ChromaDB persists on Render's filesystem.
| Component | Service | Monthly Cost |
|---|---|---|
| Frontend (React) | Vercel | Free |
| Backend (FastAPI) | Render | Free |
| Vector Store | ChromaDB on Render | Free |
| OpenAI API (~500 queries/mo) | OpenAI | ~$0.50 |
| Persistent disk (optional) | Render | $7.00 |
| Total | ~$0.50 – $7.50 |
For businesses evaluating this: The architecture is deliberately simple and cheap to start, but every layer has a clear upgrade path. ChromaDB → Pinecone for managed vector search. Render → AWS/GCP for autoscaling. GPT-4o-mini → GPT-4o or Claude for higher-quality answers. You start at $0.50/month and scale individual components as usage demands — no big-bang migration required.