Project Summary
DocuMind is a local-first RAG stack with two Chroma collections — public (encyclopedia-scale) and papers (PDFs, DOCX, text, arXiv) — FastAPI with per-request library routing, chunking and cosine retrieval, and citations in every response. Inference defaults to Ollama on your hardware. Bulk public text is indexed offline; the Next.js web UI is a status and query client over the same REST API.
Technical deep dive
Executive summary
DocuMind is a local-first, production-oriented retrieval augmented generation platform for teams that need precise answers from private document collections without losing control of data residency, model behavior, or governance posture. The system combines FastAPI orchestration, ChromaDB vector indexing, Ollama-powered embeddings and generation, and strategy-aware retrieval pipelines to return grounded responses with explicit citations. Rather than chasing one-click novelty demos, it is designed for durable enterprise AI operations where repeatability, observability, and security controls matter as much as answer fluency.
The central engineering thesis behind DocuMind is that most real-world AI answer failures begin upstream in retrieval quality rather than downstream in text generation quality. Hallucinations, citation drift, weak factual coverage, and overconfident summaries often originate in ingestion defects, chunking fragmentation, inadequate recall, poor filtering logic, or strategy mismatch for the query type. DocuMind surfaces these retrieval levers as explicit controls, so teams can tune outcomes through measurable policies instead of relying on prompt superstition.
Architecturally, DocuMind follows a layered service model that cleanly separates client experience, API orchestration, and retrieval internals. Next.js and Streamlit frontends connect to a FastAPI interface that initializes dependencies during lifespan startup and exposes predictable routes for query, ingest, health, and diagnostics. Behind that contract, DocumentService, ChromaEmbeddingService, and RAGService each own narrowly scoped responsibilities. This service isolation limits incident blast radius, accelerates debugging, and enables safe iteration when model versions, indexing logic, or retrieval strategies evolve.
From an SEO and discoverability lens, DocuMind aligns with high-intent enterprise search themes including private RAG architecture, local LLM deployment, FastAPI AI backend design, ChromaDB retrieval tuning, grounding-first prompt architecture, and secure knowledge assistant implementation. This keyword surface is not decorative content marketing. It directly reflects platform decisions that engineering leaders, security stakeholders, and AI product owners evaluate when comparing managed SaaS copilots versus self-controlled retrieval systems.
DocuMind also adopts a governance-first operating model. Query modes are explicit, response contracts are mode-aware, middleware controls are ordered, and retrieval telemetry is observable at phase-level granularity. This approach supports internal audits, incident postmortems, and compliance alignment in organizations where AI systems are reviewed by engineering, security, legal, and risk teams. In those environments, trust is earned by traceability and deterministic behavior, not by polished prose alone.
Operationally, the platform is built for long-horizon adoption. Teams can launch with conservative baseline retrieval, then progressively add multi-query expansion, HyDE query synthesis, uncertainty-aware FLARE behavior, reranking layers, and policy-driven routing without disrupting client integrations. This staged architecture lowers migration risk, protects downstream consumers, and offers a practical path from proof of concept to production-grade enterprise retrieval.
Design principles
- Grounding before fluency: response quality is evaluated first by evidence alignment, citation traceability, and uncertainty signaling, then by writing style.
- Local-first control posture: ingestion, embedding, retrieval, and generation can run in local or private infrastructure to reduce data egress and lock-in risk.
- Composable service boundaries: parsing, embedding, indexing, retrieval, and synthesis are split into narrow services for independent testing and safe iteration.
- Deterministic runtime behavior: startup lifecycle, health semantics, mode contracts, and diagnostics are explicit to keep behavior stable across environments.
- Progressive retrieval sophistication: baseline retrieval remains reliable default while FLARE, HyDE, and multi_query are selectively routed for difficult prompts.
These principles are implemented concretely through deterministic document identifiers, metadata-rich chunk records, strategy selection controls, and mode-specific response schemas. Because controls are explicit, teams can perform controlled A/B tests, isolate regressions to specific phases, and compare model or policy upgrades with reproducible evidence. The result is practical engineering rigor for production RAG systems, not best-effort experimentation.
Design principles also influence operating economics. Faster fault isolation reduces mean time to resolution. Service boundaries lower integration risk during dependency changes. Local-first portability preserves procurement flexibility across infrastructure vendors. Together, these choices reduce total cost of ownership while improving reliability and release velocity.
Runtime architecture
DocuMind uses a client API service topology with explicit FastAPI lifespan bootstrapping. Both web-first and analyst-friendly clients remain decoupled from retrieval internals and communicate through a stable API boundary. During startup, lifespan wiring instantiates services, validates dependencies, and exposes readiness only after critical paths are available. This model makes behavior predictable in local development, containerized deployment, and long-running production services.
The architecture enforces strict ownership boundaries. `DocumentService` handles ingest orchestration and content normalization. `ChromaEmbeddingService` manages embedding generation and vector persistence. `RAGService` executes retrieval strategy logic and answer synthesis from evidence. External engines, including Ollama and ChromaDB, remain behind interfaces so their lifecycle, performance tuning, or scaling can evolve without breaking API-level contracts.
Service isolation reduces blast radius under failure. If parsing logic degrades, ingest can be remediated while query retrieval continues using existing indexed corpora. If a strategy experiment increases latency, routing can revert to baseline without reindexing documents or modifying client payloads. This decoupling shortens rollback paths and supports safer release cadence.
The same design creates bounded extensibility. Teams can add reranking modules, hybrid retrieval, cache layers, policy middleware, or distributed tracing with modest integration friction. Since clients depend on stable API semantics rather than internal service details, backend refactors remain low risk for product surfaces and automation workflows.
Dual collection strategy
DocuMind uses two primary collections, `public` and `papers`, to avoid one-size-fits-all retrieval behavior. Public content emphasizes breadth, concept orientation, and navigational discovery. Papers content emphasizes precision, reproducibility, and citation fidelity for methods and benchmark claims. Segregating collections supports policy specialization and improves relevance under mixed-domain traffic patterns.
| Dimension | Public | Papers |
|---|---|---|
| Primary source profile | General docs, explainers, curated notes, and broad technical references | Research papers, PDFs, methods appendices, and benchmark-heavy artifacts |
| Optimization goal | High recall for orientation and terminology mapping | High precision for claim validation and citation quality |
| Candidate fan-out | Broader top-k exploration to reduce first-pass misses for conceptual prompts | Tighter retrieval windows to prevent noisy context contamination in evidence-heavy prompts |
| Metadata requirements | Source and title metadata usually sufficient for trust checks | Page, section, chunk lineage, and document ids are critical |
| Typical user intents | Definitions, overviews, onboarding, and ecosystem discovery | Method verification, benchmark extraction, and replication support |
| Primary failure risk | Under-recall creates shallow but plausible summaries | Over-permissive thresholds create citation drift and evidence mixing |
Collection specialization enables per-domain tuning for chunk overlap, retrieval depth, threshold strictness, and strategy routing. Governance also becomes clearer because quality standards can be defined per collection rather than diluted into global defaults. Public can optimize discoverability and onboarding value, while papers can enforce stricter evidence rules for higher-risk technical decisions.
The pattern scales naturally to legal, support, product, policy, and compliance corpora. New collections can adopt domain-specific retrieval policies while sharing API contracts, telemetry conventions, and middleware controls. This allows horizontal knowledge expansion without sacrificing operational discipline or response quality.
Ingest lifecycle
Ingest quality is the strongest predictor of retrieval quality, so DocuMind treats ingestion as a deterministic lifecycle with explicit quality gates instead of a one-time indexing script. Every phase preserves identity and lineage metadata so reindexing, forensic analysis, and rollback remain tractable as parsers, chunk policy, or embedding models change.
- 1. Source registration and identity assignment: each document receives deterministic ids, ownership metadata, and target collection routing for reproducible upserts.
- 2. Format-aware extraction: PDF, DOCX, markdown, HTML, and plaintext parsers normalize raw content while preserving useful hierarchy cues.
- 3. Structural normalization: heading artifacts, duplicated boilerplate, malformed delimiters, and encoding noise are removed before chunk creation.
- 4. Segmentation and chunk policy: token windows and overlap parameters are applied to preserve conceptual continuity while controlling context bloat.
- 5. Metadata enrichment: each chunk stores source, title, section cues, sequence position, parser version, and ingestion timestamp.
- 6. Embedding generation: chunk batches are sent to local embedding models via Ollama with retries, backoff, and rate-aware pacing.
- 7. Vector persistence: vectors and metadata are written to ChromaDB using deterministic keys that support idempotent reprocessing.
- 8. Version reconciliation: stale chunks are retired or superseded when parser, chunk policy, or embedding models are upgraded.
- 9. Post-ingest validation: canary queries and checksum diagnostics verify discoverability, metadata completeness, and citation viability.
- 10. Operational telemetry capture: parse failures, chunk distributions, embedding latency, index throughput, and reconciliation events are logged.
This ingestion lifecycle makes change management safer. If relevance degrades after a parser update, teams can inspect phase artifacts and localize drift quickly. If an embedding model upgrade improves semantic matching, migration can proceed with explicit rollback semantics and intact provenance. The practical result is higher trust with lower operational noise.
Treating ingestion as a continuously validated data product creates durable advantages including repeatable benchmarking, audit-ready lineage, safer dependency upgrades, and better SLA predictability. In enterprise AI systems, these controls are foundational engineering requirements rather than optional optimization work.
Retrieval pipeline
DocuMind retrieval executes as a six-step pipeline optimized for grounded correctness, predictable latency, and debuggability. Each stage emits inspectable artifacts, allowing offline evaluation and rapid incident triage. This explicit decomposition helps teams assign failure to the right layer and apply targeted fixes instead of masking defects with broad prompt edits.
- 1. Query interpretation and mode selection: parse user intent, map to output mode contracts, and apply mode-specific constraints before retrieval.
- 2. Strategy-aware query formulation: pass through baseline query or generate strategy-specific variants for flare, hyde, or multi_query behavior.
- 3. Candidate retrieval execution: run dense retrieval in target collection with tunable top-k and candidate multipliers.
- 4. Candidate filtering and balancing: apply similarity thresholds, lexical sanity checks, and diversity controls to reduce context collapse.
- 5. Grounded context assembly: build synthesis context from selected chunks, preserving source lineage and citation-ready metadata.
- 6. Answer synthesis and citation packaging: generate response with Ollama and return structured citations for UI rendering and audit trails.
Because the pipeline is explicit, regressions are easier to isolate. Teams can distinguish failures in mode parsing, retrieval recall, filtering strictness, context assembly, or synthesis behavior. This transparency shortens incident cycles and improves postmortem quality by enabling concrete corrective actions tied to measurable failure points.
The same structure supports rigorous evaluation methodology. Gold query suites can measure overlap, citation density, factual correctness, and response utility under controlled strategy variants. Changes should be promoted only when quality metrics improve within latency budgets, creating a stable path for continuous optimization.
Retrieval strategies
| Strategy | Mechanism | Extra LLM calls |
|---|---|---|
| baseline | Single-pass retrieval from original query, followed by filtering, context assembly, and one grounded synthesis pass. | 0 |
| flare | Uncertainty-aware iterative behavior that triggers additional retrieval when first-pass evidence appears weak or incomplete. | 1-2 |
| hyde | Generates a hypothetical answer draft, embeds that synthetic representation, retrieves against it, then grounds final output in real citations. | 1 |
| multi_query | Generates semantically distinct query variants, retrieves across each variant, and fuses candidates for broader recall. | 1 |
Baseline remains the safest default for predictable cost and latency. HyDE performs well when user prompts are short, ambiguous, or vocabulary-misaligned with indexed documents. Multi_query is useful for compound questions spanning multiple concepts. FLARE can rescue difficult edge cases where uncertainty is high, but it should be policy-routed carefully due to variable latency and token overhead.
In mature deployments, strategy routing should be policy-driven rather than manually selected by users. Query classifiers, latency budgets, and confidence signals can route traffic to the right strategy automatically. This approach keeps common traffic efficient while preserving advanced methods for complex prompts that would otherwise fail baseline retrieval.
Query modes
| Mode | Intent | Output profile | When to use |
|---|---|---|---|
| answer | Resolve direct factual or technical requests quickly | Concise grounded response with clear citations | Default operational Q&A and daily engineering workflows |
| explain | Teach a concept using retrieved context | Structured explanation with assumptions and caveats | Onboarding, architecture reviews, and internal education |
| summarize | Condense long artifacts into essential findings | Hierarchical summary anchored to evidence | Research synthesis and executive briefing |
| compare | Contrast approaches, tools, methods, or papers | Side-by-side tradeoff framing with references | Technical decision support and vendor evaluations |
| datasets | Extract benchmark and dataset references deterministically | Structured inventory compiled from cited passages | Reproducibility tracking and benchmark mapping |
Mode-aware routing standardizes output behavior at the API layer and reduces prompt burden on end users. Rather than requiring carefully handcrafted instructions for every request, DocuMind enforces mode contracts that keep response structure consistent. This improves reliability for downstream dashboards, workflow automations, and evaluation pipelines that depend on predictable schema behavior.
Mode segmentation also improves observability and optimization prioritization. Teams can track citation depth, correctness, and latency by mode, then invest where user impact and operating cost intersect most strongly. This makes roadmap planning more data-driven and reduces optimization effort on low-value paths.
HTTP API
DocuMind exposes a versioned HTTP API with explicit routes for health checks, retrieval queries, ingestion jobs, library inspection, and diagnostics. This design supports multiple client types while preserving compatibility through stable contracts. Typed request validation and standardized error shapes reduce client drift and improve integration confidence across distributed systems.
| Endpoint | Method | Purpose | Notes |
|---|---|---|---|
| /health/live | GET | Returns process liveness for heartbeat checks | Used by orchestrators to detect hard failures |
| /health/ready | GET | Returns dependency readiness after lifecycle init | Prevents traffic before services are fully available |
| /api/v1/query | POST | Runs retrieval and grounded generation | Accepts mode and strategy controls, returns citations |
| /api/v1/ingest | POST | Executes ingest lifecycle for selected documents | Parses, chunks, embeds, and indexes to collection |
| /api/v1/libraries | GET | Returns collection inventory and metadata stats | Useful for capacity planning and sanity checks |
| /api/v1/diagnostics | GET | Returns runtime diagnostics and configuration state | Supports retrieval triage and operational debugging |
A versioned API surface makes incremental delivery safer. New capabilities can be released without forcing immediate client migration, while diagnostics routes provide enough transparency for operators to validate runtime health and configuration state. Combined with contract testing, this architecture supports disciplined CI/CD and safer release management.
Security middleware stack
Security controls are applied as an ordered middleware sequence so validation occurs at ingress and behavior remains consistent under both normal and adversarial traffic. Each layer maps to a specific threat class and failure mode. This structure supports incremental hardening without requiring disruptive architecture rewrites.
- 1. CORS policy enforcement restricts browser-origin access to approved domains and reduces accidental cross-origin exposure.
- 2. Trusted host validation rejects malformed or unapproved host headers and limits routing ambiguity behind proxies.
- 3. Security response headers apply defensive defaults such as frame denial and strict MIME type handling.
- 4. Request id injection at ingress establishes traceability across logs, diagnostics, and normalized error payloads.
- 5. Authentication and credential checks on protected routes enforce API access control in shared deployments.
- 6. Error normalization sanitizes exception output to avoid internal detail leakage while preserving operational context.
- 7. Optional rate limiting and abuse controls protect expensive retrieval routes against spikes and malicious misuse.
Because DocuMind is service-oriented, security maturity can increase incrementally as governance needs evolve. Teams can start with ingress hardening and credential controls, then add identity federation, policy engines, and network segmentation over time. This avoids the common false tradeoff between shipping quickly and building a defensible long-term security posture.
Observability
DocuMind observability combines diagnostics endpoints, structured logs, and retrieval-specific telemetry for full-stack runtime visibility. Diagnostics surface active configuration and dependency readiness. Logs capture request context, route behavior, failures, and latency distribution. Retrieval telemetry captures candidate fan-out, threshold attrition, strategy timing, and citation characteristics. Together, these signals reveal quality drift early and enable evidence-based incident response.
A practical metrics baseline includes p50 and p95 latency by route, mode, and strategy; ingest throughput by source type; retrieval hit rate; grounded-answer ratio; and citation depth distribution. Teams with strict SLOs should additionally measure phase timings across retrieval, filtering, context assembly, and synthesis. This instrumentation improves reliability and creates a robust foundation for systematic quality optimization.
Observability also enables high-quality experimentation. Teams can compare strategy variants against curated gold queries, evaluate threshold changes, and quantify whether quality gains justify latency or cost tradeoffs. This evidence loop increases release confidence and prevents subtle regressions that subjective manual reviews often miss.
For leadership and cross-functional stakeholders, telemetry translates AI behavior into operational language such as reliability, evidence coverage, user-perceived latency, and regression risk. That shared language supports better prioritization, clearer ROI framing, and stronger architecture investment decisions.
Known limitations
- Fine-grained per-user authorization at chunk level may require additional ACL or policy-layer retrieval filtering.
- OCR-heavy scanned corpora can reduce precision when extraction artifacts inject malformed text and structure noise.
- Advanced retrieval strategies often improve recall but introduce higher latency variance and increased inference cost.
- Large or rapidly changing corpora may require sharding, asynchronous indexing orchestration, and stricter freshness controls.
- Grounding-first prompts reduce hallucination risk but cannot remove ambiguity when intent or evidence is incomplete.
- Cross-collection terminology overlap can still create ranking tradeoffs without careful threshold and reranking calibration.
These limitations are expected in serious retrieval systems and can be addressed through phased engineering. A practical trajectory is controlled rollout, baseline metric establishment, policy hardening, and targeted scale improvements guided by production behavior. Explicitly documenting constraints prevents unrealistic stakeholder expectations and improves roadmap quality.
Extension points
DocuMind is intentionally designed for extensibility across retrieval quality, governance controls, and performance operations. Service boundaries and typed API contracts allow major capability additions with low integration friction. This is critical in AI environments where model capabilities, compliance requirements, and workload characteristics evolve continuously.
- Cross-encoder reranking for higher precision on nuanced technical and benchmark-heavy retrieval tasks.
- Hybrid dense plus lexical retrieval for acronym-heavy, identifier-sensitive, or exact phrase dependent corpora.
- Policy-driven strategy routing based on query classification, confidence signals, and latency budget constraints.
- Automated CI evaluation harnesses for regression detection on retrieval overlap, citation quality, and answer correctness.
- Identity-aware retrieval filtering and tenancy isolation for secure multi-team deployments.
- Version-aware cache layers with ingest-triggered invalidation to improve throughput while preserving freshness guarantees.
- Adaptive threshold calibration pipelines trained on labeled query sets to continuously balance precision and recall.
DocuMind demonstrates what production-capable private RAG can look like in practice: disciplined ingest, strategy-aware retrieval, source-constrained synthesis, and operations-ready architecture. For organizations that need trustworthy AI knowledge access with strong governance and extensibility, it provides both a credible current implementation and a pragmatic growth path.
As enterprise AI programs mature, architecture quality increasingly determines business value. DocuMind aligns with that reality by optimizing for repeatability, transparency, and controlled evolution. The platform is not only technically credible now, but structurally prepared for future requirements around model updates, policy enforcement, and horizontal scale.
At portfolio level, this project also demonstrates practical senior engineering judgment: isolate volatile dependencies, expose deterministic contracts, make retrieval quality measurable, and treat governance as a first-class design objective. Those decisions are exactly what distinguish a robust enterprise AI system from a short-lived prototype. That is why DocuMind is meaningful not only as a working RAG implementation, but as an architecture blueprint for long-term, accountable AI product delivery.
Key Features & Capabilities
- Dual Chroma collections with cosine HNSW indexing and library-specific relevance thresholds
- Grounding-first answers conditioned only on retrieved chunk text with explicit SourceCitation objects
- Configurable retrieval: baseline, FLARE, HyDE, and multi-query strategies with ablation tooling
- Query modes: general, compare, methodology, datasets, and reproduce — tuned per library
- Liveness vs readiness health split, optional API key gate, and structured request correlation
- Next.js 15 query UI plus optional Streamlit frontend; Docker Compose deployment ready
Tech Stack & Components
Getting Started
1.Docker Compose (recommended)
Publishes API on port 8001 with Chroma volume mount.
docker compose up --build2.Windows development script
Starts Ollama, API, and Next.js. First boot may wait during corpus ingest.
.\start_documind.ps13.Bare metal API
Ensure Ollama is running on port 11434.
uvicorn app.main:app --host 0.0.0.0 --port 8001Frequently asked questions
- What is DocuMind?
- DocuMind is a local-first RAG stack with two Chroma collections—a public encyclopedia-scale library and a papers library for PDFs, DOCX, text, and arXiv. FastAPI orchestrates ingest, retrieval, and Ollama-backed generation; every grounded answer includes explicit SourceCitation objects.
- How does DocuMind prevent hallucinated citations?
- Answers for LLM-backed modes are conditioned only on retrieved chunk text. Prompts explicitly forbid inventing papers, metrics, or datasets absent from context. Responses include structured citations with document id, title, section, chunk index, distance, and preview text.
- What retrieval strategies does DocuMind support?
- Baseline dense-vector search with keyword rerank and source diversity; FLARE-shaped second pass when drafts contain hedges; HyDE (hypothetical document embedding); and multi-query fusion with reciprocal rank fusion. Ablation scripts report grounded rate and latency per strategy.
- Is DocuMind secure for internal document Q&A?
- The baseline is single-tenant with optional X-API-Key on all /api/v1 routes, CORS and trusted-host middleware, security response headers, and request_id correlation. Per-user ACLs and SSO are not in-tree—production hardening adds OIDC, tenant isolation, and encrypted persistence at the gateway layer.
- What are liveness vs readiness health checks in DocuMind?
- GET /health/live confirms the process is up. GET /health/ready returns 503 when Ollama or Chroma dependencies are unavailable—use readiness for load balancer routing so traffic only hits instances that can actually serve queries.
