Repository RAG Lab¶
This repository is a uv-first research lab for repository-grounded Retrieval-Augmented
Generation. Notebooks, the packaged Python CLI, make targets, tests, CI, and the Rust wrapper
all share the same implementation so experiments and automation stay aligned.
The Rust wrapper also exposes a compact SQLite lookup path for tracked files, and the default
make ask / uv run repo-rag ask path now uses that native index first before falling back to
the broader baseline retriever.
What The Repository Covers¶
The current scaffold focuses on three connected jobs:
- Explore in-repo RAG over repository files with a simple baseline retriever plus optional DSPy runtime and compiled-program flows.
- Discover MCP-related artifacts in the repository, submodules, or package manifests.
- Prepare Azure deployment manifests, validate Azure runtime contracts, and optionally answer repository questions through live Azure-backed synthesis.
Tooling Stance¶
This repository is intentionally fully uv-managed.
uvowns environment sync, locked execution, dependency resolution, builds, and publishing.uv_buildis the Python build backend.makeis a convenience layer overuv run ....- Pixi is not part of the current toolchain because it would duplicate responsibilities already
covered by
uv.
Quick Start¶
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync --extra azure
make hooks-install
make quality
make ask QUESTION="What does this repository research?"
Publication Draft¶
The repository now includes a publication-style article that explains the project piece by piece, from corpus loading and retrieval through MCP discovery, notebook scaffolding, verification, and the Rust wrapper.
- Read the PDF: publication/repository-rag-lab-article.pdf
- Read the bilingual exploratorium: publication/exploratorium_translation/exploratorium_translation.pdf
- Browse the live Pages catalog: https://realagiorganization.github.io/dspy_rag_in_repo_docs_and_impl1/
- Review the tracked file inventory: FILES.md
- Review the synced TODO table: TODO.MD
- Follow the short local verification walkthrough: documentation/simple-end-to-end-verification-guide.md
- Browse the Pages-ready Markdown catalog locally:
make pages-buildthen opensite/index.html - Refresh the file inventory:
make files-sync - Refresh the backlog tables:
make todo-sync - Refresh the bilingual exploratorium inventory:
make exploratorium-sync - Rebuild it locally:
make paper-build
Preferred Workflow Surfaces¶
| Surface | Preferred command | Purpose |
|---|---|---|
| Utility overview | make utility-summary |
Show the supported user-facing entrypoints. |
| Direct CLI | uv run repo-rag utility-summary |
Use the packaged CLI without going through make. |
| File inventory sync | make files-sync |
Regenerate FILES.md and FILES.csv from the tracked repository tree. |
| Rust lookup index | make rust-lookup-index |
Build or refresh the ignored SQLite FTS index under artifacts/sqlite/. |
| Rust lookup | make rust-lookup QUERY="dspy training" |
Search tracked file paths and contents locally before make ask-dspy. |
| Backlog sync | make todo-sync |
Regenerate the linkified TODO table in both Markdown and the publication article. |
| Exploratorium sync | make exploratorium-sync |
Regenerate the bilingual file-link-fetch-state publication inventory. |
| Pages catalog build | make pages-build |
Generate and build the MkDocs Material GitHub Pages catalog of tracked Markdown files. |
| Ask a repo question | make ask QUESTION="..." |
Run the lookup-first repository-grounded workflow with explicit Question:, Answer:, and Evidence: output, narrowing to native SQLite file hits before falling back to the broader baseline retriever. |
| DSPy ask | make ask-dspy QUESTION="..." |
Run the explicit DSPy runtime path with LM config from DSPY_*, Azure, or OpenAI environment variables, automatically reusing the latest compiled program when one exists after the same lookup-first narrowing pass; pair it with make rust-lookup when you want to inspect those candidate files directly. |
| Live Azure ask | make ask-live QUESTION="..." |
Retrieve repository evidence locally, then synthesize a live answer through Azure OpenAI or Azure AI Inference. |
| DSPy compile | make dspy-train DSPY_RUN_NAME=... |
Compile and save a repository-grounded DSPy program under artifacts/dspy/. |
| DSPy artifact inspect | make dspy-artifacts |
List saved DSPy runs, the latest compiled program, and recorded benchmark metadata. |
| Retrieval evaluation | make retrieval-eval |
Measure retrieval quality with pass rate, recall, precision, reciprocal rank, per-tag breakdowns, a top-k sweep, and enforced minimum pass/recall thresholds. |
| MCP discovery | make discover-mcp |
Inspect MCP-related repository artifacts. |
| Smoke test | make smoke-test |
Check answer generation, MCP discovery, and Azure manifest output together. |
| Azure OpenAI probe | make azure-openai-probe |
Validate the Azure OpenAI env contract and run a minimal live chat-completions round trip. |
| Azure Inference probe | make azure-inference-probe |
Validate and normalize the Azure AI Inference endpoint, then run a minimal live round trip. |
| Surface verification | make verify-surfaces |
Enforce the Makefile and notebook contract. |
| Notebook batch report | make notebook-report |
Execute all tracked notebooks with progress output, raw logs, executed copies, and a final report. |
| GitHub run list | make gh-runs |
List recent GitHub Actions runs through gh. |
| GitHub run watch | make gh-watch |
Watch the latest or selected GitHub Actions run until completion. |
| GitHub failed logs | make gh-failed-logs |
Print failed job logs for the latest or selected run when CI breaks. |
| PR gate sync | make github-pr-gates |
Apply the required GitHub status checks for master pull request merges through gh. |
| Publication PDF | make paper-build |
Build the LaTeX article PDF and clipped banner image. |
| Exploratorium PDF | make exploratorium-build |
Build the bilingual exploratorium translation PDF. |
| Notebook research | make notebook |
Open the main notebook playbook in JupyterLab. |
| Rust wrapper | cargo run --manifest-path rust-cli/Cargo.toml -- ask --question "..." |
Delegate to the Python workflow, while also exposing native index and lookup subcommands. |
Repository Map¶
| Path | Role |
|---|---|
src/repo_rag_lab/ |
Shared Python package for corpus loading, retrieval, MCP discovery, CLI commands, notebook scaffolds, utilities, and verification helpers. |
README.AGENTS.md |
Overarching research narrative that ties together the repository thesis, workflow stages, evidence surfaces, and maintenance contract. |
README.DSPY.MD |
Central DSPy map covering corpus planning, training samples, benchmarks, compile-reload flows, notebook scaffolds, and remaining DSPy limitations. |
FILES.md / FILES.csv |
Generated tracked-file inventories for humans, scripts, and agent maintenance. |
notebooks/ |
Research playbooks that reuse package helpers for validation, assertions, and logging instead of embedding workflow logic inline. |
tests/ |
Pytest suites, BDD-style checks, doctests, and surface verification tests. |
samples/training/ |
Starter question-answer pairs for DSPy-oriented experiments. |
samples/population/ |
Starter corpus-planning data for staged repository ingestion. |
documentation/ |
Supporting notes for Azure deployment and inspired external implementations. |
publication/ |
LaTeX article source, bibliography, committed PDFs, clipped banner image, bilingual exploratorium subdocument, and local build helpers. |
mkdocs.yml |
MkDocs Material configuration for the public GitHub Pages Markdown catalog. |
todo-backlog.yaml |
Single source of truth for the linkified backlog table rendered into TODO.MD and the publication article. |
samples/logs/ |
Post-push GitHub Actions inspection logs captured with gh. |
artifacts/ |
Generated DSPy program artifacts, Azure manifests, tuning metadata, notebook run logs, and notebook batch-run reports. |
rust-cli/ |
Rust wrapper that delegates to uv run repo-rag and maintains the local SQLite lookup index under artifacts/sqlite/. |
Verification And Quality¶
The repository treats documentation, notebooks, utilities, and packaging as one workflow. The main verification entrypoints are:
make ask QUESTION="What does this repository research?"make compilemake smoke-testmake lintmake typecheckmake verify-surfacesmake testmake qualitymake build
Git hooks are managed through pre-commit:
make hooks-installmake hooks-runmake hooks-run-push
The pre-commit hook stays lightweight with Ruff checks. The pre-push hook runs the heavier
acceptance gates: mypy, basedpyright, retrieval evaluation, pytest with coverage, and
repository-surface verification. The same retrieval gate also runs inside make quality and CI.
Azure Deployment Path¶
This repository does not fine-tune or deploy a model on its own. It writes deployment metadata
that downstream Azure workflows can consume after a tuned artifact already exists. It now also
includes first-class runtime probes and a live-answer surface for validating the repository's
Azure configuration against the same uv-managed CLI.
make azure-manifest MODEL_ID=my-ft-model DEPLOYMENT_NAME=repo-rag-ft
make azure-openai-probe
make azure-inference-probe
make ask-live QUESTION="What does this repository research?"
The manifest lands in artifacts/azure/ and records the deployment name, endpoint, and required
runtime environment variables.
Agent Guidance¶
Repository-local agent instructions live in AGENTS.md. Agents and contributors should start with
named make targets or uv run repo-rag ... commands before inventing one-off workflows so
notebooks, tests, CI, and automation stay aligned. The overreaching repository narrative that
agents are expected to keep current lives in README.AGENTS.md. For repo
question answering, make ask already uses the Rust lookup path first. Run make rust-lookup
QUERY="..." when you want to inspect those candidate files directly before moving to
make ask-dspy.
Post-Push Workflow¶
After every push:
- Inspect recent runs with
gh run list --limit 10. - Capture the relevant
gh run viewdetails. - Store the summary in
samples/logs/.
That step is part of the repository contract, not optional cleanup.
