Retrieval Evaluation Suite Audit¶
- Audit date:
2026-03-18(Asia/Tbilisi) - Repository root:
/home/standard/dspy_rag_in_repo_docs_and_impl1 - Git HEAD during verification:
798f5da77d6badb99f5f284afa834745aa572e53
Scope¶
This audit captures the new retrieval-quality evaluation suite: richer per-case metrics in the
benchmark layer, a top-k sweep, a repo-facing retrieval-eval utility exposed through both the
CLI and Makefile, notebook-scaffolding propagation of the richer summaries, and the matching
tests plus docs.
Executed Commands¶
Executed successfully in this turn:
make hooks-installuv run python -m compileall src testsuv run pytest tests/test_utilities.py tests/test_repository_rag_bdd.pyuv run repo-rag smoke-testuv run repo-rag retrieval-eval --root . --top-k 4 --top-k-sweep 1,2,4,8cargo build --manifest-path rust-cli/Cargo.tomluv run pytest tests/test_utilities.py tests/test_cli_and_dspy.py tests/test_benchmarks_and_notebook_scaffolding.py tests/test_verification.pymake quality
Notable Results¶
make hooks-install: passed and refreshed the managedpre-commitpluspre-pushhooksuv run python -m compileall src tests: passeduv run pytest tests/test_utilities.py tests/test_repository_rag_bdd.py: passed,9testsuv run repo-rag smoke-test: passed withanswer_contains_repository: true,mcp_candidate_count: 1, andmanifest_path: artifacts/azure/repo-rag-smoke.jsonuv run repo-rag retrieval-eval --root . --top-k 4 --top-k-sweep 1,2,4,8: passed and produced a retrieval-quality report with:- default
top_k: 4 pass_rate: 1.0fully_covered_rate: 0.3333333333333333average_source_recall: 0.7222222222222222average_source_precision: 0.4166666666666667average_reciprocal_rank: 0.6666666666666666top_k: 1pass rate dropping to0.3333333333333333cargo build --manifest-path rust-cli/Cargo.toml: passed- focused retrieval-eval pytest slice: passed,
29tests make quality: passed with90tests and87.47%total coverage
Current Verification Status¶
Configured and verified in this turn:
- Compile checks: present and passed through
uv run python -m compileall src tests - Utility and baseline pytest slice: present and passed through
uv run pytest tests/test_utilities.py tests/test_repository_rag_bdd.py - Repository smoke test: present and passed through
uv run repo-rag smoke-test - Retrieval-quality suite utility: present and passed through
uv run repo-rag retrieval-eval --root . --top-k 4 --top-k-sweep 1,2,4,8 - Rust build: present and passed through
cargo build --manifest-path rust-cli/Cargo.toml - Focused utility, CLI, benchmark, notebook-scaffolding, and verification tests: present and
passed through the targeted
uv run pytest ...slice above - Lint, notebook lint, mypy, basedpyright, repository-surface verification, complexity, pytest,
and coverage: present and passed through
make quality
Still absent or not exercised in this turn:
- UI or browser tests: none found in the repository configuration
- Full notebook execution batch: notebook lint and notebook-surface checks passed, but
make notebook-reportwas not rerun end-to-end in this turn - Live Azure deployment or inference tests: not rerun in this turn
- Post-push GitHub Actions evidence: not yet available before the push for this change set
Notes¶
- The retrieval suite now reports missed sources, first relevant rank, reciprocal rank, source recall, source precision, and full-coverage rate per benchmark case in addition to the previous hit-rate summary.
- The new
retrieval-evalsurface gives the repository a stable, user-facing way to inspect retrieval quality without digging through notebook helpers or internal Python calls. - Notebook scaffolding now exposes the top-k sweep summaries directly, so notebook consumers can compare retrieval quality at multiple cutoffs without reimplementing the evaluation loop inline.