Skip to content

Retrieval Evaluation Suite Audit

  • Audit date: 2026-03-18 (Asia/Tbilisi)
  • Repository root: /home/standard/dspy_rag_in_repo_docs_and_impl1
  • Git HEAD during verification: 798f5da77d6badb99f5f284afa834745aa572e53

Scope

This audit captures the new retrieval-quality evaluation suite: richer per-case metrics in the benchmark layer, a top-k sweep, a repo-facing retrieval-eval utility exposed through both the CLI and Makefile, notebook-scaffolding propagation of the richer summaries, and the matching tests plus docs.

Executed Commands

Executed successfully in this turn:

  • make hooks-install
  • uv run python -m compileall src tests
  • uv run pytest tests/test_utilities.py tests/test_repository_rag_bdd.py
  • uv run repo-rag smoke-test
  • uv run repo-rag retrieval-eval --root . --top-k 4 --top-k-sweep 1,2,4,8
  • cargo build --manifest-path rust-cli/Cargo.toml
  • uv run pytest tests/test_utilities.py tests/test_cli_and_dspy.py tests/test_benchmarks_and_notebook_scaffolding.py tests/test_verification.py
  • make quality

Notable Results

  • make hooks-install: passed and refreshed the managed pre-commit plus pre-push hooks
  • uv run python -m compileall src tests: passed
  • uv run pytest tests/test_utilities.py tests/test_repository_rag_bdd.py: passed, 9 tests
  • uv run repo-rag smoke-test: passed with answer_contains_repository: true, mcp_candidate_count: 1, and manifest_path: artifacts/azure/repo-rag-smoke.json
  • uv run repo-rag retrieval-eval --root . --top-k 4 --top-k-sweep 1,2,4,8: passed and produced a retrieval-quality report with:
  • default top_k: 4
  • pass_rate: 1.0
  • fully_covered_rate: 0.3333333333333333
  • average_source_recall: 0.7222222222222222
  • average_source_precision: 0.4166666666666667
  • average_reciprocal_rank: 0.6666666666666666
  • top_k: 1 pass rate dropping to 0.3333333333333333
  • cargo build --manifest-path rust-cli/Cargo.toml: passed
  • focused retrieval-eval pytest slice: passed, 29 tests
  • make quality: passed with 90 tests and 87.47% total coverage

Current Verification Status

Configured and verified in this turn:

  • Compile checks: present and passed through uv run python -m compileall src tests
  • Utility and baseline pytest slice: present and passed through uv run pytest tests/test_utilities.py tests/test_repository_rag_bdd.py
  • Repository smoke test: present and passed through uv run repo-rag smoke-test
  • Retrieval-quality suite utility: present and passed through uv run repo-rag retrieval-eval --root . --top-k 4 --top-k-sweep 1,2,4,8
  • Rust build: present and passed through cargo build --manifest-path rust-cli/Cargo.toml
  • Focused utility, CLI, benchmark, notebook-scaffolding, and verification tests: present and passed through the targeted uv run pytest ... slice above
  • Lint, notebook lint, mypy, basedpyright, repository-surface verification, complexity, pytest, and coverage: present and passed through make quality

Still absent or not exercised in this turn:

  • UI or browser tests: none found in the repository configuration
  • Full notebook execution batch: notebook lint and notebook-surface checks passed, but make notebook-report was not rerun end-to-end in this turn
  • Live Azure deployment or inference tests: not rerun in this turn
  • Post-push GitHub Actions evidence: not yet available before the push for this change set

Notes

  • The retrieval suite now reports missed sources, first relevant rank, reciprocal rank, source recall, source precision, and full-coverage rate per benchmark case in addition to the previous hit-rate summary.
  • The new retrieval-eval surface gives the repository a stable, user-facing way to inspect retrieval quality without digging through notebook helpers or internal Python calls.
  • Notebook scaffolding now exposes the top-k sweep summaries directly, so notebook consumers can compare retrieval quality at multiple cutoffs without reimplementing the evaluation loop inline.