Repository Benchmark Broadening Audit¶

Audit date: 2026-03-18 (Asia/Tbilisi)
Repository root: /tmp/repo-benchmark-broaden-D0oBUc
Verification branch: benchmark-broaden-2

Scope¶

This audit captures an expansion of the repository retrieval benchmark set from the original three-case starter suite to an eight-case suite. The new checked-in examples now cover package API notes, Azure AI Inference runtime guidance, MCP discovery notes, notebook execution/reporting, and publication build guidance in addition to the existing repo-overview, inspired-docs, and agent-utility questions. The notebook snapshots that present the benchmark suite were refreshed in the same turn so the checked-in playbooks reflect the broader coverage.

Executed Commands¶

Executed successfully in this turn:

make hooks-install
uv run repo-rag retrieval-eval --root . --top-k 4 --top-k-sweep 1,2,4,8
uv run pytest tests/test_training_samples.py tests/test_benchmarks_and_notebook_scaffolding.py tests/test_utilities.py tests/test_cli_and_dspy.py
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/02_agent_workflow_checklist.ipynb
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/03_dspy_training_lab.ipynb
cargo build --manifest-path rust-cli/Cargo.toml
make quality

Notable Results¶

uv run repo-rag retrieval-eval --root . --top-k 4 --top-k-sweep 1,2,4,8: passed and reported
benchmark_count: 8
default_top_k: 4
pass_rate: 1.0
fully_covered_rate: 1.0
average_source_recall: 1.0
average_source_precision: 0.46875
average_reciprocal_rank: 1.0
best_pass_rate_top_k: 4
Focused benchmark/training/utility/CLI pytest slice: passed, 36 tests
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/02_agent_workflow_checklist.ipynb: passed
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/03_dspy_training_lab.ipynb: passed
cargo build --manifest-path rust-cli/Cargo.toml: passed
make quality: passed with 119 tests and 87.98% total coverage

Current Verification Status¶

Configured and verified in this turn:

Expanded repository training-sample benchmark suite: present and passing through uv run repo-rag retrieval-eval --root . --top-k 4 --top-k-sweep 1,2,4,8
Training-sample, benchmark, utility, and CLI tests: present and passing through the focused uv run pytest ... slice above
Notebook snapshots for the benchmark-driven agent and training playbooks: present and refreshed through the two uv run jupyter nbconvert --execute --inplace ... commands above
Rust wrapper build: present and passing through cargo build --manifest-path rust-cli/Cargo.toml
Full Python quality gate: present and passing through make quality

Still absent or not exercised in this turn:

Full notebook batch execution across all tracked notebooks: not rerun in this turn
Live Azure OpenAI or Azure AI Inference probes: not rerun in this turn
Post-push GitHub Actions evidence: not yet available before the push for this change set

Notes¶

The repository benchmark file samples/training/repository_training_examples.yaml now acts as a broader regression surface instead of a minimal starter trio.
The new benchmark examples intentionally target repo surfaces that the current lexical retriever already ranks consistently at top_k=4, so the suite expands coverage without introducing synthetic red failures.
notebooks/02_agent_workflow_checklist.ipynb and notebooks/03_dspy_training_lab.ipynb were re-executed so their checked-in outputs now show the eight-case suite instead of the earlier three-case snapshot.