- app.js: hash-based router (#runs / #new / #profiles / #report/{runId})
- navigate() pushes history entries for back/forward support
- _restoreSession() reads hash on load and popstate
- sessionStorage fallback for same-tab refreshes
- run-card highlights selected run (.run-card.selected)
- runner.js: use App.navigate() for report redirect; persist lastRunId to sessionStorage
- index.html: report nav button starts disabled (enabled on run select/restore)
- app.css: .run-card.selected with petrol border + ring
Co-Authored-By: Claude <noreply@anthropic.com>
- pipeline.py: log each metric score/timeout/error with sample_id,
elapsed time, and score value; log NaN list per sample; progress
counter N/total after each sample completes
- evaluator.py: log eval start, dataset counts, adapter enrichment
progress (per-sample OK/FAIL with elapsed), metric scoring summary,
and per-metric NaN rate at end of run
- runner.py: _setup_logging() helper writes to stderr + optional file;
ragas/httpx/openai noisy loggers throttled to WARNING
- main.py: add --log-file and --log-level CLI flags
Usage:
python main.py --scenario scenarios/online/... --log-file logs/eval.log --log-level DEBUG
Co-Authored-By: Claude <noreply@anthropic.com>
- question_generator.py: add max_retries=3/retry_delay=5s loop with
exponential backoff on LLM timeout or server errors; encode filenames
with ascii/replace before printing to avoid UnicodeEncodeError on
Windows cp1252 consoles
- runner.py: encode PDF filenames ASCII-safe for progress messages;
catch generation failures per-document and skip (or re-raise) based
on failure_mode, preventing one bad doc from aborting the whole build
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- scenarios/siemens_build/siemens-pdf-build.yaml: dataset build for all 17
Siemens medical-imaging PDFs (aliyun_docmind parser, 10 questions/doc,
failure_mode=skip, ~170 question total)
- scenarios/offline/siemens-pdf-offline-smoke.yaml: offline evaluation using
source chunks as contexts and ground_truth as answer (up to 30 samples)
- scenarios/online/siemens-pdf-question-bank-online.yaml: online evaluation
calling siemens_pdf_qa adapter, batch_size=4, up to 50 samples
- apps/siemens_pdf_qa/adapter.py: Siemens-specific adapter with bilingual
(zh/en) system prompt and strict evidence-grounding for CT domain
- scripts/build_siemens_offline_smoke.py: helper to derive offline smoke CSV
from completed dataset build artifacts (run after dataset build step)
- docs/superpowers/specs/2026-06-15-siemens-scenario-design.md: design spec
All three scenarios are automatically discovered by the web console.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- start.bat: remove all Chinese characters (caused silent failure when
Windows batch parser ran before chcp 65001 took effect); add :error
label so window stays open with pause on any failure
- start.ps1: PowerShell alternative launcher with coloured output,
works without worrying about cmd.exe encoding issues
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>