Commit Graph

13 Commits

Author SHA1 Message Date
wangwei
5b60ed12ea feat: add LLM role-assignment panel to 新建评估 view 2026-06-16 16:27:00 +08:00
wangwei
dc8baf8662 feat: add LLM配置 management page (profiles view) 2026-06-16 16:25:20 +08:00
wangwei
e329f59139 feat: add yaml_patcher service to apply LLM profiles to scenario YAML 2026-06-16 16:21:19 +08:00
wangwei
b19054bd66 feat: add /api/llm-profiles CRUD router 2026-06-16 16:18:40 +08:00
wangwei
5d09deb420 feat: add ProfileManager service with JSON persistence 2026-06-16 16:14:31 +08:00
wangwei
b98af29449 feat: add LLMProfile pydantic models 2026-06-16 16:10:37 +08:00
wangwei
4173a40d93 feat(scripts): add run_eval.bat / run_eval.ps1 evaluation launcher scripts
Both scripts support:
  - Shortcut args: online (default), offline, or any custom .yaml path
  - Second arg: log level (DEBUG/INFO/WARNING/ERROR), default INFO
  - Auto-timestamped log file saved to logs\eval_<date>_<time>.log
  - Sets PYTHONIOENCODING=utf-8 and PYTHONPATH=. automatically
  - Friendly error/success banners with log file path

Usage:
  run_eval.bat                    # online eval
  run_eval.bat offline DEBUG      # offline eval with DEBUG logs
  .\run_eval.ps1 online DEBUG     # PowerShell equivalent

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-16 11:16:53 +08:00
wangwei
629304aa6d feat(logging): add structured evaluation logs for metric-level debugging
- pipeline.py: log each metric score/timeout/error with sample_id,
  elapsed time, and score value; log NaN list per sample; progress
  counter N/total after each sample completes
- evaluator.py: log eval start, dataset counts, adapter enrichment
  progress (per-sample OK/FAIL with elapsed), metric scoring summary,
  and per-metric NaN rate at end of run
- runner.py: _setup_logging() helper writes to stderr + optional file;
  ragas/httpx/openai noisy loggers throttled to WARNING
- main.py: add --log-file and --log-level CLI flags

Usage:
  python main.py --scenario scenarios/online/... --log-file logs/eval.log --log-level DEBUG

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-16 10:48:41 +08:00
wangwei
1ff4a3943a feat(dataset-builder): add retry logic and ASCII-safe logging for Siemens PDF pipeline
- question_generator.py: add max_retries=3/retry_delay=5s loop with
  exponential backoff on LLM timeout or server errors; encode filenames
  with ascii/replace before printing to avoid UnicodeEncodeError on
  Windows cp1252 consoles
- runner.py: encode PDF filenames ASCII-safe for progress messages;
  catch generation failures per-document and skip (or re-raise) based
  on failure_mode, preventing one bad doc from aborting the whole build

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-15 23:06:33 +08:00
wangwei
75ae7927ad Add Siemens CT document evaluation scenario (three-step pipeline)
- scenarios/siemens_build/siemens-pdf-build.yaml: dataset build for all 17
  Siemens medical-imaging PDFs (aliyun_docmind parser, 10 questions/doc,
  failure_mode=skip, ~170 question total)
- scenarios/offline/siemens-pdf-offline-smoke.yaml: offline evaluation using
  source chunks as contexts and ground_truth as answer (up to 30 samples)
- scenarios/online/siemens-pdf-question-bank-online.yaml: online evaluation
  calling siemens_pdf_qa adapter, batch_size=4, up to 50 samples
- apps/siemens_pdf_qa/adapter.py: Siemens-specific adapter with bilingual
  (zh/en) system prompt and strict evidence-grounding for CT domain
- scripts/build_siemens_offline_smoke.py: helper to derive offline smoke CSV
  from completed dataset build artifacts (run after dataset build step)
- docs/superpowers/specs/2026-06-15-siemens-scenario-design.md: design spec

All three scenarios are automatically discovered by the web console.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
2026-06-15 17:00:52 +08:00
wangwei
1288a366d1 Fix start.bat (ASCII-only, guaranteed window stays open) + add start.ps1
- start.bat: remove all Chinese characters (caused silent failure when
  Windows batch parser ran before chcp 65001 took effect); add :error
  label so window stays open with pause on any failure
- start.ps1: PowerShell alternative launcher with coloured output,
  works without worrying about cmd.exe encoding issues

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
2026-06-15 16:14:53 +08:00
wangwei
e89695e490 Add RAGAS evaluation web console (FastAPI + vanilla JS)
- webapp/: FastAPI backend with runs/scenarios/evaluations API routers;
  services for run_reader, report_builder, scenario_scanner, task_manager
  (lazy ragas import — server boots even without ragas); Pydantic models
- webapp/static/: single-page console (layout A: left-nav + main area);
  report detail with metric cards, Chart.js distribution histogram,
  grouping table, lowest-score sample review; trigger evaluation + log polling
- webmain.py: uvicorn entry point (alongside existing main.py CLI)
- start.bat: Windows one-click launcher with env checks and auto-browser open
- rag_eval/datasets/: implement missing loader + normalizer modules
  (load_dataset_records, normalize_records) required by evaluator
- scripts/seed_sample_run.py: generate realistic demo run artifacts
- .gitignore: exclude datasets/ data files but keep rag_eval/datasets/ source

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
2026-06-15 15:53:57 +08:00
Guangfei.Zhao
9cbdc1d95d first commit 2026-06-12 14:02:15 +08:00