- Add advisory_threshold=0.85 field to MetricRule (higher-is-better metrics)
- diagnose() now emits severity='low' for scores in (warning_threshold, 0.85)
- noise_sensitivity (lower-is-better) keeps its existing two-tier thresholds
- writer.py: severity labels mapped to Chinese (严重/警告/待优化)
- llm_analyzer.py: prompt explains low/warning/critical tiers in Chinese
- Tests: 5 new cases for 'low' severity, updated log summary assertions
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
gpt-5.4/5.5/5.2/5.4-mini/5.4-nano are incompatible with RAGAS 0.4.3
because they require max_completion_tokens instead of max_tokens.
gpt-5 / gpt-4.1 support max_tokens and json_object mode required by RAGAS.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- pipeline.py: log each metric score/timeout/error with sample_id,
elapsed time, and score value; log NaN list per sample; progress
counter N/total after each sample completes
- evaluator.py: log eval start, dataset counts, adapter enrichment
progress (per-sample OK/FAIL with elapsed), metric scoring summary,
and per-metric NaN rate at end of run
- runner.py: _setup_logging() helper writes to stderr + optional file;
ragas/httpx/openai noisy loggers throttled to WARNING
- main.py: add --log-file and --log-level CLI flags
Usage:
python main.py --scenario scenarios/online/... --log-file logs/eval.log --log-level DEBUG
Co-Authored-By: Claude <noreply@anthropic.com>
- question_generator.py: add max_retries=3/retry_delay=5s loop with
exponential backoff on LLM timeout or server errors; encode filenames
with ascii/replace before printing to avoid UnicodeEncodeError on
Windows cp1252 consoles
- runner.py: encode PDF filenames ASCII-safe for progress messages;
catch generation failures per-document and skip (or re-raise) based
on failure_mode, preventing one bad doc from aborting the whole build
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>