feat(advisor): add optimization advisor module

- rag_eval/advisor/: new package with rules engine, LLM analyzer, writer
  - rules.py: 7-metric diagnostic rules (warning/critical thresholds, top-3 low samples)
  - llm_analyzer.py: Chinese optimization report via judge_model, graceful fallback
  - writer.py: writes optimization_advice.md + log summary
  - __init__.py: run_advisor() entry point (no-op when optimization_advisor=False)
- Scenario.optimization_advisor: new bool field (default False)
- ScenarioModel: same field added, loader.py透传
- RunArtifactPaths.advice_md: new path field
- factory.py: build_models() now public; build_metric_pipeline() accepts pre-built llm/embeddings
- runner.py: lifts llm, passes to pipeline and advisor; calls run_advisor() at end
- siemens online YAML: optimization_advisor: true enabled
- tests: 9 rules tests + 6 writer tests, all pass
- docs: advisor section added to engine-flow.md and architecture.md

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2026-06-16 17:06:19 +08:00
parent d68399d39b
commit f5c2dce64a
17 changed files with 2381 additions and 9 deletions

View File

@@ -316,11 +316,21 @@ adapter 层的目标是:**把不同类型的目标应用,统一成同一套
当前支持的指标包括:
核心检索 / 生成指标(始终可用):
- `faithfulness`
- `answer_relevancy`
- `context_recall`
- `context_precision`
鲁棒性 / 端到端指标(架构设计 §10.2,需数据集含 `ground_truth`
- `noise_sensitivity` —— 鲁棒性:对检索噪声的敏感度
- `factual_correctness` —— 端到端:回答相对标准答案的事实正确性
- `semantic_similarity` —— 端到端:回答与标准答案的语义相似度(基于 embedding无 LLM 调用)
所有指标都通过同一套装配点接入:`registry.py`(校验白名单)、`factory.py`(实例化)、`pipeline.py``ascore` 入参分发),新增指标只需在这三处对齐即可。
所以 metric pipeline 的职责可以总结为:
**把标准样本转换成结构化评分结果。**
@@ -414,3 +424,39 @@ main.py
- 可以把每次实验的资产稳定留住
这也是它和一次性离线脚本的根本区别。
---
## 15. Optimization Advisor 链路
相关代码:
- `rag_eval/advisor/__init__.py` — 外部入口 `run_advisor()`
- `rag_eval/advisor/rules.py` — 规则引擎(纯函数,无 LLM7 条指标诊断规则
- `rag_eval/advisor/llm_analyzer.py` — LLM 分析器(复用 judge_model llm 实例,失败自动降级)
- `rag_eval/advisor/writer.py` — 写出 `optimization_advice.md` + 日志摘要
Advisor 在 `write_run_artifacts()` 之后触发,仅当场景配置 `optimization_advisor: true` 时生效,默认关闭。
执行链路:
```text
run_advisor(result, scenario, llm)
-> rules.diagnose(score_rows, metrics) # 识别异常指标,选取 top-3 低分样本
-> llm_analyzer.analyze(diagnoses, llm) # LLM 生成中文建议(失败自动降级为纯规则报告)
-> writer.write_advice(...) # 写 optimization_advice.md + 日志摘要
```
输出产物追加在现有 run 目录:
```text
outputs/online/siemens-pdf-question-bank/<run_id>/
scenario.snapshot.yaml
scores.csv
invalid.csv
summary.md
metadata.json
optimization_advice.md <- 新增optimization_advisor: true 时生成)
```
规则引擎对 7 个指标各自设 warning / critical 双档阈值,`noise_sensitivity` 为"越低越好"(方向相反)。所有诊断均附带 top-3 低分样本,喂给 LLM 生成针对具体内容的中文建议。