feat(advisor): add optimization advisor module
- rag_eval/advisor/: new package with rules engine, LLM analyzer, writer - rules.py: 7-metric diagnostic rules (warning/critical thresholds, top-3 low samples) - llm_analyzer.py: Chinese optimization report via judge_model, graceful fallback - writer.py: writes optimization_advice.md + log summary - __init__.py: run_advisor() entry point (no-op when optimization_advisor=False) - Scenario.optimization_advisor: new bool field (default False) - ScenarioModel: same field added, loader.py透传 - RunArtifactPaths.advice_md: new path field - factory.py: build_models() now public; build_metric_pipeline() accepts pre-built llm/embeddings - runner.py: lifts llm, passes to pipeline and advisor; calls run_advisor() at end - siemens online YAML: optimization_advisor: true enabled - tests: 9 rules tests + 6 writer tests, all pass - docs: advisor section added to engine-flow.md and architecture.md Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -316,11 +316,21 @@ adapter 层的目标是:**把不同类型的目标应用,统一成同一套
|
||||
|
||||
当前支持的指标包括:
|
||||
|
||||
核心检索 / 生成指标(始终可用):
|
||||
|
||||
- `faithfulness`
|
||||
- `answer_relevancy`
|
||||
- `context_recall`
|
||||
- `context_precision`
|
||||
|
||||
鲁棒性 / 端到端指标(架构设计 §10.2,需数据集含 `ground_truth`):
|
||||
|
||||
- `noise_sensitivity` —— 鲁棒性:对检索噪声的敏感度
|
||||
- `factual_correctness` —— 端到端:回答相对标准答案的事实正确性
|
||||
- `semantic_similarity` —— 端到端:回答与标准答案的语义相似度(基于 embedding,无 LLM 调用)
|
||||
|
||||
所有指标都通过同一套装配点接入:`registry.py`(校验白名单)、`factory.py`(实例化)、`pipeline.py`(`ascore` 入参分发),新增指标只需在这三处对齐即可。
|
||||
|
||||
所以 metric pipeline 的职责可以总结为:
|
||||
|
||||
**把标准样本转换成结构化评分结果。**
|
||||
@@ -414,3 +424,39 @@ main.py
|
||||
- 可以把每次实验的资产稳定留住
|
||||
|
||||
这也是它和一次性离线脚本的根本区别。
|
||||
|
||||
---
|
||||
|
||||
## 15. Optimization Advisor 链路
|
||||
|
||||
相关代码:
|
||||
|
||||
- `rag_eval/advisor/__init__.py` — 外部入口 `run_advisor()`
|
||||
- `rag_eval/advisor/rules.py` — 规则引擎(纯函数,无 LLM),7 条指标诊断规则
|
||||
- `rag_eval/advisor/llm_analyzer.py` — LLM 分析器(复用 judge_model llm 实例,失败自动降级)
|
||||
- `rag_eval/advisor/writer.py` — 写出 `optimization_advice.md` + 日志摘要
|
||||
|
||||
Advisor 在 `write_run_artifacts()` 之后触发,仅当场景配置 `optimization_advisor: true` 时生效,默认关闭。
|
||||
|
||||
执行链路:
|
||||
|
||||
```text
|
||||
run_advisor(result, scenario, llm)
|
||||
-> rules.diagnose(score_rows, metrics) # 识别异常指标,选取 top-3 低分样本
|
||||
-> llm_analyzer.analyze(diagnoses, llm) # LLM 生成中文建议(失败自动降级为纯规则报告)
|
||||
-> writer.write_advice(...) # 写 optimization_advice.md + 日志摘要
|
||||
```
|
||||
|
||||
输出产物追加在现有 run 目录:
|
||||
|
||||
```text
|
||||
outputs/online/siemens-pdf-question-bank/<run_id>/
|
||||
scenario.snapshot.yaml
|
||||
scores.csv
|
||||
invalid.csv
|
||||
summary.md
|
||||
metadata.json
|
||||
optimization_advice.md <- 新增(optimization_advisor: true 时生成)
|
||||
```
|
||||
|
||||
规则引擎对 7 个指标各自设 warning / critical 双档阈值,`noise_sensitivity` 为"越低越好"(方向相反)。所有诊断均附带 top-3 低分样本,喂给 LLM 生成针对具体内容的中文建议。
|
||||
|
||||
Reference in New Issue
Block a user