feat(advisor): add optimization advisor module
- rag_eval/advisor/: new package with rules engine, LLM analyzer, writer - rules.py: 7-metric diagnostic rules (warning/critical thresholds, top-3 low samples) - llm_analyzer.py: Chinese optimization report via judge_model, graceful fallback - writer.py: writes optimization_advice.md + log summary - __init__.py: run_advisor() entry point (no-op when optimization_advisor=False) - Scenario.optimization_advisor: new bool field (default False) - ScenarioModel: same field added, loader.py透传 - RunArtifactPaths.advice_md: new path field - factory.py: build_models() now public; build_metric_pipeline() accepts pre-built llm/embeddings - runner.py: lifts llm, passes to pipeline and advisor; calls run_advisor() at end - siemens online YAML: optimization_advisor: true enabled - tests: 9 rules tests + 6 writer tests, all pass - docs: advisor section added to engine-flow.md and architecture.md Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -318,6 +318,10 @@ metrics:
|
||||
- answer_relevancy
|
||||
- context_recall
|
||||
- context_precision
|
||||
# 可选:鲁棒性 / 端到端指标(需数据集含 ground_truth),完整列表见 §9.4
|
||||
# - noise_sensitivity
|
||||
# - factual_correctness
|
||||
# - semantic_similarity
|
||||
output_dir: runs/legal-assistant-offline-baseline
|
||||
runtime:
|
||||
batch_size: 4
|
||||
@@ -338,7 +342,7 @@ runtime:
|
||||
- `embedding_model`
|
||||
- 负责向量相关指标的模型
|
||||
- `metrics`
|
||||
- 本次启用的指标列表
|
||||
- 本次启用的指标列表(完整可选项与依赖见 §9.4)
|
||||
- `output_dir`
|
||||
- 本次运行结果输出目录
|
||||
- `runtime.batch_size`
|
||||
@@ -399,6 +403,32 @@ app_adapter:
|
||||
- embedding model
|
||||
- 指标实例
|
||||
|
||||
当前支持的指标(`rag_eval/metrics/registry.py` 中的 `SUPPORTED_METRICS`):
|
||||
|
||||
| 指标名 | 层面 | 依赖 |
|
||||
|---|---|---|
|
||||
| `faithfulness` | 生成 | judge model |
|
||||
| `answer_relevancy` | 生成 | judge model + embedding |
|
||||
| `context_recall` | 检索 | judge model + ground_truth |
|
||||
| `context_precision` | 检索 | judge model + ground_truth |
|
||||
| `noise_sensitivity` | 鲁棒性 | judge model + ground_truth |
|
||||
| `factual_correctness` | 端到端 | judge model + ground_truth |
|
||||
| `semantic_similarity` | 端到端 | embedding + ground_truth(无 LLM 调用) |
|
||||
|
||||
后四项以 `ground_truth`(标准答案)为参照,数据集必须提供该字段。新增指标统一在 `registry.py` / `factory.py` / `pipeline.py` 三处对齐装配。
|
||||
|
||||
**Optimization Advisor(§11 优化策略落地):**
|
||||
|
||||
评测结束后,若场景配置 `optimization_advisor: true`,则自动调用 `rag_eval/advisor/` 模块:
|
||||
- 规则引擎(`rules.py`)对 7 个指标各自设阈值,识别触发项并选取 top-3 低分样本
|
||||
- LLM 分析器(`llm_analyzer.py`)结合低分样本生成中文 Markdown 优化建议(复用 judge_model,失败自动降级为纯规则报告)
|
||||
- 写出层(`writer.py`)输出 `optimization_advice.md` 并打日志摘要
|
||||
|
||||
```yaml
|
||||
# 场景配置示例
|
||||
optimization_advisor: true
|
||||
```
|
||||
|
||||
### 9.5 并发控制
|
||||
|
||||
执行层负责并发上限,不把并发策略散落到各指标实现中。
|
||||
|
||||
@@ -316,11 +316,21 @@ adapter 层的目标是:**把不同类型的目标应用,统一成同一套
|
||||
|
||||
当前支持的指标包括:
|
||||
|
||||
核心检索 / 生成指标(始终可用):
|
||||
|
||||
- `faithfulness`
|
||||
- `answer_relevancy`
|
||||
- `context_recall`
|
||||
- `context_precision`
|
||||
|
||||
鲁棒性 / 端到端指标(架构设计 §10.2,需数据集含 `ground_truth`):
|
||||
|
||||
- `noise_sensitivity` —— 鲁棒性:对检索噪声的敏感度
|
||||
- `factual_correctness` —— 端到端:回答相对标准答案的事实正确性
|
||||
- `semantic_similarity` —— 端到端:回答与标准答案的语义相似度(基于 embedding,无 LLM 调用)
|
||||
|
||||
所有指标都通过同一套装配点接入:`registry.py`(校验白名单)、`factory.py`(实例化)、`pipeline.py`(`ascore` 入参分发),新增指标只需在这三处对齐即可。
|
||||
|
||||
所以 metric pipeline 的职责可以总结为:
|
||||
|
||||
**把标准样本转换成结构化评分结果。**
|
||||
@@ -414,3 +424,39 @@ main.py
|
||||
- 可以把每次实验的资产稳定留住
|
||||
|
||||
这也是它和一次性离线脚本的根本区别。
|
||||
|
||||
---
|
||||
|
||||
## 15. Optimization Advisor 链路
|
||||
|
||||
相关代码:
|
||||
|
||||
- `rag_eval/advisor/__init__.py` — 外部入口 `run_advisor()`
|
||||
- `rag_eval/advisor/rules.py` — 规则引擎(纯函数,无 LLM),7 条指标诊断规则
|
||||
- `rag_eval/advisor/llm_analyzer.py` — LLM 分析器(复用 judge_model llm 实例,失败自动降级)
|
||||
- `rag_eval/advisor/writer.py` — 写出 `optimization_advice.md` + 日志摘要
|
||||
|
||||
Advisor 在 `write_run_artifacts()` 之后触发,仅当场景配置 `optimization_advisor: true` 时生效,默认关闭。
|
||||
|
||||
执行链路:
|
||||
|
||||
```text
|
||||
run_advisor(result, scenario, llm)
|
||||
-> rules.diagnose(score_rows, metrics) # 识别异常指标,选取 top-3 低分样本
|
||||
-> llm_analyzer.analyze(diagnoses, llm) # LLM 生成中文建议(失败自动降级为纯规则报告)
|
||||
-> writer.write_advice(...) # 写 optimization_advice.md + 日志摘要
|
||||
```
|
||||
|
||||
输出产物追加在现有 run 目录:
|
||||
|
||||
```text
|
||||
outputs/online/siemens-pdf-question-bank/<run_id>/
|
||||
scenario.snapshot.yaml
|
||||
scores.csv
|
||||
invalid.csv
|
||||
summary.md
|
||||
metadata.json
|
||||
optimization_advice.md <- 新增(optimization_advisor: true 时生成)
|
||||
```
|
||||
|
||||
规则引擎对 7 个指标各自设 warning / critical 双档阈值,`noise_sensitivity` 为"越低越好"(方向相反)。所有诊断均附带 top-3 低分样本,喂给 LLM 生成针对具体内容的中文建议。
|
||||
|
||||
1378
docs/superpowers/plans/2026-06-16-optimization-advisor.md
Normal file
1378
docs/superpowers/plans/2026-06-16-optimization-advisor.md
Normal file
File diff suppressed because it is too large
Load Diff
225
docs/superpowers/specs/2026-06-16-optimization-advisor-design.md
Normal file
225
docs/superpowers/specs/2026-06-16-optimization-advisor-design.md
Normal file
@@ -0,0 +1,225 @@
|
||||
# 优化顾问模块设计 Spec
|
||||
|
||||
- 日期:2026-06-16
|
||||
- 状态:已确认,进入实现。
|
||||
|
||||
## 1. 目标
|
||||
|
||||
在现有 RAG 评测流程结束后,新增一个**优化顾问模块**(Optimization Advisor),根据本次评测的多项指标分数与低分样本,自动诊断指标偏低的原因并给出针对性的优化建议,输出为中文 Markdown 报告 + 日志摘要。
|
||||
|
||||
对应架构设计 §11(优化策略):将"指标到动作的映射"(§11.2)从文档形式落地为代码自动执行。
|
||||
|
||||
---
|
||||
|
||||
## 2. 决策摘要
|
||||
|
||||
| 决策点 | 选择 |
|
||||
|---|---|
|
||||
| 输出形式 | `optimization_advice.md`(文件)+ 控制台/日志摘要(双输出) |
|
||||
| 生成机制 | 规则引擎定位异常指标 → LLM 结合低分样本二次解读(两层) |
|
||||
| 触发方式 | YAML 场景文件显式声明 `optimization_advisor: true`,默认关闭 |
|
||||
| LLM 实例 | 复用 `build_models()` 已创建的 `llm` 实例,不重建 client |
|
||||
| 包位置 | `rag_eval/advisor/`(独立包,对外暴露 `run_advisor()` 单一入口) |
|
||||
|
||||
---
|
||||
|
||||
## 3. 架构
|
||||
|
||||
### 3.1 执行链路
|
||||
|
||||
```
|
||||
run_scenario()
|
||||
→ load_scenario() # 读 YAML,解析 optimization_advisor 字段
|
||||
→ build_models() # 已有:创建 llm, embeddings
|
||||
→ build_metric_pipeline() # 已有
|
||||
→ Evaluator.evaluate() # 已有:打分 → EvaluationResult
|
||||
→ write_run_artifacts() # 已有:scores.csv / summary.md / ...
|
||||
→ run_advisor( # 新增(3 行)
|
||||
result, scenario, llm, artifact_paths
|
||||
)
|
||||
→ rules.diagnose(score_rows) # 规则引擎:返回 Diagnosis 列表
|
||||
→ llm_analyzer.analyze(diags, samples) # LLM:生成中文 Markdown 建议
|
||||
→ writer.write(advice, paths) # 写文件 + 打日志
|
||||
```
|
||||
|
||||
### 3.2 新增文件
|
||||
|
||||
```
|
||||
rag_eval/advisor/
|
||||
__init__.py ← 暴露 run_advisor(),外部唯一入口
|
||||
rules.py ← 纯函数规则引擎,无 LLM,可单独单测
|
||||
llm_analyzer.py ← 接收 llm 实例 + 诊断结构 → 中文 Markdown
|
||||
writer.py ← 写 optimization_advice.md,打日志摘要
|
||||
```
|
||||
|
||||
### 3.3 修改文件(最小改动)
|
||||
|
||||
| 文件 | 改动 |
|
||||
|---|---|
|
||||
| `rag_eval/shared/models.py` | `Scenario` 加 `optimization_advisor: bool = False` 字段 |
|
||||
| `rag_eval/config/schema.py` | `ScenarioModel` 加同名字段 + 透传到 `Scenario` |
|
||||
| `rag_eval/config/loader.py` | 透传 `optimization_advisor` 到 `Scenario` 构造 |
|
||||
| `rag_eval/reporting/artifacts.py` | `RunArtifactPaths` 加 `advice_md: Path` 字段 + `build_artifact_paths()` 加赋值 |
|
||||
| `rag_eval/execution/runner.py` | `run_scenario()` 末尾:`build_models` 返回 llm 传入,条件调用 `run_advisor()` |
|
||||
|
||||
### 3.4 输出产物
|
||||
|
||||
```
|
||||
outputs/online/siemens-pdf-question-bank/<run_id>/
|
||||
scenario.snapshot.yaml
|
||||
scores.csv
|
||||
invalid.csv
|
||||
summary.md
|
||||
metadata.json
|
||||
optimization_advice.md ← 新增(optimization_advisor: true 时生成)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. 规则引擎(rules.py)
|
||||
|
||||
### 4.1 数据结构
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class Diagnosis:
|
||||
metric: str # 指标名
|
||||
mean_score: float # 本次均值
|
||||
threshold: float # 警戒阈值
|
||||
severity: str # "warning" | "critical"
|
||||
root_causes: list[str] # 可能原因(来自架构设计 §11.2)
|
||||
suggested_actions: list[str] # 对应可调阶段
|
||||
low_samples: list[dict] # 分数最低的 N 条样本(含 question/answer/ground_truth)
|
||||
```
|
||||
|
||||
### 4.2 七条指标诊断规则
|
||||
|
||||
阈值参考 RAG 评测最佳实践,分 warning / critical 两档:
|
||||
|
||||
| 指标 | warning | critical | 根因方向 | 对应优化阶段(§11.2) |
|
||||
|---|---|---|---|---|
|
||||
| `faithfulness` | < 0.7 | < 0.5 | 生成未严格基于检索片段 / 幻觉 | 生成 prompt grounding、开启校验 |
|
||||
| `answer_relevancy` | < 0.7 | < 0.5 | 回答偏离问题 / 格式冗余 | 查询改写、生成 prompt 格式 |
|
||||
| `context_recall` | < 0.7 | < 0.5 | 检索遗漏关键信息 | 多查询、问题分解、Step-back、加大过召回 |
|
||||
| `context_precision` | < 0.6 | < 0.4 | 检索引入过多噪声 / 排序差 | 后检索重排、压缩、相关性过滤 |
|
||||
| `noise_sensitivity` | > 0.3 | > 0.5 | 回答被噪声片段干扰(越低越好) | 后检索相关性过滤、重排 |
|
||||
| `factual_correctness` | < 0.6 | < 0.4 | 回答事实与标准答案偏差大 | 检索与生成综合优化 |
|
||||
| `semantic_similarity` | < 0.7 | < 0.5 | 回答语义与标准答案差距大 | 生成 prompt、检索质量 |
|
||||
|
||||
> 注:`noise_sensitivity` 越低越好(0=完全不受噪声影响),其阈值方向与其余相反。
|
||||
|
||||
### 4.3 低分样本选取
|
||||
|
||||
每个触发诊断的指标,取该指标分数最低的 **top-3** 样本(排除 NaN)附入 `Diagnosis.low_samples`,字段包含 `sample_id / question / answer / ground_truth / <metric_score>`。
|
||||
|
||||
---
|
||||
|
||||
## 5. LLM 分析器(llm_analyzer.py)
|
||||
|
||||
### 5.1 输入
|
||||
|
||||
- `diagnoses: list[Diagnosis]` — 规则引擎输出(仅触发阈值的指标)
|
||||
- `llm` — 已有 RAGAS LLM 实例(scenario 的 judge_model)
|
||||
- `scenario_name: str` — 用于报告标题
|
||||
|
||||
### 5.2 Prompt 设计
|
||||
|
||||
使用**一次 LLM 调用**,把所有触发诊断的指标和低分样本一起发送:
|
||||
|
||||
```
|
||||
你是一个 RAG 系统优化专家,正在分析西门子医疗 CT 文档问答系统的评测结果。
|
||||
请用中文撰写一份优化建议报告,格式为 Markdown。
|
||||
|
||||
## 评测诊断摘要
|
||||
{for each diagnosis: 指标名、均值、阈值、可能原因、建议动作}
|
||||
|
||||
## 低分样本示例
|
||||
{for each diagnosis: top-3 低分样本的 question / answer / ground_truth}
|
||||
|
||||
## 要求
|
||||
1. 按指标分节(## 指标名),先解释"为什么低",再给出"具体怎么改"
|
||||
2. "具体怎么改"要结合低分样本的具体内容,而不只是泛泛建议
|
||||
3. 最后写一节 ## 优先优化次序,按性价比排序(参考:不增加调用次数的优先)
|
||||
4. 语言简洁,面向工程师,不要废话
|
||||
```
|
||||
|
||||
### 5.3 输出
|
||||
|
||||
LLM 返回的 Markdown 字符串,直接写入 `optimization_advice.md`(在报告头部追加运行元信息)。
|
||||
|
||||
### 5.4 失败降级
|
||||
|
||||
LLM 调用失败(超时/异常)时:降级为**纯规则报告**(只输出规则引擎的诊断结构,不含 LLM 解读),文件照常写出,错误信息写入报告末尾,不阻断整个评测流程。
|
||||
|
||||
---
|
||||
|
||||
## 6. 写出层(writer.py)
|
||||
|
||||
### 6.1 文件写出
|
||||
|
||||
`optimization_advice.md` 结构:
|
||||
|
||||
```markdown
|
||||
# 优化建议报告 — <scenario_name>
|
||||
|
||||
- run_id: `<run_id>`
|
||||
- 生成时间: `<timestamp>`
|
||||
- judge_model: `<model>`
|
||||
|
||||
---
|
||||
|
||||
<LLM 生成的 Markdown 正文>
|
||||
```
|
||||
|
||||
### 6.2 日志摘要
|
||||
|
||||
`run_advisor()` 完成后向 `logger.info` 打印一条精简摘要(单行,适合 `run_eval.bat` 结束后一眼扫到):
|
||||
|
||||
```
|
||||
[advisor] 触发诊断 3 项: faithfulness(0.42, critical) context_recall(0.58, warning) noise_sensitivity(0.41, critical)
|
||||
[advisor] 优化建议已写出: outputs/online/.../optimization_advice.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. YAML 配置
|
||||
|
||||
场景文件新增一个顶层字段:
|
||||
|
||||
```yaml
|
||||
optimization_advisor: true # 默认 false;true 时评测结束后自动生成优化建议
|
||||
```
|
||||
|
||||
后续若需精细配置(阈值覆盖、top-N 低分样本数),可扩展为:
|
||||
|
||||
```yaml
|
||||
optimization_advisor:
|
||||
enabled: true
|
||||
top_low_samples: 3 # 每个指标取几条低分样本(默认 3)
|
||||
# thresholds: # 可选:覆盖默认阈值
|
||||
# faithfulness: 0.65
|
||||
```
|
||||
|
||||
本轮实现仅支持 `optimization_advisor: true/false`,扩展接口预留但不实现。
|
||||
|
||||
---
|
||||
|
||||
## 8. 测试策略
|
||||
|
||||
| 测试 | 文件 | 说明 |
|
||||
|---|---|---|
|
||||
| 规则引擎单测 | `tests/test_advisor_rules.py` | 纯函数,无 LLM,覆盖每条规则的 warning/critical 触发、NaN 跳过、low_samples 选取 |
|
||||
| writer 单测 | `tests/test_advisor_writer.py` | mock Diagnosis 列表,验证 md 文件写出格式和日志输出 |
|
||||
| 集成(可选) | 现有 `tests/test_online_eval.py` | 验证 `optimization_advisor: true` 场景下 advice_md 存在 |
|
||||
|
||||
LLM 分析器不写单测(依赖网络),由集成场景覆盖。
|
||||
|
||||
---
|
||||
|
||||
## 9. 不覆盖(本轮边界)
|
||||
|
||||
- 不支持跨版本对比分析(只分析本次 run)
|
||||
- 不支持批量场景聚合建议
|
||||
- 不建设 Web UI 展示
|
||||
- LLM 分析器 prompt 本轮不做多语言适配(直接中文)
|
||||
- advisor 阈值本轮硬编码在 `rules.py`,不从 YAML 读取
|
||||
67
rag_eval/advisor/__init__.py
Normal file
67
rag_eval/advisor/__init__.py
Normal file
@@ -0,0 +1,67 @@
|
||||
"""Optimization advisor: rule-based diagnosis + LLM-powered recommendations."""
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
from typing import Any
|
||||
|
||||
from rag_eval.reporting.artifacts import build_artifact_paths
|
||||
from rag_eval.shared.models import EvaluationResult, Scenario
|
||||
|
||||
from .llm_analyzer import analyze
|
||||
from .rules import Diagnosis, diagnose
|
||||
from .writer import write_advice
|
||||
|
||||
logger = logging.getLogger("rag_eval.advisor")
|
||||
|
||||
__all__ = ["run_advisor", "Diagnosis", "diagnose"]
|
||||
|
||||
|
||||
def run_advisor(
|
||||
result: EvaluationResult,
|
||||
scenario: Scenario,
|
||||
llm: Any,
|
||||
) -> None:
|
||||
"""Run the full optimization advisor pipeline after an evaluation completes.
|
||||
|
||||
Skips silently if scenario.optimization_advisor is False.
|
||||
Never raises — failures are logged as warnings, not exceptions.
|
||||
|
||||
Args:
|
||||
result: Completed EvaluationResult from Evaluator.evaluate().
|
||||
scenario: The resolved Scenario (provides metrics, judge_model, output_dir).
|
||||
llm: Pre-built RAGAS LLM instance (from build_models()) for LLM analysis.
|
||||
"""
|
||||
if not scenario.optimization_advisor:
|
||||
return
|
||||
|
||||
logger.info("[advisor] starting optimization analysis scenario=%s", scenario.scenario_name)
|
||||
|
||||
try:
|
||||
artifact_paths = build_artifact_paths(scenario.output_dir, result.run_id)
|
||||
if artifact_paths.advice_md is None:
|
||||
logger.warning("[advisor] advice_md path not set in RunArtifactPaths — skipping")
|
||||
return
|
||||
|
||||
diagnoses = diagnose(result.score_rows, scenario.metrics)
|
||||
logger.info("[advisor] rule diagnosis complete: %d metric(s) triggered", len(diagnoses))
|
||||
|
||||
if diagnoses:
|
||||
llm_markdown = asyncio.run(analyze(diagnoses, llm, scenario.scenario_name))
|
||||
else:
|
||||
llm_markdown = ""
|
||||
|
||||
write_advice(
|
||||
diagnoses=diagnoses,
|
||||
llm_markdown=llm_markdown,
|
||||
advice_path=artifact_paths.advice_md,
|
||||
scenario_name=scenario.scenario_name,
|
||||
run_id=result.run_id,
|
||||
judge_model=scenario.judge_model,
|
||||
)
|
||||
|
||||
except Exception as exc:
|
||||
logger.warning(
|
||||
"[advisor] advisor failed (%s: %s) — evaluation result is unaffected",
|
||||
type(exc).__name__, exc,
|
||||
)
|
||||
99
rag_eval/advisor/llm_analyzer.py
Normal file
99
rag_eval/advisor/llm_analyzer.py
Normal file
@@ -0,0 +1,99 @@
|
||||
"""LLM-powered analysis of rule diagnostics and low-score samples."""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from typing import Any
|
||||
|
||||
from .rules import Diagnosis
|
||||
|
||||
logger = logging.getLogger("rag_eval.advisor")
|
||||
|
||||
_PROMPT_TEMPLATE = """\
|
||||
你是一个 RAG 系统优化专家,正在分析西门子医疗 CT 文档问答系统的评测结果。
|
||||
请用中文撰写一份优化建议报告,格式为 Markdown。
|
||||
|
||||
## 评测诊断摘要
|
||||
|
||||
{diagnosis_summary}
|
||||
|
||||
## 低分样本示例
|
||||
|
||||
{low_sample_text}
|
||||
|
||||
## 报告要求
|
||||
|
||||
1. 按指标分节(## 指标名 [severity]),先解释"为什么低"(结合低分样本具体分析),再给出"具体怎么改"
|
||||
2. "具体怎么改"要结合低分样本的实际内容,而不只是泛泛建议
|
||||
3. 最后写一节 **## 优先优化次序**,按性价比排序(不增加 LLM 调用次数的优化优先)
|
||||
4. 语言简洁,面向工程师,不要废话,不要重复列表内容
|
||||
|
||||
只输出 Markdown 报告正文,不要任何前置说明。
|
||||
"""
|
||||
|
||||
|
||||
def _build_diagnosis_summary(diagnoses: list[Diagnosis]) -> str:
|
||||
lines = []
|
||||
for d in diagnoses:
|
||||
direction = "(越低越好)" if d.metric == "noise_sensitivity" else ""
|
||||
lines.append(
|
||||
f"- **{d.metric}** {direction} 均值={d.mean_score:.4f},"
|
||||
f"阈值={d.threshold},严重程度={d.severity}"
|
||||
)
|
||||
lines.append(f" - 可能原因:{'; '.join(d.root_causes)}")
|
||||
lines.append(f" - 建议动作:{'; '.join(d.suggested_actions)}")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def _build_low_sample_text(diagnoses: list[Diagnosis]) -> str:
|
||||
lines = []
|
||||
for d in diagnoses:
|
||||
if not d.low_samples:
|
||||
continue
|
||||
lines.append(f"### {d.metric} 低分样本(最多 3 条)")
|
||||
for i, s in enumerate(d.low_samples, 1):
|
||||
score = s.get(d.metric, "N/A")
|
||||
lines.append(f"\n**样本 {i}**(分数={score})")
|
||||
lines.append(f"- 问题:{s.get('question', '')}")
|
||||
lines.append(f"- 回答:{s.get('answer', '')[:300]}")
|
||||
lines.append(f"- 标准答案:{s.get('ground_truth', '')[:200]}")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
async def analyze(
|
||||
diagnoses: list[Diagnosis],
|
||||
llm: Any,
|
||||
scenario_name: str,
|
||||
) -> str:
|
||||
"""Call the judge LLM to generate a Chinese optimization report.
|
||||
|
||||
Args:
|
||||
diagnoses: Non-empty list of Diagnosis from rules.diagnose().
|
||||
llm: RAGAS LLM wrapper (has .agenerate() method).
|
||||
scenario_name: Used only for logging.
|
||||
|
||||
Returns:
|
||||
LLM-generated Markdown string, or "" on failure (triggers writer fallback).
|
||||
"""
|
||||
if not diagnoses:
|
||||
return ""
|
||||
|
||||
diagnosis_summary = _build_diagnosis_summary(diagnoses)
|
||||
low_sample_text = _build_low_sample_text(diagnoses)
|
||||
prompt = _PROMPT_TEMPLATE.format(
|
||||
diagnosis_summary=diagnosis_summary,
|
||||
low_sample_text=low_sample_text,
|
||||
)
|
||||
|
||||
try:
|
||||
logger.info("[advisor] calling LLM for optimization analysis scenario=%s", scenario_name)
|
||||
from langchain_core.messages import HumanMessage
|
||||
result = await llm.agenerate(texts=[[HumanMessage(content=prompt)]])
|
||||
text = result.generations[0][0].text.strip()
|
||||
logger.info("[advisor] LLM analysis complete chars=%d", len(text))
|
||||
return text
|
||||
except Exception as exc:
|
||||
logger.warning(
|
||||
"[advisor] LLM analysis failed (%s: %s) — falling back to rule report",
|
||||
type(exc).__name__, exc,
|
||||
)
|
||||
return ""
|
||||
236
rag_eval/advisor/rules.py
Normal file
236
rag_eval/advisor/rules.py
Normal file
@@ -0,0 +1,236 @@
|
||||
"""Rule-based diagnostic engine for RAG evaluation metric scores."""
|
||||
from __future__ import annotations
|
||||
|
||||
import math
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Any
|
||||
|
||||
|
||||
@dataclass
|
||||
class MetricRule:
|
||||
"""Threshold configuration and diagnostic text for one metric."""
|
||||
warning_threshold: float
|
||||
critical_threshold: float
|
||||
higher_is_better: bool # False for noise_sensitivity
|
||||
root_causes: list[str]
|
||||
suggested_actions: list[str]
|
||||
|
||||
|
||||
METRIC_RULES: dict[str, MetricRule] = {
|
||||
"faithfulness": MetricRule(
|
||||
warning_threshold=0.7,
|
||||
critical_threshold=0.5,
|
||||
higher_is_better=True,
|
||||
root_causes=[
|
||||
"生成回答包含检索片段中不支持的陈述(幻觉)",
|
||||
"生成阶段未严格遵循 grounding 约束",
|
||||
"校验阶段未开启或未生效",
|
||||
],
|
||||
suggested_actions=[
|
||||
"强化生成 prompt 的 grounding 约束('只依据参考资料作答')",
|
||||
"开启校验阶段(validation: by_scenario)",
|
||||
"检查低分样本中模型是否引用了片段外的知识",
|
||||
],
|
||||
),
|
||||
"answer_relevancy": MetricRule(
|
||||
warning_threshold=0.7,
|
||||
critical_threshold=0.5,
|
||||
higher_is_better=True,
|
||||
root_causes=[
|
||||
"回答偏离问题主旨或包含大量冗余内容",
|
||||
"查询改写后问题语义漂移",
|
||||
"生成 prompt 格式约束不足",
|
||||
],
|
||||
suggested_actions=[
|
||||
"优化查询改写 prompt,确保改写后语义不偏移",
|
||||
"在生成 prompt 中加入'简洁准确、直接回答问题'的约束",
|
||||
"检查低分样本的回答是否存在格式冗余或话题偏移",
|
||||
],
|
||||
),
|
||||
"context_recall": MetricRule(
|
||||
warning_threshold=0.7,
|
||||
critical_threshold=0.5,
|
||||
higher_is_better=True,
|
||||
root_causes=[
|
||||
"检索未能召回标准答案所涉及的关键信息",
|
||||
"单一查询未能覆盖问题的多个角度",
|
||||
"过召回数量不足,关键片段被截断",
|
||||
],
|
||||
suggested_actions=[
|
||||
"启用多查询扩展(use_multi_query)覆盖不同措辞",
|
||||
"对多跳问题启用问题分解(sub_questions)",
|
||||
"加大过召回宽度(recall_top_k)",
|
||||
"对颗粒度细的问题尝试 Step-back 双路检索",
|
||||
],
|
||||
),
|
||||
"context_precision": MetricRule(
|
||||
warning_threshold=0.6,
|
||||
critical_threshold=0.4,
|
||||
higher_is_better=True,
|
||||
root_causes=[
|
||||
"检索引入过多与问题无关的片段",
|
||||
"重排未能将相关片段排在前列",
|
||||
"缺少相关性过滤,噪声片段进入上下文",
|
||||
],
|
||||
suggested_actions=[
|
||||
"启用或优化 listwise 重排,将相关片段排在前列",
|
||||
"启用上下文压缩(compression)过滤无关句子",
|
||||
"启用相关性过滤(relevance_filter)丢弃明确无关片段",
|
||||
"缩小 rerank_keep_k(如从 8 降到 5)",
|
||||
],
|
||||
),
|
||||
"noise_sensitivity": MetricRule(
|
||||
warning_threshold=0.3, # higher is worse; trigger when mean > threshold
|
||||
critical_threshold=0.5,
|
||||
higher_is_better=False,
|
||||
root_causes=[
|
||||
"回答中包含检索到的噪声片段所引入的错误陈述",
|
||||
"相关性过滤未能拦截干扰性片段",
|
||||
"生成阶段对噪声片段未加区分地引用",
|
||||
],
|
||||
suggested_actions=[
|
||||
"启用相关性过滤(relevance_filter)拦截噪声",
|
||||
"优化重排,将不相关片段排到截断点之后",
|
||||
"在生成 prompt 中强调'来源冲突时并列陈述,不擅自下定论'",
|
||||
],
|
||||
),
|
||||
"factual_correctness": MetricRule(
|
||||
warning_threshold=0.6,
|
||||
critical_threshold=0.4,
|
||||
higher_is_better=True,
|
||||
root_causes=[
|
||||
"回答的事实陈述与标准答案存在偏差",
|
||||
"检索未能命中标准答案所依据的关键片段",
|
||||
"生成阶段对多个来源综合时产生事实错误",
|
||||
],
|
||||
suggested_actions=[
|
||||
"重点检查低分样本,确认是检索遗漏还是生成错误",
|
||||
"提升 context_recall 以确保关键信息被检索到",
|
||||
"对事实型问题将 temperature 降至 0",
|
||||
],
|
||||
),
|
||||
"semantic_similarity": MetricRule(
|
||||
warning_threshold=0.7,
|
||||
critical_threshold=0.5,
|
||||
higher_is_better=True,
|
||||
root_causes=[
|
||||
"回答语义与标准答案差距较大",
|
||||
"回答过于简短或过于冗长,语义偏移",
|
||||
"检索到的片段质量不足,导致生成内容偏离",
|
||||
],
|
||||
suggested_actions=[
|
||||
"检查低分样本的回答与标准答案的表述差异",
|
||||
"优化生成 prompt 使回答更贴近标准表述风格",
|
||||
"提升检索质量(context_recall / context_precision)",
|
||||
],
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class Diagnosis:
|
||||
"""Diagnostic result for one metric that triggered a threshold."""
|
||||
metric: str
|
||||
mean_score: float
|
||||
threshold: float # the triggered threshold
|
||||
severity: str # "warning" | "critical"
|
||||
root_causes: list[str] = field(default_factory=list)
|
||||
suggested_actions: list[str] = field(default_factory=list)
|
||||
low_samples: list[dict[str, Any]] = field(default_factory=list)
|
||||
|
||||
|
||||
def _mean_ignoring_nan(values: list[float]) -> float | None:
|
||||
valid = [v for v in values if not math.isnan(v)]
|
||||
if not valid:
|
||||
return None
|
||||
return sum(valid) / len(valid)
|
||||
|
||||
|
||||
def _select_low_samples(
|
||||
rows: list[dict[str, Any]],
|
||||
metric: str,
|
||||
top_n: int,
|
||||
higher_is_better: bool,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Return the top_n worst-scoring rows for a metric, excluding NaN."""
|
||||
valid = [r for r in rows if metric in r and not math.isnan(float(r[metric]))]
|
||||
sorted_rows = sorted(valid, key=lambda r: float(r[metric]), reverse=not higher_is_better)
|
||||
worst = sorted_rows[:top_n]
|
||||
keep_keys = {"sample_id", "question", "answer", "ground_truth", metric}
|
||||
return [{k: v for k, v in row.items() if k in keep_keys} for row in worst]
|
||||
|
||||
|
||||
def diagnose(
|
||||
score_rows: list[dict[str, Any]],
|
||||
metrics: list[str],
|
||||
top_low_samples: int = 3,
|
||||
) -> list[Diagnosis]:
|
||||
"""Analyse score_rows and return a Diagnosis for each metric below threshold.
|
||||
|
||||
Args:
|
||||
score_rows: List of per-sample score dicts (from EvaluationResult.score_rows).
|
||||
metrics: Metric names to evaluate (from Scenario.metrics).
|
||||
top_low_samples: How many worst-scoring samples to attach per diagnosis.
|
||||
|
||||
Returns:
|
||||
List of Diagnosis objects, one per triggered metric. Empty if all OK.
|
||||
"""
|
||||
diagnoses: list[Diagnosis] = []
|
||||
|
||||
for metric in metrics:
|
||||
rule = METRIC_RULES.get(metric)
|
||||
if rule is None:
|
||||
continue # unknown metric, skip
|
||||
|
||||
values = []
|
||||
for row in score_rows:
|
||||
raw = row.get(metric)
|
||||
if raw is None:
|
||||
continue
|
||||
try:
|
||||
v = float(raw)
|
||||
except (TypeError, ValueError):
|
||||
continue
|
||||
values.append(v)
|
||||
|
||||
if not values:
|
||||
continue
|
||||
|
||||
mean = _mean_ignoring_nan(values)
|
||||
if mean is None:
|
||||
continue
|
||||
|
||||
# Determine severity (direction-aware)
|
||||
if rule.higher_is_better:
|
||||
if mean < rule.critical_threshold:
|
||||
severity = "critical"
|
||||
threshold = rule.critical_threshold
|
||||
elif mean < rule.warning_threshold:
|
||||
severity = "warning"
|
||||
threshold = rule.warning_threshold
|
||||
else:
|
||||
continue # above warning threshold → no diagnosis
|
||||
else:
|
||||
# lower is better (noise_sensitivity)
|
||||
if mean > rule.critical_threshold:
|
||||
severity = "critical"
|
||||
threshold = rule.critical_threshold
|
||||
elif mean > rule.warning_threshold:
|
||||
severity = "warning"
|
||||
threshold = rule.warning_threshold
|
||||
else:
|
||||
continue
|
||||
|
||||
low_samples = _select_low_samples(score_rows, metric, top_low_samples, rule.higher_is_better)
|
||||
|
||||
diagnoses.append(Diagnosis(
|
||||
metric=metric,
|
||||
mean_score=round(mean, 4),
|
||||
threshold=threshold,
|
||||
severity=severity,
|
||||
root_causes=list(rule.root_causes),
|
||||
suggested_actions=list(rule.suggested_actions),
|
||||
low_samples=low_samples,
|
||||
))
|
||||
|
||||
return diagnoses
|
||||
82
rag_eval/advisor/writer.py
Normal file
82
rag_eval/advisor/writer.py
Normal file
@@ -0,0 +1,82 @@
|
||||
"""Write optimization advice to markdown file and emit log summary."""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from pathlib import Path
|
||||
|
||||
from .rules import Diagnosis
|
||||
|
||||
logger = logging.getLogger("rag_eval.advisor")
|
||||
|
||||
|
||||
def _format_log_summary(diagnoses: list[Diagnosis], advice_path: Path) -> str:
|
||||
"""Return a single-line log summary of triggered diagnoses."""
|
||||
if not diagnoses:
|
||||
return "[advisor] 所有指标正常,无需优化建议。"
|
||||
parts = [f"{d.metric}({d.mean_score:.2f}, {d.severity})" for d in diagnoses]
|
||||
triggered = " ".join(parts)
|
||||
return f"[advisor] 触发诊断 {len(diagnoses)} 项: {triggered} → {advice_path}"
|
||||
|
||||
|
||||
def _build_fallback_report(diagnoses: list[Diagnosis]) -> str:
|
||||
"""Build a rules-only report when LLM analysis is unavailable."""
|
||||
if not diagnoses:
|
||||
return ""
|
||||
lines = ["## 规则诊断(LLM 分析不可用)\n"]
|
||||
for d in diagnoses:
|
||||
lines.append(f"### {d.metric} [{d.severity}] 均值={d.mean_score:.4f}")
|
||||
lines.append("\n**可能原因:**")
|
||||
for cause in d.root_causes:
|
||||
lines.append(f"- {cause}")
|
||||
lines.append("\n**建议动作:**")
|
||||
for action in d.suggested_actions:
|
||||
lines.append(f"- {action}")
|
||||
lines.append("")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def write_advice(
|
||||
diagnoses: list[Diagnosis],
|
||||
llm_markdown: str,
|
||||
advice_path: Path,
|
||||
scenario_name: str,
|
||||
run_id: str,
|
||||
judge_model: str,
|
||||
) -> None:
|
||||
"""Write optimization_advice.md and emit a log summary line.
|
||||
|
||||
Args:
|
||||
diagnoses: List of Diagnosis from rules.diagnose().
|
||||
llm_markdown: LLM-generated Markdown body. Empty string triggers fallback.
|
||||
advice_path: Full path to write the .md file.
|
||||
scenario_name: Human-readable scenario identifier for the report header.
|
||||
run_id: Run identifier string.
|
||||
judge_model: Model used for LLM analysis (shown in header).
|
||||
"""
|
||||
advice_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
from rag_eval.shared.utils import utc_now_iso
|
||||
header_lines = [
|
||||
f"# 优化建议报告 — {scenario_name}",
|
||||
"",
|
||||
f"- run_id: `{run_id}`",
|
||||
f"- 生成时间: `{utc_now_iso()}`",
|
||||
f"- judge_model: `{judge_model}`",
|
||||
"",
|
||||
"---",
|
||||
"",
|
||||
]
|
||||
|
||||
if not diagnoses:
|
||||
body = "## ✅ 未发现明显指标异常\n\n所有指标均在正常范围内,当前 RAG 链路表现良好。\n"
|
||||
elif llm_markdown:
|
||||
body = llm_markdown
|
||||
else:
|
||||
body = _build_fallback_report(diagnoses)
|
||||
|
||||
content = "\n".join(header_lines) + body
|
||||
advice_path.write_text(content, encoding="utf-8")
|
||||
|
||||
summary = _format_log_summary(diagnoses, advice_path)
|
||||
logger.info(summary)
|
||||
logger.info("[advisor] 优化建议已写出: %s", advice_path)
|
||||
@@ -61,6 +61,7 @@ def load_scenario(path: str | Path) -> Scenario:
|
||||
max_samples=model.runtime.max_samples,
|
||||
),
|
||||
source_path=scenario_path,
|
||||
optimization_advisor=model.optimization_advisor,
|
||||
)
|
||||
# Run cross-field checks after all relative paths have been resolved.
|
||||
validate_scenario(scenario)
|
||||
|
||||
@@ -54,6 +54,7 @@ class ScenarioModel(BaseModel):
|
||||
metrics: list[str]
|
||||
output_dir: str
|
||||
runtime: RuntimeConfigModel = Field(default_factory=RuntimeConfigModel)
|
||||
optimization_advisor: bool = False
|
||||
|
||||
@field_validator("metrics")
|
||||
@classmethod
|
||||
|
||||
@@ -8,8 +8,9 @@ from pathlib import Path
|
||||
|
||||
from rag_eval.adapters.http import HttpAppAdapter
|
||||
from rag_eval.adapters.python import PythonFunctionAdapter
|
||||
from rag_eval.advisor import run_advisor
|
||||
from rag_eval.config.loader import load_scenario
|
||||
from rag_eval.metrics.factory import build_metric_pipeline
|
||||
from rag_eval.metrics.factory import build_models, build_metric_pipeline
|
||||
from rag_eval.reporting.writers import write_run_artifacts
|
||||
from rag_eval.settings import EvaluationSettings
|
||||
from rag_eval.shared.models import Scenario
|
||||
@@ -67,10 +68,17 @@ def run_scenario(
|
||||
logger.info("[runner] scenario loaded: name=%s mode=%s max_samples=%s",
|
||||
scenario.scenario_name, scenario.mode, scenario.runtime.max_samples)
|
||||
|
||||
# Build models once; reuse llm in both MetricPipeline and advisor.
|
||||
llm, embeddings = build_models(scenario.judge_model, scenario.embedding_model, settings)
|
||||
|
||||
adapter = build_adapter(scenario)
|
||||
pipeline = build_metric_pipeline(scenario, settings)
|
||||
pipeline = build_metric_pipeline(scenario, settings, llm=llm, embeddings=embeddings)
|
||||
evaluator = Evaluator(scenario=scenario, metric_pipeline=pipeline, app_adapter=adapter)
|
||||
result = evaluator.evaluate()
|
||||
write_run_artifacts(result)
|
||||
logger.info("[runner] artifacts written for run_id=%s", result.run_id)
|
||||
|
||||
# Optimization advisor — runs only if scenario.optimization_advisor is True.
|
||||
run_advisor(result, scenario, llm)
|
||||
|
||||
return result
|
||||
|
||||
@@ -18,7 +18,10 @@ from ragas.metrics.collections import (
|
||||
AnswerRelevancy,
|
||||
ContextPrecision,
|
||||
ContextRecall,
|
||||
FactualCorrectness,
|
||||
Faithfulness,
|
||||
NoiseSensitivity,
|
||||
SemanticSimilarity,
|
||||
)
|
||||
|
||||
from .pipeline import MetricPipeline
|
||||
@@ -39,19 +42,34 @@ def build_models(
|
||||
def build_metric_pipeline(
|
||||
scenario: Scenario,
|
||||
settings: EvaluationSettings,
|
||||
llm: Any | None = None,
|
||||
embeddings: Any | None = None,
|
||||
) -> MetricPipeline:
|
||||
"""Build a metric pipeline containing only the metrics requested by the scenario."""
|
||||
"""Build a metric pipeline containing only the metrics requested by the scenario.
|
||||
|
||||
If llm and embeddings are provided (pre-built by the caller), they are reused.
|
||||
Otherwise, new instances are created from scenario + settings.
|
||||
"""
|
||||
if llm is None or embeddings is None:
|
||||
llm, embeddings = build_models(
|
||||
scenario.judge_model,
|
||||
scenario.embedding_model,
|
||||
settings,
|
||||
)
|
||||
|
||||
# Build the full registry once, then slice it by configured metric names.
|
||||
registry: dict[str, Any] = {
|
||||
"faithfulness": Faithfulness(llm=llm),
|
||||
"answer_relevancy": AnswerRelevancy(llm=llm, embeddings=embeddings),
|
||||
"context_recall": ContextRecall(llm=llm),
|
||||
"context_precision": ContextPrecision(llm=llm),
|
||||
# Robustness / end-to-end metrics (架构设计 §10.2).
|
||||
# NoiseSensitivity mode='relevant': sensitivity to noise from relevant contexts.
|
||||
"noise_sensitivity": NoiseSensitivity(llm=llm),
|
||||
# FactualCorrectness mode='f1': balances claim precision and recall vs. ground truth.
|
||||
"factual_correctness": FactualCorrectness(llm=llm),
|
||||
# SemanticSimilarity: embedding cosine between answer and ground truth (no LLM call).
|
||||
"semantic_similarity": SemanticSimilarity(embeddings=embeddings),
|
||||
}
|
||||
return MetricPipeline(
|
||||
metrics={name: registry[name] for name in scenario.metrics},
|
||||
|
||||
@@ -17,4 +17,5 @@ def build_artifact_paths(output_dir: Path, run_id: str) -> RunArtifactPaths:
|
||||
invalid_csv=run_dir / "invalid.csv",
|
||||
summary_md=run_dir / "summary.md",
|
||||
metadata_json=run_dir / "metadata.json",
|
||||
advice_md=run_dir / "optimization_advice.md",
|
||||
)
|
||||
|
||||
@@ -76,6 +76,7 @@ class Scenario:
|
||||
runtime: RuntimeConfig = field(default_factory=RuntimeConfig)
|
||||
app_adapter: AppAdapterConfig | None = None
|
||||
source_path: Path | None = None
|
||||
optimization_advisor: bool = False
|
||||
|
||||
def snapshot(self) -> dict[str, Any]:
|
||||
"""Serialize the scenario into a reporting-friendly dictionary snapshot."""
|
||||
@@ -159,3 +160,4 @@ class RunArtifactPaths:
|
||||
invalid_csv: Path
|
||||
summary_md: Path
|
||||
metadata_json: Path
|
||||
advice_md: Path | None = None
|
||||
|
||||
@@ -1,13 +1,19 @@
|
||||
scenario_name: siemens-pdf-question-bank-online
|
||||
mode: online
|
||||
dataset: ../../datasets/raw/generated/siemens-pdf-question-bank.csv
|
||||
# judge_model: qwen3.5-flash
|
||||
judge_model: deepseek-v4-flash
|
||||
embedding_model: text-embedding-v3
|
||||
optimization_advisor: true # 评测结束后自动生成优化建议报告
|
||||
metrics:
|
||||
- faithfulness
|
||||
- answer_relevancy
|
||||
- context_recall
|
||||
- context_precision
|
||||
# 已启用:鲁棒性 / 端到端指标(数据集已含 ground_truth)
|
||||
- noise_sensitivity # 鲁棒性:对检索噪声的敏感度
|
||||
- factual_correctness # 端到端:事实正确性(相对标准答案)
|
||||
- semantic_similarity # 端到端:语义相似度(embedding,无 LLM 调用)
|
||||
output_dir: ../../outputs/online/siemens-pdf-question-bank
|
||||
runtime:
|
||||
batch_size: 4
|
||||
|
||||
59
scripts/smoke_advisor.py
Normal file
59
scripts/smoke_advisor.py
Normal file
@@ -0,0 +1,59 @@
|
||||
"""Offline smoke-check for the advisor module wiring (no network required)."""
|
||||
import math
|
||||
import sys
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
|
||||
from rag_eval.advisor.rules import diagnose
|
||||
from rag_eval.advisor.writer import write_advice, _format_log_summary
|
||||
|
||||
# Simulate score_rows with low faithfulness and high noise_sensitivity
|
||||
rows = [
|
||||
{
|
||||
"sample_id": f"s{i}",
|
||||
"question": f"问题{i}:西门子CT扫描的Flash技术原理是什么?",
|
||||
"answer": f"答案{i}:Flash技术采用双源CT扫描",
|
||||
"ground_truth": f"标准答案{i}:Flash扫描利用双源CT和大螺距实现超低辐射剂量扫描",
|
||||
"faithfulness": 0.3 + i * 0.05,
|
||||
"noise_sensitivity": 0.4 + i * 0.02,
|
||||
"context_recall": 0.75,
|
||||
"semantic_similarity": 0.65,
|
||||
}
|
||||
for i in range(5)
|
||||
]
|
||||
|
||||
diags = diagnose(rows, metrics=["faithfulness", "noise_sensitivity", "context_recall", "semantic_similarity"])
|
||||
print(f"Diagnosed {len(diags)} metric(s):")
|
||||
for d in diags:
|
||||
print(f" {d.metric}: mean={d.mean_score}, severity={d.severity}, low_samples={len(d.low_samples)}")
|
||||
|
||||
assert len(diags) >= 2, f"Expected at least 2 diagnoses, got {len(diags)}"
|
||||
metrics_hit = {d.metric for d in diags}
|
||||
assert "faithfulness" in metrics_hit, "faithfulness should be triggered"
|
||||
assert "noise_sensitivity" in metrics_hit, "noise_sensitivity should be triggered"
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
path = Path(tmp) / "optimization_advice.md"
|
||||
write_advice(
|
||||
diagnoses=diags,
|
||||
llm_markdown="", # fallback mode (no LLM)
|
||||
advice_path=path,
|
||||
scenario_name="smoke-test-siemens",
|
||||
run_id="2026-06-16T00-00-00",
|
||||
judge_model="deepseek-v4-flash",
|
||||
)
|
||||
content = path.read_text(encoding="utf-8")
|
||||
assert "smoke-test-siemens" in content, "scenario name missing from report"
|
||||
assert "faithfulness" in content, "faithfulness missing from report"
|
||||
assert "noise_sensitivity" in content, "noise_sensitivity missing from report"
|
||||
print(f"\nAdvice file ({len(content)} chars) — assertions OK")
|
||||
|
||||
# Verify log summary format
|
||||
summary = _format_log_summary(diags, Path("optimization_advice.md"))
|
||||
print(f"\nLog summary length: {len(summary)} chars, faithfulness present: {'faithfulness' in summary}")
|
||||
assert "触发诊断" in summary
|
||||
assert "faithfulness" in summary
|
||||
|
||||
print("\nSmoke check PASSED")
|
||||
113
tests/test_advisor_writer.py
Normal file
113
tests/test_advisor_writer.py
Normal file
@@ -0,0 +1,113 @@
|
||||
import shutil
|
||||
import unittest
|
||||
from pathlib import Path
|
||||
|
||||
from rag_eval.advisor.rules import Diagnosis
|
||||
from rag_eval.advisor.writer import write_advice, _format_log_summary
|
||||
|
||||
|
||||
class TestWriteAdvice(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.tmp = Path("tests/.tmp/test_advisor_writer")
|
||||
shutil.rmtree(self.tmp, ignore_errors=True)
|
||||
self.tmp.mkdir(parents=True, exist_ok=True)
|
||||
self.advice_path = self.tmp / "optimization_advice.md"
|
||||
|
||||
def tearDown(self):
|
||||
shutil.rmtree(self.tmp, ignore_errors=True)
|
||||
|
||||
def _make_diagnosis(self, metric="faithfulness", severity="warning"):
|
||||
return Diagnosis(
|
||||
metric=metric,
|
||||
mean_score=0.55,
|
||||
threshold=0.7,
|
||||
severity=severity,
|
||||
root_causes=["原因1", "原因2"],
|
||||
suggested_actions=["建议1", "建议2"],
|
||||
low_samples=[
|
||||
{"sample_id": "s1", "question": "问题1", "answer": "答案1",
|
||||
"ground_truth": "标准1", metric: 0.4},
|
||||
],
|
||||
)
|
||||
|
||||
def test_write_creates_file(self):
|
||||
diag = self._make_diagnosis()
|
||||
write_advice(
|
||||
diagnoses=[diag],
|
||||
llm_markdown="## faithfulness\n\nLLM 建议内容",
|
||||
advice_path=self.advice_path,
|
||||
scenario_name="test-scenario",
|
||||
run_id="2026-01-01T00-00-00",
|
||||
judge_model="deepseek-v4-flash",
|
||||
)
|
||||
self.assertTrue(self.advice_path.exists())
|
||||
|
||||
def test_write_contains_scenario_name_and_run_id(self):
|
||||
diag = self._make_diagnosis()
|
||||
write_advice(
|
||||
diagnoses=[diag],
|
||||
llm_markdown="## faithfulness\n\nLLM 建议",
|
||||
advice_path=self.advice_path,
|
||||
scenario_name="siemens-test",
|
||||
run_id="2026-01-01T00-00-00",
|
||||
judge_model="deepseek-v4-flash",
|
||||
)
|
||||
content = self.advice_path.read_text(encoding="utf-8")
|
||||
self.assertIn("siemens-test", content)
|
||||
self.assertIn("2026-01-01T00-00-00", content)
|
||||
|
||||
def test_write_contains_llm_markdown(self):
|
||||
diag = self._make_diagnosis()
|
||||
write_advice(
|
||||
diagnoses=[diag],
|
||||
llm_markdown="## faithfulness\n\n具体建议文本",
|
||||
advice_path=self.advice_path,
|
||||
scenario_name="test",
|
||||
run_id="rid",
|
||||
judge_model="model",
|
||||
)
|
||||
content = self.advice_path.read_text(encoding="utf-8")
|
||||
self.assertIn("具体建议文本", content)
|
||||
|
||||
def test_write_fallback_when_no_llm_markdown(self):
|
||||
"""When llm_markdown is empty, writer emits rule-only report."""
|
||||
diag = self._make_diagnosis()
|
||||
write_advice(
|
||||
diagnoses=[diag],
|
||||
llm_markdown="",
|
||||
advice_path=self.advice_path,
|
||||
scenario_name="test",
|
||||
run_id="rid",
|
||||
judge_model="model",
|
||||
)
|
||||
content = self.advice_path.read_text(encoding="utf-8")
|
||||
self.assertIn("faithfulness", content)
|
||||
self.assertIn("原因1", content)
|
||||
|
||||
def test_log_summary_format(self):
|
||||
diags = [
|
||||
self._make_diagnosis("faithfulness", "critical"),
|
||||
self._make_diagnosis("context_recall", "warning"),
|
||||
]
|
||||
summary = _format_log_summary(diags, self.advice_path)
|
||||
self.assertIn("faithfulness", summary)
|
||||
self.assertIn("critical", summary)
|
||||
self.assertIn("context_recall", summary)
|
||||
self.assertIn("warning", summary)
|
||||
|
||||
def test_write_empty_diagnoses_still_creates_file(self):
|
||||
write_advice(
|
||||
diagnoses=[],
|
||||
llm_markdown="",
|
||||
advice_path=self.advice_path,
|
||||
scenario_name="test",
|
||||
run_id="rid",
|
||||
judge_model="model",
|
||||
)
|
||||
self.assertTrue(self.advice_path.exists())
|
||||
content = self.advice_path.read_text(encoding="utf-8")
|
||||
self.assertIn("未发现明显指标异常", content)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
Reference in New Issue
Block a user