- rag_eval/advisor/: new package with rules engine, LLM analyzer, writer - rules.py: 7-metric diagnostic rules (warning/critical thresholds, top-3 low samples) - llm_analyzer.py: Chinese optimization report via judge_model, graceful fallback - writer.py: writes optimization_advice.md + log summary - __init__.py: run_advisor() entry point (no-op when optimization_advisor=False) - Scenario.optimization_advisor: new bool field (default False) - ScenarioModel: same field added, loader.py透传 - RunArtifactPaths.advice_md: new path field - factory.py: build_models() now public; build_metric_pipeline() accepts pre-built llm/embeddings - runner.py: lifts llm, passes to pipeline and advisor; calls run_advisor() at end - siemens online YAML: optimization_advisor: true enabled - tests: 9 rules tests + 6 writer tests, all pass - docs: advisor section added to engine-flow.md and architecture.md Co-Authored-By: Claude <noreply@anthropic.com>
46 KiB
Optimization Advisor Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: 新增 rag_eval/advisor/ 模块,在每次评测结束后自动分析指标低分原因并输出中文优化建议报告(optimization_advice.md + 日志摘要)。
Architecture: 规则引擎(rules.py)根据各指标均值和阈值识别异常、选取低分样本;LLM 分析器(llm_analyzer.py)复用已有 judge_model llm 实例生成中文 Markdown 建议;写出层(writer.py)写文件并打日志摘要。通过 YAML optimization_advisor: true 开关触发,默认关闭。
Tech Stack: Python 3.12, dataclasses, ragas LLM instance (AsyncOpenAI-backed), pandas (score_rows already available), pytest (unit tests)
File Map
New files
rag_eval/advisor/__init__.py— 暴露run_advisor(),外部唯一入口rag_eval/advisor/rules.py— 纯函数规则引擎,Diagnosisdataclass +diagnose()rag_eval/advisor/llm_analyzer.py—analyze()接收 llm + diagnoses → Markdown strrag_eval/advisor/writer.py—write_advice()写 md 文件 + log 摘要tests/test_advisor_rules.py— 规则引擎单测tests/test_advisor_writer.py— writer 单测
Modified files
rag_eval/shared/models.py—Scenario加optimization_advisor: bool = False;RunArtifactPaths加advice_md: Pathrag_eval/config/schema.py—ScenarioModel加optimization_advisor: bool = Falserag_eval/config/loader.py—load_scenario()透传optimization_advisor到Scenariorag_eval/reporting/artifacts.py—build_artifact_paths()加advice_md字段rag_eval/metrics/factory.py—build_metric_pipeline()改为同时返回llm(build_models_and_pipeline()),供 runner 传给 advisorrag_eval/execution/runner.py— 接收 llm,末尾条件调用run_advisor()scenarios/online/siemens-pdf-question-bank-online.yaml— 加optimization_advisor: truedocs/rag-eval-engine-flow.md— 补充 advisor 链路说明docs/rag-eval-architecture.md— §9.4 指标编排末尾加 advisor 说明
Task 1: Diagnosis dataclass + rules engine
Files:
-
Create:
rag_eval/advisor/rules.py -
Create:
tests/test_advisor_rules.py -
Step 1: Write failing tests
# tests/test_advisor_rules.py
import math
import unittest
from rag_eval.advisor.rules import Diagnosis, diagnose, METRIC_RULES
class TestDiagnosis(unittest.TestCase):
def _make_rows(self, metric: str, scores: list[float]) -> list[dict]:
return [{metric: s, "question": f"q{i}", "answer": f"a{i}",
"ground_truth": f"gt{i}", "sample_id": f"s{i}"}
for i, s in enumerate(scores)]
def test_no_diagnosis_when_all_scores_above_threshold(self):
rows = self._make_rows("faithfulness", [0.8, 0.9, 0.85])
result = diagnose(rows, metrics=["faithfulness"])
self.assertEqual(result, [])
def test_warning_when_mean_below_warning_threshold(self):
rows = self._make_rows("faithfulness", [0.65, 0.62, 0.68])
result = diagnose(rows, metrics=["faithfulness"])
self.assertEqual(len(result), 1)
self.assertEqual(result[0].metric, "faithfulness")
self.assertEqual(result[0].severity, "warning")
self.assertAlmostEqual(result[0].mean_score, 0.65, places=2)
def test_critical_when_mean_below_critical_threshold(self):
rows = self._make_rows("faithfulness", [0.3, 0.4, 0.45])
result = diagnose(rows, metrics=["faithfulness"])
self.assertEqual(result[0].severity, "critical")
def test_low_samples_selected_are_bottom_three(self):
rows = self._make_rows("faithfulness", [0.1, 0.2, 0.3, 0.8, 0.9])
result = diagnose(rows, metrics=["faithfulness"])
self.assertEqual(len(result[0].low_samples), 3)
scores = [s["faithfulness"] for s in result[0].low_samples]
self.assertEqual(sorted(scores), [0.1, 0.2, 0.3])
def test_nan_scores_excluded_from_mean_and_low_samples(self):
rows = self._make_rows("faithfulness", [0.3, float("nan"), 0.4])
result = diagnose(rows, metrics=["faithfulness"])
self.assertEqual(len(result), 1)
for s in result[0].low_samples:
self.assertFalse(math.isnan(s["faithfulness"]))
def test_noise_sensitivity_direction_inverted(self):
# noise_sensitivity: higher is worse; threshold > 0.3 is warning
rows = self._make_rows("noise_sensitivity", [0.4, 0.45, 0.5])
result = diagnose(rows, metrics=["noise_sensitivity"])
self.assertEqual(len(result), 1)
self.assertEqual(result[0].metric, "noise_sensitivity")
def test_noise_sensitivity_no_diagnosis_when_low(self):
rows = self._make_rows("noise_sensitivity", [0.1, 0.15, 0.2])
result = diagnose(rows, metrics=["noise_sensitivity"])
self.assertEqual(result, [])
def test_skips_metric_not_in_rows(self):
rows = [{"faithfulness": 0.3, "question": "q", "answer": "a",
"ground_truth": "gt", "sample_id": "s1"}]
result = diagnose(rows, metrics=["faithfulness", "context_recall"])
metrics_found = [d.metric for d in result]
self.assertIn("faithfulness", metrics_found)
self.assertNotIn("context_recall", metrics_found)
def test_all_seven_metrics_have_rules(self):
expected = {"faithfulness", "answer_relevancy", "context_recall",
"context_precision", "noise_sensitivity",
"factual_correctness", "semantic_similarity"}
self.assertEqual(set(METRIC_RULES.keys()), expected)
if __name__ == "__main__":
unittest.main()
- Step 2: Run tests to verify they fail
cd C:\Projects\AIProjects\Siemens-AIPOC\siemens_ragas
python -m pytest tests/test_advisor_rules.py -v 2>&1 | head -20
Expected: ModuleNotFoundError: No module named 'rag_eval.advisor'
- Step 3: Create rules.py
# rag_eval/advisor/rules.py
"""Rule-based diagnostic engine for RAG evaluation metric scores."""
from __future__ import annotations
import math
from dataclasses import dataclass, field
from typing import Any
@dataclass
class MetricRule:
"""Threshold configuration and diagnostic text for one metric."""
warning_threshold: float
critical_threshold: float
higher_is_better: bool # False for noise_sensitivity
root_causes: list[str]
suggested_actions: list[str]
METRIC_RULES: dict[str, MetricRule] = {
"faithfulness": MetricRule(
warning_threshold=0.7,
critical_threshold=0.5,
higher_is_better=True,
root_causes=[
"生成回答包含检索片段中不支持的陈述(幻觉)",
"生成阶段未严格遵循 grounding 约束",
"校验阶段未开启或未生效",
],
suggested_actions=[
"强化生成 prompt 的 grounding 约束('只依据参考资料作答')",
"开启校验阶段(validation: by_scenario)",
"检查低分样本中模型是否引用了片段外的知识",
],
),
"answer_relevancy": MetricRule(
warning_threshold=0.7,
critical_threshold=0.5,
higher_is_better=True,
root_causes=[
"回答偏离问题主旨或包含大量冗余内容",
"查询改写后问题语义漂移",
"生成 prompt 格式约束不足",
],
suggested_actions=[
"优化查询改写 prompt,确保改写后语义不偏移",
"在生成 prompt 中加入'简洁准确、直接回答问题'的约束",
"检查低分样本的回答是否存在格式冗余或话题偏移",
],
),
"context_recall": MetricRule(
warning_threshold=0.7,
critical_threshold=0.5,
higher_is_better=True,
root_causes=[
"检索未能召回标准答案所涉及的关键信息",
"单一查询未能覆盖问题的多个角度",
"过召回数量不足,关键片段被截断",
],
suggested_actions=[
"启用多查询扩展(use_multi_query)覆盖不同措辞",
"对多跳问题启用问题分解(sub_questions)",
"加大过召回宽度(recall_top_k)",
"对颗粒度细的问题尝试 Step-back 双路检索",
],
),
"context_precision": MetricRule(
warning_threshold=0.6,
critical_threshold=0.4,
higher_is_better=True,
root_causes=[
"检索引入过多与问题无关的片段",
"重排未能将相关片段排在前列",
"缺少相关性过滤,噪声片段进入上下文",
],
suggested_actions=[
"启用或优化 listwise 重排,将相关片段排在前列",
"启用上下文压缩(compression)过滤无关句子",
"启用相关性过滤(relevance_filter)丢弃明确无关片段",
"缩小 rerank_keep_k(如从 8 降到 5)",
],
),
"noise_sensitivity": MetricRule(
warning_threshold=0.3, # higher is worse; trigger when mean > threshold
critical_threshold=0.5,
higher_is_better=False,
root_causes=[
"回答中包含检索到的噪声片段所引入的错误陈述",
"相关性过滤未能拦截干扰性片段",
"生成阶段对噪声片段未加区分地引用",
],
suggested_actions=[
"启用相关性过滤(relevance_filter)拦截噪声",
"优化重排,将不相关片段排到截断点之后",
"在生成 prompt 中强调'来源冲突时并列陈述,不擅自下定论'",
],
),
"factual_correctness": MetricRule(
warning_threshold=0.6,
critical_threshold=0.4,
higher_is_better=True,
root_causes=[
"回答的事实陈述与标准答案存在偏差",
"检索未能命中标准答案所依据的关键片段",
"生成阶段对多个来源综合时产生事实错误",
],
suggested_actions=[
"重点检查低分样本,确认是检索遗漏还是生成错误",
"提升 context_recall 以确保关键信息被检索到",
"对事实型问题将 temperature 降至 0",
],
),
"semantic_similarity": MetricRule(
warning_threshold=0.7,
critical_threshold=0.5,
higher_is_better=True,
root_causes=[
"回答语义与标准答案差距较大",
"回答过于简短或过于冗长,语义偏移",
"检索到的片段质量不足,导致生成内容偏离",
],
suggested_actions=[
"检查低分样本的回答与标准答案的表述差异",
"优化生成 prompt 使回答更贴近标准表述风格",
"提升检索质量(context_recall / context_precision)",
],
),
}
@dataclass
class Diagnosis:
"""Diagnostic result for one metric that triggered a threshold."""
metric: str
mean_score: float
threshold: float # the triggered threshold
severity: str # "warning" | "critical"
root_causes: list[str] = field(default_factory=list)
suggested_actions: list[str] = field(default_factory=list)
low_samples: list[dict[str, Any]] = field(default_factory=list)
def _mean_ignoring_nan(values: list[float]) -> float | None:
valid = [v for v in values if not math.isnan(v)]
if not valid:
return None
return sum(valid) / len(valid)
def _select_low_samples(
rows: list[dict[str, Any]],
metric: str,
top_n: int,
higher_is_better: bool,
) -> list[dict[str, Any]]:
"""Return the top_n worst-scoring rows for a metric, excluding NaN."""
valid = [r for r in rows if metric in r and not math.isnan(float(r[metric]))]
sorted_rows = sorted(valid, key=lambda r: float(r[metric]), reverse=not higher_is_better)
worst = sorted_rows[:top_n]
keep_keys = {"sample_id", "question", "answer", "ground_truth", metric}
return [{k: v for k, v in row.items() if k in keep_keys} for row in worst]
def diagnose(
score_rows: list[dict[str, Any]],
metrics: list[str],
top_low_samples: int = 3,
) -> list[Diagnosis]:
"""Analyse score_rows and return a Diagnosis for each metric below threshold.
Args:
score_rows: List of per-sample score dicts (from EvaluationResult.score_rows).
metrics: Metric names to evaluate (from Scenario.metrics).
top_low_samples: How many worst-scoring samples to attach per diagnosis.
Returns:
List of Diagnosis objects, one per triggered metric. Empty if all OK.
"""
diagnoses: list[Diagnosis] = []
for metric in metrics:
rule = METRIC_RULES.get(metric)
if rule is None:
continue # unknown metric, skip
values = []
for row in score_rows:
raw = row.get(metric)
if raw is None:
continue
try:
v = float(raw)
except (TypeError, ValueError):
continue
values.append(v)
if not values:
continue
mean = _mean_ignoring_nan(values)
if mean is None:
continue
# Determine severity (direction-aware)
if rule.higher_is_better:
if mean < rule.critical_threshold:
severity = "critical"
threshold = rule.critical_threshold
elif mean < rule.warning_threshold:
severity = "warning"
threshold = rule.warning_threshold
else:
continue # above warning threshold → no diagnosis
else:
# lower is better (noise_sensitivity)
if mean > rule.critical_threshold:
severity = "critical"
threshold = rule.critical_threshold
elif mean > rule.warning_threshold:
severity = "warning"
threshold = rule.warning_threshold
else:
continue
low_samples = _select_low_samples(score_rows, metric, top_low_samples, rule.higher_is_better)
diagnoses.append(Diagnosis(
metric=metric,
mean_score=round(mean, 4),
threshold=threshold,
severity=severity,
root_causes=list(rule.root_causes),
suggested_actions=list(rule.suggested_actions),
low_samples=low_samples,
))
return diagnoses
- Step 4: Create
rag_eval/advisor/__init__.py(stub — full version in Task 5)
# rag_eval/advisor/__init__.py
"""Optimization advisor: rule-based diagnosis + LLM-powered recommendations."""
from .rules import Diagnosis, diagnose
__all__ = ["Diagnosis", "diagnose"]
- Step 5: Run tests — expect pass
python -m pytest tests/test_advisor_rules.py -v
Expected: all 9 tests PASS.
- Step 6: Commit
git add rag_eval/advisor/__init__.py rag_eval/advisor/rules.py tests/test_advisor_rules.py
git commit -m "feat(advisor): add rule-based diagnostic engine with 7-metric coverage"
Task 2: Writer module
Files:
-
Create:
rag_eval/advisor/writer.py -
Create:
tests/test_advisor_writer.py -
Step 1: Write failing tests
# tests/test_advisor_writer.py
import logging
import shutil
import unittest
from pathlib import Path
from rag_eval.advisor.rules import Diagnosis
from rag_eval.advisor.writer import write_advice, _format_log_summary
class TestWriteAdvice(unittest.TestCase):
def setUp(self):
self.tmp = Path("tests/.tmp/test_advisor_writer")
shutil.rmtree(self.tmp, ignore_errors=True)
self.tmp.mkdir(parents=True, exist_ok=True)
self.advice_path = self.tmp / "optimization_advice.md"
def tearDown(self):
shutil.rmtree(self.tmp, ignore_errors=True)
def _make_diagnosis(self, metric="faithfulness", severity="warning"):
return Diagnosis(
metric=metric,
mean_score=0.55,
threshold=0.7,
severity=severity,
root_causes=["原因1", "原因2"],
suggested_actions=["建议1", "建议2"],
low_samples=[
{"sample_id": "s1", "question": "问题1", "answer": "答案1",
"ground_truth": "标准1", metric: 0.4},
],
)
def test_write_creates_file(self):
diag = self._make_diagnosis()
write_advice(
diagnoses=[diag],
llm_markdown="## faithfulness\n\nLLM 建议内容",
advice_path=self.advice_path,
scenario_name="test-scenario",
run_id="2026-01-01T00-00-00",
judge_model="deepseek-v4-flash",
)
self.assertTrue(self.advice_path.exists())
def test_write_contains_scenario_name_and_run_id(self):
diag = self._make_diagnosis()
write_advice(
diagnoses=[diag],
llm_markdown="## faithfulness\n\nLLM 建议",
advice_path=self.advice_path,
scenario_name="siemens-test",
run_id="2026-01-01T00-00-00",
judge_model="deepseek-v4-flash",
)
content = self.advice_path.read_text(encoding="utf-8")
self.assertIn("siemens-test", content)
self.assertIn("2026-01-01T00-00-00", content)
def test_write_contains_llm_markdown(self):
diag = self._make_diagnosis()
write_advice(
diagnoses=[diag],
llm_markdown="## faithfulness\n\n具体建议文本",
advice_path=self.advice_path,
scenario_name="test",
run_id="rid",
judge_model="model",
)
content = self.advice_path.read_text(encoding="utf-8")
self.assertIn("具体建议文本", content)
def test_write_fallback_when_no_llm_markdown(self):
"""When llm_markdown is empty, writer emits rule-only report."""
diag = self._make_diagnosis()
write_advice(
diagnoses=[diag],
llm_markdown="",
advice_path=self.advice_path,
scenario_name="test",
run_id="rid",
judge_model="model",
)
content = self.advice_path.read_text(encoding="utf-8")
self.assertIn("faithfulness", content)
self.assertIn("原因1", content)
def test_log_summary_format(self):
diags = [
self._make_diagnosis("faithfulness", "critical"),
self._make_diagnosis("context_recall", "warning"),
]
summary = _format_log_summary(diags, self.advice_path)
self.assertIn("faithfulness", summary)
self.assertIn("critical", summary)
self.assertIn("context_recall", summary)
self.assertIn("warning", summary)
def test_write_empty_diagnoses_still_creates_file(self):
write_advice(
diagnoses=[],
llm_markdown="",
advice_path=self.advice_path,
scenario_name="test",
run_id="rid",
judge_model="model",
)
self.assertTrue(self.advice_path.exists())
content = self.advice_path.read_text(encoding="utf-8")
self.assertIn("未发现明显指标异常", content)
if __name__ == "__main__":
unittest.main()
- Step 2: Run tests to verify they fail
python -m pytest tests/test_advisor_writer.py -v 2>&1 | head -15
Expected: ImportError: cannot import name 'write_advice' from 'rag_eval.advisor.writer'
- Step 3: Create writer.py
# rag_eval/advisor/writer.py
"""Write optimization advice to markdown file and emit log summary."""
from __future__ import annotations
import logging
from pathlib import Path
from .rules import Diagnosis
logger = logging.getLogger("rag_eval.advisor")
def _format_log_summary(diagnoses: list[Diagnosis], advice_path: Path) -> str:
"""Return a single-line log summary of triggered diagnoses."""
if not diagnoses:
return "[advisor] 所有指标正常,无需优化建议。"
parts = [f"{d.metric}({d.mean_score:.2f}, {d.severity})" for d in diagnoses]
triggered = " ".join(parts)
return f"[advisor] 触发诊断 {len(diagnoses)} 项: {triggered} → {advice_path}"
def _build_fallback_report(diagnoses: list[Diagnosis]) -> str:
"""Build a rules-only report when LLM analysis is unavailable."""
if not diagnoses:
return ""
lines = ["## 规则诊断(LLM 分析不可用)\n"]
for d in diagnoses:
lines.append(f"### {d.metric} [{d.severity}] 均值={d.mean_score:.4f}")
lines.append("\n**可能原因:**")
for cause in d.root_causes:
lines.append(f"- {cause}")
lines.append("\n**建议动作:**")
for action in d.suggested_actions:
lines.append(f"- {action}")
lines.append("")
return "\n".join(lines)
def write_advice(
diagnoses: list[Diagnosis],
llm_markdown: str,
advice_path: Path,
scenario_name: str,
run_id: str,
judge_model: str,
) -> None:
"""Write optimization_advice.md and emit a log summary line.
Args:
diagnoses: List of Diagnosis from rules.diagnose().
llm_markdown: LLM-generated Markdown body. Empty string triggers fallback.
advice_path: Full path to write the .md file.
scenario_name: Human-readable scenario identifier for the report header.
run_id: Run identifier string.
judge_model: Model used for LLM analysis (shown in header).
"""
advice_path.parent.mkdir(parents=True, exist_ok=True)
# Header
from rag_eval.shared.utils import utc_now_iso
header_lines = [
f"# 优化建议报告 — {scenario_name}",
"",
f"- run_id: `{run_id}`",
f"- 生成时间: `{utc_now_iso()}`",
f"- judge_model: `{judge_model}`",
"",
"---",
"",
]
if not diagnoses:
body = "## ✅ 未发现明显指标异常\n\n所有指标均在正常范围内,当前 RAG 链路表现良好。\n"
elif llm_markdown:
body = llm_markdown
else:
body = _build_fallback_report(diagnoses)
content = "\n".join(header_lines) + body
advice_path.write_text(content, encoding="utf-8")
summary = _format_log_summary(diagnoses, advice_path)
logger.info(summary)
logger.info("[advisor] 优化建议已写出: %s", advice_path)
- Step 4: Run tests — expect pass
python -m pytest tests/test_advisor_writer.py -v
Expected: all 6 tests PASS.
- Step 5: Commit
git add rag_eval/advisor/writer.py tests/test_advisor_writer.py
git commit -m "feat(advisor): add advice writer with fallback rule-only report"
Task 3: LLM analyzer
Files:
- Create:
rag_eval/advisor/llm_analyzer.py
No LLM unit tests (network-dependent); tested in Task 7 integration.
- Step 1: Create llm_analyzer.py
# rag_eval/advisor/llm_analyzer.py
"""LLM-powered analysis of rule diagnostics and low-score samples."""
from __future__ import annotations
import logging
from typing import Any
from .rules import Diagnosis
logger = logging.getLogger("rag_eval.advisor")
_PROMPT_TEMPLATE = """\
你是一个 RAG 系统优化专家,正在分析西门子医疗 CT 文档问答系统的评测结果。
请用中文撰写一份优化建议报告,格式为 Markdown。
## 评测诊断摘要
{diagnosis_summary}
## 低分样本示例
{low_sample_text}
## 报告要求
1. 按指标分节(## 指标名 [severity]),先解释"为什么低"(结合低分样本具体分析),再给出"具体怎么改"
2. "具体怎么改"要结合低分样本的实际内容,而不只是泛泛建议
3. 最后写一节 **## 优先优化次序**,按性价比排序(不增加 LLM 调用次数的优化优先)
4. 语言简洁,面向工程师,不要废话,不要重复列表内容
只输出 Markdown 报告正文,不要任何前置说明。
"""
def _build_diagnosis_summary(diagnoses: list[Diagnosis]) -> str:
lines = []
for d in diagnoses:
direction = "(越低越好)" if d.metric == "noise_sensitivity" else ""
lines.append(
f"- **{d.metric}** {direction} 均值={d.mean_score:.4f},"
f"阈值={d.threshold},严重程度={d.severity}"
)
lines.append(f" - 可能原因:{'; '.join(d.root_causes)}")
lines.append(f" - 建议动作:{'; '.join(d.suggested_actions)}")
return "\n".join(lines)
def _build_low_sample_text(diagnoses: list[Diagnosis]) -> str:
lines = []
for d in diagnoses:
if not d.low_samples:
continue
lines.append(f"### {d.metric} 低分样本(最多 3 条)")
for i, s in enumerate(d.low_samples, 1):
score = s.get(d.metric, "N/A")
lines.append(f"\n**样本 {i}**(分数={score})")
lines.append(f"- 问题:{s.get('question', '')}")
lines.append(f"- 回答:{s.get('answer', '')[:300]}")
lines.append(f"- 标准答案:{s.get('ground_truth', '')[:200]}")
return "\n".join(lines)
async def analyze(
diagnoses: list[Diagnosis],
llm: Any,
scenario_name: str,
) -> str:
"""Call the judge LLM to generate a Chinese optimization report.
Args:
diagnoses: Non-empty list of Diagnosis from rules.diagnose().
llm: RAGAS LLM wrapper (has .agenerate() method).
scenario_name: Used only for logging.
Returns:
LLM-generated Markdown string, or "" on failure (triggers writer fallback).
"""
if not diagnoses:
return ""
diagnosis_summary = _build_diagnosis_summary(diagnoses)
low_sample_text = _build_low_sample_text(diagnoses)
prompt = _PROMPT_TEMPLATE.format(
diagnosis_summary=diagnosis_summary,
low_sample_text=low_sample_text,
)
try:
logger.info("[advisor] calling LLM for optimization analysis scenario=%s", scenario_name)
# ragas LLM wrapper: generate() accepts list[list[str]] and returns LLMResult
from langchain_core.messages import HumanMessage
result = await llm.agenerate(texts=[[HumanMessage(content=prompt)]])
text = result.generations[0][0].text.strip()
logger.info("[advisor] LLM analysis complete chars=%d", len(text))
return text
except Exception as exc:
logger.warning("[advisor] LLM analysis failed (%s: %s) — falling back to rule report", type(exc).__name__, exc)
return ""
- Step 2: Verify import works
python -c "from rag_eval.advisor.llm_analyzer import analyze; print('OK')"
Expected: OK
- Step 3: Commit
git add rag_eval/advisor/llm_analyzer.py
git commit -m "feat(advisor): add LLM analyzer with graceful fallback on failure"
Task 4: Wire advisor into models, config schema, and loader
Files:
-
Modify:
rag_eval/shared/models.py -
Modify:
rag_eval/config/schema.py -
Modify:
rag_eval/config/loader.py -
Modify:
rag_eval/reporting/artifacts.py -
Step 1: Add
optimization_advisortoScenarioandRunArtifactPaths
In rag_eval/shared/models.py, add one field to Scenario (after source_path) and one to RunArtifactPaths:
# In Scenario dataclass — add after source_path field:
optimization_advisor: bool = False
# In RunArtifactPaths dataclass — add after metadata_json field:
advice_md: Path | None = None
Full updated Scenario dataclass (slots=True, so field order matters — add at end):
@dataclass(slots=True)
class Scenario:
scenario_name: str
mode: Mode
dataset: DatasetConfig
judge_model: str
embedding_model: str
metrics: list[str]
output_dir: Path
runtime: RuntimeConfig = field(default_factory=RuntimeConfig)
app_adapter: AppAdapterConfig | None = None
source_path: Path | None = None
optimization_advisor: bool = False # NEW
Full updated RunArtifactPaths:
@dataclass(slots=True)
class RunArtifactPaths:
root_dir: Path
scenario_snapshot: Path
scores_csv: Path
invalid_csv: Path
summary_md: Path
metadata_json: Path
advice_md: Path | None = None # NEW
- Step 2: Add field to ScenarioModel in schema.py
In rag_eval/config/schema.py, add to ScenarioModel:
optimization_advisor: bool = False # NEW — enable optimization advisor output
(add after the runtime field)
- Step 3:透传 optimization_advisor in loader.py
In rag_eval/config/loader.py, in the Scenario(...) constructor call, add:
optimization_advisor=model.optimization_advisor, # NEW
- Step 4: Add advice_md to artifact paths in artifacts.py
In rag_eval/reporting/artifacts.py, update build_artifact_paths():
def build_artifact_paths(output_dir: Path, run_id: str) -> RunArtifactPaths:
"""Build the canonical artifact file paths for a single evaluation run."""
run_dir = output_dir / run_id
return RunArtifactPaths(
root_dir=run_dir,
scenario_snapshot=run_dir / "scenario.snapshot.yaml",
scores_csv=run_dir / "scores.csv",
invalid_csv=run_dir / "invalid.csv",
summary_md=run_dir / "summary.md",
metadata_json=run_dir / "metadata.json",
advice_md=run_dir / "optimization_advice.md", # NEW
)
- Step 5: Verify existing tests still pass
python -m pytest tests/test_offline_eval.py tests/test_online_eval.py -v 2>&1 | tail -15
Expected: same pass/fail as before (the 4 pre-existing failures are unrelated).
- Step 6: Commit
git add rag_eval/shared/models.py rag_eval/config/schema.py rag_eval/config/loader.py rag_eval/reporting/artifacts.py
git commit -m "feat(advisor): wire optimization_advisor field through Scenario, ScenarioModel, loader, and RunArtifactPaths"
Task 5: Lift build_models() to runner + wire run_advisor()
Files:
- Modify:
rag_eval/metrics/factory.py - Modify:
rag_eval/execution/runner.py - Modify:
rag_eval/advisor/__init__.py
This is the integration wiring. The key change: build_metric_pipeline() currently creates llm internally and returns only MetricPipeline. We add a companion function build_models() that runner.py calls first, then passes llm to both build_metric_pipeline() and run_advisor().
- Step 1: Add
build_models()as public function in factory.py
The existing build_models() is already defined in factory.py (lines 30-39) but is module-private (no __all__). We expose it and update build_metric_pipeline() to accept optional pre-built models:
# rag_eval/metrics/factory.py — full replacement
"""Factories for OpenAI-backed RAGAS models and metric pipelines."""
from __future__ import annotations
from typing import Any
from openai import AsyncOpenAI
from rag_eval.compat import ensure_ragas_import_compat
from rag_eval.settings import EvaluationSettings
from rag_eval.shared.models import Scenario
ensure_ragas_import_compat()
from ragas.embeddings.base import embedding_factory
from ragas.llms import llm_factory
from ragas.metrics.collections import (
AnswerRelevancy,
ContextPrecision,
ContextRecall,
FactualCorrectness,
Faithfulness,
NoiseSensitivity,
SemanticSimilarity,
)
from .pipeline import MetricPipeline
def build_models(
judge_model: str,
embedding_model: str,
settings: EvaluationSettings,
) -> tuple[Any, Any]:
"""Create the LLM and embedding clients required by the selected RAGAS metrics."""
client = AsyncOpenAI(**settings.openai_client_kwargs)
llm = llm_factory(judge_model, client=client)
embeddings = embedding_factory(provider="openai", model=embedding_model, client=client)
return llm, embeddings
def build_metric_pipeline(
scenario: Scenario,
settings: EvaluationSettings,
llm: Any | None = None,
embeddings: Any | None = None,
) -> MetricPipeline:
"""Build a metric pipeline containing only the metrics requested by the scenario.
If llm and embeddings are provided (pre-built by the caller), they are reused.
Otherwise, new instances are created from scenario + settings.
"""
if llm is None or embeddings is None:
llm, embeddings = build_models(
scenario.judge_model,
scenario.embedding_model,
settings,
)
registry: dict[str, Any] = {
"faithfulness": Faithfulness(llm=llm),
"answer_relevancy": AnswerRelevancy(llm=llm, embeddings=embeddings),
"context_recall": ContextRecall(llm=llm),
"context_precision": ContextPrecision(llm=llm),
"noise_sensitivity": NoiseSensitivity(llm=llm),
"factual_correctness": FactualCorrectness(llm=llm),
"semantic_similarity": SemanticSimilarity(embeddings=embeddings),
}
return MetricPipeline(
metrics={name: registry[name] for name in scenario.metrics},
metric_timeout_seconds=settings.ragas_metric_timeout_seconds,
)
- Step 2: Update
rag_eval/advisor/__init__.pywith fullrun_advisor()
# rag_eval/advisor/__init__.py
"""Optimization advisor: rule-based diagnosis + LLM-powered recommendations."""
from __future__ import annotations
import asyncio
import logging
from typing import Any
from rag_eval.reporting.artifacts import build_artifact_paths
from rag_eval.shared.models import EvaluationResult, Scenario
from .llm_analyzer import analyze
from .rules import Diagnosis, diagnose
from .writer import write_advice
logger = logging.getLogger("rag_eval.advisor")
__all__ = ["run_advisor", "Diagnosis", "diagnose"]
def run_advisor(
result: EvaluationResult,
scenario: Scenario,
llm: Any,
) -> None:
"""Run the full optimization advisor pipeline after an evaluation completes.
Skips silently if scenario.optimization_advisor is False.
Never raises — failures are logged as warnings, not exceptions.
Args:
result: Completed EvaluationResult from Evaluator.evaluate().
scenario: The resolved Scenario (provides metrics, judge_model, output_dir).
llm: Pre-built RAGAS LLM instance (from build_models()) for LLM analysis.
"""
if not scenario.optimization_advisor:
return
logger.info("[advisor] starting optimization analysis scenario=%s", scenario.scenario_name)
try:
artifact_paths = build_artifact_paths(scenario.output_dir, result.run_id)
if artifact_paths.advice_md is None:
logger.warning("[advisor] advice_md path not set in RunArtifactPaths — skipping")
return
diagnoses = diagnose(result.score_rows, scenario.metrics)
logger.info("[advisor] rule diagnosis complete: %d metric(s) triggered", len(diagnoses))
if diagnoses:
llm_markdown = asyncio.run(analyze(diagnoses, llm, scenario.scenario_name))
else:
llm_markdown = ""
write_advice(
diagnoses=diagnoses,
llm_markdown=llm_markdown,
advice_path=artifact_paths.advice_md,
scenario_name=scenario.scenario_name,
run_id=result.run_id,
judge_model=scenario.judge_model,
)
except Exception as exc:
logger.warning(
"[advisor] advisor failed (%s: %s) — evaluation result is unaffected",
type(exc).__name__, exc,
)
- Step 3: Update runner.py to lift llm and call run_advisor()
In rag_eval/execution/runner.py, make these changes:
- Add import at top:
from rag_eval.advisor import run_advisor
from rag_eval.metrics.factory import build_models, build_metric_pipeline
- Replace the
build_metric_pipelineimport (it's already imported fromrag_eval.metrics.factory) and updaterun_scenario():
# rag_eval/execution/runner.py — full replacement
"""High-level scenario runner used by the package and CLI entrypoints."""
from __future__ import annotations
import logging
import sys
from pathlib import Path
from rag_eval.adapters.http import HttpAppAdapter
from rag_eval.adapters.python import PythonFunctionAdapter
from rag_eval.advisor import run_advisor
from rag_eval.config.loader import load_scenario
from rag_eval.metrics.factory import build_models, build_metric_pipeline
from rag_eval.reporting.writers import write_run_artifacts
from rag_eval.settings import EvaluationSettings
from rag_eval.shared.models import Scenario
from .evaluator import Evaluator
logger = logging.getLogger("rag_eval.execution.runner")
def _setup_logging(log_file: Path | None = None, level: int = logging.INFO) -> None:
"""Configure root logger: always write to stderr, optionally also to a file."""
fmt = "%(asctime)s %(levelname)-8s %(name)s %(message)s"
datefmt = "%H:%M:%S"
handlers: list[logging.Handler] = [logging.StreamHandler(sys.stderr)]
if log_file is not None:
log_file.parent.mkdir(parents=True, exist_ok=True)
fh = logging.FileHandler(log_file, encoding="utf-8")
fh.setFormatter(logging.Formatter(fmt, datefmt=datefmt))
handlers.append(fh)
logging.basicConfig(level=level, format=fmt, datefmt=datefmt, handlers=handlers, force=True)
logging.getLogger("ragas").setLevel(logging.WARNING)
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("openai").setLevel(logging.WARNING)
def build_adapter(scenario: Scenario):
"""Instantiate the adapter required by the resolved scenario, if any."""
if scenario.app_adapter is None:
return None
if scenario.app_adapter.type == "http":
return HttpAppAdapter(scenario.app_adapter)
if scenario.app_adapter.type == "python":
return PythonFunctionAdapter(scenario.app_adapter)
raise ValueError(f"Unsupported adapter type: {scenario.app_adapter.type}")
def run_scenario(
scenario_path: str,
settings: EvaluationSettings | None = None,
log_file: Path | None = None,
log_level: int = logging.INFO,
):
"""Run one scenario end to end and persist its reporting artifacts."""
_setup_logging(log_file=log_file, level=log_level)
logger.info("[runner] run_scenario path=%s", scenario_path)
settings = settings or EvaluationSettings()
if not settings.openai_api_key:
raise EnvironmentError("OPENAI_API_KEY must be set before running the evaluator.")
scenario = load_scenario(scenario_path)
logger.info("[runner] scenario loaded: name=%s mode=%s max_samples=%s",
scenario.scenario_name, scenario.mode, scenario.runtime.max_samples)
# Build models once; reuse llm in both MetricPipeline and advisor.
llm, embeddings = build_models(scenario.judge_model, scenario.embedding_model, settings)
adapter = build_adapter(scenario)
pipeline = build_metric_pipeline(scenario, settings, llm=llm, embeddings=embeddings)
evaluator = Evaluator(scenario=scenario, metric_pipeline=pipeline, app_adapter=adapter)
result = evaluator.evaluate()
write_run_artifacts(result)
logger.info("[runner] artifacts written for run_id=%s", result.run_id)
# Optimization advisor — runs only if scenario.optimization_advisor is True.
run_advisor(result, scenario, llm)
return result
- Step 4: Verify existing tests still pass
python -m pytest tests/ -v 2>&1 | tail -20
Expected: same pass count as before this change (only pre-existing 4 failures).
- Step 5: Commit
git add rag_eval/metrics/factory.py rag_eval/execution/runner.py rag_eval/advisor/__init__.py
git commit -m "feat(advisor): wire run_advisor into runner, lift llm from build_metric_pipeline"
Task 6: Enable advisor in Siemens online YAML
Files:
-
Modify:
scenarios/online/siemens-pdf-question-bank-online.yaml -
Step 1: Add optimization_advisor field
Read the current file first, then add one line after embedding_model:
scenario_name: siemens-pdf-question-bank-online
mode: online
dataset: ../../datasets/raw/generated/siemens-pdf-question-bank.csv
judge_model: deepseek-v4-flash
embedding_model: text-embedding-v3
optimization_advisor: true # 评测结束后自动生成优化建议报告
metrics:
- faithfulness
- answer_relevancy
- context_recall
- context_precision
# 已启用:鲁棒性 / 端到端指标(数据集已含 ground_truth)
- noise_sensitivity # 鲁棒性:对检索噪声的敏感度
- factual_correctness # 端到端:事实正确性(相对标准答案)
- semantic_similarity # 端到端:语义相似度(embedding,无 LLM 调用)
output_dir: ../../outputs/online/siemens-pdf-question-bank
runtime:
batch_size: 4
app_concurrency: 4
metric_concurrency: 4
max_samples: 50
app_adapter:
type: python
callable: apps.siemens_pdf_qa.adapter:run
static_kwargs:
source_chunks_path: ../../outputs/dataset-builds/siemens-pdf-question-bank/latest/source_chunks.jsonl
model: deepseek-v4-flash
- Step 2: Verify scenario loads correctly
python -c "
from rag_eval.config.loader import load_scenario
s = load_scenario('scenarios/online/siemens-pdf-question-bank-online.yaml')
print('optimization_advisor:', s.optimization_advisor)
print('metrics:', s.metrics)
"
Expected:
optimization_advisor: True
metrics: ['faithfulness', 'answer_relevancy', 'context_recall', 'context_precision', 'noise_sensitivity', 'factual_correctness', 'semantic_similarity']
- Step 3: Commit
git add scenarios/online/siemens-pdf-question-bank-online.yaml
git commit -m "feat(advisor): enable optimization_advisor in siemens online scenario"
Task 7: Run all advisor tests + smoke check
Files: none new
- Step 1: Run full advisor test suite
python -m pytest tests/test_advisor_rules.py tests/test_advisor_writer.py -v
Expected: 15 tests PASS (9 rules + 6 writer).
- Step 2: Smoke-check the full module wiring (no network)
# paste into Python REPL or save as scripts/smoke_advisor.py and run
import math, sys
sys.path.insert(0, ".")
from rag_eval.advisor.rules import diagnose
from rag_eval.advisor.writer import write_advice, _format_log_summary
from pathlib import Path
import tempfile, os
# Simulate score_rows with low faithfulness and high noise_sensitivity
rows = [
{"sample_id": f"s{i}", "question": f"q{i}", "answer": f"a{i}",
"ground_truth": f"gt{i}", "faithfulness": 0.3 + i*0.05,
"noise_sensitivity": 0.4 + i*0.02}
for i in range(5)
]
diags = diagnose(rows, metrics=["faithfulness", "noise_sensitivity"])
print(f"Diagnosed {len(diags)} metric(s):")
for d in diags:
print(f" {d.metric}: mean={d.mean_score}, severity={d.severity}, samples={len(d.low_samples)}")
with tempfile.TemporaryDirectory() as tmp:
path = Path(tmp) / "optimization_advice.md"
write_advice(
diagnoses=diags,
llm_markdown="", # fallback mode
advice_path=path,
scenario_name="smoke-test",
run_id="2026-01-01T00-00-00",
judge_model="deepseek-v4-flash",
)
content = path.read_text(encoding="utf-8")
print(f"\nAdvice file ({len(content)} chars):")
print(content[:600])
print("\nSmoke check PASSED")
python scripts/smoke_advisor.py
Expected: prints diagnosed metrics, advice content, Smoke check PASSED.
- Step 3: Commit smoke script
git add scripts/smoke_advisor.py
git commit -m "test(advisor): add smoke-check script for offline wiring verification"
Task 8: Update docs
Files:
-
Modify:
docs/rag-eval-engine-flow.md -
Modify:
docs/rag-eval-architecture.md -
Step 1: Add advisor section to rag-eval-engine-flow.md
Append a new section at the end of docs/rag-eval-engine-flow.md:
---
## 15. Optimization Advisor 链路
相关代码:
- `rag_eval/advisor/__init__.py` — 外部入口 `run_advisor()`
- `rag_eval/advisor/rules.py` — 规则引擎(纯函数,无 LLM)
- `rag_eval/advisor/llm_analyzer.py` — LLM 分析器(复用 judge_model)
- `rag_eval/advisor/writer.py` — 写出 `optimization_advice.md` + 日志摘要
Advisor 在 `write_run_artifacts()` 之后触发,仅当场景配置 `optimization_advisor: true` 时生效。
执行链路:
```text
run_advisor(result, scenario, llm)
-> rules.diagnose(score_rows, metrics) # 识别异常指标,选取低分样本
-> llm_analyzer.analyze(diagnoses, llm) # LLM 生成中文建议(失败自动降级)
-> writer.write_advice(...) # 写 optimization_advice.md + 日志摘要
输出产物追加在现有 run 目录:
outputs/online/siemens-pdf-question-bank/<run_id>/
...(现有文件)
optimization_advice.md ← 新增(optimization_advisor: true 时生成)
- [ ] **Step 2: Add advisor note to rag-eval-architecture.md §9.4**
In `docs/rag-eval-architecture.md`, find the end of section `### 9.4 指标编排` and append:
```markdown
**Optimization Advisor(§11 优化策略落地):**
评测结束后,若场景配置 `optimization_advisor: true`,则自动调用 `rag_eval/advisor/` 模块:
- 规则引擎(`rules.py`)对 7 个指标各自设阈值,识别触发项并选取 top-3 低分样本
- LLM 分析器(`llm_analyzer.py`)结合低分样本生成中文 Markdown 优化建议(复用 judge_model,失败自动降级为纯规则报告)
- 写出层(`writer.py`)输出 `optimization_advice.md` 并打日志摘要
```yaml
# 场景配置示例
optimization_advisor: true
- [ ] **Step 3: Commit docs**
git add docs/rag-eval-engine-flow.md docs/rag-eval-architecture.md git commit -m "docs: add optimization advisor section to engine-flow and architecture docs"
---
## Self-Review
**Spec coverage check:**
- §2 决策摘要 → Task 5 (llm reuse), Task 6 (YAML trigger), Task 1+2+3 (outputs) ✓
- §3.1 执行链路 → Task 5 runner.py wiring ✓
- §3.2 新增文件 → Tasks 1, 2, 3 ✓
- §3.3 修改文件 → Task 4 (models/schema/loader/artifacts), Task 5 (factory/runner) ✓
- §3.4 输出产物 → Task 4 advice_md in RunArtifactPaths ✓
- §4 规则引擎 7条规则 → Task 1 rules.py METRIC_RULES ✓
- §4.3 low_samples top-3 → Task 1 `_select_low_samples` ✓
- §5 LLM分析器 → Task 3 llm_analyzer.py ✓
- §5.4 失败降级 → Task 3 try/except returns "" → Task 2 writer fallback ✓
- §6.1 文件写出 + §6.2 日志摘要 → Task 2 writer.py ✓
- §7 YAML配置 → Task 6 ✓
- §8 测试策略 → Tasks 1 (rules tests), 2 (writer tests) ✓
- §9 非目标 → not implemented ✓
**Type consistency check:**
- `Diagnosis` defined in `rules.py`, imported in `__init__.py`, `llm_analyzer.py`, `writer.py` — all consistent ✓
- `write_advice()` signature matches calls in `__init__.py` ✓
- `build_models()` returns `tuple[Any, Any]`; `build_metric_pipeline()` accepts `llm, embeddings` — consistent ✓
- `run_advisor(result, scenario, llm)` signature matches call in `runner.py` ✓
**Placeholder scan:** No TBD/TODO/fill-in-later found ✓