Files

wangwei ca586bf9bb docs: add metric and doc weights feature design spec

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-06-18 16:37:18 +08:00

8.4 KiB

Raw Permalink Blame History

指标权重 & 文档片段权重功能设计

日期: 2026-06-18
状态: 已批准，待实现
范围: 在「新建评估」运行评估时，支持为 RAGAS 指标和文档配置权重，计算加权综合得分并在报告中展示。

1. 目标

指标权重（Metric Weights）：允许为每个 RAGAS 指标配置浮点权重（如 faithfulness: 0.35），计算每道题的加权综合得分 weighted_score。
文档权重（Doc Weights）：允许为特定 PDF 文档名称配置权重（如 "322_双源CT.pdf": 2.0），该文档的题目在汇总指标均值时按权重放大贡献。
前端覆盖：在「新建评估」页面选中场景后，展示可编辑的权重面板，运行前可临时覆盖 YAML 中的权重。
完全向后兼容：两个字段均为可选，省略时退化为等权行为，现有场景 YAML 无需修改。

2. 数据模型

2.1 场景 YAML（新增可选字段）

# 可选。缺省时所有指标权重 = 1.0
metric_weights:
  faithfulness: 0.35
  context_recall: 0.25
  context_precision: 0.20
  answer_relevancy: 0.20

# 可选。缺省时所有文档权重 = 1.0
doc_weights:
  "322_双源CT成像技术.pdf": 2.0
  "323_单源CT对比.pdf": 1.5

2.2 Pydantic Schema（`rag_eval/config/schema.py`）

ScenarioModel 新增：

metric_weights: dict[str, float] = Field(default_factory=dict)
doc_weights:    dict[str, float] = Field(default_factory=dict)

ConfigDict(extra="ignore") 不变，新字段不影响既有 YAML 的加载。

2.3 内部 Scenario dataclass（`rag_eval/shared/models.py`）

Scenario 新增：

metric_weights: dict[str, float] = field(default_factory=dict)
doc_weights:    dict[str, float] = field(default_factory=dict)

随 scenario.snapshot() 序列化，供 run_reader / 报告层读取。

3. 后端：权重计算逻辑

3.1 新模块 `rag_eval/metrics/weights.py`

纯函数模块，无外部依赖，独立可测：

def resolve_weight(weights: dict[str, float], key: str, default: float = 1.0) -> float:
    """返回 key 对应的权重，缺失时返回 default。"""

def compute_weighted_score(
    scores: dict[str, float | None],
    metric_weights: dict[str, float],
) -> float | None:
    """
    给定各指标得分和权重，返回加权综合得分。
    - 忽略 NaN / None 值
    - metric_weights 为空时退化为等权均值
    - 全部 NaN 时返回 None
    公式: Σ(w_i * s_i) / Σ(w_i)，只对非 NaN 项求和
    """

def weighted_metric_means(
    score_rows: list[dict],
    metrics: list[str],
    doc_weights: dict[str, float],
) -> dict[str, float | None]:
    """
    对每个指标计算文档加权均值。
    - sample_weight = doc_weights.get(row["doc_name"], 1.0)
    - 公式: Σ(sample_weight_j * score_m_j) / Σ(sample_weight_j)
    - doc_weights 为空时退化为普通算术均值
    """

3.2 评估器（`rag_eval/execution/evaluator.py`）

_merge_score() 新增两列：

record["weighted_score"] = compute_weighted_score(
    score.metrics, self.scenario.metric_weights
)
record["sample_weight"] = self.scenario.doc_weights.get(
    sample.metadata.get("doc_name", ""), 1.0
)

scores.csv 新增 weighted_score、sample_weight 两列。

3.3 报告摘要（`rag_eval/reporting/summary.py`）

build_summary_markdown() 改用 weighted_metric_means() 计算各指标均值；新增 weighted_score 整体均值行：

## Metric Means（加权）
- faithfulness:     0.8123  (w=0.35)
- context_recall:   0.7654  (w=0.25)
- context_precision: 0.7200  (w=0.20)
- answer_relevancy: 0.7400  (w=0.20)
- **weighted_score: 0.7789**

4. yaml_patcher 扩展（`webapp/services/yaml_patcher.py`）

apply_profiles_to_scenario() 扩展签名，新增可选参数：

def apply_profiles_to_scenario(
    scenario_path: str,
    judge_profile: LLMProfile | None,
    answer_profile: LLMProfile | None,
    dataset_profile: LLMProfile | None,
    metric_weights: dict[str, float] | None = None,   # 新增
    doc_weights: dict[str, float] | None = None,       # 新增
    _resolve_absolute: bool = False,
) -> list[str]:

metric_weights 非 None 时写入 data["metric_weights"]，追加 "metric_weights" 到 patched 列表
doc_weights 非 None 时写入 data["doc_weights"]，追加 "doc_weights" 到 patched 列表

5. Webapp 模型与 API 扩展

5.1 `webapp/models.py`

ProfileApplyRequest 新增：

metric_weights: dict[str, float] | None = None
doc_weights:    dict[str, float] | None = None

ProfileApplyResponse 不变（patched_fields 已包含新字段名）。

5.2 `webapp/api/llm_profiles.py` — `apply_profiles()`

透传 metric_weights / doc_weights 给 apply_profiles_to_scenario()。

6. 前端：权重配置面板

6.1 HTML（`index.html`）

在 #llm-assignment-panel 下方新增 #weight-config-panel（选中场景后显示）：

┌─────────────────────────────────────────────┐
│ 权重配置  （可选，留空使用场景原始配置）         │
├─────────────────────────────────────────────┤
│ 指标权重                                     │
│  faithfulness        [____1.0____]           │
│  context_recall      [____1.0____]           │
│  ...（根据选中场景的 metrics 动态生成）         │
│                                              │
│ 文档权重（doc_weights）                       │
│  [doc名称_______________] [权重__] [＋] [✕]  │
│  [doc名称_______________] [权重__] [＋] [✕]  │
│  ＋ 添加文档权重规则                          │
└─────────────────────────────────────────────┘

6.2 `runner.js`

renderScenarioItem() 选中后调用 Runner._renderWeightPanel(sc) 动态生成指标行
_applyProfilesIfNeeded() 同时读取权重输入，追加到 apply 请求 body
Runner._collectWeights() 收集 metric_weights / doc_weights，全部为 1.0 时不发送（跳过）

6.3 CSS（`app.css`）

新增 .weight-config-panel、.weight-row、.weight-input 样式，与现有 .llm-role-row 风格一致。

7. 报告展示（`webapp/services/report_builder.py`）

RunSummary.metric_means 改用 weighted_metric_means() 计算（需从 scenario.snapshot.yaml 读取 doc_weights / metric_weights）
RunSummary 新增 weighted_score_mean: float | None 字段
前端 report.js 的指标卡片区新增「综合加权得分」卡片，使用 good/warn/bad 配色

8. 测试计划

测试文件	覆盖内容
`tests/test_weights.py`	`compute_weighted_score` / `weighted_metric_means` 纯函数，含 NaN 边界、空权重、全 NaN
`tests/test_dataset_build.py`	无改动（隔离良好）
`tests/test_offline_eval.py`	`_merge_score` 新增 weighted_score / sample_weight 列断言
`tests/webapp/test_llm_profiles_api.py`	`apply_profiles` 带 metric_weights / doc_weights 的 patching 测试

9. 改动文件清单

文件	改动类型
`rag_eval/config/schema.py`	新增字段
`rag_eval/shared/models.py`	新增字段
`rag_eval/config/loader.py`	透传新字段到 Scenario
`rag_eval/metrics/weights.py`	新建
`rag_eval/execution/evaluator.py`	`_merge_score` 新增两列
`rag_eval/reporting/summary.py`	改用加权均值
`webapp/services/yaml_patcher.py`	新增 metric_weights / doc_weights 参数
`webapp/models.py`	ProfileApplyRequest 新增字段；RunSummary 新增 weighted_score_mean
`webapp/api/llm_profiles.py`	透传新参数
`webapp/services/report_builder.py`	加权均值计算
`webapp/static/index.html`	新增权重配置面板
`webapp/static/js/runner.js`	权重面板逻辑
`webapp/static/css/app.css`	新增权重面板样式
`tests/test_weights.py`	新建

10. 向后兼容保证

metric_weights: {} + doc_weights: {} → 所有权重 = 1.0，行为与当前完全一致
现有场景 YAML 不含这两个字段 → Pydantic default_factory=dict 填充空字典
scores.csv 新增两列不影响现有报告读取（run_reader 只读已知列）

8.4 KiB Raw Permalink Blame History Unescape Escape

指标权重 & 文档片段权重功能设计

1. 目标

2. 数据模型

2.1 场景 YAML（新增可选字段）

2.2 Pydantic Schema（rag_eval/config/schema.py）

2.3 内部 Scenario dataclass（rag_eval/shared/models.py）

3. 后端：权重计算逻辑

3.1 新模块 rag_eval/metrics/weights.py

3.2 评估器（rag_eval/execution/evaluator.py）

3.3 报告摘要（rag_eval/reporting/summary.py）

4. yaml_patcher 扩展（webapp/services/yaml_patcher.py）