docs: add metric and doc weights feature design spec

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-18 16:37:18 +08:00
parent 9ad2daff73
commit ca586bf9bb
1 changed files with 240 additions and 0 deletions
--- a/docs/superpowers/specs/2026-06-18-metric-doc-weights-design.md
+++ b/docs/superpowers/specs/2026-06-18-metric-doc-weights-design.md
@@ -0,0 +1,240 @@
+# 指标权重 & 文档片段权重功能设计
+
+**日期**: 2026-06-18  
+**状态**: 已批准，待实现  
+**范围**: 在「新建评估」运行评估时，支持为 RAGAS 指标和文档配置权重，计算加权综合得分并在报告中展示。
+
+---
+
+## 1. 目标
+
+1. **指标权重（Metric Weights）**：允许为每个 RAGAS 指标配置浮点权重（如 faithfulness: 0.35），计算每道题的加权综合得分 `weighted_score`。
+2. **文档权重（Doc Weights）**：允许为特定 PDF 文档名称配置权重（如 `"322_双源CT.pdf": 2.0`），该文档的题目在汇总指标均值时按权重放大贡献。
+3. **前端覆盖**：在「新建评估」页面选中场景后，展示可编辑的权重面板，运行前可临时覆盖 YAML 中的权重。
+4. **完全向后兼容**：两个字段均为可选，省略时退化为等权行为，现有场景 YAML 无需修改。
+
+---
+
+## 2. 数据模型
+
+### 2.1 场景 YAML（新增可选字段）
+
+```yaml
+# 可选。缺省时所有指标权重 = 1.0
+metric_weights:
+  faithfulness: 0.35
+  context_recall: 0.25
+  context_precision: 0.20
+  answer_relevancy: 0.20
+
+# 可选。缺省时所有文档权重 = 1.0
+doc_weights:
+  "322_双源CT成像技术.pdf": 2.0
+  "323_单源CT对比.pdf": 1.5
+```
+
+### 2.2 Pydantic Schema（`rag_eval/config/schema.py`）
+
+`ScenarioModel` 新增：
+```python
+metric_weights: dict[str, float] = Field(default_factory=dict)
+doc_weights:    dict[str, float] = Field(default_factory=dict)
+```
+
+`ConfigDict(extra="ignore")` 不变，新字段不影响既有 YAML 的加载。
+
+### 2.3 内部 Scenario dataclass（`rag_eval/shared/models.py`）
+
+`Scenario` 新增：
+```python
+metric_weights: dict[str, float] = field(default_factory=dict)
+doc_weights:    dict[str, float] = field(default_factory=dict)
+```
+
+随 `scenario.snapshot()` 序列化，供 `run_reader` / 报告层读取。
+
+---
+
+## 3. 后端：权重计算逻辑
+
+### 3.1 新模块 `rag_eval/metrics/weights.py`
+
+纯函数模块，无外部依赖，独立可测：
+
+```python
+def resolve_weight(weights: dict[str, float], key: str, default: float = 1.0) -> float:
+    """返回 key 对应的权重，缺失时返回 default。"""
+
+def compute_weighted_score(
+    scores: dict[str, float | None],
+    metric_weights: dict[str, float],
+) -> float | None:
+    """
+    给定各指标得分和权重，返回加权综合得分。
+    - 忽略 NaN / None 值
+    - metric_weights 为空时退化为等权均值
+    - 全部 NaN 时返回 None
+    公式: Σ(w_i * s_i) / Σ(w_i)，只对非 NaN 项求和
+    """
+
+def weighted_metric_means(
+    score_rows: list[dict],
+    metrics: list[str],
+    doc_weights: dict[str, float],
+) -> dict[str, float | None]:
+    """
+    对每个指标计算文档加权均值。
+    - sample_weight = doc_weights.get(row["doc_name"], 1.0)
+    - 公式: Σ(sample_weight_j * score_m_j) / Σ(sample_weight_j)
+    - doc_weights 为空时退化为普通算术均值
+    """
+```
+
+### 3.2 评估器（`rag_eval/execution/evaluator.py`）
+
+`_merge_score()` 新增两列：
+```python
+record["weighted_score"] = compute_weighted_score(
+    score.metrics, self.scenario.metric_weights
+)
+record["sample_weight"] = self.scenario.doc_weights.get(
+    sample.metadata.get("doc_name", ""), 1.0
+)
+```
+
+`scores.csv` 新增 `weighted_score`、`sample_weight` 两列。
+
+### 3.3 报告摘要（`rag_eval/reporting/summary.py`）
+
+`build_summary_markdown()` 改用 `weighted_metric_means()` 计算各指标均值；
+新增 `weighted_score` 整体均值行：
+
+```
+## Metric Means（加权）
+- faithfulness:     0.8123  (w=0.35)
+- context_recall:   0.7654  (w=0.25)
+- context_precision: 0.7200  (w=0.20)
+- answer_relevancy: 0.7400  (w=0.20)
+- **weighted_score: 0.7789**
+```
+
+---
+
+## 4. yaml_patcher 扩展（`webapp/services/yaml_patcher.py`）
+
+`apply_profiles_to_scenario()` 扩展签名，新增可选参数：
+
+```python
+def apply_profiles_to_scenario(
+    scenario_path: str,
+    judge_profile: LLMProfile | None,
+    answer_profile: LLMProfile | None,
+    dataset_profile: LLMProfile | None,
+    metric_weights: dict[str, float] | None = None,   # 新增
+    doc_weights: dict[str, float] | None = None,       # 新增
+    _resolve_absolute: bool = False,
+) -> list[str]:
+```
+
+- `metric_weights` 非 None 时写入 `data["metric_weights"]`，追加 `"metric_weights"` 到 patched 列表
+- `doc_weights` 非 None 时写入 `data["doc_weights"]`，追加 `"doc_weights"` 到 patched 列表
+
+---
+
+## 5. Webapp 模型与 API 扩展
+
+### 5.1 `webapp/models.py`
+
+`ProfileApplyRequest` 新增：
+```python
+metric_weights: dict[str, float] | None = None
+doc_weights:    dict[str, float] | None = None
+```
+
+`ProfileApplyResponse` 不变（`patched_fields` 已包含新字段名）。
+
+### 5.2 `webapp/api/llm_profiles.py` — `apply_profiles()`
+
+透传 `metric_weights` / `doc_weights` 给 `apply_profiles_to_scenario()`。
+
+---
+
+## 6. 前端：权重配置面板
+
+### 6.1 HTML（`index.html`）
+
+在 `#llm-assignment-panel` 下方新增 `#weight-config-panel`（选中场景后显示）：
+
+```
+┌─────────────────────────────────────────────┐
+│ 权重配置  （可选，留空使用场景原始配置）         │
+├─────────────────────────────────────────────┤
+│ 指标权重                                     │
+│  faithfulness        [____1.0____]           │
+│  context_recall      [____1.0____]           │
+│  ...（根据选中场景的 metrics 动态生成）         │
+│                                              │
+│ 文档权重（doc_weights）                       │
+│  [doc名称_______________] [权重__] [＋] [✕]  │
+│  [doc名称_______________] [权重__] [＋] [✕]  │
+│  ＋ 添加文档权重规则                          │
+└─────────────────────────────────────────────┘
+```
+
+### 6.2 `runner.js`
+
+- `renderScenarioItem()` 选中后调用 `Runner._renderWeightPanel(sc)` 动态生成指标行
+- `_applyProfilesIfNeeded()` 同时读取权重输入，追加到 `apply` 请求 body
+- `Runner._collectWeights()` 收集 metric_weights / doc_weights，全部为 1.0 时不发送（跳过）
+
+### 6.3 CSS（`app.css`）
+
+新增 `.weight-config-panel`、`.weight-row`、`.weight-input` 样式，与现有 `.llm-role-row` 风格一致。
+
+---
+
+## 7. 报告展示（`webapp/services/report_builder.py`）
+
+- `RunSummary.metric_means` 改用 `weighted_metric_means()` 计算（需从 `scenario.snapshot.yaml` 读取 `doc_weights` / `metric_weights`）
+- `RunSummary` 新增 `weighted_score_mean: float | None` 字段
+- 前端 `report.js` 的指标卡片区新增「综合加权得分」卡片，使用 `good/warn/bad` 配色
+
+---
+
+## 8. 测试计划
+
+| 测试文件 | 覆盖内容 |
+|----------|---------|
+| `tests/test_weights.py` | `compute_weighted_score` / `weighted_metric_means` 纯函数，含 NaN 边界、空权重、全 NaN |
+| `tests/test_dataset_build.py` | 无改动（隔离良好） |
+| `tests/test_offline_eval.py` | `_merge_score` 新增 weighted_score / sample_weight 列断言 |
+| `tests/webapp/test_llm_profiles_api.py` | `apply_profiles` 带 metric_weights / doc_weights 的 patching 测试 |
+
+---
+
+## 9. 改动文件清单
+
+| 文件 | 改动类型 |
+|------|---------|
+| `rag_eval/config/schema.py` | 新增字段 |
+| `rag_eval/shared/models.py` | 新增字段 |
+| `rag_eval/config/loader.py` | 透传新字段到 Scenario |
+| `rag_eval/metrics/weights.py` | **新建** |
+| `rag_eval/execution/evaluator.py` | `_merge_score` 新增两列 |
+| `rag_eval/reporting/summary.py` | 改用加权均值 |
+| `webapp/services/yaml_patcher.py` | 新增 metric_weights / doc_weights 参数 |
+| `webapp/models.py` | ProfileApplyRequest 新增字段；RunSummary 新增 weighted_score_mean |
+| `webapp/api/llm_profiles.py` | 透传新参数 |
+| `webapp/services/report_builder.py` | 加权均值计算 |
+| `webapp/static/index.html` | 新增权重配置面板 |
+| `webapp/static/js/runner.js` | 权重面板逻辑 |
+| `webapp/static/css/app.css` | 新增权重面板样式 |
+| `tests/test_weights.py` | **新建** |
+
+---
+
+## 10. 向后兼容保证
+
+- `metric_weights: {}` + `doc_weights: {}` → 所有权重 = 1.0，行为与当前完全一致
+- 现有场景 YAML 不含这两个字段 → Pydantic `default_factory=dict` 填充空字典
+- `scores.csv` 新增两列不影响现有报告读取（`run_reader` 只读已知列）