siemens_ragas/docs/superpowers/specs/2026-06-18-metric-doc-weights-design.md

# 指标权重 & 文档片段权重功能设计

**日期**: 2026-06-18
**状态**: 已批准，待实现
**范围**: 在「新建评估」运行评估时，支持为 RAGAS 指标和文档配置权重，计算加权综合得分并在报告中展示。

---

## 1. 目标

1. **指标权重（Metric Weights）**：允许为每个 RAGAS 指标配置浮点权重（如 faithfulness: 0.35），计算每道题的加权综合得分 `weighted_score`。
2. **文档权重（Doc Weights）**：允许为特定 PDF 文档名称配置权重（如 `"322_双源CT.pdf": 2.0`），该文档的题目在汇总指标均值时按权重放大贡献。
3. **前端覆盖**：在「新建评估」页面选中场景后，展示可编辑的权重面板，运行前可临时覆盖 YAML 中的权重。
4. **完全向后兼容**：两个字段均为可选，省略时退化为等权行为，现有场景 YAML 无需修改。

---

## 2. 数据模型

### 2.1 场景 YAML（新增可选字段）

```yaml
# 可选。缺省时所有指标权重 = 1.0
metric_weights:
  faithfulness: 0.35
  context_recall: 0.25
  context_precision: 0.20
  answer_relevancy: 0.20

# 可选。缺省时所有文档权重 = 1.0
doc_weights:
  "322_双源CT成像技术.pdf": 2.0
  "323_单源CT对比.pdf": 1.5
```

### 2.2 Pydantic Schema（`rag_eval/config/schema.py`）

`ScenarioModel` 新增：
```python
metric_weights: dict[str, float] = Field(default_factory=dict)
doc_weights:    dict[str, float] = Field(default_factory=dict)
```

`ConfigDict(extra="ignore")` 不变，新字段不影响既有 YAML 的加载。

### 2.3 内部 Scenario dataclass（`rag_eval/shared/models.py`）

`Scenario` 新增：
```python
metric_weights: dict[str, float] = field(default_factory=dict)
doc_weights:    dict[str, float] = field(default_factory=dict)
```

随 `scenario.snapshot()` 序列化，供 `run_reader` / 报告层读取。

---

## 3. 后端：权重计算逻辑

### 3.1 新模块 `rag_eval/metrics/weights.py`

纯函数模块，无外部依赖，独立可测：

```python
def resolve_weight(weights: dict[str, float], key: str, default: float = 1.0) -> float:
    """返回 key 对应的权重，缺失时返回 default。"""

def compute_weighted_score(
    scores: dict[str, float | None],
    metric_weights: dict[str, float],
) -> float | None:
    """
    给定各指标得分和权重，返回加权综合得分。
    - 忽略 NaN / None 值
    - metric_weights 为空时退化为等权均值
    - 全部 NaN 时返回 None
    公式: Σ(w_i * s_i) / Σ(w_i)，只对非 NaN 项求和
    """

def weighted_metric_means(
    score_rows: list[dict],
    metrics: list[str],
    doc_weights: dict[str, float],
) -> dict[str, float | None]:
    """
    对每个指标计算文档加权均值。
    - sample_weight = doc_weights.get(row["doc_name"], 1.0)
    - 公式: Σ(sample_weight_j * score_m_j) / Σ(sample_weight_j)
    - doc_weights 为空时退化为普通算术均值
    """
```

### 3.2 评估器（`rag_eval/execution/evaluator.py`）

`_merge_score()` 新增两列：
```python
record["weighted_score"] = compute_weighted_score(
    score.metrics, self.scenario.metric_weights
)
record["sample_weight"] = self.scenario.doc_weights.get(
    sample.metadata.get("doc_name", ""), 1.0
)
```

`scores.csv` 新增 `weighted_score`、`sample_weight` 两列。

### 3.3 报告摘要（`rag_eval/reporting/summary.py`）

`build_summary_markdown()` 改用 `weighted_metric_means()` 计算各指标均值；
新增 `weighted_score` 整体均值行：

```
## Metric Means（加权）
- faithfulness:     0.8123  (w=0.35)
- context_recall:   0.7654  (w=0.25)
- context_precision: 0.7200  (w=0.20)
- answer_relevancy: 0.7400  (w=0.20)
- **weighted_score: 0.7789**
```

---

## 4. yaml_patcher 扩展（`webapp/services/yaml_patcher.py`）

`apply_profiles_to_scenario()` 扩展签名，新增可选参数：

```python
def apply_profiles_to_scenario(
    scenario_path: str,
    judge_profile: LLMProfile | None,
    answer_profile: LLMProfile | None,
    dataset_profile: LLMProfile | None,
    metric_weights: dict[str, float] | None = None,   # 新增
    doc_weights: dict[str, float] | None = None,       # 新增
    _resolve_absolute: bool = False,
) -> list[str]:
```

- `metric_weights` 非 None 时写入 `data["metric_weights"]`，追加 `"metric_weights"` 到 patched 列表
- `doc_weights` 非 None 时写入 `data["doc_weights"]`，追加 `"doc_weights"` 到 patched 列表

---

## 5. Webapp 模型与 API 扩展

### 5.1 `webapp/models.py`

`ProfileApplyRequest` 新增：
```python
metric_weights: dict[str, float] | None = None
doc_weights:    dict[str, float] | None = None
```

`ProfileApplyResponse` 不变（`patched_fields` 已包含新字段名）。

### 5.2 `webapp/api/llm_profiles.py` — `apply_profiles()`

透传 `metric_weights` / `doc_weights` 给 `apply_profiles_to_scenario()`。

---

## 6. 前端：权重配置面板

### 6.1 HTML（`index.html`）

在 `#llm-assignment-panel` 下方新增 `#weight-config-panel`（选中场景后显示）：

```
┌─────────────────────────────────────────────┐
│ 权重配置  （可选，留空使用场景原始配置）         │
├─────────────────────────────────────────────┤
│ 指标权重                                     │
│  faithfulness        [____1.0____]           │
│  context_recall      [____1.0____]           │
│  ...（根据选中场景的 metrics 动态生成）         │
│                                              │
│ 文档权重（doc_weights）                       │
│  [doc名称_______________] [权重__] [＋] [✕]  │
│  [doc名称_______________] [权重__] [＋] [✕]  │
│  ＋ 添加文档权重规则                          │
└─────────────────────────────────────────────┘
```

### 6.2 `runner.js`

- `renderScenarioItem()` 选中后调用 `Runner._renderWeightPanel(sc)` 动态生成指标行
- `_applyProfilesIfNeeded()` 同时读取权重输入，追加到 `apply` 请求 body
- `Runner._collectWeights()` 收集 metric_weights / doc_weights，全部为 1.0 时不发送（跳过）

### 6.3 CSS（`app.css`）

新增 `.weight-config-panel`、`.weight-row`、`.weight-input` 样式，与现有 `.llm-role-row` 风格一致。

---

## 7. 报告展示（`webapp/services/report_builder.py`）

- `RunSummary.metric_means` 改用 `weighted_metric_means()` 计算（需从 `scenario.snapshot.yaml` 读取 `doc_weights` / `metric_weights`）
- `RunSummary` 新增 `weighted_score_mean: float | None` 字段
- 前端 `report.js` 的指标卡片区新增「综合加权得分」卡片，使用 `good/warn/bad` 配色

---

## 8. 测试计划

| 测试文件 | 覆盖内容 |
|----------|---------|
| `tests/test_weights.py` | `compute_weighted_score` / `weighted_metric_means` 纯函数，含 NaN 边界、空权重、全 NaN |
| `tests/test_dataset_build.py` | 无改动（隔离良好） |
| `tests/test_offline_eval.py` | `_merge_score` 新增 weighted_score / sample_weight 列断言 |
| `tests/webapp/test_llm_profiles_api.py` | `apply_profiles` 带 metric_weights / doc_weights 的 patching 测试 |

---

## 9. 改动文件清单

| 文件 | 改动类型 |
|------|---------|
| `rag_eval/config/schema.py` | 新增字段 |
| `rag_eval/shared/models.py` | 新增字段 |
| `rag_eval/config/loader.py` | 透传新字段到 Scenario |
| `rag_eval/metrics/weights.py` | **新建** |
| `rag_eval/execution/evaluator.py` | `_merge_score` 新增两列 |
| `rag_eval/reporting/summary.py` | 改用加权均值 |
| `webapp/services/yaml_patcher.py` | 新增 metric_weights / doc_weights 参数 |
| `webapp/models.py` | ProfileApplyRequest 新增字段；RunSummary 新增 weighted_score_mean |
| `webapp/api/llm_profiles.py` | 透传新参数 |
| `webapp/services/report_builder.py` | 加权均值计算 |
| `webapp/static/index.html` | 新增权重配置面板 |
| `webapp/static/js/runner.js` | 权重面板逻辑 |
| `webapp/static/css/app.css` | 新增权重面板样式 |
| `tests/test_weights.py` | **新建** |

---

## 10. 向后兼容保证

- `metric_weights: {}` + `doc_weights: {}` → 所有权重 = 1.0，行为与当前完全一致
- 现有场景 YAML 不含这两个字段 → Pydantic `default_factory=dict` 填充空字典
- `scores.csv` 新增两列不影响现有报告读取（`run_reader` 只读已知列）