docs: add metric and doc weights feature design spec
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
240
docs/superpowers/specs/2026-06-18-metric-doc-weights-design.md
Normal file
240
docs/superpowers/specs/2026-06-18-metric-doc-weights-design.md
Normal file
@@ -0,0 +1,240 @@
|
||||
# 指标权重 & 文档片段权重功能设计
|
||||
|
||||
**日期**: 2026-06-18
|
||||
**状态**: 已批准,待实现
|
||||
**范围**: 在「新建评估」运行评估时,支持为 RAGAS 指标和文档配置权重,计算加权综合得分并在报告中展示。
|
||||
|
||||
---
|
||||
|
||||
## 1. 目标
|
||||
|
||||
1. **指标权重(Metric Weights)**:允许为每个 RAGAS 指标配置浮点权重(如 faithfulness: 0.35),计算每道题的加权综合得分 `weighted_score`。
|
||||
2. **文档权重(Doc Weights)**:允许为特定 PDF 文档名称配置权重(如 `"322_双源CT.pdf": 2.0`),该文档的题目在汇总指标均值时按权重放大贡献。
|
||||
3. **前端覆盖**:在「新建评估」页面选中场景后,展示可编辑的权重面板,运行前可临时覆盖 YAML 中的权重。
|
||||
4. **完全向后兼容**:两个字段均为可选,省略时退化为等权行为,现有场景 YAML 无需修改。
|
||||
|
||||
---
|
||||
|
||||
## 2. 数据模型
|
||||
|
||||
### 2.1 场景 YAML(新增可选字段)
|
||||
|
||||
```yaml
|
||||
# 可选。缺省时所有指标权重 = 1.0
|
||||
metric_weights:
|
||||
faithfulness: 0.35
|
||||
context_recall: 0.25
|
||||
context_precision: 0.20
|
||||
answer_relevancy: 0.20
|
||||
|
||||
# 可选。缺省时所有文档权重 = 1.0
|
||||
doc_weights:
|
||||
"322_双源CT成像技术.pdf": 2.0
|
||||
"323_单源CT对比.pdf": 1.5
|
||||
```
|
||||
|
||||
### 2.2 Pydantic Schema(`rag_eval/config/schema.py`)
|
||||
|
||||
`ScenarioModel` 新增:
|
||||
```python
|
||||
metric_weights: dict[str, float] = Field(default_factory=dict)
|
||||
doc_weights: dict[str, float] = Field(default_factory=dict)
|
||||
```
|
||||
|
||||
`ConfigDict(extra="ignore")` 不变,新字段不影响既有 YAML 的加载。
|
||||
|
||||
### 2.3 内部 Scenario dataclass(`rag_eval/shared/models.py`)
|
||||
|
||||
`Scenario` 新增:
|
||||
```python
|
||||
metric_weights: dict[str, float] = field(default_factory=dict)
|
||||
doc_weights: dict[str, float] = field(default_factory=dict)
|
||||
```
|
||||
|
||||
随 `scenario.snapshot()` 序列化,供 `run_reader` / 报告层读取。
|
||||
|
||||
---
|
||||
|
||||
## 3. 后端:权重计算逻辑
|
||||
|
||||
### 3.1 新模块 `rag_eval/metrics/weights.py`
|
||||
|
||||
纯函数模块,无外部依赖,独立可测:
|
||||
|
||||
```python
|
||||
def resolve_weight(weights: dict[str, float], key: str, default: float = 1.0) -> float:
|
||||
"""返回 key 对应的权重,缺失时返回 default。"""
|
||||
|
||||
def compute_weighted_score(
|
||||
scores: dict[str, float | None],
|
||||
metric_weights: dict[str, float],
|
||||
) -> float | None:
|
||||
"""
|
||||
给定各指标得分和权重,返回加权综合得分。
|
||||
- 忽略 NaN / None 值
|
||||
- metric_weights 为空时退化为等权均值
|
||||
- 全部 NaN 时返回 None
|
||||
公式: Σ(w_i * s_i) / Σ(w_i),只对非 NaN 项求和
|
||||
"""
|
||||
|
||||
def weighted_metric_means(
|
||||
score_rows: list[dict],
|
||||
metrics: list[str],
|
||||
doc_weights: dict[str, float],
|
||||
) -> dict[str, float | None]:
|
||||
"""
|
||||
对每个指标计算文档加权均值。
|
||||
- sample_weight = doc_weights.get(row["doc_name"], 1.0)
|
||||
- 公式: Σ(sample_weight_j * score_m_j) / Σ(sample_weight_j)
|
||||
- doc_weights 为空时退化为普通算术均值
|
||||
"""
|
||||
```
|
||||
|
||||
### 3.2 评估器(`rag_eval/execution/evaluator.py`)
|
||||
|
||||
`_merge_score()` 新增两列:
|
||||
```python
|
||||
record["weighted_score"] = compute_weighted_score(
|
||||
score.metrics, self.scenario.metric_weights
|
||||
)
|
||||
record["sample_weight"] = self.scenario.doc_weights.get(
|
||||
sample.metadata.get("doc_name", ""), 1.0
|
||||
)
|
||||
```
|
||||
|
||||
`scores.csv` 新增 `weighted_score`、`sample_weight` 两列。
|
||||
|
||||
### 3.3 报告摘要(`rag_eval/reporting/summary.py`)
|
||||
|
||||
`build_summary_markdown()` 改用 `weighted_metric_means()` 计算各指标均值;
|
||||
新增 `weighted_score` 整体均值行:
|
||||
|
||||
```
|
||||
## Metric Means(加权)
|
||||
- faithfulness: 0.8123 (w=0.35)
|
||||
- context_recall: 0.7654 (w=0.25)
|
||||
- context_precision: 0.7200 (w=0.20)
|
||||
- answer_relevancy: 0.7400 (w=0.20)
|
||||
- **weighted_score: 0.7789**
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. yaml_patcher 扩展(`webapp/services/yaml_patcher.py`)
|
||||
|
||||
`apply_profiles_to_scenario()` 扩展签名,新增可选参数:
|
||||
|
||||
```python
|
||||
def apply_profiles_to_scenario(
|
||||
scenario_path: str,
|
||||
judge_profile: LLMProfile | None,
|
||||
answer_profile: LLMProfile | None,
|
||||
dataset_profile: LLMProfile | None,
|
||||
metric_weights: dict[str, float] | None = None, # 新增
|
||||
doc_weights: dict[str, float] | None = None, # 新增
|
||||
_resolve_absolute: bool = False,
|
||||
) -> list[str]:
|
||||
```
|
||||
|
||||
- `metric_weights` 非 None 时写入 `data["metric_weights"]`,追加 `"metric_weights"` 到 patched 列表
|
||||
- `doc_weights` 非 None 时写入 `data["doc_weights"]`,追加 `"doc_weights"` 到 patched 列表
|
||||
|
||||
---
|
||||
|
||||
## 5. Webapp 模型与 API 扩展
|
||||
|
||||
### 5.1 `webapp/models.py`
|
||||
|
||||
`ProfileApplyRequest` 新增:
|
||||
```python
|
||||
metric_weights: dict[str, float] | None = None
|
||||
doc_weights: dict[str, float] | None = None
|
||||
```
|
||||
|
||||
`ProfileApplyResponse` 不变(`patched_fields` 已包含新字段名)。
|
||||
|
||||
### 5.2 `webapp/api/llm_profiles.py` — `apply_profiles()`
|
||||
|
||||
透传 `metric_weights` / `doc_weights` 给 `apply_profiles_to_scenario()`。
|
||||
|
||||
---
|
||||
|
||||
## 6. 前端:权重配置面板
|
||||
|
||||
### 6.1 HTML(`index.html`)
|
||||
|
||||
在 `#llm-assignment-panel` 下方新增 `#weight-config-panel`(选中场景后显示):
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ 权重配置 (可选,留空使用场景原始配置) │
|
||||
├─────────────────────────────────────────────┤
|
||||
│ 指标权重 │
|
||||
│ faithfulness [____1.0____] │
|
||||
│ context_recall [____1.0____] │
|
||||
│ ...(根据选中场景的 metrics 动态生成) │
|
||||
│ │
|
||||
│ 文档权重(doc_weights) │
|
||||
│ [doc名称_______________] [权重__] [+] [✕] │
|
||||
│ [doc名称_______________] [权重__] [+] [✕] │
|
||||
│ + 添加文档权重规则 │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 6.2 `runner.js`
|
||||
|
||||
- `renderScenarioItem()` 选中后调用 `Runner._renderWeightPanel(sc)` 动态生成指标行
|
||||
- `_applyProfilesIfNeeded()` 同时读取权重输入,追加到 `apply` 请求 body
|
||||
- `Runner._collectWeights()` 收集 metric_weights / doc_weights,全部为 1.0 时不发送(跳过)
|
||||
|
||||
### 6.3 CSS(`app.css`)
|
||||
|
||||
新增 `.weight-config-panel`、`.weight-row`、`.weight-input` 样式,与现有 `.llm-role-row` 风格一致。
|
||||
|
||||
---
|
||||
|
||||
## 7. 报告展示(`webapp/services/report_builder.py`)
|
||||
|
||||
- `RunSummary.metric_means` 改用 `weighted_metric_means()` 计算(需从 `scenario.snapshot.yaml` 读取 `doc_weights` / `metric_weights`)
|
||||
- `RunSummary` 新增 `weighted_score_mean: float | None` 字段
|
||||
- 前端 `report.js` 的指标卡片区新增「综合加权得分」卡片,使用 `good/warn/bad` 配色
|
||||
|
||||
---
|
||||
|
||||
## 8. 测试计划
|
||||
|
||||
| 测试文件 | 覆盖内容 |
|
||||
|----------|---------|
|
||||
| `tests/test_weights.py` | `compute_weighted_score` / `weighted_metric_means` 纯函数,含 NaN 边界、空权重、全 NaN |
|
||||
| `tests/test_dataset_build.py` | 无改动(隔离良好) |
|
||||
| `tests/test_offline_eval.py` | `_merge_score` 新增 weighted_score / sample_weight 列断言 |
|
||||
| `tests/webapp/test_llm_profiles_api.py` | `apply_profiles` 带 metric_weights / doc_weights 的 patching 测试 |
|
||||
|
||||
---
|
||||
|
||||
## 9. 改动文件清单
|
||||
|
||||
| 文件 | 改动类型 |
|
||||
|------|---------|
|
||||
| `rag_eval/config/schema.py` | 新增字段 |
|
||||
| `rag_eval/shared/models.py` | 新增字段 |
|
||||
| `rag_eval/config/loader.py` | 透传新字段到 Scenario |
|
||||
| `rag_eval/metrics/weights.py` | **新建** |
|
||||
| `rag_eval/execution/evaluator.py` | `_merge_score` 新增两列 |
|
||||
| `rag_eval/reporting/summary.py` | 改用加权均值 |
|
||||
| `webapp/services/yaml_patcher.py` | 新增 metric_weights / doc_weights 参数 |
|
||||
| `webapp/models.py` | ProfileApplyRequest 新增字段;RunSummary 新增 weighted_score_mean |
|
||||
| `webapp/api/llm_profiles.py` | 透传新参数 |
|
||||
| `webapp/services/report_builder.py` | 加权均值计算 |
|
||||
| `webapp/static/index.html` | 新增权重配置面板 |
|
||||
| `webapp/static/js/runner.js` | 权重面板逻辑 |
|
||||
| `webapp/static/css/app.css` | 新增权重面板样式 |
|
||||
| `tests/test_weights.py` | **新建** |
|
||||
|
||||
---
|
||||
|
||||
## 10. 向后兼容保证
|
||||
|
||||
- `metric_weights: {}` + `doc_weights: {}` → 所有权重 = 1.0,行为与当前完全一致
|
||||
- 现有场景 YAML 不含这两个字段 → Pydantic `default_factory=dict` 填充空字典
|
||||
- `scores.csv` 新增两列不影响现有报告读取(`run_reader` 只读已知列)
|
||||
Reference in New Issue
Block a user