siemens_ragas/docs/superpowers/plans/2026-06-18-metric-doc-weights.md

# 指标权重 & 文档片段权重 Implementation Plan

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** 在场景 YAML 中支持 `metric_weights` 和 `doc_weights` 两个可选字段，计算加权综合得分并在报告页和「新建评估」页的权重配置面板中展示。

**Architecture:** 新增纯函数模块 `rag_eval/metrics/weights.py` 承载所有计算逻辑；评估器写入两列新字段 (`weighted_score`, `sample_weight`) 到 `scores.csv`；`yaml_patcher` 扩展支持写入权重字段；前端在 LLM 角色配置面板下方动态渲染权重面板，报告页新增「加权综合得分」卡片。

**Tech Stack:** Python 3.12, Pydantic v2, FastAPI, Vanilla JS (无框架), pytest

## Global Constraints

- Python 3.12+，PEP 8，4 空格缩进，类型注解必须
- 所有新字段均为可选，缺省行为与现有完全一致（向后兼容）
- 测试用 pytest，不依赖真实 LLM 或网络
- JS 不引入任何新依赖（原生 DOM API）
- 权重值无需归一化，计算时内部 `w / Σw`

---

## 文件清单

| 操作 | 文件 | 职责 |
|------|------|------|
| 新建 | `rag_eval/metrics/weights.py` | 权重计算纯函数 |
| 新建 | `tests/test_weights.py` | weights 模块单元测试 |
| 修改 | `rag_eval/config/schema.py` | ScenarioModel 新增两字段 |
| 修改 | `rag_eval/shared/models.py` | Scenario dataclass 新增两字段 |
| 修改 | `rag_eval/config/loader.py` | load_scenario 透传新字段 |
| 修改 | `rag_eval/execution/evaluator.py` | `_merge_score` 新增两列 |
| 修改 | `rag_eval/reporting/summary.py` | 改用加权均值，新增 weighted_score 行 |
| 修改 | `webapp/services/yaml_patcher.py` | 新增 metric_weights/doc_weights 参数 |
| 修改 | `webapp/models.py` | ProfileApplyRequest 新增字段；ReportData 新增 weighted_score_mean |
| 修改 | `webapp/api/llm_profiles.py` | apply_profiles 透传新参数 |
| 修改 | `webapp/services/report_builder.py` | 读取权重，计算加权均值和 weighted_score_mean |
| 修改 | `webapp/services/run_reader.py` | 从 snapshot.yaml 读取 metric_weights/doc_weights |
| 修改 | `webapp/static/index.html` | 新增权重配置面板 HTML |
| 修改 | `webapp/static/js/runner.js` | 权重面板逻辑 + apply 时传权重 |
| 修改 | `webapp/static/css/app.css` | 权重面板样式 |
| 修改 | `webapp/static/js/report.js` | renderMetricCards 中渲染 weighted_score 卡片 |
| 修改 | `webapp/api/scenarios.py` | ScenarioInfo 新增 metric_weights/doc_weights 字段 |
| 修改 | `webapp/services/scenario_scanner.py` | 扫描时读取权重字段 |
| 修改 | `tests/test_offline_eval.py` | 断言 scores.csv 包含 weighted_score/sample_weight |
| 修改 | `tests/webapp/test_llm_profiles_api.py` | apply_profiles 权重写入测试 |

---

## Task 1: 新建权重计算核心模块（TDD）

**Files:**
- Create: `rag_eval/metrics/weights.py`
- Create: `tests/test_weights.py`

**Interfaces:**
- Produces:
  - `resolve_weight(weights: dict[str, float], key: str, default: float = 1.0) -> float`
  - `compute_weighted_score(scores: dict[str, float | None], metric_weights: dict[str, float]) -> float | None`
  - `weighted_metric_means(score_rows: list[dict], metrics: list[str], doc_weights: dict[str, float]) -> dict[str, float | None]`
  - `compute_overall_weighted_score_mean(score_rows: list[dict], metric_weights: dict[str, float], doc_weights: dict[str, float]) -> float | None`

- [ ] **Step 1: Write failing tests**

Create `tests/test_weights.py`:

```python
"""Unit tests for rag_eval/metrics/weights.py"""
import math
import pytest
from rag_eval.metrics.weights import (
    resolve_weight,
    compute_weighted_score,
    weighted_metric_means,
    compute_overall_weighted_score_mean,
)


class TestResolveWeight:
    def test_returns_value_when_key_present(self):
        assert resolve_weight({"faith": 0.5}, "faith") == 0.5

    def test_returns_default_when_key_missing(self):
        assert resolve_weight({}, "faith") == 1.0

    def test_returns_custom_default_when_key_missing(self):
        assert resolve_weight({}, "faith", default=2.0) == 2.0

    def test_empty_dict_returns_default(self):
        assert resolve_weight({}, "anything") == 1.0


class TestComputeWeightedScore:
    def test_equal_weights_is_simple_mean(self):
        scores = {"faithfulness": 0.8, "context_recall": 0.6}
        result = compute_weighted_score(scores, {})
        assert result == pytest.approx(0.7, rel=1e-4)

    def test_explicit_weights(self):
        scores = {"faithfulness": 1.0, "context_recall": 0.0}
        weights = {"faithfulness": 3.0, "context_recall": 1.0}
        # (3*1.0 + 1*0.0) / (3+1) = 0.75
        result = compute_weighted_score(scores, weights)
        assert result == pytest.approx(0.75, rel=1e-4)

    def test_nan_values_excluded(self):
        scores = {"faithfulness": float("nan"), "context_recall": 0.8}
        result = compute_weighted_score(scores, {})
        assert result == pytest.approx(0.8, rel=1e-4)

    def test_none_values_excluded(self):
        scores = {"faithfulness": None, "context_recall": 0.6}
        result = compute_weighted_score(scores, {})
        assert result == pytest.approx(0.6, rel=1e-4)

    def test_all_nan_returns_none(self):
        scores = {"faithfulness": float("nan"), "context_recall": float("nan")}
        assert compute_weighted_score(scores, {}) is None

    def test_empty_scores_returns_none(self):
        assert compute_weighted_score({}, {}) is None

    def test_missing_metric_in_weights_uses_default_1(self):
        scores = {"faithfulness": 0.8, "context_recall": 0.4}
        weights = {"faithfulness": 2.0}  # context_recall defaults to 1.0
        # (2*0.8 + 1*0.4) / (2+1) = 2.0/3 ≈ 0.6667
        result = compute_weighted_score(scores, weights)
        assert result == pytest.approx(2.0 / 3, rel=1e-4)


class TestWeightedMetricMeans:
    def _rows(self):
        return [
            {"doc_name": "a.pdf", "faithfulness": 1.0, "context_recall": 0.5},
            {"doc_name": "b.pdf", "faithfulness": 0.6, "context_recall": 0.8},
        ]

    def test_equal_weights_gives_arithmetic_mean(self):
        rows = self._rows()
        result = weighted_metric_means(rows, ["faithfulness", "context_recall"], {})
        assert result["faithfulness"] == pytest.approx(0.8, rel=1e-4)
        assert result["context_recall"] == pytest.approx(0.65, rel=1e-4)

    def test_doc_weight_amplifies_contribution(self):
        rows = self._rows()
        # doc a.pdf gets weight 3, b.pdf gets 1
        doc_weights = {"a.pdf": 3.0, "b.pdf": 1.0}
        result = weighted_metric_means(rows, ["faithfulness"], doc_weights)
        # (3*1.0 + 1*0.6) / (3+1) = 3.6/4 = 0.9
        assert result["faithfulness"] == pytest.approx(0.9, rel=1e-4)

    def test_nan_rows_skipped_per_metric(self):
        rows = [
            {"doc_name": "a.pdf", "faithfulness": float("nan"), "context_recall": 0.5},
            {"doc_name": "b.pdf", "faithfulness": 0.8, "context_recall": 0.9},
        ]
        result = weighted_metric_means(rows, ["faithfulness", "context_recall"], {})
        assert result["faithfulness"] == pytest.approx(0.8, rel=1e-4)
        assert result["context_recall"] == pytest.approx(0.7, rel=1e-4)

    def test_missing_metric_column_returns_none(self):
        rows = [{"doc_name": "a.pdf", "faithfulness": 0.8}]
        result = weighted_metric_means(rows, ["faithfulness", "unknown_metric"], {})
        assert result["faithfulness"] == pytest.approx(0.8, rel=1e-4)
        assert result["unknown_metric"] is None

    def test_empty_rows_returns_none_for_all(self):
        result = weighted_metric_means([], ["faithfulness"], {})
        assert result["faithfulness"] is None


class TestComputeOverallWeightedScoreMean:
    def test_basic_weighted_mean_of_weighted_scores(self):
        rows = [
            {"doc_name": "a.pdf", "faithfulness": 1.0, "context_recall": 0.0},
            {"doc_name": "b.pdf", "faithfulness": 0.5, "context_recall": 0.5},
        ]
        metric_weights = {"faithfulness": 1.0, "context_recall": 1.0}
        result = compute_overall_weighted_score_mean(rows, metric_weights, {})
        # sample 1 ws = 0.5, sample 2 ws = 0.5 → mean = 0.5
        assert result == pytest.approx(0.5, rel=1e-4)

    def test_doc_weight_amplifies_sample(self):
        rows = [
            {"doc_name": "important.pdf", "faithfulness": 1.0},
            {"doc_name": "other.pdf", "faithfulness": 0.0},
        ]
        doc_weights = {"important.pdf": 9.0, "other.pdf": 1.0}
        result = compute_overall_weighted_score_mean(rows, {}, doc_weights)
        # ws_1=1.0 w=9, ws_2=0.0 w=1 → (9*1 + 1*0)/(9+1) = 0.9
        assert result == pytest.approx(0.9, rel=1e-4)

    def test_all_nan_returns_none(self):
        rows = [{"doc_name": "a.pdf", "faithfulness": float("nan")}]
        assert compute_overall_weighted_score_mean(rows, {}, {}) is None
```

- [ ] **Step 2: Run tests to verify they fail**

```
python -m pytest tests/test_weights.py -v
```
Expected: `ModuleNotFoundError: No module named 'rag_eval.metrics.weights'`

- [ ] **Step 3: Implement `rag_eval/metrics/weights.py`**

```python
"""Utility functions for weighted metric aggregation.

All functions are pure (no side effects, no I/O) and operate on plain dicts/lists.
Weights do not need to be pre-normalised — normalisation is done internally.
"""

from __future__ import annotations

import math


def resolve_weight(weights: dict[str, float], key: str, default: float = 1.0) -> float:
    """Return the weight for *key*, or *default* when absent."""
    return float(weights.get(key, default))


def compute_weighted_score(
    scores: dict[str, float | None],
    metric_weights: dict[str, float],
) -> float | None:
    """Return the weighted mean of valid (non-NaN, non-None) metric scores.

    Args:
        scores: mapping of metric_name -> raw score (may be NaN or None).
        metric_weights: optional per-metric weights; absent keys default to 1.0.

    Returns:
        Weighted mean as a float, or None when no valid score exists.
    """
    total_weight = 0.0
    total_score = 0.0
    for metric, score in scores.items():
        if score is None:
            continue
        try:
            v = float(score)
        except (TypeError, ValueError):
            continue
        if math.isnan(v) or math.isinf(v):
            continue
        w = resolve_weight(metric_weights, metric, default=1.0)
        total_weight += w
        total_score += w * v
    if total_weight == 0.0:
        return None
    return total_score / total_weight


def weighted_metric_means(
    score_rows: list[dict],
    metrics: list[str],
    doc_weights: dict[str, float],
) -> dict[str, float | None]:
    """Compute per-metric weighted means across all score rows.

    Each row's contribution is scaled by the doc_weight for its ``doc_name``.
    Rows with NaN/None for a given metric are excluded from that metric's mean.

    Args:
        score_rows: list of score record dicts (from scores.csv).
        metrics: ordered list of metric names to aggregate.
        doc_weights: mapping doc_name -> weight multiplier; absent keys default to 1.0.

    Returns:
        Dict mapping metric_name -> weighted mean (or None if no valid data).
    """
    totals: dict[str, float] = {m: 0.0 for m in metrics}
    weights_sum: dict[str, float] = {m: 0.0 for m in metrics}

    for row in score_rows:
        doc_name = str(row.get("doc_name", "") or "")
        sample_w = resolve_weight(doc_weights, doc_name, default=1.0)
        for metric in metrics:
            raw = row.get(metric)
            if raw is None:
                continue
            try:
                v = float(raw)
            except (TypeError, ValueError):
                continue
            if math.isnan(v) or math.isinf(v):
                continue
            totals[metric] += sample_w * v
            weights_sum[metric] += sample_w

    return {
        metric: (totals[metric] / weights_sum[metric] if weights_sum[metric] > 0 else None)
        for metric in metrics
    }


def compute_overall_weighted_score_mean(
    score_rows: list[dict],
    metric_weights: dict[str, float],
    doc_weights: dict[str, float],
) -> float | None:
    """Compute the overall weighted-score mean across all samples.

    For each sample:
      1. Compute per-sample weighted_score via ``compute_weighted_score``.
      2. Scale by the doc weight for that sample's ``doc_name``.
    Then return the weighted mean of all per-sample weighted_scores.

    Args:
        score_rows: list of score record dicts.
        metric_weights: per-metric weight multipliers.
        doc_weights: per-doc weight multipliers.

    Returns:
        Float mean, or None when no sample has a valid weighted_score.
    """
    total_weight = 0.0
    total_score = 0.0
    for row in score_rows:
        # Collect only numeric metric columns (exclude meta-columns)
        metric_scores: dict[str, float | None] = {}
        for k, v in row.items():
            if k in _META_COLUMNS:
                continue
            metric_scores[k] = v  # type: ignore[assignment]

        ws = compute_weighted_score(metric_scores, metric_weights)
        if ws is None:
            continue
        doc_name = str(row.get("doc_name", "") or "")
        sample_w = resolve_weight(doc_weights, doc_name, default=1.0)
        total_weight += sample_w
        total_score += sample_w * ws

    return total_score / total_weight if total_weight > 0 else None


# Columns in scores.csv that are sample metadata, not metric scores.
_META_COLUMNS = frozenset({
    "sample_id", "question", "contexts", "answer", "ground_truth",
    "scenario", "language", "retrieval_config", "error",
    "judge_model", "embedding_model", "run_id",
    "difficulty", "question_type", "doc_id", "doc_name",
    "section_path", "page_start", "page_end",
    "source_chunk_ids", "review_status", "review_notes",
    "weighted_score", "sample_weight",
})
```

- [ ] **Step 4: Run tests to verify they pass**

```
python -m pytest tests/test_weights.py -v
```
Expected: all 18 tests PASS

- [ ] **Step 5: Commit**

```
git add rag_eval/metrics/weights.py tests/test_weights.py
git commit -m "feat: add metric/doc weight computation module (weights.py)"
```

---

## Task 2: 扩展 Schema、Dataclass 和 Loader

**Files:**
- Modify: `rag_eval/config/schema.py`
- Modify: `rag_eval/shared/models.py`
- Modify: `rag_eval/config/loader.py`

**Interfaces:**
- Consumes: 无新依赖
- Produces:
  - `ScenarioModel.metric_weights: dict[str, float]` (default `{}`)
  - `ScenarioModel.doc_weights: dict[str, float]` (default `{}`)
  - `Scenario.metric_weights: dict[str, float]` (default `{}`)
  - `Scenario.doc_weights: dict[str, float]` (default `{}`)

- [ ] **Step 1: Write failing test**

Add to `tests/test_offline_eval.py` inside `ScenarioAndDatasetTests`:

```python
def test_load_scenario_metric_and_doc_weights(self):
    """load_scenario passes metric_weights and doc_weights into Scenario."""
    import tempfile, yaml
    from rag_eval.config.loader import load_scenario
    payload = {
        "scenario_name": "w-test", "mode": "offline",
        "dataset": "nonexistent.csv", "judge_model": "m",
        "embedding_model": "e", "metrics": ["faithfulness"],
        "output_dir": "out",
        "metric_weights": {"faithfulness": 0.7},
        "doc_weights": {"doc.pdf": 2.0},
    }
    with tempfile.NamedTemporaryFile(suffix=".yaml", mode="w",
                                     encoding="utf-8", delete=False) as f:
        yaml.dump(payload, f, allow_unicode=True)
        tmp_path = f.name
    scenario = load_scenario(tmp_path)
    assert scenario.metric_weights == {"faithfulness": 0.7}
    assert scenario.doc_weights == {"doc.pdf": 2.0}

def test_load_scenario_defaults_to_empty_weights(self):
    """load_scenario defaults metric_weights and doc_weights to empty dicts."""
    import tempfile, yaml
    from rag_eval.config.loader import load_scenario
    payload = {
        "scenario_name": "no-w", "mode": "offline",
        "dataset": "nonexistent.csv", "judge_model": "m",
        "embedding_model": "e", "metrics": ["faithfulness"],
        "output_dir": "out",
    }
    with tempfile.NamedTemporaryFile(suffix=".yaml", mode="w",
                                     encoding="utf-8", delete=False) as f:
        yaml.dump(payload, f, allow_unicode=True)
        tmp_path = f.name
    scenario = load_scenario(tmp_path)
    assert scenario.metric_weights == {}
    assert scenario.doc_weights == {}
```

- [ ] **Step 2: Run tests to verify they fail**

```
python -m pytest tests/test_offline_eval.py::ScenarioAndDatasetTests::test_load_scenario_metric_and_doc_weights tests/test_offline_eval.py::ScenarioAndDatasetTests::test_load_scenario_defaults_to_empty_weights -v
```
Expected: FAIL — `Scenario has no attribute 'metric_weights'`

- [ ] **Step 3: Add fields to `rag_eval/config/schema.py`**

In `ScenarioModel`, add after `optimization_advisor`:
```python
metric_weights: dict[str, float] = Field(default_factory=dict)
doc_weights: dict[str, float] = Field(default_factory=dict)
```

- [ ] **Step 4: Add fields to `rag_eval/shared/models.py`**

In `Scenario` dataclass, add after `optimization_advisor: bool = False`:
```python
metric_weights: dict[str, float] = field(default_factory=dict)
doc_weights: dict[str, float] = field(default_factory=dict)
```

- [ ] **Step 5: Update `rag_eval/config/loader.py`**

In `load_scenario()`, in the `Scenario(...)` constructor call, add after `optimization_advisor=model.optimization_advisor,`:
```python
metric_weights=dict(model.metric_weights),
doc_weights=dict(model.doc_weights),
```

- [ ] **Step 6: Run tests to verify they pass**

```
python -m pytest tests/test_offline_eval.py::ScenarioAndDatasetTests::test_load_scenario_metric_and_doc_weights tests/test_offline_eval.py::ScenarioAndDatasetTests::test_load_scenario_defaults_to_empty_weights -v
```
Expected: both PASS

- [ ] **Step 7: Commit**

```
git add rag_eval/config/schema.py rag_eval/shared/models.py rag_eval/config/loader.py tests/test_offline_eval.py
git commit -m "feat: add metric_weights and doc_weights to Scenario schema and dataclass"
```

---

## Task 3: 评估器 — _merge_score 新增两列

**Files:**
- Modify: `rag_eval/execution/evaluator.py`

**Interfaces:**
- Consumes: `compute_weighted_score(scores, metric_weights) -> float | None` from `rag_eval.metrics.weights`
- Produces: `scores.csv` 新增列 `weighted_score: float | NaN`, `sample_weight: float`

- [ ] **Step 1: Write failing test**

Add to `tests/test_offline_eval.py` inside `EvaluatorAndReportingTests`:

```python
def test_merge_score_includes_weighted_score_and_sample_weight(self):
    """_merge_score adds weighted_score and sample_weight columns."""
    from unittest.mock import MagicMock
    from rag_eval.execution.evaluator import Evaluator
    from rag_eval.shared.models import (
        MetricScore, NormalizedSample, RuntimeConfig, Scenario, DatasetConfig,
    )
    from pathlib import Path
    scenario = Scenario(
        scenario_name="w-test", mode="offline",
        dataset=DatasetConfig(path=Path("d.csv")),
        judge_model="m", embedding_model="e",
        metrics=["faithfulness", "context_recall"],
        output_dir=Path("out"),
        metric_weights={"faithfulness": 3.0, "context_recall": 1.0},
        doc_weights={"doc.pdf": 2.0},
    )
    evaluator = Evaluator(
        scenario=scenario,
        metric_pipeline=MagicMock(),
        app_adapter=None,
    )
    sample = NormalizedSample(
        sample_id="s1", question="q", contexts=["ctx"],
        answer="a", ground_truth="gt",
        metadata={"doc_name": "doc.pdf"},
    )
    score = MetricScore(metrics={"faithfulness": 1.0, "context_recall": 0.0})
    row = evaluator._merge_score(sample, score)
    # (3*1.0 + 1*0.0) / 4 = 0.75
    assert abs(row["weighted_score"] - 0.75) < 1e-4
    assert row["sample_weight"] == 2.0
```

- [ ] **Step 2: Run test to verify it fails**

```
python -m pytest tests/test_offline_eval.py::EvaluatorAndReportingTests::test_merge_score_includes_weighted_score_and_sample_weight -v
```
Expected: FAIL — `KeyError: 'weighted_score'`

- [ ] **Step 3: Update `rag_eval/execution/evaluator.py`**

Add import at top of file (after existing imports):
```python
from rag_eval.metrics.weights import compute_weighted_score, resolve_weight
```

Replace `_merge_score` method:
```python
def _merge_score(self, sample: NormalizedSample, score: Any) -> dict[str, Any]:
    """Combine sample data, metric results, run metadata, and weight columns."""
    record = sample.to_record()
    record["contexts"] = sample.contexts
    record.update(score.metrics)
    record["error"] = score.error
    record["judge_model"] = self.scenario.judge_model
    record["embedding_model"] = self.scenario.embedding_model
    record["run_id"] = self.scenario.scenario_name
    # Weighted score columns — enable post-hoc weighted aggregation in reporting.
    record["weighted_score"] = compute_weighted_score(
        score.metrics, self.scenario.metric_weights
    )
    doc_name = str(sample.metadata.get("doc_name", "") or "")
    record["sample_weight"] = resolve_weight(
        self.scenario.doc_weights, doc_name, default=1.0
    )
    return record
```

- [ ] **Step 4: Run test to verify it passes**

```
python -m pytest tests/test_offline_eval.py::EvaluatorAndReportingTests::test_merge_score_includes_weighted_score_and_sample_weight -v
```
Expected: PASS

- [ ] **Step 5: Commit**

```
git add rag_eval/execution/evaluator.py tests/test_offline_eval.py
git commit -m "feat: add weighted_score and sample_weight columns to score rows"
```

---

## Task 4: 报告摘要 — 改用加权均值

**Files:**
- Modify: `rag_eval/reporting/summary.py`

**Interfaces:**
- Consumes:
  - `weighted_metric_means(score_rows, metrics, doc_weights) -> dict[str, float | None]`
  - `compute_overall_weighted_score_mean(score_rows, metric_weights, doc_weights) -> float | None`
  - Both from `rag_eval.metrics.weights`

- [ ] **Step 1: Write failing test**

Add to `tests/test_offline_eval.py` inside `EvaluatorAndReportingTests`:

```python
def test_summary_markdown_shows_weighted_score(self):
    """build_summary_markdown includes weighted_score when metric_weights set."""
    import math
    from rag_eval.reporting.summary import build_summary_markdown
    from rag_eval.shared.models import (
        EvaluationResult, NormalizedSample, DatasetConfig, Scenario, RuntimeConfig,
    )
    from pathlib import Path
    scenario = Scenario(
        scenario_name="ws-test", mode="offline",
        dataset=DatasetConfig(path=Path("d.csv")),
        judge_model="m", embedding_model="e",
        metrics=["faithfulness"],
        output_dir=Path("out"),
        metric_weights={"faithfulness": 1.0},
        doc_weights={},
    )
    sample = NormalizedSample(
        sample_id="s1", question="q", contexts=["c"],
        answer="a", ground_truth="gt",
    )
    result = EvaluationResult(
        scenario=scenario, run_id="r1",
        started_at="2026-01-01T00:00:00", finished_at="2026-01-01T00:01:00",
        valid_samples=[sample], invalid_samples=[],
        score_rows=[{
            "sample_id": "s1", "faithfulness": 0.8,
            "weighted_score": 0.8, "sample_weight": 1.0,
            "doc_name": "", "error": "",
        }],
    )
    md = build_summary_markdown(result)
    assert "weighted_score" in md
    assert "0.8000" in md
```

- [ ] **Step 2: Run test to verify it fails**

```
python -m pytest tests/test_offline_eval.py::EvaluatorAndReportingTests::test_summary_markdown_shows_weighted_score -v
```
Expected: FAIL — `"weighted_score" not in md`

- [ ] **Step 3: Update `rag_eval/reporting/summary.py`**

Replace the entire file:

```python
"""Markdown summary generation for completed evaluation runs."""

from __future__ import annotations

import math

import pandas as pd

from rag_eval.metrics.weights import (
    compute_overall_weighted_score_mean,
    weighted_metric_means,
)
from rag_eval.shared.models import EvaluationResult


def _table_from_frame(frame: pd.DataFrame) -> str:
    """Render a small dataframe as a fixed-width markdown-friendly text table."""
    if frame.empty:
        return "No rows."

    columns = list(frame.columns)
    rows = [[str(value) for value in row] for row in frame.astype(object).values.tolist()]
    widths = []
    for index, column in enumerate(columns):
        column_width = len(str(column))
        row_width = max((len(row[index]) for row in rows), default=0)
        widths.append(max(column_width, row_width))

    header = " | ".join(str(column).ljust(widths[idx]) for idx, column in enumerate(columns))
    separator = "-|-".join("-" * widths[idx] for idx in range(len(columns)))
    body = [
        " | ".join(row[idx].ljust(widths[idx]) for idx in range(len(columns)))
        for row in rows
    ]
    return "\n".join([header, separator, *body])


def build_summary_markdown(result: EvaluationResult) -> str:
    """Build the human-readable markdown summary written for each evaluation run."""
    total = len(result.valid_samples) + len(result.invalid_samples)
    scores = pd.DataFrame(result.score_rows)

    lines = [
        f"# {result.scenario.scenario_name}",
        "",
        f"- run_id: `{result.run_id}`",
        f"- mode: `{result.scenario.mode}`",
        f"- total_samples: `{total}`",
        f"- valid_samples: `{len(result.valid_samples)}`",
        f"- invalid_samples: `{len(result.invalid_samples)}`",
        f"- judge_model: `{result.scenario.judge_model}`",
        f"- embedding_model: `{result.scenario.embedding_model}`",
        "",
        "## Metric Means",
        "",
    ]

    if scores.empty:
        lines.append("No valid samples were scored.")
        return "\n".join(lines) + "\n"

    score_rows_list = scores.to_dict(orient="records")
    w_means = weighted_metric_means(
        score_rows_list, result.scenario.metrics, result.scenario.doc_weights
    )

    has_weights = bool(result.scenario.metric_weights or result.scenario.doc_weights)
    weight_suffix = " (加权)" if has_weights else ""

    for metric in result.scenario.metrics:
        mean_value = w_means.get(metric)
        w = result.scenario.metric_weights.get(metric, 1.0) if result.scenario.metric_weights else 1.0
        weight_note = f"  (w={w:.2f})" if result.scenario.metric_weights else ""
        if mean_value is not None and not math.isnan(mean_value):
            lines.append(f"- {metric}: `{mean_value:.4f}`{weight_note}")
        else:
            lines.append(f"- {metric}: `n/a`{weight_note}")

    overall_ws = compute_overall_weighted_score_mean(
        score_rows_list, result.scenario.metric_weights, result.scenario.doc_weights
    )
    if overall_ws is not None and not math.isnan(overall_ws):
        lines.append(f"- **weighted_score{weight_suffix}: `{overall_ws:.4f}`**")
    else:
        lines.append(f"- **weighted_score{weight_suffix}: `n/a`**")

    detail_columns = ["sample_id", *result.scenario.metrics, "weighted_score", "error"]
    existing_columns = [c for c in detail_columns if c in scores.columns]
    detail = scores[existing_columns]
    lines.extend([
        "",
        "## Per-sample Scores",
        "",
        "```text",
        _table_from_frame(detail),
        "```",
    ])
    return "\n".join(lines) + "\n"
```

- [ ] **Step 4: Run test to verify it passes**

```
python -m pytest tests/test_offline_eval.py::EvaluatorAndReportingTests::test_summary_markdown_shows_weighted_score -v
```
Expected: PASS

- [ ] **Step 5: Run full offline test suite to check no regressions**

```
python -m pytest tests/test_offline_eval.py tests/test_weights.py -v
```
Expected: all PASS

- [ ] **Step 6: Commit**

```
git add rag_eval/reporting/summary.py tests/test_offline_eval.py
git commit -m "feat: use weighted metric means and add weighted_score row to summary.md"
```

---

## Task 5: yaml_patcher 扩展

**Files:**
- Modify: `webapp/services/yaml_patcher.py`
- Modify: `webapp/models.py`
- Modify: `webapp/api/llm_profiles.py`

**Interfaces:**
- Produces:
  - `apply_profiles_to_scenario(..., metric_weights=None, doc_weights=None)` — new optional params
  - `ProfileApplyRequest.metric_weights: dict[str, float] | None`
  - `ProfileApplyRequest.doc_weights: dict[str, float] | None`

- [ ] **Step 1: Write failing test**

Add to `tests/webapp/test_llm_profiles_api.py`:

```python
def test_apply_metric_weights_patches_yaml(tmp_path):
    """Applying metric_weights writes them into the YAML."""
    scenario_file = tmp_path / "w-scenario.yaml"
    scenario_file.write_text(
        "scenario_name: test\nmode: offline\njudge_model: m\nembedding_model: e\n"
        "dataset: d.csv\nmetrics:\n- faithfulness\noutput_dir: out\n",
        encoding="utf-8",
    )
    from webapp.services.yaml_patcher import apply_profiles_to_scenario
    patched = apply_profiles_to_scenario(
        scenario_path=str(scenario_file),
        judge_profile=None, answer_profile=None, dataset_profile=None,
        metric_weights={"faithfulness": 0.7, "context_recall": 0.3},
        _resolve_absolute=True,
    )
    assert "metric_weights" in patched
    data = yaml_lib.safe_load(scenario_file.read_text())
    assert data["metric_weights"]["faithfulness"] == pytest.approx(0.7)


def test_apply_doc_weights_patches_yaml(tmp_path):
    """Applying doc_weights writes them into the YAML."""
    scenario_file = tmp_path / "dw-scenario.yaml"
    scenario_file.write_text(
        "scenario_name: test\nmode: offline\njudge_model: m\nembedding_model: e\n"
        "dataset: d.csv\nmetrics:\n- faithfulness\noutput_dir: out\n",
        encoding="utf-8",
    )
    from webapp.services.yaml_patcher import apply_profiles_to_scenario
    patched = apply_profiles_to_scenario(
        scenario_path=str(scenario_file),
        judge_profile=None, answer_profile=None, dataset_profile=None,
        doc_weights={"doc.pdf": 2.0},
        _resolve_absolute=True,
    )
    assert "doc_weights" in patched
    data = yaml_lib.safe_load(scenario_file.read_text())
    assert data["doc_weights"]["doc.pdf"] == pytest.approx(2.0)
```

- [ ] **Step 2: Run tests to verify they fail**

```
python -m pytest tests/webapp/test_llm_profiles_api.py::test_apply_metric_weights_patches_yaml tests/webapp/test_llm_profiles_api.py::test_apply_doc_weights_patches_yaml -v
```
Expected: FAIL — `unexpected keyword argument 'metric_weights'`

- [ ] **Step 3: Update `webapp/services/yaml_patcher.py`**

Replace `apply_profiles_to_scenario` signature and body:

```python
def apply_profiles_to_scenario(
    scenario_path: str,
    judge_profile: LLMProfile | None,
    answer_profile: LLMProfile | None,
    dataset_profile: LLMProfile | None,
    metric_weights: dict[str, float] | None = None,
    doc_weights: dict[str, float] | None = None,
    _resolve_absolute: bool = False,
) -> list[str]:
    """Patch the YAML file at *scenario_path* with the supplied profiles and weights.

    Returns a list of dotted field names that were actually patched.
    """
    if _resolve_absolute:
        resolved = Path(scenario_path)
    else:
        resolved = _resolve_scenario_path(scenario_path)

    if not resolved.exists():
        raise FileNotFoundError(f"Scenario file not found: {resolved}")

    data: dict[str, Any] = yaml.safe_load(resolved.read_text(encoding="utf-8")) or {}
    patched: list[str] = []

    if judge_profile is not None:
        data["judge_model"] = judge_profile.model
        patched.append("judge_model")

    if answer_profile is not None:
        adapter = data.get("app_adapter")
        if isinstance(adapter, dict):
            static_kwargs = adapter.setdefault("static_kwargs", {})
            static_kwargs["model"] = answer_profile.model
            patched.append("app_adapter.static_kwargs.model")

    if dataset_profile is not None:
        generation = data.get("generation")
        if isinstance(generation, dict):
            generation["model"] = dataset_profile.model
            patched.append("generation.model")

    if metric_weights is not None:
        data["metric_weights"] = dict(metric_weights)
        patched.append("metric_weights")

    if doc_weights is not None:
        data["doc_weights"] = dict(doc_weights)
        patched.append("doc_weights")

    resolved.write_text(
        yaml.dump(data, allow_unicode=True, default_flow_style=False, sort_keys=False),
        encoding="utf-8",
    )
    return patched
```

- [ ] **Step 4: Update `webapp/models.py` — ProfileApplyRequest**

Add two fields to `ProfileApplyRequest`:
```python
class ProfileApplyRequest(BaseModel):
    """Request body to patch LLM profile selections into a scenario YAML."""

    scenario_path: str
    judge_profile_id: str | None = None
    answer_profile_id: str | None = None
    dataset_profile_id: str | None = None
    metric_weights: dict[str, float] | None = Field(
        default=None,
        description="指标权重映射，如 {\"faithfulness\": 0.35}。为 null 时不修改 YAML。",
    )
    doc_weights: dict[str, float] | None = Field(
        default=None,
        description="文档权重映射，如 {\"doc.pdf\": 2.0}。为 null 时不修改 YAML。",
    )
```

- [ ] **Step 5: Update `webapp/api/llm_profiles.py` — apply_profiles endpoint**

In `apply_profiles()`, update the call to `apply_profiles_to_scenario`:
```python
patched = apply_profiles_to_scenario(
    scenario_path=request.scenario_path,
    judge_profile=role_profiles["judge"],
    answer_profile=role_profiles["answer"],
    dataset_profile=role_profiles["dataset"],
    metric_weights=request.metric_weights,
    doc_weights=request.doc_weights,
)
```

- [ ] **Step 6: Run tests to verify they pass**

```
python -m pytest tests/webapp/test_llm_profiles_api.py -v
```
Expected: all 15 tests PASS

- [ ] **Step 7: Commit**

```
git add webapp/services/yaml_patcher.py webapp/models.py webapp/api/llm_profiles.py tests/webapp/test_llm_profiles_api.py
git commit -m "feat: yaml_patcher and ProfileApplyRequest support metric_weights and doc_weights"
```

---

## Task 6: report_builder 和 run_reader 加权支持

**Files:**
- Modify: `webapp/services/run_reader.py`
- Modify: `webapp/services/report_builder.py`
- Modify: `webapp/models.py` (ReportData 新增 weighted_score_mean)

**Interfaces:**
- Consumes:
  - `weighted_metric_means(score_rows, metrics, doc_weights)` from `rag_eval.metrics.weights`
  - `compute_overall_weighted_score_mean(score_rows, metric_weights, doc_weights)` from `rag_eval.metrics.weights`
- Produces:
  - `ReportData.weighted_score_mean: float | None`
  - `_read_weights_from_snapshot(run_dir) -> tuple[dict, dict]` in run_reader

- [ ] **Step 1: Update `webapp/models.py` — ReportData**

Add `weighted_score_mean` field:
```python
class ReportData(BaseModel):
    """Aggregated report payload rendered by the report detail page."""

    metrics: list[str] = Field(default_factory=list)
    metric_means: dict[str, float | None] = Field(default_factory=dict)
    distributions: dict[str, list[DistributionBin]] = Field(default_factory=dict)
    groupings: dict[str, list[GroupStat]] = Field(default_factory=dict)
    lowest_samples: list[SampleScore] = Field(default_factory=list)
    summary_markdown: str = ""
    advice_markdown: str = ""
    weighted_score_mean: float | None = Field(
        default=None,
        description="加权综合得分均值（metric_weights × doc_weights 共同作用）。等权时等于各指标均值的均值。",
    )
    metric_weights: dict[str, float] = Field(
        default_factory=dict,
        description="该次运行使用的指标权重配置（来自 scenario.snapshot.yaml）。",
    )
    doc_weights: dict[str, float] = Field(
        default_factory=dict,
        description="该次运行使用的文档权重配置（来自 scenario.snapshot.yaml）。",
    )
```

- [ ] **Step 2: Add `_read_weights_from_snapshot` to `webapp/services/run_reader.py`**

Add after `_read_metrics_from_snapshot`:
```python
def _read_weights_from_snapshot(run_dir: Path) -> tuple[dict[str, float], dict[str, float]]:
    """Read metric_weights and doc_weights from a scenario snapshot if present.

    Returns a (metric_weights, doc_weights) tuple of plain dicts.
    Both default to empty dicts when the snapshot is absent or lacks the fields.
    """
    snapshot = run_dir / "scenario.snapshot.yaml"
    if not snapshot.is_file():
        return {}, {}
    try:
        payload = yaml.safe_load(snapshot.read_text(encoding="utf-8")) or {}
    except (OSError, yaml.YAMLError):
        return {}, {}
    mw = payload.get("metric_weights") or {}
    dw = payload.get("doc_weights") or {}
    return (
        {str(k): float(v) for k, v in mw.items() if isinstance(v, (int, float))},
        {str(k): float(v) for k, v in dw.items() if isinstance(v, (int, float))},
    )
```

- [ ] **Step 3: Update `webapp/services/report_builder.py`**

Replace `_metric_means` call and `build_report` to use weighted versions:

```python
# Add imports at top:
from rag_eval.metrics.weights import (
    compute_overall_weighted_score_mean,
    weighted_metric_means as _weighted_metric_means,
)
from webapp.services.run_reader import _read_weights_from_snapshot
```

Replace `build_report`:
```python
def build_report(run_dir: Path, metrics: list[str]) -> ReportData:
    """Build the full aggregated report payload for one run directory."""
    frame = run_reader.read_scores_frame(run_dir)
    summary_markdown = run_reader.read_summary_markdown(run_dir)
    advice_markdown = run_reader.read_advice_markdown(run_dir)
    metric_weights, doc_weights = _read_weights_from_snapshot(run_dir)

    if frame.empty or not metrics:
        return ReportData(
            metrics=metrics,
            metric_means={metric: None for metric in metrics},
            summary_markdown=summary_markdown,
            advice_markdown=advice_markdown,
            metric_weights=metric_weights,
            doc_weights=doc_weights,
        )

    score_rows_list = frame.to_dict(orient="records")

    # Use weighted metric means (degrades to arithmetic mean when weights are empty).
    w_means = _weighted_metric_means(score_rows_list, metrics, doc_weights)
    rounded_means = {m: _round_or_none(v) for m, v in w_means.items()}

    overall_ws = compute_overall_weighted_score_mean(
        score_rows_list, metric_weights, doc_weights
    )

    distributions = {
        metric: _distribution(frame, metric)
        for metric in metrics
        if metric in frame.columns
    }

    return ReportData(
        metrics=metrics,
        metric_means=rounded_means,
        distributions=distributions,
        groupings=_groupings(frame, metrics),
        lowest_samples=_lowest_samples(frame, metrics),
        summary_markdown=summary_markdown,
        advice_markdown=advice_markdown,
        weighted_score_mean=_round_or_none(overall_ws),
        metric_weights=metric_weights,
        doc_weights=doc_weights,
    )
```

Also delete the old `_metric_means` function (it is replaced by the weighted version).

- [ ] **Step 4: Run existing webapp tests to check no regressions**

```
python -m pytest tests/webapp/ -v
```
Expected: all PASS

- [ ] **Step 5: Commit**

```
git add webapp/models.py webapp/services/run_reader.py webapp/services/report_builder.py
git commit -m "feat: report_builder uses weighted metric means; ReportData gains weighted_score_mean"
```

---

## Task 7: scenario_scanner — 读取权重字段供前端使用

**Files:**
- Modify: `webapp/models.py` (ScenarioInfo 新增字段)
- Modify: `webapp/services/scenario_scanner.py`

**Interfaces:**
- Produces:
  - `ScenarioInfo.metric_weights: dict[str, float]`
  - `ScenarioInfo.doc_weights: dict[str, float]`

- [ ] **Step 1: Add fields to `ScenarioInfo` in `webapp/models.py`**

```python
class ScenarioInfo(BaseModel):
    """One discoverable scenario YAML file that can be evaluated from the UI."""

    path: str
    scenario_name: str = ""
    mode: str = ""
    dataset: str = ""
    judge_model: str = ""
    metrics: list[str] = Field(default_factory=list)
    error: str = ""
    metric_weights: dict[str, float] = Field(default_factory=dict)
    doc_weights: dict[str, float] = Field(default_factory=dict)
```

- [ ] **Step 2: Update `webapp/services/scenario_scanner.py` — `_summarize_scenario`**

After the `metric_list` line, add weight extraction:
```python
raw_metric_weights = payload.get("metric_weights") or {}
raw_doc_weights = payload.get("doc_weights") or {}
metric_weights = {str(k): float(v) for k, v in raw_metric_weights.items()
                  if isinstance(v, (int, float))}
doc_weights = {str(k): float(v) for k, v in raw_doc_weights.items()
               if isinstance(v, (int, float))}

return ScenarioInfo(
    path=relative,
    scenario_name=str(payload.get("scenario_name", "")),
    mode=str(payload.get("mode", "")),
    dataset=str(payload.get("dataset", "")),
    judge_model=str(payload.get("judge_model", "")),
    metrics=metric_list,
    metric_weights=metric_weights,
    doc_weights=doc_weights,
)
```

- [ ] **Step 3: Run existing tests**

```
python -m pytest tests/webapp/ tests/test_offline_eval.py -v
```
Expected: all PASS

- [ ] **Step 4: Commit**

```
git add webapp/models.py webapp/services/scenario_scanner.py
git commit -m "feat: ScenarioInfo exposes metric_weights and doc_weights from YAML"
```

---

## Task 8: 前端 — 权重配置面板 + 报告卡片

**Files:**
- Modify: `webapp/static/index.html`
- Modify: `webapp/static/js/runner.js`
- Modify: `webapp/static/css/app.css`
- Modify: `webapp/static/js/report.js`

**Interfaces:**
- Consumes: `ScenarioInfo.metric_weights`, `ScenarioInfo.doc_weights` (from `/api/scenarios`)
- Consumes: `ReportData.weighted_score_mean`, `ReportData.metric_weights` (from `/api/runs/{id}`)
- Produces: weight panel HTML in 新建评估; weighted_score card in 报告页

- [ ] **Step 1: Add weight panel HTML to `index.html`**

Add this block immediately after the closing `</div>` of `#llm-assignment-panel` (before `<div class="panel" id="task-panel"`):

```html
<!-- 权重配置面板（选中场景后显示） -->
<div class="panel weight-config-panel" id="weight-config-panel" hidden>
  <h2>权重配置 <span class="muted" style="font-size:13px;font-weight:400">（可选，留空使用场景原始配置）</span></h2>

  <div class="weight-section">
    <div class="weight-section-title">指标权重 <span class="muted" style="font-size:12px">（数值越大该指标在综合得分中占比越高）</span></div>
    <div id="metric-weight-rows" class="weight-rows"></div>
  </div>

  <div class="weight-section" style="margin-top:16px">
    <div class="weight-section-title">文档权重 <span class="muted" style="font-size:12px">（按 PDF 文件名，数值越大该文档的题目在汇总时贡献越大）</span></div>
    <div id="doc-weight-rows" class="weight-rows"></div>
    <button class="btn btn-sm" id="add-doc-weight-btn" style="margin-top:8px">＋ 添加文档权重</button>
  </div>
</div>
```

- [ ] **Step 2: Add CSS to `app.css`**

Add at the end of the file, before any `@media print` block:

```css
/* ── 权重配置面板 ─────────────────────────────────── */
.weight-config-panel { margin-top: 12px; }
.weight-section-title { font-size: 13px; font-weight: 600; color: var(--text); margin-bottom: 8px; }
.weight-rows { display: flex; flex-direction: column; gap: 6px; }
.weight-row {
  display: flex; align-items: center; gap: 10px;
  font-size: 13px;
}
.weight-row-label { min-width: 180px; color: var(--slate); font-family: monospace; }
.weight-row-input {
  width: 80px; padding: 4px 8px; border: 1px solid var(--border);
  border-radius: 6px; font-size: 13px; text-align: right;
}
.weight-row-input:focus { outline: none; border-color: #6366f1; }
.doc-weight-name {
  flex: 1; padding: 4px 8px; border: 1px solid var(--border);
  border-radius: 6px; font-size: 13px; min-width: 0;
}
.weight-row-remove { color: var(--bad); cursor: pointer; font-size: 14px; background: none; border: none; padding: 2px 6px; }
.weight-row-remove:hover { background: #fee2e2; border-radius: 4px; }

/* weighted_score 指标卡片突出显示 */
.metric-card.weighted-score-card {
  border: 2px solid #6366f1;
  background: #f5f3ff;
}
.metric-card.weighted-score-card .metric-name { color: #4f46e5; font-weight: 700; }
```

- [ ] **Step 3: Update `runner.js`**

Replace the entire `runner.js` with:

```javascript
// runner.js — 新建评估视图：列出场景、LLM角色配置、权重配置、触发评估、轮询任务状态。

const Runner = {
  selectedScenario: null,
  selectedScenarioInfo: null,
  pollTimer: null,
  lastRunId: null,

  init() {
    document.getElementById("run-btn").addEventListener("click", () => Runner.trigger());
    document.getElementById("view-report-btn").addEventListener("click", () => {
      if (Runner.lastRunId) {
        App.enableReportNav();
        App.navigate("report", Runner.lastRunId);
      }
    });
    document.getElementById("add-doc-weight-btn").addEventListener("click", () => Runner._addDocWeightRow());
  },

  async loadScenarios() {
    const list = document.getElementById("scenario-list");
    list.innerHTML = '<p class="muted">加载中…</p>';
    try {
      const data = await API.scenarios();
      const scenarios = data.scenarios || [];
      if (scenarios.length === 0) {
        list.innerHTML = '<p class="muted">未在 scenarios/ 下找到场景文件。</p>';
        return;
      }
      list.innerHTML = "";
      scenarios.forEach((sc) => list.appendChild(Runner.renderScenarioItem(sc)));
    } catch (err) {
      list.innerHTML = `<p class="muted">加载失败：${App.escape(err.message)}</p>`;
    }
    Runner._populateProfileSelects();
  },

  async _populateProfileSelects() {
    const cached = Profiles.getAll();
    const profiles = cached.length > 0
      ? cached
      : (await API.profiles().catch(() => ({ profiles: [] }))).profiles;
    ["role-judge", "role-answer", "role-dataset"].forEach(id => {
      const sel = document.getElementById(id);
      sel.innerHTML = '<option value="">— 使用场景原始配置 —</option>';
      profiles.forEach(p => {
        const opt = document.createElement("option");
        opt.value = p.profile_id;
        opt.textContent = `${p.name}  (${p.model})`;
        sel.appendChild(opt);
      });
    });
  },

  renderScenarioItem(sc) {
    const item = document.createElement("div");
    const invalid = !!sc.error;
    item.className = "scenario-item" + (invalid ? " invalid" : "");
    const modeTag = sc.mode
      ? `<span class="tag mode-${App.escape(sc.mode)}">${App.escape(sc.mode)}</span>`
      : "";
    const metricCount = (sc.metrics || []).length;
    item.innerHTML = `
      <div>
        <div class="scenario-name">${App.escape(sc.scenario_name || sc.path)}</div>
        <div class="scenario-path">${App.escape(sc.path)}</div>
        ${sc.error ? `<div class="scenario-path" style="color:#dc2626">${App.escape(sc.error)}</div>` : ""}
      </div>
      <div class="scenario-tags">
        ${modeTag}
        <span class="tag">${metricCount} 指标</span>
      </div>
    `;
    if (!invalid) {
      item.addEventListener("click", () => {
        document.querySelectorAll(".scenario-item").forEach((el) => el.classList.remove("selected"));
        item.classList.add("selected");
        Runner.selectedScenario = sc.path;
        Runner.selectedScenarioInfo = sc;
        document.getElementById("selected-scenario").textContent = sc.path;
        document.getElementById("run-btn").disabled = false;
        document.getElementById("llm-assignment-panel").hidden = false;
        Runner._renderWeightPanel(sc);
        document.getElementById("weight-config-panel").hidden = false;
      });
    }
    return item;
  },

  // 根据选中场景渲染指标权重行（动态）
  _renderWeightPanel(sc) {
    const metricRows = document.getElementById("metric-weight-rows");
    metricRows.innerHTML = "";
    const metrics = sc.metrics || [];
    const existingWeights = sc.metric_weights || {};
    metrics.forEach(metric => {
      const row = document.createElement("div");
      row.className = "weight-row";
      const currentVal = existingWeights[metric] != null ? existingWeights[metric] : 1.0;
      row.innerHTML = `
        <span class="weight-row-label">${App.escape(metric)}</span>
        <input class="weight-row-input" type="number" min="0" step="0.1"
               data-metric="${App.escape(metric)}" value="${currentVal}" />
      `;
      metricRows.appendChild(row);
    });

    // 填充已有文档权重
    const docRows = document.getElementById("doc-weight-rows");
    docRows.innerHTML = "";
    const existingDocWeights = sc.doc_weights || {};
    Object.entries(existingDocWeights).forEach(([docName, w]) => {
      Runner._addDocWeightRow(docName, w);
    });
  },

  // 添加一行文档权重输入
  _addDocWeightRow(docName = "", weight = 1.0) {
    const container = document.getElementById("doc-weight-rows");
    const row = document.createElement("div");
    row.className = "weight-row";
    row.innerHTML = `
      <input class="doc-weight-name" type="text" placeholder="PDF 文件名（如 322_双源CT.pdf）" value="${App.escape(docName)}" />
      <input class="weight-row-input" type="number" min="0" step="0.1" value="${weight}" />
      <button class="weight-row-remove" title="删除">✕</button>
    `;
    row.querySelector(".weight-row-remove").addEventListener("click", () => row.remove());
    container.appendChild(row);
  },

  // 收集权重面板当前值
  _collectWeights() {
    const metricWeights = {};
    document.querySelectorAll("#metric-weight-rows .weight-row-input").forEach(input => {
      const metric = input.dataset.metric;
      const val = parseFloat(input.value);
      if (metric && !isNaN(val)) metricWeights[metric] = val;
    });

    const docWeights = {};
    document.querySelectorAll("#doc-weight-rows .weight-row").forEach(row => {
      const nameInput = row.querySelector(".doc-weight-name");
      const valInput = row.querySelector(".weight-row-input");
      if (!nameInput || !valInput) return;
      const name = nameInput.value.trim();
      const val = parseFloat(valInput.value);
      if (name && !isNaN(val)) docWeights[name] = val;
    });

    // 如果全部指标权重均为 1.0 且无文档权重，不发送（等权，跳过）
    const allDefault = Object.values(metricWeights).every(v => Math.abs(v - 1.0) < 1e-9)
                    && Object.keys(docWeights).length === 0;
    if (allDefault) return { metricWeights: null, docWeights: null };
    return { metricWeights, docWeights };
  },

  async trigger() {
    if (!Runner.selectedScenario) return;
    const runBtn = document.getElementById("run-btn");
    runBtn.disabled = true;
    const panel = document.getElementById("task-panel");
    const logBox = document.getElementById("task-log");
    const statusBadge = document.getElementById("task-status");
    const reportBtn = document.getElementById("view-report-btn");
    panel.hidden = false;
    reportBtn.hidden = true;
    logBox.textContent = "";
    Runner._setStatus(statusBadge, "queued");
    try {
      await Runner._applyProfilesIfNeeded(logBox);
      const resp = await API.triggerEvaluation(Runner.selectedScenario);
      Runner.poll(resp.task_id);
    } catch (err) {
      Runner._setStatus(statusBadge, "failed");
      logBox.textContent = (logBox.textContent ? logBox.textContent + "\n" : "") + `触发失败：${err.message}`;
      runBtn.disabled = false;
    }
  },

  async _applyProfilesIfNeeded(logBox) {
    const judgeId = document.getElementById("role-judge").value;
    const answerId = document.getElementById("role-answer").value;
    const datasetId = document.getElementById("role-dataset").value;
    const { metricWeights, docWeights } = Runner._collectWeights();

    if (!judgeId && !answerId && !datasetId && !metricWeights && !docWeights) return;

    logBox.textContent = "正在将 LLM 配置和权重写入场景文件…\n";
    const body = {
      scenario_path: Runner.selectedScenario,
      judge_profile_id: judgeId || null,
      answer_profile_id: answerId || null,
      dataset_profile_id: datasetId || null,
      metric_weights: metricWeights,
      doc_weights: docWeights,
    };
    const result = await API.applyProfiles(body);
    const fields = (result.patched_fields || []).join(", ");
    logBox.textContent += fields
      ? `✓ 已更新字段：${fields}\n`
      : "（未找到可更新的字段，继续运行）\n";
  },

  poll(taskId) {
    const logBox = document.getElementById("task-log");
    const statusBadge = document.getElementById("task-status");
    const reportBtn = document.getElementById("view-report-btn");
    const runBtn = document.getElementById("run-btn");
    if (Runner.pollTimer) clearInterval(Runner.pollTimer);
    Runner.pollTimer = setInterval(async () => {
      try {
        const status = await API.taskStatus(taskId);
        logBox.textContent = (status.logs || []).join("\n");
        logBox.scrollTop = logBox.scrollHeight;
        Runner._setStatus(statusBadge, status.status);
        if (status.status === "completed" || status.status === "failed") {
          clearInterval(Runner.pollTimer);
          runBtn.disabled = false;
          if (status.status === "completed" && status.run_id) {
            Runner.lastRunId = status.run_id;
            sessionStorage.setItem("rag_run_id", status.run_id);
            reportBtn.hidden = false;
          }
        }
      } catch (err) {
        clearInterval(Runner.pollTimer);
        logBox.textContent += `\n轮询失败：${err.message}`;
        runBtn.disabled = false;
      }
    }, 1200);
  },

  _setStatus(badge, status) {
    badge.textContent = status;
    badge.className = "badge " + status;
  },
};
```

- [ ] **Step 4: Update `report.js` — renderMetricCards**

In `renderMetricCards`, after the `metrics.forEach` loop that renders individual cards, append this block to show the weighted_score card:

```javascript
// 在 renderMetricCards 方法末尾，metrics.forEach 之后追加：
const wsValue = report.weighted_score_mean;
const wsCard = document.createElement("div");
wsCard.className = "metric-card weighted-score-card";
const wsCls = App.scoreClass(wsValue);
const wsText = wsValue === null || wsValue === undefined ? "n/a" : wsValue.toFixed(2);
wsCard.innerHTML = `
  <div class="metric-value ${wsCls}">${wsText}</div>
  <div class="metric-name">综合加权得分</div>
`;
wrap.appendChild(wsCard);
```

- [ ] **Step 5: Verify app loads without JS errors**

Start the webapp:
```
python webmain.py
```
Open http://localhost:8000, navigate to「新建评估」, click a scenario and verify:
- Weight panel appears below LLM 角色配置
- Each metric listed with a default weight of 1.0
- 「添加文档权重」button adds a new row
- Navigate to any report and verify「综合加权得分」card appears

- [ ] **Step 6: Commit**

```
git add webapp/static/index.html webapp/static/js/runner.js webapp/static/css/app.css webapp/static/js/report.js
git commit -m "feat: add weight config panel to 新建评估 and weighted_score card to report"
```

---

## Task 9: 全量回归测试

- [ ] **Step 1: Run all tests**

```
python -m pytest tests/ -v --tb=short
```
Expected: all previously-passing tests still PASS, new tests PASS.

Note: pre-existing failures in `webapp.test_*` (module import path issues) and `test_offline_eval::test_normalize_sample_pdf_offline_smoke_row` (missing CSV fixture) are known pre-existing issues — they are not regressions from this feature.

- [ ] **Step 2: Run pipeline and llm-profiles tests explicitly**

```
python -m pytest tests/test_pipeline.py tests/webapp/test_llm_profiles_api.py tests/test_weights.py -v
```
Expected: all PASS

- [ ] **Step 3: Final commit**

```
git add .
git commit -m "feat: metric & doc weights — full implementation complete

- New rag_eval/metrics/weights.py with pure-function weight computation
- Scenario YAML supports metric_weights and doc_weights (optional, backward-compatible)
- scores.csv gains weighted_score and sample_weight columns
- summary.md shows weighted metric means and overall weighted_score
- yaml_patcher writes metric_weights/doc_weights on apply
- report_builder uses weighted means; ReportData gains weighted_score_mean
- 新建评估 page: weight config panel with metric sliders and doc weight rows
- 报告详情 page: 综合加权得分 card

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>"
```