Files

wangwei 078097af00 docs: add metric/doc weights implementation plan

2026-06-18 16:43:08 +08:00

56 KiB

Raw Blame History

指标权重 & 文档片段权重 Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: 在场景 YAML 中支持 metric_weights 和 doc_weights 两个可选字段，计算加权综合得分并在报告页和「新建评估」页的权重配置面板中展示。

Architecture: 新增纯函数模块 rag_eval/metrics/weights.py 承载所有计算逻辑；评估器写入两列新字段 (weighted_score, sample_weight) 到 scores.csv；yaml_patcher 扩展支持写入权重字段；前端在 LLM 角色配置面板下方动态渲染权重面板，报告页新增「加权综合得分」卡片。

Tech Stack: Python 3.12, Pydantic v2, FastAPI, Vanilla JS (无框架), pytest

Global Constraints

Python 3.12+，PEP 8，4 空格缩进，类型注解必须
所有新字段均为可选，缺省行为与现有完全一致（向后兼容）
测试用 pytest，不依赖真实 LLM 或网络
JS 不引入任何新依赖（原生 DOM API）
权重值无需归一化，计算时内部 w / Σw

文件清单

操作	文件	职责
新建	`rag_eval/metrics/weights.py`	权重计算纯函数
新建	`tests/test_weights.py`	weights 模块单元测试
修改	`rag_eval/config/schema.py`	ScenarioModel 新增两字段
修改	`rag_eval/shared/models.py`	Scenario dataclass 新增两字段
修改	`rag_eval/config/loader.py`	load_scenario 透传新字段
修改	`rag_eval/execution/evaluator.py`	`_merge_score` 新增两列
修改	`rag_eval/reporting/summary.py`	改用加权均值，新增 weighted_score 行
修改	`webapp/services/yaml_patcher.py`	新增 metric_weights/doc_weights 参数
修改	`webapp/models.py`	ProfileApplyRequest 新增字段；ReportData 新增 weighted_score_mean
修改	`webapp/api/llm_profiles.py`	apply_profiles 透传新参数
修改	`webapp/services/report_builder.py`	读取权重，计算加权均值和 weighted_score_mean
修改	`webapp/services/run_reader.py`	从 snapshot.yaml 读取 metric_weights/doc_weights
修改	`webapp/static/index.html`	新增权重配置面板 HTML
修改	`webapp/static/js/runner.js`	权重面板逻辑 + apply 时传权重
修改	`webapp/static/css/app.css`	权重面板样式
修改	`webapp/static/js/report.js`	renderMetricCards 中渲染 weighted_score 卡片
修改	`webapp/api/scenarios.py`	ScenarioInfo 新增 metric_weights/doc_weights 字段
修改	`webapp/services/scenario_scanner.py`	扫描时读取权重字段
修改	`tests/test_offline_eval.py`	断言 scores.csv 包含 weighted_score/sample_weight
修改	`tests/webapp/test_llm_profiles_api.py`	apply_profiles 权重写入测试

Task 1: 新建权重计算核心模块（TDD）

Files:

Create: rag_eval/metrics/weights.py
Create: tests/test_weights.py

Interfaces:

Produces:
- resolve_weight(weights: dict[str, float], key: str, default: float = 1.0) -> float
- compute_weighted_score(scores: dict[str, float | None], metric_weights: dict[str, float]) -> float | None
- weighted_metric_means(score_rows: list[dict], metrics: list[str], doc_weights: dict[str, float]) -> dict[str, float | None]
- compute_overall_weighted_score_mean(score_rows: list[dict], metric_weights: dict[str, float], doc_weights: dict[str, float]) -> float | None
Step 1: Write failing tests

Create tests/test_weights.py:

"""Unit tests for rag_eval/metrics/weights.py"""
import math
import pytest
from rag_eval.metrics.weights import (
    resolve_weight,
    compute_weighted_score,
    weighted_metric_means,
    compute_overall_weighted_score_mean,
)


class TestResolveWeight:
    def test_returns_value_when_key_present(self):
        assert resolve_weight({"faith": 0.5}, "faith") == 0.5

    def test_returns_default_when_key_missing(self):
        assert resolve_weight({}, "faith") == 1.0

    def test_returns_custom_default_when_key_missing(self):
        assert resolve_weight({}, "faith", default=2.0) == 2.0

    def test_empty_dict_returns_default(self):
        assert resolve_weight({}, "anything") == 1.0


class TestComputeWeightedScore:
    def test_equal_weights_is_simple_mean(self):
        scores = {"faithfulness": 0.8, "context_recall": 0.6}
        result = compute_weighted_score(scores, {})
        assert result == pytest.approx(0.7, rel=1e-4)

    def test_explicit_weights(self):
        scores = {"faithfulness": 1.0, "context_recall": 0.0}
        weights = {"faithfulness": 3.0, "context_recall": 1.0}
        # (3*1.0 + 1*0.0) / (3+1) = 0.75
        result = compute_weighted_score(scores, weights)
        assert result == pytest.approx(0.75, rel=1e-4)

    def test_nan_values_excluded(self):
        scores = {"faithfulness": float("nan"), "context_recall": 0.8}
        result = compute_weighted_score(scores, {})
        assert result == pytest.approx(0.8, rel=1e-4)

    def test_none_values_excluded(self):
        scores = {"faithfulness": None, "context_recall": 0.6}
        result = compute_weighted_score(scores, {})
        assert result == pytest.approx(0.6, rel=1e-4)

    def test_all_nan_returns_none(self):
        scores = {"faithfulness": float("nan"), "context_recall": float("nan")}
        assert compute_weighted_score(scores, {}) is None

    def test_empty_scores_returns_none(self):
        assert compute_weighted_score({}, {}) is None

    def test_missing_metric_in_weights_uses_default_1(self):
        scores = {"faithfulness": 0.8, "context_recall": 0.4}
        weights = {"faithfulness": 2.0}  # context_recall defaults to 1.0
        # (2*0.8 + 1*0.4) / (2+1) = 2.0/3 ≈ 0.6667
        result = compute_weighted_score(scores, weights)
        assert result == pytest.approx(2.0 / 3, rel=1e-4)


class TestWeightedMetricMeans:
    def _rows(self):
        return [
            {"doc_name": "a.pdf", "faithfulness": 1.0, "context_recall": 0.5},
            {"doc_name": "b.pdf", "faithfulness": 0.6, "context_recall": 0.8},
        ]

    def test_equal_weights_gives_arithmetic_mean(self):
        rows = self._rows()
        result = weighted_metric_means(rows, ["faithfulness", "context_recall"], {})
        assert result["faithfulness"] == pytest.approx(0.8, rel=1e-4)
        assert result["context_recall"] == pytest.approx(0.65, rel=1e-4)

    def test_doc_weight_amplifies_contribution(self):
        rows = self._rows()
        # doc a.pdf gets weight 3, b.pdf gets 1
        doc_weights = {"a.pdf": 3.0, "b.pdf": 1.0}
        result = weighted_metric_means(rows, ["faithfulness"], doc_weights)
        # (3*1.0 + 1*0.6) / (3+1) = 3.6/4 = 0.9
        assert result["faithfulness"] == pytest.approx(0.9, rel=1e-4)

    def test_nan_rows_skipped_per_metric(self):
        rows = [
            {"doc_name": "a.pdf", "faithfulness": float("nan"), "context_recall": 0.5},
            {"doc_name": "b.pdf", "faithfulness": 0.8, "context_recall": 0.9},
        ]
        result = weighted_metric_means(rows, ["faithfulness", "context_recall"], {})
        assert result["faithfulness"] == pytest.approx(0.8, rel=1e-4)
        assert result["context_recall"] == pytest.approx(0.7, rel=1e-4)

    def test_missing_metric_column_returns_none(self):
        rows = [{"doc_name": "a.pdf", "faithfulness": 0.8}]
        result = weighted_metric_means(rows, ["faithfulness", "unknown_metric"], {})
        assert result["faithfulness"] == pytest.approx(0.8, rel=1e-4)
        assert result["unknown_metric"] is None

    def test_empty_rows_returns_none_for_all(self):
        result = weighted_metric_means([], ["faithfulness"], {})
        assert result["faithfulness"] is None


class TestComputeOverallWeightedScoreMean:
    def test_basic_weighted_mean_of_weighted_scores(self):
        rows = [
            {"doc_name": "a.pdf", "faithfulness": 1.0, "context_recall": 0.0},
            {"doc_name": "b.pdf", "faithfulness": 0.5, "context_recall": 0.5},
        ]
        metric_weights = {"faithfulness": 1.0, "context_recall": 1.0}
        result = compute_overall_weighted_score_mean(rows, metric_weights, {})
        # sample 1 ws = 0.5, sample 2 ws = 0.5 → mean = 0.5
        assert result == pytest.approx(0.5, rel=1e-4)

    def test_doc_weight_amplifies_sample(self):
        rows = [
            {"doc_name": "important.pdf", "faithfulness": 1.0},
            {"doc_name": "other.pdf", "faithfulness": 0.0},
        ]
        doc_weights = {"important.pdf": 9.0, "other.pdf": 1.0}
        result = compute_overall_weighted_score_mean(rows, {}, doc_weights)
        # ws_1=1.0 w=9, ws_2=0.0 w=1 → (9*1 + 1*0)/(9+1) = 0.9
        assert result == pytest.approx(0.9, rel=1e-4)

    def test_all_nan_returns_none(self):
        rows = [{"doc_name": "a.pdf", "faithfulness": float("nan")}]
        assert compute_overall_weighted_score_mean(rows, {}, {}) is None

Step 2: Run tests to verify they fail

python -m pytest tests/test_weights.py -v

Expected: ModuleNotFoundError: No module named 'rag_eval.metrics.weights'

Step 3: Implement rag_eval/metrics/weights.py

"""Utility functions for weighted metric aggregation.

All functions are pure (no side effects, no I/O) and operate on plain dicts/lists.
Weights do not need to be pre-normalised — normalisation is done internally.
"""

from __future__ import annotations

import math


def resolve_weight(weights: dict[str, float], key: str, default: float = 1.0) -> float:
    """Return the weight for *key*, or *default* when absent."""
    return float(weights.get(key, default))


def compute_weighted_score(
    scores: dict[str, float | None],
    metric_weights: dict[str, float],
) -> float | None:
    """Return the weighted mean of valid (non-NaN, non-None) metric scores.

    Args:
        scores: mapping of metric_name -> raw score (may be NaN or None).
        metric_weights: optional per-metric weights; absent keys default to 1.0.

    Returns:
        Weighted mean as a float, or None when no valid score exists.
    """
    total_weight = 0.0
    total_score = 0.0
    for metric, score in scores.items():
        if score is None:
            continue
        try:
            v = float(score)
        except (TypeError, ValueError):
            continue
        if math.isnan(v) or math.isinf(v):
            continue
        w = resolve_weight(metric_weights, metric, default=1.0)
        total_weight += w
        total_score += w * v
    if total_weight == 0.0:
        return None
    return total_score / total_weight


def weighted_metric_means(
    score_rows: list[dict],
    metrics: list[str],
    doc_weights: dict[str, float],
) -> dict[str, float | None]:
    """Compute per-metric weighted means across all score rows.

    Each row's contribution is scaled by the doc_weight for its ``doc_name``.
    Rows with NaN/None for a given metric are excluded from that metric's mean.

    Args:
        score_rows: list of score record dicts (from scores.csv).
        metrics: ordered list of metric names to aggregate.
        doc_weights: mapping doc_name -> weight multiplier; absent keys default to 1.0.

    Returns:
        Dict mapping metric_name -> weighted mean (or None if no valid data).
    """
    totals: dict[str, float] = {m: 0.0 for m in metrics}
    weights_sum: dict[str, float] = {m: 0.0 for m in metrics}

    for row in score_rows:
        doc_name = str(row.get("doc_name", "") or "")
        sample_w = resolve_weight(doc_weights, doc_name, default=1.0)
        for metric in metrics:
            raw = row.get(metric)
            if raw is None:
                continue
            try:
                v = float(raw)
            except (TypeError, ValueError):
                continue
            if math.isnan(v) or math.isinf(v):
                continue
            totals[metric] += sample_w * v
            weights_sum[metric] += sample_w

    return {
        metric: (totals[metric] / weights_sum[metric] if weights_sum[metric] > 0 else None)
        for metric in metrics
    }


def compute_overall_weighted_score_mean(
    score_rows: list[dict],
    metric_weights: dict[str, float],
    doc_weights: dict[str, float],
) -> float | None:
    """Compute the overall weighted-score mean across all samples.

    For each sample:
      1. Compute per-sample weighted_score via ``compute_weighted_score``.
      2. Scale by the doc weight for that sample's ``doc_name``.
    Then return the weighted mean of all per-sample weighted_scores.

    Args:
        score_rows: list of score record dicts.
        metric_weights: per-metric weight multipliers.
        doc_weights: per-doc weight multipliers.

    Returns:
        Float mean, or None when no sample has a valid weighted_score.
    """
    total_weight = 0.0
    total_score = 0.0
    for row in score_rows:
        # Collect only numeric metric columns (exclude meta-columns)
        metric_scores: dict[str, float | None] = {}
        for k, v in row.items():
            if k in _META_COLUMNS:
                continue
            metric_scores[k] = v  # type: ignore[assignment]

        ws = compute_weighted_score(metric_scores, metric_weights)
        if ws is None:
            continue
        doc_name = str(row.get("doc_name", "") or "")
        sample_w = resolve_weight(doc_weights, doc_name, default=1.0)
        total_weight += sample_w
        total_score += sample_w * ws

    return total_score / total_weight if total_weight > 0 else None


# Columns in scores.csv that are sample metadata, not metric scores.
_META_COLUMNS = frozenset({
    "sample_id", "question", "contexts", "answer", "ground_truth",
    "scenario", "language", "retrieval_config", "error",
    "judge_model", "embedding_model", "run_id",
    "difficulty", "question_type", "doc_id", "doc_name",
    "section_path", "page_start", "page_end",
    "source_chunk_ids", "review_status", "review_notes",
    "weighted_score", "sample_weight",
})

Step 4: Run tests to verify they pass

python -m pytest tests/test_weights.py -v

Expected: all 18 tests PASS

Step 5: Commit

git add rag_eval/metrics/weights.py tests/test_weights.py
git commit -m "feat: add metric/doc weight computation module (weights.py)"

Task 2: 扩展 Schema、Dataclass 和 Loader

Files:

Modify: rag_eval/config/schema.py
Modify: rag_eval/shared/models.py
Modify: rag_eval/config/loader.py

Interfaces:

Consumes: 无新依赖
Produces:
- ScenarioModel.metric_weights: dict[str, float] (default {})
- ScenarioModel.doc_weights: dict[str, float] (default {})
- Scenario.metric_weights: dict[str, float] (default {})
- Scenario.doc_weights: dict[str, float] (default {})
Step 1: Write failing test

Add to tests/test_offline_eval.py inside ScenarioAndDatasetTests:

def test_load_scenario_metric_and_doc_weights(self):
    """load_scenario passes metric_weights and doc_weights into Scenario."""
    import tempfile, yaml
    from rag_eval.config.loader import load_scenario
    payload = {
        "scenario_name": "w-test", "mode": "offline",
        "dataset": "nonexistent.csv", "judge_model": "m",
        "embedding_model": "e", "metrics": ["faithfulness"],
        "output_dir": "out",
        "metric_weights": {"faithfulness": 0.7},
        "doc_weights": {"doc.pdf": 2.0},
    }
    with tempfile.NamedTemporaryFile(suffix=".yaml", mode="w",
                                     encoding="utf-8", delete=False) as f:
        yaml.dump(payload, f, allow_unicode=True)
        tmp_path = f.name
    scenario = load_scenario(tmp_path)
    assert scenario.metric_weights == {"faithfulness": 0.7}
    assert scenario.doc_weights == {"doc.pdf": 2.0}

def test_load_scenario_defaults_to_empty_weights(self):
    """load_scenario defaults metric_weights and doc_weights to empty dicts."""
    import tempfile, yaml
    from rag_eval.config.loader import load_scenario
    payload = {
        "scenario_name": "no-w", "mode": "offline",
        "dataset": "nonexistent.csv", "judge_model": "m",
        "embedding_model": "e", "metrics": ["faithfulness"],
        "output_dir": "out",
    }
    with tempfile.NamedTemporaryFile(suffix=".yaml", mode="w",
                                     encoding="utf-8", delete=False) as f:
        yaml.dump(payload, f, allow_unicode=True)
        tmp_path = f.name
    scenario = load_scenario(tmp_path)
    assert scenario.metric_weights == {}
    assert scenario.doc_weights == {}

Step 2: Run tests to verify they fail

python -m pytest tests/test_offline_eval.py::ScenarioAndDatasetTests::test_load_scenario_metric_and_doc_weights tests/test_offline_eval.py::ScenarioAndDatasetTests::test_load_scenario_defaults_to_empty_weights -v

Expected: FAIL — Scenario has no attribute 'metric_weights'

Step 3: Add fields to rag_eval/config/schema.py

In ScenarioModel, add after optimization_advisor:

metric_weights: dict[str, float] = Field(default_factory=dict)
doc_weights: dict[str, float] = Field(default_factory=dict)

Step 4: Add fields to rag_eval/shared/models.py

In Scenario dataclass, add after optimization_advisor: bool = False:

metric_weights: dict[str, float] = field(default_factory=dict)
doc_weights: dict[str, float] = field(default_factory=dict)

Step 5: Update rag_eval/config/loader.py

In load_scenario(), in the Scenario(...) constructor call, add after optimization_advisor=model.optimization_advisor,:

metric_weights=dict(model.metric_weights),
doc_weights=dict(model.doc_weights),

Step 6: Run tests to verify they pass

python -m pytest tests/test_offline_eval.py::ScenarioAndDatasetTests::test_load_scenario_metric_and_doc_weights tests/test_offline_eval.py::ScenarioAndDatasetTests::test_load_scenario_defaults_to_empty_weights -v

Expected: both PASS

Step 7: Commit

git add rag_eval/config/schema.py rag_eval/shared/models.py rag_eval/config/loader.py tests/test_offline_eval.py
git commit -m "feat: add metric_weights and doc_weights to Scenario schema and dataclass"

Task 3: 评估器 — _merge_score 新增两列

Files:

Modify: rag_eval/execution/evaluator.py

Interfaces:

Consumes: compute_weighted_score(scores, metric_weights) -> float | None from rag_eval.metrics.weights
Produces: scores.csv 新增列 weighted_score: float | NaN, sample_weight: float
Step 1: Write failing test

Add to tests/test_offline_eval.py inside EvaluatorAndReportingTests:

def test_merge_score_includes_weighted_score_and_sample_weight(self):
    """_merge_score adds weighted_score and sample_weight columns."""
    from unittest.mock import MagicMock
    from rag_eval.execution.evaluator import Evaluator
    from rag_eval.shared.models import (
        MetricScore, NormalizedSample, RuntimeConfig, Scenario, DatasetConfig,
    )
    from pathlib import Path
    scenario = Scenario(
        scenario_name="w-test", mode="offline",
        dataset=DatasetConfig(path=Path("d.csv")),
        judge_model="m", embedding_model="e",
        metrics=["faithfulness", "context_recall"],
        output_dir=Path("out"),
        metric_weights={"faithfulness": 3.0, "context_recall": 1.0},
        doc_weights={"doc.pdf": 2.0},
    )
    evaluator = Evaluator(
        scenario=scenario,
        metric_pipeline=MagicMock(),
        app_adapter=None,
    )
    sample = NormalizedSample(
        sample_id="s1", question="q", contexts=["ctx"],
        answer="a", ground_truth="gt",
        metadata={"doc_name": "doc.pdf"},
    )
    score = MetricScore(metrics={"faithfulness": 1.0, "context_recall": 0.0})
    row = evaluator._merge_score(sample, score)
    # (3*1.0 + 1*0.0) / 4 = 0.75
    assert abs(row["weighted_score"] - 0.75) < 1e-4
    assert row["sample_weight"] == 2.0

Step 2: Run test to verify it fails

python -m pytest tests/test_offline_eval.py::EvaluatorAndReportingTests::test_merge_score_includes_weighted_score_and_sample_weight -v

Expected: FAIL — KeyError: 'weighted_score'

Step 3: Update rag_eval/execution/evaluator.py

Add import at top of file (after existing imports):

from rag_eval.metrics.weights import compute_weighted_score, resolve_weight

Replace _merge_score method:

def _merge_score(self, sample: NormalizedSample, score: Any) -> dict[str, Any]:
    """Combine sample data, metric results, run metadata, and weight columns."""
    record = sample.to_record()
    record["contexts"] = sample.contexts
    record.update(score.metrics)
    record["error"] = score.error
    record["judge_model"] = self.scenario.judge_model
    record["embedding_model"] = self.scenario.embedding_model
    record["run_id"] = self.scenario.scenario_name
    # Weighted score columns — enable post-hoc weighted aggregation in reporting.
    record["weighted_score"] = compute_weighted_score(
        score.metrics, self.scenario.metric_weights
    )
    doc_name = str(sample.metadata.get("doc_name", "") or "")
    record["sample_weight"] = resolve_weight(
        self.scenario.doc_weights, doc_name, default=1.0
    )
    return record

Step 4: Run test to verify it passes

python -m pytest tests/test_offline_eval.py::EvaluatorAndReportingTests::test_merge_score_includes_weighted_score_and_sample_weight -v

Expected: PASS

Step 5: Commit

git add rag_eval/execution/evaluator.py tests/test_offline_eval.py
git commit -m "feat: add weighted_score and sample_weight columns to score rows"

Task 4: 报告摘要 — 改用加权均值

Files:

Modify: rag_eval/reporting/summary.py

Interfaces:

Consumes:
- weighted_metric_means(score_rows, metrics, doc_weights) -> dict[str, float | None]
- compute_overall_weighted_score_mean(score_rows, metric_weights, doc_weights) -> float | None
- Both from rag_eval.metrics.weights
Step 1: Write failing test

Add to tests/test_offline_eval.py inside EvaluatorAndReportingTests:

def test_summary_markdown_shows_weighted_score(self):
    """build_summary_markdown includes weighted_score when metric_weights set."""
    import math
    from rag_eval.reporting.summary import build_summary_markdown
    from rag_eval.shared.models import (
        EvaluationResult, NormalizedSample, DatasetConfig, Scenario, RuntimeConfig,
    )
    from pathlib import Path
    scenario = Scenario(
        scenario_name="ws-test", mode="offline",
        dataset=DatasetConfig(path=Path("d.csv")),
        judge_model="m", embedding_model="e",
        metrics=["faithfulness"],
        output_dir=Path("out"),
        metric_weights={"faithfulness": 1.0},
        doc_weights={},
    )
    sample = NormalizedSample(
        sample_id="s1", question="q", contexts=["c"],
        answer="a", ground_truth="gt",
    )
    result = EvaluationResult(
        scenario=scenario, run_id="r1",
        started_at="2026-01-01T00:00:00", finished_at="2026-01-01T00:01:00",
        valid_samples=[sample], invalid_samples=[],
        score_rows=[{
            "sample_id": "s1", "faithfulness": 0.8,
            "weighted_score": 0.8, "sample_weight": 1.0,
            "doc_name": "", "error": "",
        }],
    )
    md = build_summary_markdown(result)
    assert "weighted_score" in md
    assert "0.8000" in md

Step 2: Run test to verify it fails

python -m pytest tests/test_offline_eval.py::EvaluatorAndReportingTests::test_summary_markdown_shows_weighted_score -v

Expected: FAIL — "weighted_score" not in md

Step 3: Update rag_eval/reporting/summary.py

Replace the entire file:

"""Markdown summary generation for completed evaluation runs."""

from __future__ import annotations

import math

import pandas as pd

from rag_eval.metrics.weights import (
    compute_overall_weighted_score_mean,
    weighted_metric_means,
)
from rag_eval.shared.models import EvaluationResult


def _table_from_frame(frame: pd.DataFrame) -> str:
    """Render a small dataframe as a fixed-width markdown-friendly text table."""
    if frame.empty:
        return "No rows."

    columns = list(frame.columns)
    rows = [[str(value) for value in row] for row in frame.astype(object).values.tolist()]
    widths = []
    for index, column in enumerate(columns):
        column_width = len(str(column))
        row_width = max((len(row[index]) for row in rows), default=0)
        widths.append(max(column_width, row_width))

    header = " | ".join(str(column).ljust(widths[idx]) for idx, column in enumerate(columns))
    separator = "-|-".join("-" * widths[idx] for idx in range(len(columns)))
    body = [
        " | ".join(row[idx].ljust(widths[idx]) for idx in range(len(columns)))
        for row in rows
    ]
    return "\n".join([header, separator, *body])


def build_summary_markdown(result: EvaluationResult) -> str:
    """Build the human-readable markdown summary written for each evaluation run."""
    total = len(result.valid_samples) + len(result.invalid_samples)
    scores = pd.DataFrame(result.score_rows)

    lines = [
        f"# {result.scenario.scenario_name}",
        "",
        f"- run_id: `{result.run_id}`",
        f"- mode: `{result.scenario.mode}`",
        f"- total_samples: `{total}`",
        f"- valid_samples: `{len(result.valid_samples)}`",
        f"- invalid_samples: `{len(result.invalid_samples)}`",
        f"- judge_model: `{result.scenario.judge_model}`",
        f"- embedding_model: `{result.scenario.embedding_model}`",
        "",
        "## Metric Means",
        "",
    ]

    if scores.empty:
        lines.append("No valid samples were scored.")
        return "\n".join(lines) + "\n"

    score_rows_list = scores.to_dict(orient="records")
    w_means = weighted_metric_means(
        score_rows_list, result.scenario.metrics, result.scenario.doc_weights
    )

    has_weights = bool(result.scenario.metric_weights or result.scenario.doc_weights)
    weight_suffix = " (加权)" if has_weights else ""

    for metric in result.scenario.metrics:
        mean_value = w_means.get(metric)
        w = result.scenario.metric_weights.get(metric, 1.0) if result.scenario.metric_weights else 1.0
        weight_note = f"  (w={w:.2f})" if result.scenario.metric_weights else ""
        if mean_value is not None and not math.isnan(mean_value):
            lines.append(f"- {metric}: `{mean_value:.4f}`{weight_note}")
        else:
            lines.append(f"- {metric}: `n/a`{weight_note}")

    overall_ws = compute_overall_weighted_score_mean(
        score_rows_list, result.scenario.metric_weights, result.scenario.doc_weights
    )
    if overall_ws is not None and not math.isnan(overall_ws):
        lines.append(f"- **weighted_score{weight_suffix}: `{overall_ws:.4f}`**")
    else:
        lines.append(f"- **weighted_score{weight_suffix}: `n/a`**")

    detail_columns = ["sample_id", *result.scenario.metrics, "weighted_score", "error"]
    existing_columns = [c for c in detail_columns if c in scores.columns]
    detail = scores[existing_columns]
    lines.extend([
        "",
        "## Per-sample Scores",
        "",
        "```text",
        _table_from_frame(detail),
        "```",
    ])
    return "\n".join(lines) + "\n"

Step 4: Run test to verify it passes

python -m pytest tests/test_offline_eval.py::EvaluatorAndReportingTests::test_summary_markdown_shows_weighted_score -v

Expected: PASS

Step 5: Run full offline test suite to check no regressions

python -m pytest tests/test_offline_eval.py tests/test_weights.py -v

Expected: all PASS

Step 6: Commit

git add rag_eval/reporting/summary.py tests/test_offline_eval.py
git commit -m "feat: use weighted metric means and add weighted_score row to summary.md"

Task 5: yaml_patcher 扩展

Files:

Modify: webapp/services/yaml_patcher.py
Modify: webapp/models.py
Modify: webapp/api/llm_profiles.py

Interfaces:

Produces:
- apply_profiles_to_scenario(..., metric_weights=None, doc_weights=None) — new optional params
- ProfileApplyRequest.metric_weights: dict[str, float] | None
- ProfileApplyRequest.doc_weights: dict[str, float] | None
Step 1: Write failing test

Add to tests/webapp/test_llm_profiles_api.py:

def test_apply_metric_weights_patches_yaml(tmp_path):
    """Applying metric_weights writes them into the YAML."""
    scenario_file = tmp_path / "w-scenario.yaml"
    scenario_file.write_text(
        "scenario_name: test\nmode: offline\njudge_model: m\nembedding_model: e\n"
        "dataset: d.csv\nmetrics:\n- faithfulness\noutput_dir: out\n",
        encoding="utf-8",
    )
    from webapp.services.yaml_patcher import apply_profiles_to_scenario
    patched = apply_profiles_to_scenario(
        scenario_path=str(scenario_file),
        judge_profile=None, answer_profile=None, dataset_profile=None,
        metric_weights={"faithfulness": 0.7, "context_recall": 0.3},
        _resolve_absolute=True,
    )
    assert "metric_weights" in patched
    data = yaml_lib.safe_load(scenario_file.read_text())
    assert data["metric_weights"]["faithfulness"] == pytest.approx(0.7)


def test_apply_doc_weights_patches_yaml(tmp_path):
    """Applying doc_weights writes them into the YAML."""
    scenario_file = tmp_path / "dw-scenario.yaml"
    scenario_file.write_text(
        "scenario_name: test\nmode: offline\njudge_model: m\nembedding_model: e\n"
        "dataset: d.csv\nmetrics:\n- faithfulness\noutput_dir: out\n",
        encoding="utf-8",
    )
    from webapp.services.yaml_patcher import apply_profiles_to_scenario
    patched = apply_profiles_to_scenario(
        scenario_path=str(scenario_file),
        judge_profile=None, answer_profile=None, dataset_profile=None,
        doc_weights={"doc.pdf": 2.0},
        _resolve_absolute=True,
    )
    assert "doc_weights" in patched
    data = yaml_lib.safe_load(scenario_file.read_text())
    assert data["doc_weights"]["doc.pdf"] == pytest.approx(2.0)

Step 2: Run tests to verify they fail

python -m pytest tests/webapp/test_llm_profiles_api.py::test_apply_metric_weights_patches_yaml tests/webapp/test_llm_profiles_api.py::test_apply_doc_weights_patches_yaml -v

Expected: FAIL — unexpected keyword argument 'metric_weights'

Step 3: Update webapp/services/yaml_patcher.py

Replace apply_profiles_to_scenario signature and body:

def apply_profiles_to_scenario(
    scenario_path: str,
    judge_profile: LLMProfile | None,
    answer_profile: LLMProfile | None,
    dataset_profile: LLMProfile | None,
    metric_weights: dict[str, float] | None = None,
    doc_weights: dict[str, float] | None = None,
    _resolve_absolute: bool = False,
) -> list[str]:
    """Patch the YAML file at *scenario_path* with the supplied profiles and weights.

    Returns a list of dotted field names that were actually patched.
    """
    if _resolve_absolute:
        resolved = Path(scenario_path)
    else:
        resolved = _resolve_scenario_path(scenario_path)

    if not resolved.exists():
        raise FileNotFoundError(f"Scenario file not found: {resolved}")

    data: dict[str, Any] = yaml.safe_load(resolved.read_text(encoding="utf-8")) or {}
    patched: list[str] = []

    if judge_profile is not None:
        data["judge_model"] = judge_profile.model
        patched.append("judge_model")

    if answer_profile is not None:
        adapter = data.get("app_adapter")
        if isinstance(adapter, dict):
            static_kwargs = adapter.setdefault("static_kwargs", {})
            static_kwargs["model"] = answer_profile.model
            patched.append("app_adapter.static_kwargs.model")

    if dataset_profile is not None:
        generation = data.get("generation")
        if isinstance(generation, dict):
            generation["model"] = dataset_profile.model
            patched.append("generation.model")

    if metric_weights is not None:
        data["metric_weights"] = dict(metric_weights)
        patched.append("metric_weights")

    if doc_weights is not None:
        data["doc_weights"] = dict(doc_weights)
        patched.append("doc_weights")

    resolved.write_text(
        yaml.dump(data, allow_unicode=True, default_flow_style=False, sort_keys=False),
        encoding="utf-8",
    )
    return patched

Step 4: Update webapp/models.py — ProfileApplyRequest

Add two fields to ProfileApplyRequest:

class ProfileApplyRequest(BaseModel):
    """Request body to patch LLM profile selections into a scenario YAML."""

    scenario_path: str
    judge_profile_id: str | None = None
    answer_profile_id: str | None = None
    dataset_profile_id: str | None = None
    metric_weights: dict[str, float] | None = Field(
        default=None,
        description="指标权重映射，如 {\"faithfulness\": 0.35}。为 null 时不修改 YAML。",
    )
    doc_weights: dict[str, float] | None = Field(
        default=None,
        description="文档权重映射，如 {\"doc.pdf\": 2.0}。为 null 时不修改 YAML。",
    )

Step 5: Update webapp/api/llm_profiles.py — apply_profiles endpoint

In apply_profiles(), update the call to apply_profiles_to_scenario:

patched = apply_profiles_to_scenario(
    scenario_path=request.scenario_path,
    judge_profile=role_profiles["judge"],
    answer_profile=role_profiles["answer"],
    dataset_profile=role_profiles["dataset"],
    metric_weights=request.metric_weights,
    doc_weights=request.doc_weights,
)

Step 6: Run tests to verify they pass

python -m pytest tests/webapp/test_llm_profiles_api.py -v

Expected: all 15 tests PASS

Step 7: Commit

git add webapp/services/yaml_patcher.py webapp/models.py webapp/api/llm_profiles.py tests/webapp/test_llm_profiles_api.py
git commit -m "feat: yaml_patcher and ProfileApplyRequest support metric_weights and doc_weights"

Task 6: report_builder 和 run_reader 加权支持

Files:

Modify: webapp/services/run_reader.py
Modify: webapp/services/report_builder.py
Modify: webapp/models.py (ReportData 新增 weighted_score_mean)

Interfaces:

Consumes:
- weighted_metric_means(score_rows, metrics, doc_weights) from rag_eval.metrics.weights
- compute_overall_weighted_score_mean(score_rows, metric_weights, doc_weights) from rag_eval.metrics.weights
Produces:
- ReportData.weighted_score_mean: float | None
- _read_weights_from_snapshot(run_dir) -> tuple[dict, dict] in run_reader
Step 1: Update webapp/models.py — ReportData

Add weighted_score_mean field:

class ReportData(BaseModel):
    """Aggregated report payload rendered by the report detail page."""

    metrics: list[str] = Field(default_factory=list)
    metric_means: dict[str, float | None] = Field(default_factory=dict)
    distributions: dict[str, list[DistributionBin]] = Field(default_factory=dict)
    groupings: dict[str, list[GroupStat]] = Field(default_factory=dict)
    lowest_samples: list[SampleScore] = Field(default_factory=list)
    summary_markdown: str = ""
    advice_markdown: str = ""
    weighted_score_mean: float | None = Field(
        default=None,
        description="加权综合得分均值（metric_weights × doc_weights 共同作用）。等权时等于各指标均值的均值。",
    )
    metric_weights: dict[str, float] = Field(
        default_factory=dict,
        description="该次运行使用的指标权重配置（来自 scenario.snapshot.yaml）。",
    )
    doc_weights: dict[str, float] = Field(
        default_factory=dict,
        description="该次运行使用的文档权重配置（来自 scenario.snapshot.yaml）。",
    )

Step 2: Add _read_weights_from_snapshot to webapp/services/run_reader.py

Add after _read_metrics_from_snapshot:

def _read_weights_from_snapshot(run_dir: Path) -> tuple[dict[str, float], dict[str, float]]:
    """Read metric_weights and doc_weights from a scenario snapshot if present.

    Returns a (metric_weights, doc_weights) tuple of plain dicts.
    Both default to empty dicts when the snapshot is absent or lacks the fields.
    """
    snapshot = run_dir / "scenario.snapshot.yaml"
    if not snapshot.is_file():
        return {}, {}
    try:
        payload = yaml.safe_load(snapshot.read_text(encoding="utf-8")) or {}
    except (OSError, yaml.YAMLError):
        return {}, {}
    mw = payload.get("metric_weights") or {}
    dw = payload.get("doc_weights") or {}
    return (
        {str(k): float(v) for k, v in mw.items() if isinstance(v, (int, float))},
        {str(k): float(v) for k, v in dw.items() if isinstance(v, (int, float))},
    )

Step 3: Update webapp/services/report_builder.py

Replace _metric_means call and build_report to use weighted versions:

# Add imports at top:
from rag_eval.metrics.weights import (
    compute_overall_weighted_score_mean,
    weighted_metric_means as _weighted_metric_means,
)
from webapp.services.run_reader import _read_weights_from_snapshot

Replace build_report:

def build_report(run_dir: Path, metrics: list[str]) -> ReportData:
    """Build the full aggregated report payload for one run directory."""
    frame = run_reader.read_scores_frame(run_dir)
    summary_markdown = run_reader.read_summary_markdown(run_dir)
    advice_markdown = run_reader.read_advice_markdown(run_dir)
    metric_weights, doc_weights = _read_weights_from_snapshot(run_dir)

    if frame.empty or not metrics:
        return ReportData(
            metrics=metrics,
            metric_means={metric: None for metric in metrics},
            summary_markdown=summary_markdown,
            advice_markdown=advice_markdown,
            metric_weights=metric_weights,
            doc_weights=doc_weights,
        )

    score_rows_list = frame.to_dict(orient="records")

    # Use weighted metric means (degrades to arithmetic mean when weights are empty).
    w_means = _weighted_metric_means(score_rows_list, metrics, doc_weights)
    rounded_means = {m: _round_or_none(v) for m, v in w_means.items()}

    overall_ws = compute_overall_weighted_score_mean(
        score_rows_list, metric_weights, doc_weights
    )

    distributions = {
        metric: _distribution(frame, metric)
        for metric in metrics
        if metric in frame.columns
    }

    return ReportData(
        metrics=metrics,
        metric_means=rounded_means,
        distributions=distributions,
        groupings=_groupings(frame, metrics),
        lowest_samples=_lowest_samples(frame, metrics),
        summary_markdown=summary_markdown,
        advice_markdown=advice_markdown,
        weighted_score_mean=_round_or_none(overall_ws),
        metric_weights=metric_weights,
        doc_weights=doc_weights,
    )

Also delete the old _metric_means function (it is replaced by the weighted version).

Step 4: Run existing webapp tests to check no regressions

python -m pytest tests/webapp/ -v

Expected: all PASS

Step 5: Commit

git add webapp/models.py webapp/services/run_reader.py webapp/services/report_builder.py
git commit -m "feat: report_builder uses weighted metric means; ReportData gains weighted_score_mean"

Task 7: scenario_scanner — 读取权重字段供前端使用

Files:

Modify: webapp/models.py (ScenarioInfo 新增字段)
Modify: webapp/services/scenario_scanner.py

Interfaces:

Produces:
- ScenarioInfo.metric_weights: dict[str, float]
- ScenarioInfo.doc_weights: dict[str, float]
Step 1: Add fields to ScenarioInfo in webapp/models.py

class ScenarioInfo(BaseModel):
    """One discoverable scenario YAML file that can be evaluated from the UI."""

    path: str
    scenario_name: str = ""
    mode: str = ""
    dataset: str = ""
    judge_model: str = ""
    metrics: list[str] = Field(default_factory=list)
    error: str = ""
    metric_weights: dict[str, float] = Field(default_factory=dict)
    doc_weights: dict[str, float] = Field(default_factory=dict)

Step 2: Update webapp/services/scenario_scanner.py — _summarize_scenario

After the metric_list line, add weight extraction:

raw_metric_weights = payload.get("metric_weights") or {}
raw_doc_weights = payload.get("doc_weights") or {}
metric_weights = {str(k): float(v) for k, v in raw_metric_weights.items()
                  if isinstance(v, (int, float))}
doc_weights = {str(k): float(v) for k, v in raw_doc_weights.items()
               if isinstance(v, (int, float))}

return ScenarioInfo(
    path=relative,
    scenario_name=str(payload.get("scenario_name", "")),
    mode=str(payload.get("mode", "")),
    dataset=str(payload.get("dataset", "")),
    judge_model=str(payload.get("judge_model", "")),
    metrics=metric_list,
    metric_weights=metric_weights,
    doc_weights=doc_weights,
)

Step 3: Run existing tests

python -m pytest tests/webapp/ tests/test_offline_eval.py -v

Expected: all PASS

Step 4: Commit

git add webapp/models.py webapp/services/scenario_scanner.py
git commit -m "feat: ScenarioInfo exposes metric_weights and doc_weights from YAML"

Task 8: 前端 — 权重配置面板 + 报告卡片

Files:

Modify: webapp/static/index.html
Modify: webapp/static/js/runner.js
Modify: webapp/static/css/app.css
Modify: webapp/static/js/report.js

Interfaces:

Consumes: ScenarioInfo.metric_weights, ScenarioInfo.doc_weights (from /api/scenarios)
Consumes: ReportData.weighted_score_mean, ReportData.metric_weights (from /api/runs/{id})
Produces: weight panel HTML in 新建评估; weighted_score card in 报告页
Step 1: Add weight panel HTML to index.html

Add this block immediately after the closing </div> of #llm-assignment-panel (before <div class="panel" id="task-panel"):

<!-- 权重配置面板（选中场景后显示） -->
<div class="panel weight-config-panel" id="weight-config-panel" hidden>
  <h2>权重配置 <span class="muted" style="font-size:13px;font-weight:400">（可选，留空使用场景原始配置）</span></h2>

  <div class="weight-section">
    <div class="weight-section-title">指标权重 <span class="muted" style="font-size:12px">（数值越大该指标在综合得分中占比越高）</span></div>
    <div id="metric-weight-rows" class="weight-rows"></div>
  </div>

  <div class="weight-section" style="margin-top:16px">
    <div class="weight-section-title">文档权重 <span class="muted" style="font-size:12px">（按 PDF 文件名，数值越大该文档的题目在汇总时贡献越大）</span></div>
    <div id="doc-weight-rows" class="weight-rows"></div>
    <button class="btn btn-sm" id="add-doc-weight-btn" style="margin-top:8px">＋ 添加文档权重</button>
  </div>
</div>

Step 2: Add CSS to app.css

Add at the end of the file, before any @media print block:

/* ── 权重配置面板 ─────────────────────────────────── */
.weight-config-panel { margin-top: 12px; }
.weight-section-title { font-size: 13px; font-weight: 600; color: var(--text); margin-bottom: 8px; }
.weight-rows { display: flex; flex-direction: column; gap: 6px; }
.weight-row {
  display: flex; align-items: center; gap: 10px;
  font-size: 13px;
}
.weight-row-label { min-width: 180px; color: var(--slate); font-family: monospace; }
.weight-row-input {
  width: 80px; padding: 4px 8px; border: 1px solid var(--border);
  border-radius: 6px; font-size: 13px; text-align: right;
}
.weight-row-input:focus { outline: none; border-color: #6366f1; }
.doc-weight-name {
  flex: 1; padding: 4px 8px; border: 1px solid var(--border);
  border-radius: 6px; font-size: 13px; min-width: 0;
}
.weight-row-remove { color: var(--bad); cursor: pointer; font-size: 14px; background: none; border: none; padding: 2px 6px; }
.weight-row-remove:hover { background: #fee2e2; border-radius: 4px; }

/* weighted_score 指标卡片突出显示 */
.metric-card.weighted-score-card {
  border: 2px solid #6366f1;
  background: #f5f3ff;
}
.metric-card.weighted-score-card .metric-name { color: #4f46e5; font-weight: 700; }

Step 3: Update runner.js

Replace the entire runner.js with:

// runner.js — 新建评估视图：列出场景、LLM角色配置、权重配置、触发评估、轮询任务状态。

const Runner = {
  selectedScenario: null,
  selectedScenarioInfo: null,
  pollTimer: null,
  lastRunId: null,

  init() {
    document.getElementById("run-btn").addEventListener("click", () => Runner.trigger());
    document.getElementById("view-report-btn").addEventListener("click", () => {
      if (Runner.lastRunId) {
        App.enableReportNav();
        App.navigate("report", Runner.lastRunId);
      }
    });
    document.getElementById("add-doc-weight-btn").addEventListener("click", () => Runner._addDocWeightRow());
  },

  async loadScenarios() {
    const list = document.getElementById("scenario-list");
    list.innerHTML = '<p class="muted">加载中…</p>';
    try {
      const data = await API.scenarios();
      const scenarios = data.scenarios || [];
      if (scenarios.length === 0) {
        list.innerHTML = '<p class="muted">未在 scenarios/ 下找到场景文件。</p>';
        return;
      }
      list.innerHTML = "";
      scenarios.forEach((sc) => list.appendChild(Runner.renderScenarioItem(sc)));
    } catch (err) {
      list.innerHTML = `<p class="muted">加载失败：${App.escape(err.message)}</p>`;
    }
    Runner._populateProfileSelects();
  },

  async _populateProfileSelects() {
    const cached = Profiles.getAll();
    const profiles = cached.length > 0
      ? cached
      : (await API.profiles().catch(() => ({ profiles: [] }))).profiles;
    ["role-judge", "role-answer", "role-dataset"].forEach(id => {
      const sel = document.getElementById(id);
      sel.innerHTML = '<option value="">— 使用场景原始配置 —</option>';
      profiles.forEach(p => {
        const opt = document.createElement("option");
        opt.value = p.profile_id;
        opt.textContent = `${p.name}  (${p.model})`;
        sel.appendChild(opt);
      });
    });
  },

  renderScenarioItem(sc) {
    const item = document.createElement("div");
    const invalid = !!sc.error;
    item.className = "scenario-item" + (invalid ? " invalid" : "");
    const modeTag = sc.mode
      ? `<span class="tag mode-${App.escape(sc.mode)}">${App.escape(sc.mode)}</span>`
      : "";
    const metricCount = (sc.metrics || []).length;
    item.innerHTML = `
      <div>
        <div class="scenario-name">${App.escape(sc.scenario_name || sc.path)}</div>
        <div class="scenario-path">${App.escape(sc.path)}</div>
        ${sc.error ? `<div class="scenario-path" style="color:#dc2626">${App.escape(sc.error)}</div>` : ""}
      </div>
      <div class="scenario-tags">
        ${modeTag}
        <span class="tag">${metricCount} 指标</span>
      </div>
    `;
    if (!invalid) {
      item.addEventListener("click", () => {
        document.querySelectorAll(".scenario-item").forEach((el) => el.classList.remove("selected"));
        item.classList.add("selected");
        Runner.selectedScenario = sc.path;
        Runner.selectedScenarioInfo = sc;
        document.getElementById("selected-scenario").textContent = sc.path;
        document.getElementById("run-btn").disabled = false;
        document.getElementById("llm-assignment-panel").hidden = false;
        Runner._renderWeightPanel(sc);
        document.getElementById("weight-config-panel").hidden = false;
      });
    }
    return item;
  },

  // 根据选中场景渲染指标权重行（动态）
  _renderWeightPanel(sc) {
    const metricRows = document.getElementById("metric-weight-rows");
    metricRows.innerHTML = "";
    const metrics = sc.metrics || [];
    const existingWeights = sc.metric_weights || {};
    metrics.forEach(metric => {
      const row = document.createElement("div");
      row.className = "weight-row";
      const currentVal = existingWeights[metric] != null ? existingWeights[metric] : 1.0;
      row.innerHTML = `
        <span class="weight-row-label">${App.escape(metric)}</span>
        <input class="weight-row-input" type="number" min="0" step="0.1"
               data-metric="${App.escape(metric)}" value="${currentVal}" />
      `;
      metricRows.appendChild(row);
    });

    // 填充已有文档权重
    const docRows = document.getElementById("doc-weight-rows");
    docRows.innerHTML = "";
    const existingDocWeights = sc.doc_weights || {};
    Object.entries(existingDocWeights).forEach(([docName, w]) => {
      Runner._addDocWeightRow(docName, w);
    });
  },

  // 添加一行文档权重输入
  _addDocWeightRow(docName = "", weight = 1.0) {
    const container = document.getElementById("doc-weight-rows");
    const row = document.createElement("div");
    row.className = "weight-row";
    row.innerHTML = `
      <input class="doc-weight-name" type="text" placeholder="PDF 文件名（如 322_双源CT.pdf）" value="${App.escape(docName)}" />
      <input class="weight-row-input" type="number" min="0" step="0.1" value="${weight}" />
      <button class="weight-row-remove" title="删除">✕</button>
    `;
    row.querySelector(".weight-row-remove").addEventListener("click", () => row.remove());
    container.appendChild(row);
  },

  // 收集权重面板当前值
  _collectWeights() {
    const metricWeights = {};
    document.querySelectorAll("#metric-weight-rows .weight-row-input").forEach(input => {
      const metric = input.dataset.metric;
      const val = parseFloat(input.value);
      if (metric && !isNaN(val)) metricWeights[metric] = val;
    });

    const docWeights = {};
    document.querySelectorAll("#doc-weight-rows .weight-row").forEach(row => {
      const nameInput = row.querySelector(".doc-weight-name");
      const valInput = row.querySelector(".weight-row-input");
      if (!nameInput || !valInput) return;
      const name = nameInput.value.trim();
      const val = parseFloat(valInput.value);
      if (name && !isNaN(val)) docWeights[name] = val;
    });

    // 如果全部指标权重均为 1.0 且无文档权重，不发送（等权，跳过）
    const allDefault = Object.values(metricWeights).every(v => Math.abs(v - 1.0) < 1e-9)
                    && Object.keys(docWeights).length === 0;
    if (allDefault) return { metricWeights: null, docWeights: null };
    return { metricWeights, docWeights };
  },

  async trigger() {
    if (!Runner.selectedScenario) return;
    const runBtn = document.getElementById("run-btn");
    runBtn.disabled = true;
    const panel = document.getElementById("task-panel");
    const logBox = document.getElementById("task-log");
    const statusBadge = document.getElementById("task-status");
    const reportBtn = document.getElementById("view-report-btn");
    panel.hidden = false;
    reportBtn.hidden = true;
    logBox.textContent = "";
    Runner._setStatus(statusBadge, "queued");
    try {
      await Runner._applyProfilesIfNeeded(logBox);
      const resp = await API.triggerEvaluation(Runner.selectedScenario);
      Runner.poll(resp.task_id);
    } catch (err) {
      Runner._setStatus(statusBadge, "failed");
      logBox.textContent = (logBox.textContent ? logBox.textContent + "\n" : "") + `触发失败：${err.message}`;
      runBtn.disabled = false;
    }
  },

  async _applyProfilesIfNeeded(logBox) {
    const judgeId = document.getElementById("role-judge").value;
    const answerId = document.getElementById("role-answer").value;
    const datasetId = document.getElementById("role-dataset").value;
    const { metricWeights, docWeights } = Runner._collectWeights();

    if (!judgeId && !answerId && !datasetId && !metricWeights && !docWeights) return;

    logBox.textContent = "正在将 LLM 配置和权重写入场景文件…\n";
    const body = {
      scenario_path: Runner.selectedScenario,
      judge_profile_id: judgeId || null,
      answer_profile_id: answerId || null,
      dataset_profile_id: datasetId || null,
      metric_weights: metricWeights,
      doc_weights: docWeights,
    };
    const result = await API.applyProfiles(body);
    const fields = (result.patched_fields || []).join(", ");
    logBox.textContent += fields
      ? `✓ 已更新字段：${fields}\n`
      : "（未找到可更新的字段，继续运行）\n";
  },

  poll(taskId) {
    const logBox = document.getElementById("task-log");
    const statusBadge = document.getElementById("task-status");
    const reportBtn = document.getElementById("view-report-btn");
    const runBtn = document.getElementById("run-btn");
    if (Runner.pollTimer) clearInterval(Runner.pollTimer);
    Runner.pollTimer = setInterval(async () => {
      try {
        const status = await API.taskStatus(taskId);
        logBox.textContent = (status.logs || []).join("\n");
        logBox.scrollTop = logBox.scrollHeight;
        Runner._setStatus(statusBadge, status.status);
        if (status.status === "completed" || status.status === "failed") {
          clearInterval(Runner.pollTimer);
          runBtn.disabled = false;
          if (status.status === "completed" && status.run_id) {
            Runner.lastRunId = status.run_id;
            sessionStorage.setItem("rag_run_id", status.run_id);
            reportBtn.hidden = false;
          }
        }
      } catch (err) {
        clearInterval(Runner.pollTimer);
        logBox.textContent += `\n轮询失败：${err.message}`;
        runBtn.disabled = false;
      }
    }, 1200);
  },

  _setStatus(badge, status) {
    badge.textContent = status;
    badge.className = "badge " + status;
  },
};

Step 4: Update report.js — renderMetricCards

In renderMetricCards, after the metrics.forEach loop that renders individual cards, append this block to show the weighted_score card:

// 在 renderMetricCards 方法末尾，metrics.forEach 之后追加：
const wsValue = report.weighted_score_mean;
const wsCard = document.createElement("div");
wsCard.className = "metric-card weighted-score-card";
const wsCls = App.scoreClass(wsValue);
const wsText = wsValue === null || wsValue === undefined ? "n/a" : wsValue.toFixed(2);
wsCard.innerHTML = `
  <div class="metric-value ${wsCls}">${wsText}</div>
  <div class="metric-name">综合加权得分</div>
`;
wrap.appendChild(wsCard);

Step 5: Verify app loads without JS errors

Start the webapp:

python webmain.py

Open http://localhost:8000, navigate to「新建评估」, click a scenario and verify:

Weight panel appears below LLM 角色配置
Each metric listed with a default weight of 1.0
「添加文档权重」button adds a new row
Navigate to any report and verify「综合加权得分」card appears
Step 6: Commit

git add webapp/static/index.html webapp/static/js/runner.js webapp/static/css/app.css webapp/static/js/report.js
git commit -m "feat: add weight config panel to 新建评估 and weighted_score card to report"

Task 9: 全量回归测试

Step 1: Run all tests

python -m pytest tests/ -v --tb=short

Expected: all previously-passing tests still PASS, new tests PASS.

Note: pre-existing failures in webapp.test_* (module import path issues) and test_offline_eval::test_normalize_sample_pdf_offline_smoke_row (missing CSV fixture) are known pre-existing issues — they are not regressions from this feature.

Step 2: Run pipeline and llm-profiles tests explicitly

python -m pytest tests/test_pipeline.py tests/webapp/test_llm_profiles_api.py tests/test_weights.py -v

Expected: all PASS

Step 3: Final commit

git add .
git commit -m "feat: metric & doc weights — full implementation complete

- New rag_eval/metrics/weights.py with pure-function weight computation
- Scenario YAML supports metric_weights and doc_weights (optional, backward-compatible)
- scores.csv gains weighted_score and sample_weight columns
- summary.md shows weighted metric means and overall weighted_score
- yaml_patcher writes metric_weights/doc_weights on apply
- report_builder uses weighted means; ReportData gains weighted_score_mean
- 新建评估 page: weight config panel with metric sliders and doc weight rows
- 报告详情 page: 综合加权得分 card

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>"

56 KiB Raw Blame History Unescape Escape

指标权重 & 文档片段权重 Implementation Plan

Global Constraints

文件清单

Task 1: 新建权重计算核心模块（TDD）

Task 2: 扩展 Schema、Dataclass 和 Loader

Task 3: 评估器 — _merge_score 新增两列

Task 4: 报告摘要 — 改用加权均值

Task 5: yaml_patcher 扩展

Task 6: report_builder 和 run_reader 加权支持

Task 7: scenario_scanner — 读取权重字段供前端使用

Task 8: 前端 — 权重配置面板 + 报告卡片

Task 9: 全量回归测试

56 KiB

Raw Blame History