56 KiB
指标权重 & 文档片段权重 Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: 在场景 YAML 中支持 metric_weights 和 doc_weights 两个可选字段,计算加权综合得分并在报告页和「新建评估」页的权重配置面板中展示。
Architecture: 新增纯函数模块 rag_eval/metrics/weights.py 承载所有计算逻辑;评估器写入两列新字段 (weighted_score, sample_weight) 到 scores.csv;yaml_patcher 扩展支持写入权重字段;前端在 LLM 角色配置面板下方动态渲染权重面板,报告页新增「加权综合得分」卡片。
Tech Stack: Python 3.12, Pydantic v2, FastAPI, Vanilla JS (无框架), pytest
Global Constraints
- Python 3.12+,PEP 8,4 空格缩进,类型注解必须
- 所有新字段均为可选,缺省行为与现有完全一致(向后兼容)
- 测试用 pytest,不依赖真实 LLM 或网络
- JS 不引入任何新依赖(原生 DOM API)
- 权重值无需归一化,计算时内部
w / Σw
文件清单
| 操作 | 文件 | 职责 |
|---|---|---|
| 新建 | rag_eval/metrics/weights.py |
权重计算纯函数 |
| 新建 | tests/test_weights.py |
weights 模块单元测试 |
| 修改 | rag_eval/config/schema.py |
ScenarioModel 新增两字段 |
| 修改 | rag_eval/shared/models.py |
Scenario dataclass 新增两字段 |
| 修改 | rag_eval/config/loader.py |
load_scenario 透传新字段 |
| 修改 | rag_eval/execution/evaluator.py |
_merge_score 新增两列 |
| 修改 | rag_eval/reporting/summary.py |
改用加权均值,新增 weighted_score 行 |
| 修改 | webapp/services/yaml_patcher.py |
新增 metric_weights/doc_weights 参数 |
| 修改 | webapp/models.py |
ProfileApplyRequest 新增字段;ReportData 新增 weighted_score_mean |
| 修改 | webapp/api/llm_profiles.py |
apply_profiles 透传新参数 |
| 修改 | webapp/services/report_builder.py |
读取权重,计算加权均值和 weighted_score_mean |
| 修改 | webapp/services/run_reader.py |
从 snapshot.yaml 读取 metric_weights/doc_weights |
| 修改 | webapp/static/index.html |
新增权重配置面板 HTML |
| 修改 | webapp/static/js/runner.js |
权重面板逻辑 + apply 时传权重 |
| 修改 | webapp/static/css/app.css |
权重面板样式 |
| 修改 | webapp/static/js/report.js |
renderMetricCards 中渲染 weighted_score 卡片 |
| 修改 | webapp/api/scenarios.py |
ScenarioInfo 新增 metric_weights/doc_weights 字段 |
| 修改 | webapp/services/scenario_scanner.py |
扫描时读取权重字段 |
| 修改 | tests/test_offline_eval.py |
断言 scores.csv 包含 weighted_score/sample_weight |
| 修改 | tests/webapp/test_llm_profiles_api.py |
apply_profiles 权重写入测试 |
Task 1: 新建权重计算核心模块(TDD)
Files:
- Create:
rag_eval/metrics/weights.py - Create:
tests/test_weights.py
Interfaces:
-
Produces:
resolve_weight(weights: dict[str, float], key: str, default: float = 1.0) -> floatcompute_weighted_score(scores: dict[str, float | None], metric_weights: dict[str, float]) -> float | Noneweighted_metric_means(score_rows: list[dict], metrics: list[str], doc_weights: dict[str, float]) -> dict[str, float | None]compute_overall_weighted_score_mean(score_rows: list[dict], metric_weights: dict[str, float], doc_weights: dict[str, float]) -> float | None
-
Step 1: Write failing tests
Create tests/test_weights.py:
"""Unit tests for rag_eval/metrics/weights.py"""
import math
import pytest
from rag_eval.metrics.weights import (
resolve_weight,
compute_weighted_score,
weighted_metric_means,
compute_overall_weighted_score_mean,
)
class TestResolveWeight:
def test_returns_value_when_key_present(self):
assert resolve_weight({"faith": 0.5}, "faith") == 0.5
def test_returns_default_when_key_missing(self):
assert resolve_weight({}, "faith") == 1.0
def test_returns_custom_default_when_key_missing(self):
assert resolve_weight({}, "faith", default=2.0) == 2.0
def test_empty_dict_returns_default(self):
assert resolve_weight({}, "anything") == 1.0
class TestComputeWeightedScore:
def test_equal_weights_is_simple_mean(self):
scores = {"faithfulness": 0.8, "context_recall": 0.6}
result = compute_weighted_score(scores, {})
assert result == pytest.approx(0.7, rel=1e-4)
def test_explicit_weights(self):
scores = {"faithfulness": 1.0, "context_recall": 0.0}
weights = {"faithfulness": 3.0, "context_recall": 1.0}
# (3*1.0 + 1*0.0) / (3+1) = 0.75
result = compute_weighted_score(scores, weights)
assert result == pytest.approx(0.75, rel=1e-4)
def test_nan_values_excluded(self):
scores = {"faithfulness": float("nan"), "context_recall": 0.8}
result = compute_weighted_score(scores, {})
assert result == pytest.approx(0.8, rel=1e-4)
def test_none_values_excluded(self):
scores = {"faithfulness": None, "context_recall": 0.6}
result = compute_weighted_score(scores, {})
assert result == pytest.approx(0.6, rel=1e-4)
def test_all_nan_returns_none(self):
scores = {"faithfulness": float("nan"), "context_recall": float("nan")}
assert compute_weighted_score(scores, {}) is None
def test_empty_scores_returns_none(self):
assert compute_weighted_score({}, {}) is None
def test_missing_metric_in_weights_uses_default_1(self):
scores = {"faithfulness": 0.8, "context_recall": 0.4}
weights = {"faithfulness": 2.0} # context_recall defaults to 1.0
# (2*0.8 + 1*0.4) / (2+1) = 2.0/3 ≈ 0.6667
result = compute_weighted_score(scores, weights)
assert result == pytest.approx(2.0 / 3, rel=1e-4)
class TestWeightedMetricMeans:
def _rows(self):
return [
{"doc_name": "a.pdf", "faithfulness": 1.0, "context_recall": 0.5},
{"doc_name": "b.pdf", "faithfulness": 0.6, "context_recall": 0.8},
]
def test_equal_weights_gives_arithmetic_mean(self):
rows = self._rows()
result = weighted_metric_means(rows, ["faithfulness", "context_recall"], {})
assert result["faithfulness"] == pytest.approx(0.8, rel=1e-4)
assert result["context_recall"] == pytest.approx(0.65, rel=1e-4)
def test_doc_weight_amplifies_contribution(self):
rows = self._rows()
# doc a.pdf gets weight 3, b.pdf gets 1
doc_weights = {"a.pdf": 3.0, "b.pdf": 1.0}
result = weighted_metric_means(rows, ["faithfulness"], doc_weights)
# (3*1.0 + 1*0.6) / (3+1) = 3.6/4 = 0.9
assert result["faithfulness"] == pytest.approx(0.9, rel=1e-4)
def test_nan_rows_skipped_per_metric(self):
rows = [
{"doc_name": "a.pdf", "faithfulness": float("nan"), "context_recall": 0.5},
{"doc_name": "b.pdf", "faithfulness": 0.8, "context_recall": 0.9},
]
result = weighted_metric_means(rows, ["faithfulness", "context_recall"], {})
assert result["faithfulness"] == pytest.approx(0.8, rel=1e-4)
assert result["context_recall"] == pytest.approx(0.7, rel=1e-4)
def test_missing_metric_column_returns_none(self):
rows = [{"doc_name": "a.pdf", "faithfulness": 0.8}]
result = weighted_metric_means(rows, ["faithfulness", "unknown_metric"], {})
assert result["faithfulness"] == pytest.approx(0.8, rel=1e-4)
assert result["unknown_metric"] is None
def test_empty_rows_returns_none_for_all(self):
result = weighted_metric_means([], ["faithfulness"], {})
assert result["faithfulness"] is None
class TestComputeOverallWeightedScoreMean:
def test_basic_weighted_mean_of_weighted_scores(self):
rows = [
{"doc_name": "a.pdf", "faithfulness": 1.0, "context_recall": 0.0},
{"doc_name": "b.pdf", "faithfulness": 0.5, "context_recall": 0.5},
]
metric_weights = {"faithfulness": 1.0, "context_recall": 1.0}
result = compute_overall_weighted_score_mean(rows, metric_weights, {})
# sample 1 ws = 0.5, sample 2 ws = 0.5 → mean = 0.5
assert result == pytest.approx(0.5, rel=1e-4)
def test_doc_weight_amplifies_sample(self):
rows = [
{"doc_name": "important.pdf", "faithfulness": 1.0},
{"doc_name": "other.pdf", "faithfulness": 0.0},
]
doc_weights = {"important.pdf": 9.0, "other.pdf": 1.0}
result = compute_overall_weighted_score_mean(rows, {}, doc_weights)
# ws_1=1.0 w=9, ws_2=0.0 w=1 → (9*1 + 1*0)/(9+1) = 0.9
assert result == pytest.approx(0.9, rel=1e-4)
def test_all_nan_returns_none(self):
rows = [{"doc_name": "a.pdf", "faithfulness": float("nan")}]
assert compute_overall_weighted_score_mean(rows, {}, {}) is None
- Step 2: Run tests to verify they fail
python -m pytest tests/test_weights.py -v
Expected: ModuleNotFoundError: No module named 'rag_eval.metrics.weights'
- Step 3: Implement
rag_eval/metrics/weights.py
"""Utility functions for weighted metric aggregation.
All functions are pure (no side effects, no I/O) and operate on plain dicts/lists.
Weights do not need to be pre-normalised — normalisation is done internally.
"""
from __future__ import annotations
import math
def resolve_weight(weights: dict[str, float], key: str, default: float = 1.0) -> float:
"""Return the weight for *key*, or *default* when absent."""
return float(weights.get(key, default))
def compute_weighted_score(
scores: dict[str, float | None],
metric_weights: dict[str, float],
) -> float | None:
"""Return the weighted mean of valid (non-NaN, non-None) metric scores.
Args:
scores: mapping of metric_name -> raw score (may be NaN or None).
metric_weights: optional per-metric weights; absent keys default to 1.0.
Returns:
Weighted mean as a float, or None when no valid score exists.
"""
total_weight = 0.0
total_score = 0.0
for metric, score in scores.items():
if score is None:
continue
try:
v = float(score)
except (TypeError, ValueError):
continue
if math.isnan(v) or math.isinf(v):
continue
w = resolve_weight(metric_weights, metric, default=1.0)
total_weight += w
total_score += w * v
if total_weight == 0.0:
return None
return total_score / total_weight
def weighted_metric_means(
score_rows: list[dict],
metrics: list[str],
doc_weights: dict[str, float],
) -> dict[str, float | None]:
"""Compute per-metric weighted means across all score rows.
Each row's contribution is scaled by the doc_weight for its ``doc_name``.
Rows with NaN/None for a given metric are excluded from that metric's mean.
Args:
score_rows: list of score record dicts (from scores.csv).
metrics: ordered list of metric names to aggregate.
doc_weights: mapping doc_name -> weight multiplier; absent keys default to 1.0.
Returns:
Dict mapping metric_name -> weighted mean (or None if no valid data).
"""
totals: dict[str, float] = {m: 0.0 for m in metrics}
weights_sum: dict[str, float] = {m: 0.0 for m in metrics}
for row in score_rows:
doc_name = str(row.get("doc_name", "") or "")
sample_w = resolve_weight(doc_weights, doc_name, default=1.0)
for metric in metrics:
raw = row.get(metric)
if raw is None:
continue
try:
v = float(raw)
except (TypeError, ValueError):
continue
if math.isnan(v) or math.isinf(v):
continue
totals[metric] += sample_w * v
weights_sum[metric] += sample_w
return {
metric: (totals[metric] / weights_sum[metric] if weights_sum[metric] > 0 else None)
for metric in metrics
}
def compute_overall_weighted_score_mean(
score_rows: list[dict],
metric_weights: dict[str, float],
doc_weights: dict[str, float],
) -> float | None:
"""Compute the overall weighted-score mean across all samples.
For each sample:
1. Compute per-sample weighted_score via ``compute_weighted_score``.
2. Scale by the doc weight for that sample's ``doc_name``.
Then return the weighted mean of all per-sample weighted_scores.
Args:
score_rows: list of score record dicts.
metric_weights: per-metric weight multipliers.
doc_weights: per-doc weight multipliers.
Returns:
Float mean, or None when no sample has a valid weighted_score.
"""
total_weight = 0.0
total_score = 0.0
for row in score_rows:
# Collect only numeric metric columns (exclude meta-columns)
metric_scores: dict[str, float | None] = {}
for k, v in row.items():
if k in _META_COLUMNS:
continue
metric_scores[k] = v # type: ignore[assignment]
ws = compute_weighted_score(metric_scores, metric_weights)
if ws is None:
continue
doc_name = str(row.get("doc_name", "") or "")
sample_w = resolve_weight(doc_weights, doc_name, default=1.0)
total_weight += sample_w
total_score += sample_w * ws
return total_score / total_weight if total_weight > 0 else None
# Columns in scores.csv that are sample metadata, not metric scores.
_META_COLUMNS = frozenset({
"sample_id", "question", "contexts", "answer", "ground_truth",
"scenario", "language", "retrieval_config", "error",
"judge_model", "embedding_model", "run_id",
"difficulty", "question_type", "doc_id", "doc_name",
"section_path", "page_start", "page_end",
"source_chunk_ids", "review_status", "review_notes",
"weighted_score", "sample_weight",
})
- Step 4: Run tests to verify they pass
python -m pytest tests/test_weights.py -v
Expected: all 18 tests PASS
- Step 5: Commit
git add rag_eval/metrics/weights.py tests/test_weights.py
git commit -m "feat: add metric/doc weight computation module (weights.py)"
Task 2: 扩展 Schema、Dataclass 和 Loader
Files:
- Modify:
rag_eval/config/schema.py - Modify:
rag_eval/shared/models.py - Modify:
rag_eval/config/loader.py
Interfaces:
-
Consumes: 无新依赖
-
Produces:
ScenarioModel.metric_weights: dict[str, float](default{})ScenarioModel.doc_weights: dict[str, float](default{})Scenario.metric_weights: dict[str, float](default{})Scenario.doc_weights: dict[str, float](default{})
-
Step 1: Write failing test
Add to tests/test_offline_eval.py inside ScenarioAndDatasetTests:
def test_load_scenario_metric_and_doc_weights(self):
"""load_scenario passes metric_weights and doc_weights into Scenario."""
import tempfile, yaml
from rag_eval.config.loader import load_scenario
payload = {
"scenario_name": "w-test", "mode": "offline",
"dataset": "nonexistent.csv", "judge_model": "m",
"embedding_model": "e", "metrics": ["faithfulness"],
"output_dir": "out",
"metric_weights": {"faithfulness": 0.7},
"doc_weights": {"doc.pdf": 2.0},
}
with tempfile.NamedTemporaryFile(suffix=".yaml", mode="w",
encoding="utf-8", delete=False) as f:
yaml.dump(payload, f, allow_unicode=True)
tmp_path = f.name
scenario = load_scenario(tmp_path)
assert scenario.metric_weights == {"faithfulness": 0.7}
assert scenario.doc_weights == {"doc.pdf": 2.0}
def test_load_scenario_defaults_to_empty_weights(self):
"""load_scenario defaults metric_weights and doc_weights to empty dicts."""
import tempfile, yaml
from rag_eval.config.loader import load_scenario
payload = {
"scenario_name": "no-w", "mode": "offline",
"dataset": "nonexistent.csv", "judge_model": "m",
"embedding_model": "e", "metrics": ["faithfulness"],
"output_dir": "out",
}
with tempfile.NamedTemporaryFile(suffix=".yaml", mode="w",
encoding="utf-8", delete=False) as f:
yaml.dump(payload, f, allow_unicode=True)
tmp_path = f.name
scenario = load_scenario(tmp_path)
assert scenario.metric_weights == {}
assert scenario.doc_weights == {}
- Step 2: Run tests to verify they fail
python -m pytest tests/test_offline_eval.py::ScenarioAndDatasetTests::test_load_scenario_metric_and_doc_weights tests/test_offline_eval.py::ScenarioAndDatasetTests::test_load_scenario_defaults_to_empty_weights -v
Expected: FAIL — Scenario has no attribute 'metric_weights'
- Step 3: Add fields to
rag_eval/config/schema.py
In ScenarioModel, add after optimization_advisor:
metric_weights: dict[str, float] = Field(default_factory=dict)
doc_weights: dict[str, float] = Field(default_factory=dict)
- Step 4: Add fields to
rag_eval/shared/models.py
In Scenario dataclass, add after optimization_advisor: bool = False:
metric_weights: dict[str, float] = field(default_factory=dict)
doc_weights: dict[str, float] = field(default_factory=dict)
- Step 5: Update
rag_eval/config/loader.py
In load_scenario(), in the Scenario(...) constructor call, add after optimization_advisor=model.optimization_advisor,:
metric_weights=dict(model.metric_weights),
doc_weights=dict(model.doc_weights),
- Step 6: Run tests to verify they pass
python -m pytest tests/test_offline_eval.py::ScenarioAndDatasetTests::test_load_scenario_metric_and_doc_weights tests/test_offline_eval.py::ScenarioAndDatasetTests::test_load_scenario_defaults_to_empty_weights -v
Expected: both PASS
- Step 7: Commit
git add rag_eval/config/schema.py rag_eval/shared/models.py rag_eval/config/loader.py tests/test_offline_eval.py
git commit -m "feat: add metric_weights and doc_weights to Scenario schema and dataclass"
Task 3: 评估器 — _merge_score 新增两列
Files:
- Modify:
rag_eval/execution/evaluator.py
Interfaces:
-
Consumes:
compute_weighted_score(scores, metric_weights) -> float | Nonefromrag_eval.metrics.weights -
Produces:
scores.csv新增列weighted_score: float | NaN,sample_weight: float -
Step 1: Write failing test
Add to tests/test_offline_eval.py inside EvaluatorAndReportingTests:
def test_merge_score_includes_weighted_score_and_sample_weight(self):
"""_merge_score adds weighted_score and sample_weight columns."""
from unittest.mock import MagicMock
from rag_eval.execution.evaluator import Evaluator
from rag_eval.shared.models import (
MetricScore, NormalizedSample, RuntimeConfig, Scenario, DatasetConfig,
)
from pathlib import Path
scenario = Scenario(
scenario_name="w-test", mode="offline",
dataset=DatasetConfig(path=Path("d.csv")),
judge_model="m", embedding_model="e",
metrics=["faithfulness", "context_recall"],
output_dir=Path("out"),
metric_weights={"faithfulness": 3.0, "context_recall": 1.0},
doc_weights={"doc.pdf": 2.0},
)
evaluator = Evaluator(
scenario=scenario,
metric_pipeline=MagicMock(),
app_adapter=None,
)
sample = NormalizedSample(
sample_id="s1", question="q", contexts=["ctx"],
answer="a", ground_truth="gt",
metadata={"doc_name": "doc.pdf"},
)
score = MetricScore(metrics={"faithfulness": 1.0, "context_recall": 0.0})
row = evaluator._merge_score(sample, score)
# (3*1.0 + 1*0.0) / 4 = 0.75
assert abs(row["weighted_score"] - 0.75) < 1e-4
assert row["sample_weight"] == 2.0
- Step 2: Run test to verify it fails
python -m pytest tests/test_offline_eval.py::EvaluatorAndReportingTests::test_merge_score_includes_weighted_score_and_sample_weight -v
Expected: FAIL — KeyError: 'weighted_score'
- Step 3: Update
rag_eval/execution/evaluator.py
Add import at top of file (after existing imports):
from rag_eval.metrics.weights import compute_weighted_score, resolve_weight
Replace _merge_score method:
def _merge_score(self, sample: NormalizedSample, score: Any) -> dict[str, Any]:
"""Combine sample data, metric results, run metadata, and weight columns."""
record = sample.to_record()
record["contexts"] = sample.contexts
record.update(score.metrics)
record["error"] = score.error
record["judge_model"] = self.scenario.judge_model
record["embedding_model"] = self.scenario.embedding_model
record["run_id"] = self.scenario.scenario_name
# Weighted score columns — enable post-hoc weighted aggregation in reporting.
record["weighted_score"] = compute_weighted_score(
score.metrics, self.scenario.metric_weights
)
doc_name = str(sample.metadata.get("doc_name", "") or "")
record["sample_weight"] = resolve_weight(
self.scenario.doc_weights, doc_name, default=1.0
)
return record
- Step 4: Run test to verify it passes
python -m pytest tests/test_offline_eval.py::EvaluatorAndReportingTests::test_merge_score_includes_weighted_score_and_sample_weight -v
Expected: PASS
- Step 5: Commit
git add rag_eval/execution/evaluator.py tests/test_offline_eval.py
git commit -m "feat: add weighted_score and sample_weight columns to score rows"
Task 4: 报告摘要 — 改用加权均值
Files:
- Modify:
rag_eval/reporting/summary.py
Interfaces:
-
Consumes:
weighted_metric_means(score_rows, metrics, doc_weights) -> dict[str, float | None]compute_overall_weighted_score_mean(score_rows, metric_weights, doc_weights) -> float | None- Both from
rag_eval.metrics.weights
-
Step 1: Write failing test
Add to tests/test_offline_eval.py inside EvaluatorAndReportingTests:
def test_summary_markdown_shows_weighted_score(self):
"""build_summary_markdown includes weighted_score when metric_weights set."""
import math
from rag_eval.reporting.summary import build_summary_markdown
from rag_eval.shared.models import (
EvaluationResult, NormalizedSample, DatasetConfig, Scenario, RuntimeConfig,
)
from pathlib import Path
scenario = Scenario(
scenario_name="ws-test", mode="offline",
dataset=DatasetConfig(path=Path("d.csv")),
judge_model="m", embedding_model="e",
metrics=["faithfulness"],
output_dir=Path("out"),
metric_weights={"faithfulness": 1.0},
doc_weights={},
)
sample = NormalizedSample(
sample_id="s1", question="q", contexts=["c"],
answer="a", ground_truth="gt",
)
result = EvaluationResult(
scenario=scenario, run_id="r1",
started_at="2026-01-01T00:00:00", finished_at="2026-01-01T00:01:00",
valid_samples=[sample], invalid_samples=[],
score_rows=[{
"sample_id": "s1", "faithfulness": 0.8,
"weighted_score": 0.8, "sample_weight": 1.0,
"doc_name": "", "error": "",
}],
)
md = build_summary_markdown(result)
assert "weighted_score" in md
assert "0.8000" in md
- Step 2: Run test to verify it fails
python -m pytest tests/test_offline_eval.py::EvaluatorAndReportingTests::test_summary_markdown_shows_weighted_score -v
Expected: FAIL — "weighted_score" not in md
- Step 3: Update
rag_eval/reporting/summary.py
Replace the entire file:
"""Markdown summary generation for completed evaluation runs."""
from __future__ import annotations
import math
import pandas as pd
from rag_eval.metrics.weights import (
compute_overall_weighted_score_mean,
weighted_metric_means,
)
from rag_eval.shared.models import EvaluationResult
def _table_from_frame(frame: pd.DataFrame) -> str:
"""Render a small dataframe as a fixed-width markdown-friendly text table."""
if frame.empty:
return "No rows."
columns = list(frame.columns)
rows = [[str(value) for value in row] for row in frame.astype(object).values.tolist()]
widths = []
for index, column in enumerate(columns):
column_width = len(str(column))
row_width = max((len(row[index]) for row in rows), default=0)
widths.append(max(column_width, row_width))
header = " | ".join(str(column).ljust(widths[idx]) for idx, column in enumerate(columns))
separator = "-|-".join("-" * widths[idx] for idx in range(len(columns)))
body = [
" | ".join(row[idx].ljust(widths[idx]) for idx in range(len(columns)))
for row in rows
]
return "\n".join([header, separator, *body])
def build_summary_markdown(result: EvaluationResult) -> str:
"""Build the human-readable markdown summary written for each evaluation run."""
total = len(result.valid_samples) + len(result.invalid_samples)
scores = pd.DataFrame(result.score_rows)
lines = [
f"# {result.scenario.scenario_name}",
"",
f"- run_id: `{result.run_id}`",
f"- mode: `{result.scenario.mode}`",
f"- total_samples: `{total}`",
f"- valid_samples: `{len(result.valid_samples)}`",
f"- invalid_samples: `{len(result.invalid_samples)}`",
f"- judge_model: `{result.scenario.judge_model}`",
f"- embedding_model: `{result.scenario.embedding_model}`",
"",
"## Metric Means",
"",
]
if scores.empty:
lines.append("No valid samples were scored.")
return "\n".join(lines) + "\n"
score_rows_list = scores.to_dict(orient="records")
w_means = weighted_metric_means(
score_rows_list, result.scenario.metrics, result.scenario.doc_weights
)
has_weights = bool(result.scenario.metric_weights or result.scenario.doc_weights)
weight_suffix = " (加权)" if has_weights else ""
for metric in result.scenario.metrics:
mean_value = w_means.get(metric)
w = result.scenario.metric_weights.get(metric, 1.0) if result.scenario.metric_weights else 1.0
weight_note = f" (w={w:.2f})" if result.scenario.metric_weights else ""
if mean_value is not None and not math.isnan(mean_value):
lines.append(f"- {metric}: `{mean_value:.4f}`{weight_note}")
else:
lines.append(f"- {metric}: `n/a`{weight_note}")
overall_ws = compute_overall_weighted_score_mean(
score_rows_list, result.scenario.metric_weights, result.scenario.doc_weights
)
if overall_ws is not None and not math.isnan(overall_ws):
lines.append(f"- **weighted_score{weight_suffix}: `{overall_ws:.4f}`**")
else:
lines.append(f"- **weighted_score{weight_suffix}: `n/a`**")
detail_columns = ["sample_id", *result.scenario.metrics, "weighted_score", "error"]
existing_columns = [c for c in detail_columns if c in scores.columns]
detail = scores[existing_columns]
lines.extend([
"",
"## Per-sample Scores",
"",
"```text",
_table_from_frame(detail),
"```",
])
return "\n".join(lines) + "\n"
- Step 4: Run test to verify it passes
python -m pytest tests/test_offline_eval.py::EvaluatorAndReportingTests::test_summary_markdown_shows_weighted_score -v
Expected: PASS
- Step 5: Run full offline test suite to check no regressions
python -m pytest tests/test_offline_eval.py tests/test_weights.py -v
Expected: all PASS
- Step 6: Commit
git add rag_eval/reporting/summary.py tests/test_offline_eval.py
git commit -m "feat: use weighted metric means and add weighted_score row to summary.md"
Task 5: yaml_patcher 扩展
Files:
- Modify:
webapp/services/yaml_patcher.py - Modify:
webapp/models.py - Modify:
webapp/api/llm_profiles.py
Interfaces:
-
Produces:
apply_profiles_to_scenario(..., metric_weights=None, doc_weights=None)— new optional paramsProfileApplyRequest.metric_weights: dict[str, float] | NoneProfileApplyRequest.doc_weights: dict[str, float] | None
-
Step 1: Write failing test
Add to tests/webapp/test_llm_profiles_api.py:
def test_apply_metric_weights_patches_yaml(tmp_path):
"""Applying metric_weights writes them into the YAML."""
scenario_file = tmp_path / "w-scenario.yaml"
scenario_file.write_text(
"scenario_name: test\nmode: offline\njudge_model: m\nembedding_model: e\n"
"dataset: d.csv\nmetrics:\n- faithfulness\noutput_dir: out\n",
encoding="utf-8",
)
from webapp.services.yaml_patcher import apply_profiles_to_scenario
patched = apply_profiles_to_scenario(
scenario_path=str(scenario_file),
judge_profile=None, answer_profile=None, dataset_profile=None,
metric_weights={"faithfulness": 0.7, "context_recall": 0.3},
_resolve_absolute=True,
)
assert "metric_weights" in patched
data = yaml_lib.safe_load(scenario_file.read_text())
assert data["metric_weights"]["faithfulness"] == pytest.approx(0.7)
def test_apply_doc_weights_patches_yaml(tmp_path):
"""Applying doc_weights writes them into the YAML."""
scenario_file = tmp_path / "dw-scenario.yaml"
scenario_file.write_text(
"scenario_name: test\nmode: offline\njudge_model: m\nembedding_model: e\n"
"dataset: d.csv\nmetrics:\n- faithfulness\noutput_dir: out\n",
encoding="utf-8",
)
from webapp.services.yaml_patcher import apply_profiles_to_scenario
patched = apply_profiles_to_scenario(
scenario_path=str(scenario_file),
judge_profile=None, answer_profile=None, dataset_profile=None,
doc_weights={"doc.pdf": 2.0},
_resolve_absolute=True,
)
assert "doc_weights" in patched
data = yaml_lib.safe_load(scenario_file.read_text())
assert data["doc_weights"]["doc.pdf"] == pytest.approx(2.0)
- Step 2: Run tests to verify they fail
python -m pytest tests/webapp/test_llm_profiles_api.py::test_apply_metric_weights_patches_yaml tests/webapp/test_llm_profiles_api.py::test_apply_doc_weights_patches_yaml -v
Expected: FAIL — unexpected keyword argument 'metric_weights'
- Step 3: Update
webapp/services/yaml_patcher.py
Replace apply_profiles_to_scenario signature and body:
def apply_profiles_to_scenario(
scenario_path: str,
judge_profile: LLMProfile | None,
answer_profile: LLMProfile | None,
dataset_profile: LLMProfile | None,
metric_weights: dict[str, float] | None = None,
doc_weights: dict[str, float] | None = None,
_resolve_absolute: bool = False,
) -> list[str]:
"""Patch the YAML file at *scenario_path* with the supplied profiles and weights.
Returns a list of dotted field names that were actually patched.
"""
if _resolve_absolute:
resolved = Path(scenario_path)
else:
resolved = _resolve_scenario_path(scenario_path)
if not resolved.exists():
raise FileNotFoundError(f"Scenario file not found: {resolved}")
data: dict[str, Any] = yaml.safe_load(resolved.read_text(encoding="utf-8")) or {}
patched: list[str] = []
if judge_profile is not None:
data["judge_model"] = judge_profile.model
patched.append("judge_model")
if answer_profile is not None:
adapter = data.get("app_adapter")
if isinstance(adapter, dict):
static_kwargs = adapter.setdefault("static_kwargs", {})
static_kwargs["model"] = answer_profile.model
patched.append("app_adapter.static_kwargs.model")
if dataset_profile is not None:
generation = data.get("generation")
if isinstance(generation, dict):
generation["model"] = dataset_profile.model
patched.append("generation.model")
if metric_weights is not None:
data["metric_weights"] = dict(metric_weights)
patched.append("metric_weights")
if doc_weights is not None:
data["doc_weights"] = dict(doc_weights)
patched.append("doc_weights")
resolved.write_text(
yaml.dump(data, allow_unicode=True, default_flow_style=False, sort_keys=False),
encoding="utf-8",
)
return patched
- Step 4: Update
webapp/models.py— ProfileApplyRequest
Add two fields to ProfileApplyRequest:
class ProfileApplyRequest(BaseModel):
"""Request body to patch LLM profile selections into a scenario YAML."""
scenario_path: str
judge_profile_id: str | None = None
answer_profile_id: str | None = None
dataset_profile_id: str | None = None
metric_weights: dict[str, float] | None = Field(
default=None,
description="指标权重映射,如 {\"faithfulness\": 0.35}。为 null 时不修改 YAML。",
)
doc_weights: dict[str, float] | None = Field(
default=None,
description="文档权重映射,如 {\"doc.pdf\": 2.0}。为 null 时不修改 YAML。",
)
- Step 5: Update
webapp/api/llm_profiles.py— apply_profiles endpoint
In apply_profiles(), update the call to apply_profiles_to_scenario:
patched = apply_profiles_to_scenario(
scenario_path=request.scenario_path,
judge_profile=role_profiles["judge"],
answer_profile=role_profiles["answer"],
dataset_profile=role_profiles["dataset"],
metric_weights=request.metric_weights,
doc_weights=request.doc_weights,
)
- Step 6: Run tests to verify they pass
python -m pytest tests/webapp/test_llm_profiles_api.py -v
Expected: all 15 tests PASS
- Step 7: Commit
git add webapp/services/yaml_patcher.py webapp/models.py webapp/api/llm_profiles.py tests/webapp/test_llm_profiles_api.py
git commit -m "feat: yaml_patcher and ProfileApplyRequest support metric_weights and doc_weights"
Task 6: report_builder 和 run_reader 加权支持
Files:
- Modify:
webapp/services/run_reader.py - Modify:
webapp/services/report_builder.py - Modify:
webapp/models.py(ReportData 新增 weighted_score_mean)
Interfaces:
-
Consumes:
weighted_metric_means(score_rows, metrics, doc_weights)fromrag_eval.metrics.weightscompute_overall_weighted_score_mean(score_rows, metric_weights, doc_weights)fromrag_eval.metrics.weights
-
Produces:
ReportData.weighted_score_mean: float | None_read_weights_from_snapshot(run_dir) -> tuple[dict, dict]in run_reader
-
Step 1: Update
webapp/models.py— ReportData
Add weighted_score_mean field:
class ReportData(BaseModel):
"""Aggregated report payload rendered by the report detail page."""
metrics: list[str] = Field(default_factory=list)
metric_means: dict[str, float | None] = Field(default_factory=dict)
distributions: dict[str, list[DistributionBin]] = Field(default_factory=dict)
groupings: dict[str, list[GroupStat]] = Field(default_factory=dict)
lowest_samples: list[SampleScore] = Field(default_factory=list)
summary_markdown: str = ""
advice_markdown: str = ""
weighted_score_mean: float | None = Field(
default=None,
description="加权综合得分均值(metric_weights × doc_weights 共同作用)。等权时等于各指标均值的均值。",
)
metric_weights: dict[str, float] = Field(
default_factory=dict,
description="该次运行使用的指标权重配置(来自 scenario.snapshot.yaml)。",
)
doc_weights: dict[str, float] = Field(
default_factory=dict,
description="该次运行使用的文档权重配置(来自 scenario.snapshot.yaml)。",
)
- Step 2: Add
_read_weights_from_snapshottowebapp/services/run_reader.py
Add after _read_metrics_from_snapshot:
def _read_weights_from_snapshot(run_dir: Path) -> tuple[dict[str, float], dict[str, float]]:
"""Read metric_weights and doc_weights from a scenario snapshot if present.
Returns a (metric_weights, doc_weights) tuple of plain dicts.
Both default to empty dicts when the snapshot is absent or lacks the fields.
"""
snapshot = run_dir / "scenario.snapshot.yaml"
if not snapshot.is_file():
return {}, {}
try:
payload = yaml.safe_load(snapshot.read_text(encoding="utf-8")) or {}
except (OSError, yaml.YAMLError):
return {}, {}
mw = payload.get("metric_weights") or {}
dw = payload.get("doc_weights") or {}
return (
{str(k): float(v) for k, v in mw.items() if isinstance(v, (int, float))},
{str(k): float(v) for k, v in dw.items() if isinstance(v, (int, float))},
)
- Step 3: Update
webapp/services/report_builder.py
Replace _metric_means call and build_report to use weighted versions:
# Add imports at top:
from rag_eval.metrics.weights import (
compute_overall_weighted_score_mean,
weighted_metric_means as _weighted_metric_means,
)
from webapp.services.run_reader import _read_weights_from_snapshot
Replace build_report:
def build_report(run_dir: Path, metrics: list[str]) -> ReportData:
"""Build the full aggregated report payload for one run directory."""
frame = run_reader.read_scores_frame(run_dir)
summary_markdown = run_reader.read_summary_markdown(run_dir)
advice_markdown = run_reader.read_advice_markdown(run_dir)
metric_weights, doc_weights = _read_weights_from_snapshot(run_dir)
if frame.empty or not metrics:
return ReportData(
metrics=metrics,
metric_means={metric: None for metric in metrics},
summary_markdown=summary_markdown,
advice_markdown=advice_markdown,
metric_weights=metric_weights,
doc_weights=doc_weights,
)
score_rows_list = frame.to_dict(orient="records")
# Use weighted metric means (degrades to arithmetic mean when weights are empty).
w_means = _weighted_metric_means(score_rows_list, metrics, doc_weights)
rounded_means = {m: _round_or_none(v) for m, v in w_means.items()}
overall_ws = compute_overall_weighted_score_mean(
score_rows_list, metric_weights, doc_weights
)
distributions = {
metric: _distribution(frame, metric)
for metric in metrics
if metric in frame.columns
}
return ReportData(
metrics=metrics,
metric_means=rounded_means,
distributions=distributions,
groupings=_groupings(frame, metrics),
lowest_samples=_lowest_samples(frame, metrics),
summary_markdown=summary_markdown,
advice_markdown=advice_markdown,
weighted_score_mean=_round_or_none(overall_ws),
metric_weights=metric_weights,
doc_weights=doc_weights,
)
Also delete the old _metric_means function (it is replaced by the weighted version).
- Step 4: Run existing webapp tests to check no regressions
python -m pytest tests/webapp/ -v
Expected: all PASS
- Step 5: Commit
git add webapp/models.py webapp/services/run_reader.py webapp/services/report_builder.py
git commit -m "feat: report_builder uses weighted metric means; ReportData gains weighted_score_mean"
Task 7: scenario_scanner — 读取权重字段供前端使用
Files:
- Modify:
webapp/models.py(ScenarioInfo 新增字段) - Modify:
webapp/services/scenario_scanner.py
Interfaces:
-
Produces:
ScenarioInfo.metric_weights: dict[str, float]ScenarioInfo.doc_weights: dict[str, float]
-
Step 1: Add fields to
ScenarioInfoinwebapp/models.py
class ScenarioInfo(BaseModel):
"""One discoverable scenario YAML file that can be evaluated from the UI."""
path: str
scenario_name: str = ""
mode: str = ""
dataset: str = ""
judge_model: str = ""
metrics: list[str] = Field(default_factory=list)
error: str = ""
metric_weights: dict[str, float] = Field(default_factory=dict)
doc_weights: dict[str, float] = Field(default_factory=dict)
- Step 2: Update
webapp/services/scenario_scanner.py—_summarize_scenario
After the metric_list line, add weight extraction:
raw_metric_weights = payload.get("metric_weights") or {}
raw_doc_weights = payload.get("doc_weights") or {}
metric_weights = {str(k): float(v) for k, v in raw_metric_weights.items()
if isinstance(v, (int, float))}
doc_weights = {str(k): float(v) for k, v in raw_doc_weights.items()
if isinstance(v, (int, float))}
return ScenarioInfo(
path=relative,
scenario_name=str(payload.get("scenario_name", "")),
mode=str(payload.get("mode", "")),
dataset=str(payload.get("dataset", "")),
judge_model=str(payload.get("judge_model", "")),
metrics=metric_list,
metric_weights=metric_weights,
doc_weights=doc_weights,
)
- Step 3: Run existing tests
python -m pytest tests/webapp/ tests/test_offline_eval.py -v
Expected: all PASS
- Step 4: Commit
git add webapp/models.py webapp/services/scenario_scanner.py
git commit -m "feat: ScenarioInfo exposes metric_weights and doc_weights from YAML"
Task 8: 前端 — 权重配置面板 + 报告卡片
Files:
- Modify:
webapp/static/index.html - Modify:
webapp/static/js/runner.js - Modify:
webapp/static/css/app.css - Modify:
webapp/static/js/report.js
Interfaces:
-
Consumes:
ScenarioInfo.metric_weights,ScenarioInfo.doc_weights(from/api/scenarios) -
Consumes:
ReportData.weighted_score_mean,ReportData.metric_weights(from/api/runs/{id}) -
Produces: weight panel HTML in 新建评估; weighted_score card in 报告页
-
Step 1: Add weight panel HTML to
index.html
Add this block immediately after the closing </div> of #llm-assignment-panel (before <div class="panel" id="task-panel"):
<!-- 权重配置面板(选中场景后显示) -->
<div class="panel weight-config-panel" id="weight-config-panel" hidden>
<h2>权重配置 <span class="muted" style="font-size:13px;font-weight:400">(可选,留空使用场景原始配置)</span></h2>
<div class="weight-section">
<div class="weight-section-title">指标权重 <span class="muted" style="font-size:12px">(数值越大该指标在综合得分中占比越高)</span></div>
<div id="metric-weight-rows" class="weight-rows"></div>
</div>
<div class="weight-section" style="margin-top:16px">
<div class="weight-section-title">文档权重 <span class="muted" style="font-size:12px">(按 PDF 文件名,数值越大该文档的题目在汇总时贡献越大)</span></div>
<div id="doc-weight-rows" class="weight-rows"></div>
<button class="btn btn-sm" id="add-doc-weight-btn" style="margin-top:8px">+ 添加文档权重</button>
</div>
</div>
- Step 2: Add CSS to
app.css
Add at the end of the file, before any @media print block:
/* ── 权重配置面板 ─────────────────────────────────── */
.weight-config-panel { margin-top: 12px; }
.weight-section-title { font-size: 13px; font-weight: 600; color: var(--text); margin-bottom: 8px; }
.weight-rows { display: flex; flex-direction: column; gap: 6px; }
.weight-row {
display: flex; align-items: center; gap: 10px;
font-size: 13px;
}
.weight-row-label { min-width: 180px; color: var(--slate); font-family: monospace; }
.weight-row-input {
width: 80px; padding: 4px 8px; border: 1px solid var(--border);
border-radius: 6px; font-size: 13px; text-align: right;
}
.weight-row-input:focus { outline: none; border-color: #6366f1; }
.doc-weight-name {
flex: 1; padding: 4px 8px; border: 1px solid var(--border);
border-radius: 6px; font-size: 13px; min-width: 0;
}
.weight-row-remove { color: var(--bad); cursor: pointer; font-size: 14px; background: none; border: none; padding: 2px 6px; }
.weight-row-remove:hover { background: #fee2e2; border-radius: 4px; }
/* weighted_score 指标卡片突出显示 */
.metric-card.weighted-score-card {
border: 2px solid #6366f1;
background: #f5f3ff;
}
.metric-card.weighted-score-card .metric-name { color: #4f46e5; font-weight: 700; }
- Step 3: Update
runner.js
Replace the entire runner.js with:
// runner.js — 新建评估视图:列出场景、LLM角色配置、权重配置、触发评估、轮询任务状态。
const Runner = {
selectedScenario: null,
selectedScenarioInfo: null,
pollTimer: null,
lastRunId: null,
init() {
document.getElementById("run-btn").addEventListener("click", () => Runner.trigger());
document.getElementById("view-report-btn").addEventListener("click", () => {
if (Runner.lastRunId) {
App.enableReportNav();
App.navigate("report", Runner.lastRunId);
}
});
document.getElementById("add-doc-weight-btn").addEventListener("click", () => Runner._addDocWeightRow());
},
async loadScenarios() {
const list = document.getElementById("scenario-list");
list.innerHTML = '<p class="muted">加载中…</p>';
try {
const data = await API.scenarios();
const scenarios = data.scenarios || [];
if (scenarios.length === 0) {
list.innerHTML = '<p class="muted">未在 scenarios/ 下找到场景文件。</p>';
return;
}
list.innerHTML = "";
scenarios.forEach((sc) => list.appendChild(Runner.renderScenarioItem(sc)));
} catch (err) {
list.innerHTML = `<p class="muted">加载失败:${App.escape(err.message)}</p>`;
}
Runner._populateProfileSelects();
},
async _populateProfileSelects() {
const cached = Profiles.getAll();
const profiles = cached.length > 0
? cached
: (await API.profiles().catch(() => ({ profiles: [] }))).profiles;
["role-judge", "role-answer", "role-dataset"].forEach(id => {
const sel = document.getElementById(id);
sel.innerHTML = '<option value="">— 使用场景原始配置 —</option>';
profiles.forEach(p => {
const opt = document.createElement("option");
opt.value = p.profile_id;
opt.textContent = `${p.name} (${p.model})`;
sel.appendChild(opt);
});
});
},
renderScenarioItem(sc) {
const item = document.createElement("div");
const invalid = !!sc.error;
item.className = "scenario-item" + (invalid ? " invalid" : "");
const modeTag = sc.mode
? `<span class="tag mode-${App.escape(sc.mode)}">${App.escape(sc.mode)}</span>`
: "";
const metricCount = (sc.metrics || []).length;
item.innerHTML = `
<div>
<div class="scenario-name">${App.escape(sc.scenario_name || sc.path)}</div>
<div class="scenario-path">${App.escape(sc.path)}</div>
${sc.error ? `<div class="scenario-path" style="color:#dc2626">${App.escape(sc.error)}</div>` : ""}
</div>
<div class="scenario-tags">
${modeTag}
<span class="tag">${metricCount} 指标</span>
</div>
`;
if (!invalid) {
item.addEventListener("click", () => {
document.querySelectorAll(".scenario-item").forEach((el) => el.classList.remove("selected"));
item.classList.add("selected");
Runner.selectedScenario = sc.path;
Runner.selectedScenarioInfo = sc;
document.getElementById("selected-scenario").textContent = sc.path;
document.getElementById("run-btn").disabled = false;
document.getElementById("llm-assignment-panel").hidden = false;
Runner._renderWeightPanel(sc);
document.getElementById("weight-config-panel").hidden = false;
});
}
return item;
},
// 根据选中场景渲染指标权重行(动态)
_renderWeightPanel(sc) {
const metricRows = document.getElementById("metric-weight-rows");
metricRows.innerHTML = "";
const metrics = sc.metrics || [];
const existingWeights = sc.metric_weights || {};
metrics.forEach(metric => {
const row = document.createElement("div");
row.className = "weight-row";
const currentVal = existingWeights[metric] != null ? existingWeights[metric] : 1.0;
row.innerHTML = `
<span class="weight-row-label">${App.escape(metric)}</span>
<input class="weight-row-input" type="number" min="0" step="0.1"
data-metric="${App.escape(metric)}" value="${currentVal}" />
`;
metricRows.appendChild(row);
});
// 填充已有文档权重
const docRows = document.getElementById("doc-weight-rows");
docRows.innerHTML = "";
const existingDocWeights = sc.doc_weights || {};
Object.entries(existingDocWeights).forEach(([docName, w]) => {
Runner._addDocWeightRow(docName, w);
});
},
// 添加一行文档权重输入
_addDocWeightRow(docName = "", weight = 1.0) {
const container = document.getElementById("doc-weight-rows");
const row = document.createElement("div");
row.className = "weight-row";
row.innerHTML = `
<input class="doc-weight-name" type="text" placeholder="PDF 文件名(如 322_双源CT.pdf)" value="${App.escape(docName)}" />
<input class="weight-row-input" type="number" min="0" step="0.1" value="${weight}" />
<button class="weight-row-remove" title="删除">✕</button>
`;
row.querySelector(".weight-row-remove").addEventListener("click", () => row.remove());
container.appendChild(row);
},
// 收集权重面板当前值
_collectWeights() {
const metricWeights = {};
document.querySelectorAll("#metric-weight-rows .weight-row-input").forEach(input => {
const metric = input.dataset.metric;
const val = parseFloat(input.value);
if (metric && !isNaN(val)) metricWeights[metric] = val;
});
const docWeights = {};
document.querySelectorAll("#doc-weight-rows .weight-row").forEach(row => {
const nameInput = row.querySelector(".doc-weight-name");
const valInput = row.querySelector(".weight-row-input");
if (!nameInput || !valInput) return;
const name = nameInput.value.trim();
const val = parseFloat(valInput.value);
if (name && !isNaN(val)) docWeights[name] = val;
});
// 如果全部指标权重均为 1.0 且无文档权重,不发送(等权,跳过)
const allDefault = Object.values(metricWeights).every(v => Math.abs(v - 1.0) < 1e-9)
&& Object.keys(docWeights).length === 0;
if (allDefault) return { metricWeights: null, docWeights: null };
return { metricWeights, docWeights };
},
async trigger() {
if (!Runner.selectedScenario) return;
const runBtn = document.getElementById("run-btn");
runBtn.disabled = true;
const panel = document.getElementById("task-panel");
const logBox = document.getElementById("task-log");
const statusBadge = document.getElementById("task-status");
const reportBtn = document.getElementById("view-report-btn");
panel.hidden = false;
reportBtn.hidden = true;
logBox.textContent = "";
Runner._setStatus(statusBadge, "queued");
try {
await Runner._applyProfilesIfNeeded(logBox);
const resp = await API.triggerEvaluation(Runner.selectedScenario);
Runner.poll(resp.task_id);
} catch (err) {
Runner._setStatus(statusBadge, "failed");
logBox.textContent = (logBox.textContent ? logBox.textContent + "\n" : "") + `触发失败:${err.message}`;
runBtn.disabled = false;
}
},
async _applyProfilesIfNeeded(logBox) {
const judgeId = document.getElementById("role-judge").value;
const answerId = document.getElementById("role-answer").value;
const datasetId = document.getElementById("role-dataset").value;
const { metricWeights, docWeights } = Runner._collectWeights();
if (!judgeId && !answerId && !datasetId && !metricWeights && !docWeights) return;
logBox.textContent = "正在将 LLM 配置和权重写入场景文件…\n";
const body = {
scenario_path: Runner.selectedScenario,
judge_profile_id: judgeId || null,
answer_profile_id: answerId || null,
dataset_profile_id: datasetId || null,
metric_weights: metricWeights,
doc_weights: docWeights,
};
const result = await API.applyProfiles(body);
const fields = (result.patched_fields || []).join(", ");
logBox.textContent += fields
? `✓ 已更新字段:${fields}\n`
: "(未找到可更新的字段,继续运行)\n";
},
poll(taskId) {
const logBox = document.getElementById("task-log");
const statusBadge = document.getElementById("task-status");
const reportBtn = document.getElementById("view-report-btn");
const runBtn = document.getElementById("run-btn");
if (Runner.pollTimer) clearInterval(Runner.pollTimer);
Runner.pollTimer = setInterval(async () => {
try {
const status = await API.taskStatus(taskId);
logBox.textContent = (status.logs || []).join("\n");
logBox.scrollTop = logBox.scrollHeight;
Runner._setStatus(statusBadge, status.status);
if (status.status === "completed" || status.status === "failed") {
clearInterval(Runner.pollTimer);
runBtn.disabled = false;
if (status.status === "completed" && status.run_id) {
Runner.lastRunId = status.run_id;
sessionStorage.setItem("rag_run_id", status.run_id);
reportBtn.hidden = false;
}
}
} catch (err) {
clearInterval(Runner.pollTimer);
logBox.textContent += `\n轮询失败:${err.message}`;
runBtn.disabled = false;
}
}, 1200);
},
_setStatus(badge, status) {
badge.textContent = status;
badge.className = "badge " + status;
},
};
- Step 4: Update
report.js— renderMetricCards
In renderMetricCards, after the metrics.forEach loop that renders individual cards, append this block to show the weighted_score card:
// 在 renderMetricCards 方法末尾,metrics.forEach 之后追加:
const wsValue = report.weighted_score_mean;
const wsCard = document.createElement("div");
wsCard.className = "metric-card weighted-score-card";
const wsCls = App.scoreClass(wsValue);
const wsText = wsValue === null || wsValue === undefined ? "n/a" : wsValue.toFixed(2);
wsCard.innerHTML = `
<div class="metric-value ${wsCls}">${wsText}</div>
<div class="metric-name">综合加权得分</div>
`;
wrap.appendChild(wsCard);
- Step 5: Verify app loads without JS errors
Start the webapp:
python webmain.py
Open http://localhost:8000, navigate to「新建评估」, click a scenario and verify:
-
Weight panel appears below LLM 角色配置
-
Each metric listed with a default weight of 1.0
-
「添加文档权重」button adds a new row
-
Navigate to any report and verify「综合加权得分」card appears
-
Step 6: Commit
git add webapp/static/index.html webapp/static/js/runner.js webapp/static/css/app.css webapp/static/js/report.js
git commit -m "feat: add weight config panel to 新建评估 and weighted_score card to report"
Task 9: 全量回归测试
- Step 1: Run all tests
python -m pytest tests/ -v --tb=short
Expected: all previously-passing tests still PASS, new tests PASS.
Note: pre-existing failures in webapp.test_* (module import path issues) and test_offline_eval::test_normalize_sample_pdf_offline_smoke_row (missing CSV fixture) are known pre-existing issues — they are not regressions from this feature.
- Step 2: Run pipeline and llm-profiles tests explicitly
python -m pytest tests/test_pipeline.py tests/webapp/test_llm_profiles_api.py tests/test_weights.py -v
Expected: all PASS
- Step 3: Final commit
git add .
git commit -m "feat: metric & doc weights — full implementation complete
- New rag_eval/metrics/weights.py with pure-function weight computation
- Scenario YAML supports metric_weights and doc_weights (optional, backward-compatible)
- scores.csv gains weighted_score and sample_weight columns
- summary.md shows weighted metric means and overall weighted_score
- yaml_patcher writes metric_weights/doc_weights on apply
- report_builder uses weighted means; ReportData gains weighted_score_mean
- 新建评估 page: weight config panel with metric sliders and doc weight rows
- 报告详情 page: 综合加权得分 card
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>"