更新

feat(webapp): add session persistence via URL hash routing + sessionStorage
- app.js: hash-based router (#runs / #new / #profiles / #report/{runId}) - navigate() pushes history entries for back/forward support - _restoreSession() reads hash on load and popstate - sessionStorage fallback for same-tab refreshes - run-card highlights selected run (.run-card.selected) - runner.js: use App.navigate() for report redirect; persist lastRunId to sessionStorage - index.html: report nav button starts disabled (enabled on run select/restore) - app.css: .run-card.selected with petrol border + ring Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-16 18:12:33 +08:00 · 2026-06-16 17:55:07 +08:00 · 2026-06-16 17:26:37 +08:00 · 2026-06-16 17:12:32 +08:00 · 2026-06-16 17:06:19 +08:00 · 2026-06-16 17:03:25 +08:00
49 changed files with 5448 additions and 83 deletions
--- a/.env.example
+++ b/.env.example
@@ -1,11 +1,22 @@
 # ===== LLM 连接配置（RAGAS 评测 + 生成） =====
 # 所有模型共用同一个 OpenAI 兼容 endpoint
 # 在 Web 控制台的「LLM 配置」页面可以保存多个命名配置，
 # 并在运行评估时按角色（Judge / Answer / Dataset）分别选择覆盖。
 OPENAI_API_KEY=your-api-key
 OPENAI_BASE_URL=http://6.86.80.4:30080/v1
 OPENAI_TIMEOUT_SECONDS=180
 # 默认评测模型（可在场景 YAML 或 Web 控制台 LLM 配置中覆盖）
 RAGAS_JUDGE_MODEL=deepseek-v4-flash
 RAGAS_EMBEDDING_MODEL=text-embedding-v3
 # 评估并发控制（启用 7 个指标时建议 RAGAS_METRIC_TIMEOUT_SECONDS=300）
 BATCH_SIZE=8
 RAGAS_METRIC_TIMEOUT_SECONDS=300
-# ===== 阿里云文档解析 =====
+# ===== 阿里云文档解析（dataset build 功能需要） =====
 ALIBABA_ACCESS_KEY_ID=
 ALIBABA_ACCESS_KEY_SECRET=
 ALIBABA_ENDPOINT=docmind-api.cn-hangzhou.aliyuncs.com
@@ -14,6 +25,8 @@ ALIYUN_PARSE_TIMEOUT_SECONDS=900
 ALIYUN_PARSE_LAYOUT_STEP_SIZE=50
 ALIYUN_LLM_ENHANCEMENT=true
 ALIYUN_ENHANCEMENT_MODE=VLM
-DOCUMENT_PARSE_ARTIFACT_PREFIX=artifacts
+DOCUMENT_PARSE_ARTIFACT_PREFIX=outputs/dataset-builds
 PARSER_FAILURE_MODE=fail
 # 生成题库时使用的模型（可在 Web 控制台 LLM 配置中按场景覆盖）
 DATASET_GENERATOR_MODEL=qwen3.6-plus
--- a/configs/llm_profiles.json
+++ b/configs/llm_profiles.json
@@ -0,0 +1,64 @@
 {
  "profiles": [
    {
      "profile_id": "c8e185a64fa0",
      "name": "glm-5",
      "model": "glm-5",
      "base_url": "http://6.86.80.4:30080/v1",
      "api_key": "sk-fVr9KmDZNC4pGDBQj0EUWz9bDmFzNxjYC9EzZpe2bVDsxtz8",
      "timeout_seconds": 600,
      "created_at": "2026-06-16T09:16:22.438297+00:00",
      "updated_at": "2026-06-16T09:19:03.089865+00:00"
    },
    {
      "profile_id": "54ddfe5aeb46",
      "name": "deepseek-v4-pro",
      "model": "deepseek-v4-pro",
      "base_url": "http://6.86.80.4:30080/v1",
      "api_key": "sk-fVr9KmDZNC4pGDBQj0EUWz9bDmFzNxjYC9EzZpe2bVDsxtz8",
      "timeout_seconds": 600,
      "created_at": "2026-06-16T09:17:08.473904+00:00",
      "updated_at": "2026-06-16T09:19:07.504082+00:00"
    },
    {
      "profile_id": "25d035eef194",
      "name": "qwen3.5-flash",
      "model": "qwen3.5-flash",
      "base_url": "http://6.86.80.4:30080/v1",
      "api_key": "sk-fVr9KmDZNC4pGDBQj0EUWz9bDmFzNxjYC9EzZpe2bVDsxtz8",
      "timeout_seconds": 600,
      "created_at": "2026-06-16T09:18:24.265619+00:00",
      "updated_at": "2026-06-16T09:18:24.265619+00:00"
    },
    {
      "profile_id": "ff1d0f417a5d",
      "name": "deepseek-v4-flash",
      "model": "deepseek-v4-flash",
      "base_url": "http://6.86.80.4:30080/v1",
      "api_key": "sk-fVr9KmDZNC4pGDBQj0EUWz9bDmFzNxjYC9EzZpe2bVDsxtz8",
      "timeout_seconds": 600,
      "created_at": "2026-06-16T09:18:57.091549+00:00",
      "updated_at": "2026-06-16T09:18:57.091549+00:00"
    },
    {
      "profile_id": "5b04c49df9df",
      "name": "text-embedding-v4",
      "model": "text-embedding-v4",
      "base_url": "http://6.86.80.4:30080/v1",
      "api_key": "sk-fVr9KmDZNC4pGDBQj0EUWz9bDmFzNxjYC9EzZpe2bVDsxtz8",
      "timeout_seconds": 600,
      "created_at": "2026-06-16T09:19:49.104004+00:00",
      "updated_at": "2026-06-16T09:19:49.104004+00:00"
    },
    {
      "profile_id": "b4f7c82859d5",
      "name": "text-embedding-v3",
      "model": "text-embedding-v3",
      "base_url": "http://6.86.80.4:30080/v1",
      "api_key": "sk-fVr9KmDZNC4pGDBQj0EUWz9bDmFzNxjYC9EzZpe2bVDsxtz8",
      "timeout_seconds": 600,
      "created_at": "2026-06-16T09:20:18.266540+00:00",
      "updated_at": "2026-06-16T09:20:18.266540+00:00"
    }
  ]
 }
--- a/docs/rag-eval-architecture.md
+++ b/docs/rag-eval-architecture.md
@@ -318,6 +318,10 @@ metrics:
  - answer_relevancy
  - context_recall
  - context_precision
  # 可选：鲁棒性 / 端到端指标（需数据集含 ground_truth），完整列表见 §9.4
  # - noise_sensitivity
  # - factual_correctness
  # - semantic_similarity
 output_dir: runs/legal-assistant-offline-baseline
 runtime:
  batch_size: 4
@@ -338,7 +342,7 @@ runtime:
 - `embedding_model`
  - 负责向量相关指标的模型
 - `metrics`
-  - 本次启用的指标列表
+  - 本次启用的指标列表（完整可选项与依赖见 §9.4）
 - `output_dir`
  - 本次运行结果输出目录
 - `runtime.batch_size`
@@ -399,6 +403,32 @@ app_adapter:
 - embedding model
 - 指标实例
 当前支持的指标（`rag_eval/metrics/registry.py` 中的 `SUPPORTED_METRICS`）：
 | 指标名 | 层面 | 依赖 |
 |---|---|---|
 | `faithfulness` | 生成 | judge model |
 | `answer_relevancy` | 生成 | judge model + embedding |
 | `context_recall` | 检索 | judge model + ground_truth |
 | `context_precision` | 检索 | judge model + ground_truth |
 | `noise_sensitivity` | 鲁棒性 | judge model + ground_truth |
 | `factual_correctness` | 端到端 | judge model + ground_truth |
 | `semantic_similarity` | 端到端 | embedding + ground_truth（无 LLM 调用） |
 后四项以 `ground_truth`（标准答案）为参照，数据集必须提供该字段。新增指标统一在 `registry.py` / `factory.py` / `pipeline.py` 三处对齐装配。
 **Optimization Advisor（§11 优化策略落地）：**
 评测结束后，若场景配置 `optimization_advisor: true`，则自动调用 `rag_eval/advisor/` 模块：
 - 规则引擎（`rules.py`）对 7 个指标各自设阈值，识别触发项并选取 top-3 低分样本
 - LLM 分析器（`llm_analyzer.py`）结合低分样本生成中文 Markdown 优化建议（复用 judge_model，失败自动降级为纯规则报告）
 - 写出层（`writer.py`）输出 `optimization_advice.md` 并打日志摘要
 ```yaml
 # 场景配置示例
 optimization_advisor: true
 ```
 ### 9.5 并发控制
 执行层负责并发上限，不把并发策略散落到各指标实现中。
--- a/docs/rag-eval-engine-flow.md
+++ b/docs/rag-eval-engine-flow.md
@@ -316,11 +316,21 @@ adapter 层的目标是：**把不同类型的目标应用，统一成同一套
 当前支持的指标包括：
 核心检索 / 生成指标（始终可用）：
 - `faithfulness`
 - `answer_relevancy`
 - `context_recall`
 - `context_precision`
 鲁棒性 / 端到端指标（架构设计 §10.2，需数据集含 `ground_truth`）：
 - `noise_sensitivity` —— 鲁棒性：对检索噪声的敏感度
 - `factual_correctness` —— 端到端：回答相对标准答案的事实正确性
 - `semantic_similarity` —— 端到端：回答与标准答案的语义相似度（基于 embedding，无 LLM 调用）
 所有指标都通过同一套装配点接入：`registry.py`（校验白名单）、`factory.py`（实例化）、`pipeline.py`（`ascore` 入参分发），新增指标只需在这三处对齐即可。
 所以 metric pipeline 的职责可以总结为：
 **把标准样本转换成结构化评分结果。**
@@ -414,3 +424,39 @@ main.py
 - 可以把每次实验的资产稳定留住
 这也是它和一次性离线脚本的根本区别。
 ---
 ## 15. Optimization Advisor 链路
 相关代码：
 - `rag_eval/advisor/__init__.py` — 外部入口 `run_advisor()`
 - `rag_eval/advisor/rules.py` — 规则引擎（纯函数，无 LLM），7 条指标诊断规则
 - `rag_eval/advisor/llm_analyzer.py` — LLM 分析器（复用 judge_model llm 实例，失败自动降级）
 - `rag_eval/advisor/writer.py` — 写出 `optimization_advice.md` + 日志摘要
 Advisor 在 `write_run_artifacts()` 之后触发，仅当场景配置 `optimization_advisor: true` 时生效，默认关闭。
 执行链路：
 ```text
 run_advisor(result, scenario, llm)
  -> rules.diagnose(score_rows, metrics)     # 识别异常指标，选取 top-3 低分样本
  -> llm_analyzer.analyze(diagnoses, llm)   # LLM 生成中文建议（失败自动降级为纯规则报告）
  -> writer.write_advice(...)               # 写 optimization_advice.md + 日志摘要
 ```
 输出产物追加在现有 run 目录：
 ```text
 outputs/online/siemens-pdf-question-bank/<run_id>/
  scenario.snapshot.yaml
  scores.csv
  invalid.csv
  summary.md
  metadata.json
  optimization_advice.md    <- 新增（optimization_advisor: true 时生成）
 ```
 规则引擎对 7 个指标各自设 warning / critical 双档阈值，`noise_sensitivity` 为"越低越好"（方向相反）。所有诊断均附带 top-3 低分样本，喂给 LLM 生成针对具体内容的中文建议。
--- a/docs/superpowers/plans/2026-06-16-llm-profile-manager.md
+++ b/docs/superpowers/plans/2026-06-16-llm-profile-manager.md
--- a/docs/superpowers/plans/2026-06-16-optimization-advisor.md
+++ b/docs/superpowers/plans/2026-06-16-optimization-advisor.md
--- a/docs/superpowers/specs/2026-06-16-optimization-advisor-design.md
+++ b/docs/superpowers/specs/2026-06-16-optimization-advisor-design.md
@@ -0,0 +1,225 @@
 # 优化顾问模块设计 Spec
 - 日期：2026-06-16
 - 状态：已确认，进入实现。
 ## 1. 目标
 在现有 RAG 评测流程结束后，新增一个**优化顾问模块**（Optimization Advisor），根据本次评测的多项指标分数与低分样本，自动诊断指标偏低的原因并给出针对性的优化建议，输出为中文 Markdown 报告 + 日志摘要。
 对应架构设计 §11（优化策略）：将"指标到动作的映射"（§11.2）从文档形式落地为代码自动执行。
 ---
 ## 2. 决策摘要
 | 决策点 | 选择 |
 |---|---|
 | 输出形式 | `optimization_advice.md`（文件）+ 控制台/日志摘要（双输出） |
 | 生成机制 | 规则引擎定位异常指标 → LLM 结合低分样本二次解读（两层） |
 | 触发方式 | YAML 场景文件显式声明 `optimization_advisor: true`，默认关闭 |
 | LLM 实例 | 复用 `build_models()` 已创建的 `llm` 实例，不重建 client |
 | 包位置 | `rag_eval/advisor/`（独立包，对外暴露 `run_advisor()` 单一入口） |
 ---
 ## 3. 架构
 ### 3.1 执行链路
 ```
 run_scenario()
  → load_scenario()           # 读 YAML，解析 optimization_advisor 字段
  → build_models()            # 已有：创建 llm, embeddings
  → build_metric_pipeline()   # 已有
  → Evaluator.evaluate()      # 已有：打分 → EvaluationResult
  → write_run_artifacts()     # 已有：scores.csv / summary.md / ...
  → run_advisor(              # 新增（3 行）
        result, scenario, llm, artifact_paths
    )
      → rules.diagnose(score_rows)           # 规则引擎：返回 Diagnosis 列表
      → llm_analyzer.analyze(diags, samples) # LLM：生成中文 Markdown 建议
      → writer.write(advice, paths)          # 写文件 + 打日志
 ```
 ### 3.2 新增文件
 ```
 rag_eval/advisor/
  __init__.py          ← 暴露 run_advisor()，外部唯一入口
  rules.py             ← 纯函数规则引擎，无 LLM，可单独单测
  llm_analyzer.py      ← 接收 llm 实例 + 诊断结构 → 中文 Markdown
  writer.py            ← 写 optimization_advice.md，打日志摘要
 ```
 ### 3.3 修改文件（最小改动）
 | 文件 | 改动 |
 |---|---|
 | `rag_eval/shared/models.py` | `Scenario` 加 `optimization_advisor: bool = False` 字段 |
 | `rag_eval/config/schema.py` | `ScenarioModel` 加同名字段 + 透传到 `Scenario` |
 | `rag_eval/config/loader.py` | 透传 `optimization_advisor` 到 `Scenario` 构造 |
 | `rag_eval/reporting/artifacts.py` | `RunArtifactPaths` 加 `advice_md: Path` 字段 + `build_artifact_paths()` 加赋值 |
 | `rag_eval/execution/runner.py` | `run_scenario()` 末尾：`build_models` 返回 llm 传入，条件调用 `run_advisor()` |
 ### 3.4 输出产物
 ```
 outputs/online/siemens-pdf-question-bank/<run_id>/
  scenario.snapshot.yaml
  scores.csv
  invalid.csv
  summary.md
  metadata.json
  optimization_advice.md    ← 新增（optimization_advisor: true 时生成）
 ```
 ---
 ## 4. 规则引擎（rules.py）
 ### 4.1 数据结构
 ```python
@dataclass
 class Diagnosis:
    metric: str           # 指标名
    mean_score: float     # 本次均值
    threshold: float      # 警戒阈值
    severity: str         # "warning" | "critical"
    root_causes: list[str]  # 可能原因（来自架构设计 §11.2）
    suggested_actions: list[str]  # 对应可调阶段
    low_samples: list[dict]  # 分数最低的 N 条样本（含 question/answer/ground_truth）
 ```
 ### 4.2 七条指标诊断规则
 阈值参考 RAG 评测最佳实践，分 warning / critical 两档：
 | 指标 | warning | critical | 根因方向 | 对应优化阶段（§11.2） |
 |---|---|---|---|---|
 | `faithfulness` | < 0.7 | < 0.5 | 生成未严格基于检索片段 / 幻觉 | 生成 prompt grounding、开启校验 |
 | `answer_relevancy` | < 0.7 | < 0.5 | 回答偏离问题 / 格式冗余 | 查询改写、生成 prompt 格式 |
 | `context_recall` | < 0.7 | < 0.5 | 检索遗漏关键信息 | 多查询、问题分解、Step-back、加大过召回 |
 | `context_precision` | < 0.6 | < 0.4 | 检索引入过多噪声 / 排序差 | 后检索重排、压缩、相关性过滤 |
 | `noise_sensitivity` | > 0.3 | > 0.5 | 回答被噪声片段干扰（越低越好） | 后检索相关性过滤、重排 |
 | `factual_correctness` | < 0.6 | < 0.4 | 回答事实与标准答案偏差大 | 检索与生成综合优化 |
 | `semantic_similarity` | < 0.7 | < 0.5 | 回答语义与标准答案差距大 | 生成 prompt、检索质量 |
 > 注：`noise_sensitivity` 越低越好（0=完全不受噪声影响），其阈值方向与其余相反。
 ### 4.3 低分样本选取
 每个触发诊断的指标，取该指标分数最低的 **top-3** 样本（排除 NaN）附入 `Diagnosis.low_samples`，字段包含 `sample_id / question / answer / ground_truth / <metric_score>`。
 ---
 ## 5. LLM 分析器（llm_analyzer.py）
 ### 5.1 输入
 - `diagnoses: list[Diagnosis]` — 规则引擎输出（仅触发阈值的指标）
 - `llm` — 已有 RAGAS LLM 实例（scenario 的 judge_model）
 - `scenario_name: str` — 用于报告标题
 ### 5.2 Prompt 设计
 使用**一次 LLM 调用**，把所有触发诊断的指标和低分样本一起发送：
 ```
 你是一个 RAG 系统优化专家，正在分析西门子医疗 CT 文档问答系统的评测结果。
 请用中文撰写一份优化建议报告，格式为 Markdown。
 ## 评测诊断摘要
 {for each diagnosis: 指标名、均值、阈值、可能原因、建议动作}
 ## 低分样本示例
 {for each diagnosis: top-3 低分样本的 question / answer / ground_truth}
 ## 要求
 1. 按指标分节（## 指标名），先解释"为什么低"，再给出"具体怎么改"
 2. "具体怎么改"要结合低分样本的具体内容，而不只是泛泛建议
 3. 最后写一节 ## 优先优化次序，按性价比排序（参考：不增加调用次数的优先）
 4. 语言简洁，面向工程师，不要废话
 ```
 ### 5.3 输出
 LLM 返回的 Markdown 字符串，直接写入 `optimization_advice.md`（在报告头部追加运行元信息）。
 ### 5.4 失败降级
 LLM 调用失败（超时/异常）时：降级为**纯规则报告**（只输出规则引擎的诊断结构，不含 LLM 解读），文件照常写出，错误信息写入报告末尾，不阻断整个评测流程。
 ---
 ## 6. 写出层（writer.py）
 ### 6.1 文件写出
 `optimization_advice.md` 结构：
 ```markdown
 # 优化建议报告 — <scenario_name>
 - run_id: `<run_id>`
 - 生成时间: `<timestamp>`
 - judge_model: `<model>`
 ---
 <LLM 生成的 Markdown 正文>
 ```
 ### 6.2 日志摘要
 `run_advisor()` 完成后向 `logger.info` 打印一条精简摘要（单行，适合 `run_eval.bat` 结束后一眼扫到）：
 ```
 [advisor] 触发诊断 3 项: faithfulness(0.42, critical) context_recall(0.58, warning) noise_sensitivity(0.41, critical)
 [advisor] 优化建议已写出: outputs/online/.../optimization_advice.md
 ```
 ---
 ## 7. YAML 配置
 场景文件新增一个顶层字段：
 ```yaml
 optimization_advisor: true   # 默认 false；true 时评测结束后自动生成优化建议
 ```
 后续若需精细配置（阈值覆盖、top-N 低分样本数），可扩展为：
 ```yaml
 optimization_advisor:
  enabled: true
  top_low_samples: 3          # 每个指标取几条低分样本（默认 3）
  # thresholds:               # 可选：覆盖默认阈值
  #   faithfulness: 0.65
 ```
 本轮实现仅支持 `optimization_advisor: true/false`，扩展接口预留但不实现。
 ---
 ## 8. 测试策略
 | 测试 | 文件 | 说明 |
 |---|---|---|
 | 规则引擎单测 | `tests/test_advisor_rules.py` | 纯函数，无 LLM，覆盖每条规则的 warning/critical 触发、NaN 跳过、low_samples 选取 |
 | writer 单测 | `tests/test_advisor_writer.py` | mock Diagnosis 列表，验证 md 文件写出格式和日志输出 |
 | 集成（可选） | 现有 `tests/test_online_eval.py` | 验证 `optimization_advisor: true` 场景下 advice_md 存在 |
 LLM 分析器不写单测（依赖网络），由集成场景覆盖。
 ---
 ## 9. 不覆盖（本轮边界）
 - 不支持跨版本对比分析（只分析本次 run）
 - 不支持批量场景聚合建议
 - 不建设 Web UI 展示
 - LLM 分析器 prompt 本轮不做多语言适配（直接中文）
 - advisor 阈值本轮硬编码在 `rules.py`，不从 YAML 读取
--- a/main.py
+++ b/main.py
@@ -1,6 +1,8 @@
 from __future__ import annotations
 import argparse
 import logging
 from pathlib import Path
 from rag_eval.dataset_builder.runner import run_dataset_build
 from rag_eval.execution.runner import run_scenario
@@ -18,18 +20,33 @@ def parse_args() -> argparse.Namespace:
        "--dataset-build-config",
        help="Path to a YAML dataset build config file.",
    )
    parser.add_argument(
        "--log-file",
        default=None,
        help="Write evaluation logs to this file (in addition to stderr). "
             "Example: logs/eval.log",
    )
    parser.add_argument(
        "--log-level",
        default="INFO",
        choices=["DEBUG", "INFO", "WARNING", "ERROR"],
        help="Logging verbosity level (default: INFO). Use DEBUG for per-metric detail.",
    )
    return parser.parse_args()
 def main() -> None:
    """Dispatch the CLI call to the requested workflow."""
    args = parse_args()
    log_level = getattr(logging, args.log_level.upper(), logging.INFO)
    log_file = Path(args.log_file) if args.log_file else None
    if args.dataset_build_config:
        result = run_dataset_build(args.dataset_build_config)
        print(f"Completed dataset build: {result.artifact_paths.root_dir}")
        return
-    result = run_scenario(args.scenario)
+    result = run_scenario(args.scenario, log_file=log_file, log_level=log_level)
    print(f"Completed run: {result.scenario.output_dir}")
--- a/rag_eval/advisor/init.py
+++ b/rag_eval/advisor/init.py
@@ -0,0 +1,67 @@
 """Optimization advisor: rule-based diagnosis + LLM-powered recommendations."""
 from __future__ import annotations
 import asyncio
 import logging
 from typing import Any
 from rag_eval.reporting.artifacts import build_artifact_paths
 from rag_eval.shared.models import EvaluationResult, Scenario
 from .llm_analyzer import analyze
 from .rules import Diagnosis, diagnose
 from .writer import write_advice
 logger = logging.getLogger("rag_eval.advisor")
 __all__ = ["run_advisor", "Diagnosis", "diagnose"]
 def run_advisor(
    result: EvaluationResult,
    scenario: Scenario,
    llm: Any,
 ) -> None:
    """Run the full optimization advisor pipeline after an evaluation completes.
    Skips silently if scenario.optimization_advisor is False.
    Never raises — failures are logged as warnings, not exceptions.
    Args:
        result: Completed EvaluationResult from Evaluator.evaluate().
        scenario: The resolved Scenario (provides metrics, judge_model, output_dir).
        llm: Pre-built RAGAS LLM instance (from build_models()) for LLM analysis.
    """
    if not scenario.optimization_advisor:
        return
    logger.info("[advisor] starting optimization analysis  scenario=%s", scenario.scenario_name)
    try:
        artifact_paths = build_artifact_paths(scenario.output_dir, result.run_id)
        if artifact_paths.advice_md is None:
            logger.warning("[advisor] advice_md path not set in RunArtifactPaths — skipping")
            return
        diagnoses = diagnose(result.score_rows, scenario.metrics)
        logger.info("[advisor] rule diagnosis complete: %d metric(s) triggered", len(diagnoses))
        if diagnoses:
            llm_markdown = asyncio.run(analyze(diagnoses, llm, scenario.scenario_name))
        else:
            llm_markdown = ""
        write_advice(
            diagnoses=diagnoses,
            llm_markdown=llm_markdown,
            advice_path=artifact_paths.advice_md,
            scenario_name=scenario.scenario_name,
            run_id=result.run_id,
            judge_model=scenario.judge_model,
        )
    except Exception as exc:
        logger.warning(
            "[advisor] advisor failed (%s: %s) — evaluation result is unaffected",
            type(exc).__name__, exc,
        )
--- a/rag_eval/advisor/llm_analyzer.py
+++ b/rag_eval/advisor/llm_analyzer.py
@@ -0,0 +1,100 @@
 """LLM-powered analysis of rule diagnostics and low-score samples."""
 from __future__ import annotations
 import logging
 from typing import Any
 from .rules import Diagnosis
 logger = logging.getLogger("rag_eval.advisor")
 _PROMPT_TEMPLATE = """\
 你是一个 RAG 系统优化专家，正在分析西门子医疗 CT 文档问答系统的评测结果。
 请用中文撰写一份优化建议报告，格式为 Markdown。
 ## 评测诊断摘要
 {diagnosis_summary}
 ## 低分样本示例
 {low_sample_text}
 ## 报告要求
 1. 按指标分节（## 指标名  [severity]），先解释"为什么低"（结合低分样本具体分析），再给出"具体怎么改"
 2. "具体怎么改"要结合低分样本的实际内容，而不只是泛泛建议
 3. 最后写一节 **## 优先优化次序**，按性价比排序（不增加 LLM 调用次数的优化优先）
 4. 语言简洁，面向工程师，不要废话，不要重复列表内容
 只输出 Markdown 报告正文，不要任何前置说明。
 """
 def _build_diagnosis_summary(diagnoses: list[Diagnosis]) -> str:
    lines = []
    for d in diagnoses:
        direction = "（越低越好）" if d.metric == "noise_sensitivity" else ""
        lines.append(
            f"- **{d.metric}** {direction} 均值={d.mean_score:.4f}，"
            f"阈值={d.threshold}，严重程度={d.severity}"
        )
        lines.append(f"  - 可能原因：{'; '.join(d.root_causes)}")
        lines.append(f"  - 建议动作：{'; '.join(d.suggested_actions)}")
    return "\n".join(lines)
 def _build_low_sample_text(diagnoses: list[Diagnosis]) -> str:
    lines = []
    for d in diagnoses:
        if not d.low_samples:
            continue
        lines.append(f"### {d.metric} 低分样本（最多 3 条）")
        for i, s in enumerate(d.low_samples, 1):
            score = s.get(d.metric, "N/A")
            lines.append(f"\n**样本 {i}**（分数={score}）")
            lines.append(f"- 问题：{s.get('question', '')}")
            lines.append(f"- 回答：{s.get('answer', '')[:300]}")
            lines.append(f"- 标准答案：{s.get('ground_truth', '')[:200]}")
    return "\n".join(lines)
 async def analyze(
    diagnoses: list[Diagnosis],
    llm: Any,
    scenario_name: str,
 ) -> str:
    """Call the judge LLM to generate a Chinese optimization report.
    Args:
        diagnoses: Non-empty list of Diagnosis from rules.diagnose().
        llm: RAGAS LLM wrapper (has .agenerate() method).
        scenario_name: Used only for logging.
    Returns:
        LLM-generated Markdown string, or "" on failure (triggers writer fallback).
    """
    if not diagnoses:
        return ""
    diagnosis_summary = _build_diagnosis_summary(diagnoses)
    low_sample_text = _build_low_sample_text(diagnoses)
    prompt = _PROMPT_TEMPLATE.format(
        diagnosis_summary=diagnosis_summary,
        low_sample_text=low_sample_text,
    )
    try:
        logger.info("[advisor] calling LLM for optimization analysis  scenario=%s", scenario_name)
        from langchain_core.messages import HumanMessage
        # Use the underlying langchain chat model directly (RAGAS LangchainLLMWrapper wraps BaseChatModel)
        response = await llm.langchain_llm.ainvoke([HumanMessage(content=prompt)])
        text = response.content.strip()
        logger.info("[advisor] LLM analysis complete  chars=%d", len(text))
        return text
    except Exception as exc:
        logger.warning(
            "[advisor] LLM analysis failed (%s: %s) — falling back to rule report",
            type(exc).__name__, exc,
        )
        return ""
--- a/rag_eval/advisor/rules.py
+++ b/rag_eval/advisor/rules.py
@@ -0,0 +1,236 @@
 """Rule-based diagnostic engine for RAG evaluation metric scores."""
 from __future__ import annotations
 import math
 from dataclasses import dataclass, field
 from typing import Any
@dataclass
 class MetricRule:
    """Threshold configuration and diagnostic text for one metric."""
    warning_threshold: float
    critical_threshold: float
    higher_is_better: bool  # False for noise_sensitivity
    root_causes: list[str]
    suggested_actions: list[str]
 METRIC_RULES: dict[str, MetricRule] = {
    "faithfulness": MetricRule(
        warning_threshold=0.7,
        critical_threshold=0.5,
        higher_is_better=True,
        root_causes=[
            "生成回答包含检索片段中不支持的陈述（幻觉）",
            "生成阶段未严格遵循 grounding 约束",
            "校验阶段未开启或未生效",
        ],
        suggested_actions=[
            "强化生成 prompt 的 grounding 约束（'只依据参考资料作答'）",
            "开启校验阶段（validation: by_scenario）",
            "检查低分样本中模型是否引用了片段外的知识",
        ],
    ),
    "answer_relevancy": MetricRule(
        warning_threshold=0.7,
        critical_threshold=0.5,
        higher_is_better=True,
        root_causes=[
            "回答偏离问题主旨或包含大量冗余内容",
            "查询改写后问题语义漂移",
            "生成 prompt 格式约束不足",
        ],
        suggested_actions=[
            "优化查询改写 prompt，确保改写后语义不偏移",
            "在生成 prompt 中加入'简洁准确、直接回答问题'的约束",
            "检查低分样本的回答是否存在格式冗余或话题偏移",
        ],
    ),
    "context_recall": MetricRule(
        warning_threshold=0.7,
        critical_threshold=0.5,
        higher_is_better=True,
        root_causes=[
            "检索未能召回标准答案所涉及的关键信息",
            "单一查询未能覆盖问题的多个角度",
            "过召回数量不足，关键片段被截断",
        ],
        suggested_actions=[
            "启用多查询扩展（use_multi_query）覆盖不同措辞",
            "对多跳问题启用问题分解（sub_questions）",
            "加大过召回宽度（recall_top_k）",
            "对颗粒度细的问题尝试 Step-back 双路检索",
        ],
    ),
    "context_precision": MetricRule(
        warning_threshold=0.6,
        critical_threshold=0.4,
        higher_is_better=True,
        root_causes=[
            "检索引入过多与问题无关的片段",
            "重排未能将相关片段排在前列",
            "缺少相关性过滤，噪声片段进入上下文",
        ],
        suggested_actions=[
            "启用或优化 listwise 重排，将相关片段排在前列",
            "启用上下文压缩（compression）过滤无关句子",
            "启用相关性过滤（relevance_filter）丢弃明确无关片段",
            "缩小 rerank_keep_k（如从 8 降到 5）",
        ],
    ),
    "noise_sensitivity": MetricRule(
        warning_threshold=0.3,   # higher is worse; trigger when mean > threshold
        critical_threshold=0.5,
        higher_is_better=False,
        root_causes=[
            "回答中包含检索到的噪声片段所引入的错误陈述",
            "相关性过滤未能拦截干扰性片段",
            "生成阶段对噪声片段未加区分地引用",
        ],
        suggested_actions=[
            "启用相关性过滤（relevance_filter）拦截噪声",
            "优化重排，将不相关片段排到截断点之后",
            "在生成 prompt 中强调'来源冲突时并列陈述，不擅自下定论'",
        ],
    ),
    "factual_correctness": MetricRule(
        warning_threshold=0.6,
        critical_threshold=0.4,
        higher_is_better=True,
        root_causes=[
            "回答的事实陈述与标准答案存在偏差",
            "检索未能命中标准答案所依据的关键片段",
            "生成阶段对多个来源综合时产生事实错误",
        ],
        suggested_actions=[
            "重点检查低分样本，确认是检索遗漏还是生成错误",
            "提升 context_recall 以确保关键信息被检索到",
            "对事实型问题将 temperature 降至 0",
        ],
    ),
    "semantic_similarity": MetricRule(
        warning_threshold=0.7,
        critical_threshold=0.5,
        higher_is_better=True,
        root_causes=[
            "回答语义与标准答案差距较大",
            "回答过于简短或过于冗长，语义偏移",
            "检索到的片段质量不足，导致生成内容偏离",
        ],
        suggested_actions=[
            "检查低分样本的回答与标准答案的表述差异",
            "优化生成 prompt 使回答更贴近标准表述风格",
            "提升检索质量（context_recall / context_precision）",
        ],
    ),
 }
@dataclass
 class Diagnosis:
    """Diagnostic result for one metric that triggered a threshold."""
    metric: str
    mean_score: float
    threshold: float          # the triggered threshold
    severity: str             # "warning" | "critical"
    root_causes: list[str] = field(default_factory=list)
    suggested_actions: list[str] = field(default_factory=list)
    low_samples: list[dict[str, Any]] = field(default_factory=list)
 def _mean_ignoring_nan(values: list[float]) -> float | None:
    valid = [v for v in values if not math.isnan(v)]
    if not valid:
        return None
    return sum(valid) / len(valid)
 def _select_low_samples(
    rows: list[dict[str, Any]],
    metric: str,
    top_n: int,
    higher_is_better: bool,
 ) -> list[dict[str, Any]]:
    """Return the top_n worst-scoring rows for a metric, excluding NaN."""
    valid = [r for r in rows if metric in r and not math.isnan(float(r[metric]))]
    sorted_rows = sorted(valid, key=lambda r: float(r[metric]), reverse=not higher_is_better)
    worst = sorted_rows[:top_n]
    keep_keys = {"sample_id", "question", "answer", "ground_truth", metric}
    return [{k: v for k, v in row.items() if k in keep_keys} for row in worst]
 def diagnose(
    score_rows: list[dict[str, Any]],
    metrics: list[str],
    top_low_samples: int = 3,
 ) -> list[Diagnosis]:
    """Analyse score_rows and return a Diagnosis for each metric below threshold.
    Args:
        score_rows: List of per-sample score dicts (from EvaluationResult.score_rows).
        metrics: Metric names to evaluate (from Scenario.metrics).
        top_low_samples: How many worst-scoring samples to attach per diagnosis.
    Returns:
        List of Diagnosis objects, one per triggered metric. Empty if all OK.
    """
    diagnoses: list[Diagnosis] = []
    for metric in metrics:
        rule = METRIC_RULES.get(metric)
        if rule is None:
            continue  # unknown metric, skip
        values = []
        for row in score_rows:
            raw = row.get(metric)
            if raw is None:
                continue
            try:
                v = float(raw)
            except (TypeError, ValueError):
                continue
            values.append(v)
        if not values:
            continue
        mean = _mean_ignoring_nan(values)
        if mean is None:
            continue
        # Determine severity (direction-aware)
        if rule.higher_is_better:
            if mean < rule.critical_threshold:
                severity = "critical"
                threshold = rule.critical_threshold
            elif mean < rule.warning_threshold:
                severity = "warning"
                threshold = rule.warning_threshold
            else:
                continue  # above warning threshold → no diagnosis
        else:
            # lower is better (noise_sensitivity)
            if mean > rule.critical_threshold:
                severity = "critical"
                threshold = rule.critical_threshold
            elif mean > rule.warning_threshold:
                severity = "warning"
                threshold = rule.warning_threshold
            else:
                continue
        low_samples = _select_low_samples(score_rows, metric, top_low_samples, rule.higher_is_better)
        diagnoses.append(Diagnosis(
            metric=metric,
            mean_score=round(mean, 4),
            threshold=threshold,
            severity=severity,
            root_causes=list(rule.root_causes),
            suggested_actions=list(rule.suggested_actions),
            low_samples=low_samples,
        ))
    return diagnoses
--- a/rag_eval/advisor/writer.py
+++ b/rag_eval/advisor/writer.py
@@ -0,0 +1,82 @@
 """Write optimization advice to markdown file and emit log summary."""
 from __future__ import annotations
 import logging
 from pathlib import Path
 from .rules import Diagnosis
 logger = logging.getLogger("rag_eval.advisor")
 def _format_log_summary(diagnoses: list[Diagnosis], advice_path: Path) -> str:
    """Return a single-line log summary of triggered diagnoses."""
    if not diagnoses:
        return "[advisor] 所有指标正常，无需优化建议。"
    parts = [f"{d.metric}({d.mean_score:.2f}, {d.severity})" for d in diagnoses]
    triggered = " ".join(parts)
    return f"[advisor] 触发诊断 {len(diagnoses)} 项: {triggered}  →  {advice_path}"
 def _build_fallback_report(diagnoses: list[Diagnosis]) -> str:
    """Build a rules-only report when LLM analysis is unavailable."""
    if not diagnoses:
        return ""
    lines = ["## 规则诊断（LLM 分析不可用）\n"]
    for d in diagnoses:
        lines.append(f"### {d.metric}  [{d.severity}]  均值={d.mean_score:.4f}")
        lines.append("\n**可能原因：**")
        for cause in d.root_causes:
            lines.append(f"- {cause}")
        lines.append("\n**建议动作：**")
        for action in d.suggested_actions:
            lines.append(f"- {action}")
        lines.append("")
    return "\n".join(lines)
 def write_advice(
    diagnoses: list[Diagnosis],
    llm_markdown: str,
    advice_path: Path,
    scenario_name: str,
    run_id: str,
    judge_model: str,
 ) -> None:
    """Write optimization_advice.md and emit a log summary line.
    Args:
        diagnoses: List of Diagnosis from rules.diagnose().
        llm_markdown: LLM-generated Markdown body. Empty string triggers fallback.
        advice_path: Full path to write the .md file.
        scenario_name: Human-readable scenario identifier for the report header.
        run_id: Run identifier string.
        judge_model: Model used for LLM analysis (shown in header).
    """
    advice_path.parent.mkdir(parents=True, exist_ok=True)
    from rag_eval.shared.utils import utc_now_iso
    header_lines = [
        f"# 优化建议报告 — {scenario_name}",
        "",
        f"- run_id: `{run_id}`",
        f"- 生成时间: `{utc_now_iso()}`",
        f"- judge_model: `{judge_model}`",
        "",
        "---",
        "",
    ]
    if not diagnoses:
        body = "## ✅ 未发现明显指标异常\n\n所有指标均在正常范围内，当前 RAG 链路表现良好。\n"
    elif llm_markdown:
        body = llm_markdown
    else:
        body = _build_fallback_report(diagnoses)
    content = "\n".join(header_lines) + body
    advice_path.write_text(content, encoding="utf-8")
    summary = _format_log_summary(diagnoses, advice_path)
    logger.info(summary)
    logger.info("[advisor] 优化建议已写出: %s", advice_path)
--- a/rag_eval/config/loader.py
+++ b/rag_eval/config/loader.py
@@ -61,6 +61,7 @@ def load_scenario(path: str | Path) -> Scenario:
            max_samples=model.runtime.max_samples,
        ),
        source_path=scenario_path,
        optimization_advisor=model.optimization_advisor,
    )
    # Run cross-field checks after all relative paths have been resolved.
    validate_scenario(scenario)
--- a/rag_eval/config/schema.py
+++ b/rag_eval/config/schema.py
@@ -54,6 +54,7 @@ class ScenarioModel(BaseModel):
    metrics: list[str]
    output_dir: str
    runtime: RuntimeConfigModel = Field(default_factory=RuntimeConfigModel)
    optimization_advisor: bool = False
    @field_validator("metrics")
    @classmethod
--- a/rag_eval/execution/evaluator.py
+++ b/rag_eval/execution/evaluator.py
@@ -3,6 +3,8 @@
 from __future__ import annotations
 import asyncio
 import logging
 import time
 from typing import Any
 from rag_eval.adapters.base import AppAdapter
@@ -13,6 +15,8 @@ from rag_eval.metrics.pipeline import MetricPipeline
 from rag_eval.shared.models import EvaluationResult, InvalidSample, NormalizedSample, Scenario
 from rag_eval.shared.utils import utc_now_iso
 logger = logging.getLogger("rag_eval.execution.evaluator")
 class Evaluator:
    """Coordinate dataset loading, optional app execution, and metric scoring."""
@@ -31,27 +35,61 @@ class Evaluator:
    def evaluate(self) -> EvaluationResult:
        """Execute the full evaluation flow and return the collected results."""
        started_at = utc_now_iso()
        scenario_name = self.scenario.scenario_name
        mode = self.scenario.mode
        logger.info("=" * 60)
        logger.info("[eval] START  scenario=%s  mode=%s", scenario_name, mode)
        logger.info("[eval] dataset=%s", self.scenario.dataset.path)
        logger.info("[eval] metrics=%s", list(self.scenario.metrics))
        logger.info("[eval] judge=%s  embed=%s", self.scenario.judge_model, self.scenario.embedding_model)
        raw_records = load_dataset_records(self.scenario.dataset.path)
        logger.info("[eval] raw_records=%d", len(raw_records))
        samples, invalid_samples = normalize_records(
            raw_records,
            mode=self.scenario.mode,
            max_samples=self.scenario.runtime.max_samples,
        )
        logger.info("[eval] normalized: valid=%d  invalid=%d", len(samples), len(invalid_samples))
        if self.scenario.mode == "online":
-            # Online mode enriches each sample by calling the target application first.
+            logger.info("[eval] online mode: calling app adapter for %d samples ...", len(samples))
            t0 = time.monotonic()
            samples, online_invalids = asyncio.run(self._enrich_online_samples(samples))
            elapsed = time.monotonic() - t0
            invalid_samples.extend(online_invalids)
            logger.info(
                "[eval] adapter done: enriched=%d  adapter_invalids=%d  elapsed=%.1fs",
                len(samples), len(online_invalids), elapsed,
            )
        logger.info("[eval] scoring %d samples with metric pipeline ...", len(samples))
        t0 = time.monotonic()
        metric_scores = asyncio.run(
            self.metric_pipeline.score_samples(
                samples,
                max_concurrency=self.scenario.runtime.metric_limit(),
            )
        )
        elapsed = time.monotonic() - t0
        logger.info("[eval] metric scoring done  elapsed=%.1fs", elapsed)
        finished_at = utc_now_iso()
        score_rows = [self._merge_score(sample, score) for sample, score in zip(samples, metric_scores)]
        # Summary of NaN rates per metric
        import math
        for metric_name in self.scenario.metrics:
            nan_count = sum(1 for row in score_rows if math.isnan(float(row.get(metric_name, float("nan")) or float("nan"))))
            logger.info("[eval] %-22s  NaN=%d/%d (%.0f%%)",
                        metric_name, nan_count, len(score_rows),
                        100 * nan_count / len(score_rows) if score_rows else 0)
        run_id = finished_at.replace(":", "-")
        logger.info("[eval] DONE  run_id=%s  total_valid=%d  total_invalid=%d",
                    run_id, len(samples), len(invalid_samples))
        logger.info("=" * 60)
        return EvaluationResult(
            scenario=self.scenario,
            run_id=run_id,
@@ -72,13 +110,27 @@ class Evaluator:
        valid: list[NormalizedSample] = []
        invalid: list[InvalidSample] = []
        total = len(samples)
-        async def enrich_with_capture(sample: NormalizedSample) -> NormalizedSample | InvalidSample:
+        async def enrich_with_capture(idx: int, sample: NormalizedSample) -> NormalizedSample | InvalidSample:
            """Convert adapter exceptions into invalid samples instead of aborting the run."""
            sid = sample.sample_id[:12]
            logger.debug("[adapter] [%d/%d] calling adapter  sample=%s  question=%r",
                         idx + 1, total, sid, (sample.question or "")[:60])
            t0 = time.monotonic()
            try:
-                return await self.app_adapter.enrich_sample(sample)
+                result = await self.app_adapter.enrich_sample(sample)
                elapsed = time.monotonic() - t0
                ans_len = len(result.answer or "")
                ctx_count = len(result.contexts or [])
                logger.info("[adapter] [%d/%d] OK  sample=%-12s  ans_len=%d  ctx_count=%d  elapsed=%.1fs",
                            idx + 1, total, sid, ans_len, ctx_count, elapsed)
                return result
            except Exception as exc:
                elapsed = time.monotonic() - t0
                error_type = type(exc).__name__
                logger.warning("[adapter] [%d/%d] FAIL  sample=%-12s  %s: %s  (elapsed=%.1fs)",
                               idx + 1, total, sid, error_type, exc, elapsed)
                return InvalidSample(
                    sample_id=sample.sample_id,
                    error=f"adapter failed [{error_type}]: {exc}",
@@ -86,8 +138,8 @@ class Evaluator:
                )
        factories = [
-            (lambda sample=sample: enrich_with_capture(sample))
+            (lambda _idx=i, _sample=sample: enrich_with_capture(_idx, _sample))
-            for sample in samples
+            for i, sample in enumerate(samples)
        ]
        results = await gather_with_limit(factories, self.scenario.runtime.app_limit())
@@ -102,6 +154,8 @@ class Evaluator:
            if not sample.contexts:
                errors.append("adapter returned empty contexts")
            if errors:
                logger.warning("[adapter] incomplete payload  sample=%s  errors=%s",
                               sample.sample_id[:12], errors)
                invalid.append(
                    InvalidSample(
                        sample_id=sample.sample_id,
@@ -111,6 +165,9 @@ class Evaluator:
                )
                continue
            valid.append(sample)
        logger.info("[adapter] enrichment summary: valid=%d  invalid=%d  of total=%d",
                    len(valid), len(invalid), total)
        return valid, invalid
    def _merge_score(self, sample: NormalizedSample, score: Any) -> dict[str, Any]:
--- a/rag_eval/execution/runner.py
+++ b/rag_eval/execution/runner.py
@@ -2,16 +2,42 @@
 from __future__ import annotations
 import logging
 import sys
 from pathlib import Path
 from rag_eval.adapters.http import HttpAppAdapter
 from rag_eval.adapters.python import PythonFunctionAdapter
 from rag_eval.advisor import run_advisor
 from rag_eval.config.loader import load_scenario
-from rag_eval.metrics.factory import build_metric_pipeline
+from rag_eval.metrics.factory import build_models, build_metric_pipeline
 from rag_eval.reporting.writers import write_run_artifacts
 from rag_eval.settings import EvaluationSettings
 from rag_eval.shared.models import Scenario
 from .evaluator import Evaluator
 logger = logging.getLogger("rag_eval.execution.runner")
 def _setup_logging(log_file: Path | None = None, level: int = logging.INFO) -> None:
    """Configure root logger: always write to stderr, optionally also to a file."""
    fmt = "%(asctime)s  %(levelname)-8s  %(name)s  %(message)s"
    datefmt = "%H:%M:%S"
    handlers: list[logging.Handler] = [logging.StreamHandler(sys.stderr)]
    if log_file is not None:
        log_file.parent.mkdir(parents=True, exist_ok=True)
        fh = logging.FileHandler(log_file, encoding="utf-8")
        fh.setFormatter(logging.Formatter(fmt, datefmt=datefmt))
        handlers.append(fh)
    logging.basicConfig(level=level, format=fmt, datefmt=datefmt, handlers=handlers, force=True)
    # Also show ragas internal logs at WARNING so we can see LLM errors
    logging.getLogger("ragas").setLevel(logging.WARNING)
    logging.getLogger("httpx").setLevel(logging.WARNING)
    logging.getLogger("openai").setLevel(logging.WARNING)
 def build_adapter(scenario: Scenario):
    """Instantiate the adapter required by the resolved scenario, if any."""
@@ -27,16 +53,32 @@ def build_adapter(scenario: Scenario):
 def run_scenario(
    scenario_path: str,
    settings: EvaluationSettings | None = None,
    log_file: Path | None = None,
    log_level: int = logging.INFO,
 ):
    """Run one scenario end to end and persist its reporting artifacts."""
    _setup_logging(log_file=log_file, level=log_level)
    logger.info("[runner] run_scenario  path=%s", scenario_path)
    settings = settings or EvaluationSettings()
    if not settings.openai_api_key:
        raise EnvironmentError("OPENAI_API_KEY must be set before running the evaluator.")
    scenario = load_scenario(scenario_path)
    logger.info("[runner] scenario loaded: name=%s  mode=%s  max_samples=%s",
                scenario.scenario_name, scenario.mode, scenario.runtime.max_samples)
    # Build models once; reuse llm in both MetricPipeline and advisor.
    llm, embeddings = build_models(scenario.judge_model, scenario.embedding_model, settings)
    adapter = build_adapter(scenario)
-    pipeline = build_metric_pipeline(scenario, settings)
+    pipeline = build_metric_pipeline(scenario, settings, llm=llm, embeddings=embeddings)
    evaluator = Evaluator(scenario=scenario, metric_pipeline=pipeline, app_adapter=adapter)
    result = evaluator.evaluate()
    write_run_artifacts(result)
    logger.info("[runner] artifacts written for run_id=%s", result.run_id)
    # Optimization advisor — runs only if scenario.optimization_advisor is True.
    run_advisor(result, scenario, llm)
    return result
--- a/rag_eval/metrics/factory.py
+++ b/rag_eval/metrics/factory.py
@@ -18,7 +18,10 @@ from ragas.metrics.collections import (
    AnswerRelevancy,
    ContextPrecision,
    ContextRecall,
    FactualCorrectness,
    Faithfulness,
    NoiseSensitivity,
    SemanticSimilarity,
 )
 from .pipeline import MetricPipeline
@@ -39,19 +42,34 @@ def build_models(
 def build_metric_pipeline(
    scenario: Scenario,
    settings: EvaluationSettings,
    llm: Any | None = None,
    embeddings: Any | None = None,
 ) -> MetricPipeline:
-    """Build a metric pipeline containing only the metrics requested by the scenario."""
+    """Build a metric pipeline containing only the metrics requested by the scenario.
    If llm and embeddings are provided (pre-built by the caller), they are reused.
    Otherwise, new instances are created from scenario + settings.
    """
    if llm is None or embeddings is None:
        llm, embeddings = build_models(
            scenario.judge_model,
            scenario.embedding_model,
            settings,
        )
    # Build the full registry once, then slice it by configured metric names.
    registry: dict[str, Any] = {
        "faithfulness": Faithfulness(llm=llm),
        "answer_relevancy": AnswerRelevancy(llm=llm, embeddings=embeddings),
        "context_recall": ContextRecall(llm=llm),
        "context_precision": ContextPrecision(llm=llm),
        # Robustness / end-to-end metrics (架构设计 §10.2).
        # NoiseSensitivity mode='relevant': sensitivity to noise from relevant contexts.
        "noise_sensitivity": NoiseSensitivity(llm=llm),
        # FactualCorrectness mode='f1': balances claim precision and recall vs. ground truth.
        "factual_correctness": FactualCorrectness(llm=llm),
        # SemanticSimilarity: embedding cosine between answer and ground truth (no LLM call).
        "semantic_similarity": SemanticSimilarity(embeddings=embeddings),
    }
    return MetricPipeline(
        metrics={name: registry[name] for name in scenario.metrics},
--- a/rag_eval/metrics/pipeline.py
+++ b/rag_eval/metrics/pipeline.py
@@ -3,12 +3,16 @@
 from __future__ import annotations
 import asyncio
 import logging
 import math
 import time
 from dataclasses import dataclass
 from typing import Any
 from rag_eval.shared.models import MetricScore, NormalizedSample
 logger = logging.getLogger("rag_eval.metrics.pipeline")
@dataclass(slots=True)
 class MetricPipeline:
@@ -22,12 +26,43 @@ class MetricPipeline:
        results = {name: math.nan for name in self.metrics}
        errors: list[str] = []
        sid = sample.sample_id[:12]
        ans_len = len(sample.answer or "")
        ctx_count = len(sample.contexts or [])
        logger.debug(
            "[score] sample=%s  ans_len=%d  ctx_count=%d  question=%r",
            sid, ans_len, ctx_count,
            (sample.question or "")[:80],
        )
        for name, metric in self.metrics.items():
            t0 = time.monotonic()
            try:
                result = await self._run_metric(name, metric, sample)
-                results[name] = float(result.value)
+                score_val = float(result.value)
                results[name] = score_val
                elapsed = time.monotonic() - t0
                logger.info(
                    "[metric OK ] sample=%-12s  %-20s  score=%.4f  elapsed=%.1fs",
                    sid, name, score_val, elapsed,
                )
            except asyncio.TimeoutError:
                elapsed = time.monotonic() - t0
                msg = f"timeout after {self.metric_timeout_seconds}s"
                errors.append(f"{name}: {msg}")
                logger.warning(
                    "[metric TMO] sample=%-12s  %-20s  TIMEOUT after %.1fs",
                    sid, name, elapsed,
                )
            except Exception as exc:
                elapsed = time.monotonic() - t0
                exc_type = type(exc).__name__
                errors.append(f"{name}: {exc}")
                logger.warning(
                    "[metric ERR] sample=%-12s  %-20s  %s: %s  (elapsed=%.1fs)",
                    sid, name, exc_type, exc, elapsed,
                )
        return MetricScore(metrics=results, error=" | ".join(errors))
    async def _run_metric(self, name: str, metric: Any, sample: NormalizedSample) -> Any:
@@ -59,6 +94,23 @@ class MetricPipeline:
                reference=sample.ground_truth,
                retrieved_contexts=sample.contexts,
            )
        elif name == "noise_sensitivity":
            coroutine = metric.ascore(
                user_input=sample.question,
                response=sample.answer,
                reference=sample.ground_truth,
                retrieved_contexts=sample.contexts,
            )
        elif name == "factual_correctness":
            coroutine = metric.ascore(
                response=sample.answer,
                reference=sample.ground_truth,
            )
        elif name == "semantic_similarity":
            coroutine = metric.ascore(
                reference=sample.ground_truth,
                response=sample.answer,
            )
        else:
            raise ValueError(f"Unsupported metric: {name}")
@@ -72,11 +124,22 @@ class MetricPipeline:
        max_concurrency: int,
    ) -> list[MetricScore]:
        """Score all samples while respecting the configured concurrency limit."""
        total = len(samples)
        logger.info("[pipeline] scoring %d samples  concurrency=%d  timeout=%ss",
                    total, max_concurrency, self.metric_timeout_seconds)
        semaphore = asyncio.Semaphore(max(1, max_concurrency))
        completed = 0
-        async def guarded(sample: NormalizedSample) -> MetricScore:
+        async def guarded(idx: int, sample: NormalizedSample) -> MetricScore:
            """Throttle a single sample-scoring coroutine with the shared semaphore."""
            nonlocal completed
            async with semaphore:
-                return await self.score_sample(sample)
+                result = await self.score_sample(sample)
                completed += 1
                nan_metrics = [k for k, v in result.metrics.items() if math.isnan(v)]
                status = f"NaN={nan_metrics}" if nan_metrics else "all OK"
                logger.info("[pipeline] progress %d/%d  sample=%-12s  %s",
                            completed, total, sample.sample_id[:12], status)
                return result
-        return await asyncio.gather(*(guarded(sample) for sample in samples))
+        return await asyncio.gather(*(guarded(i, s) for i, s in enumerate(samples)))
--- a/rag_eval/metrics/registry.py
+++ b/rag_eval/metrics/registry.py
@@ -1,8 +1,13 @@
 """Supported metric names recognized by scenario validation and pipeline setup."""
 SUPPORTED_METRICS = {
    # Core retrieval / generation metrics (always available).
    "faithfulness",
    "answer_relevancy",
    "context_recall",
    "context_precision",
    # Robustness and end-to-end metrics (see 架构设计 §10.2).
    "noise_sensitivity",      # 鲁棒性：对检索噪声的敏感度
    "factual_correctness",    # 端到端：回答相对标准答案的事实正确性
    "semantic_similarity",    # 端到端：回答与标准答案的语义相似度（embedding，无 LLM 调用）
 }
--- a/rag_eval/reporting/artifacts.py
+++ b/rag_eval/reporting/artifacts.py
@@ -17,4 +17,5 @@ def build_artifact_paths(output_dir: Path, run_id: str) -> RunArtifactPaths:
        invalid_csv=run_dir / "invalid.csv",
        summary_md=run_dir / "summary.md",
        metadata_json=run_dir / "metadata.json",
        advice_md=run_dir / "optimization_advice.md",
    )
--- a/rag_eval/shared/models.py
+++ b/rag_eval/shared/models.py
@@ -76,6 +76,7 @@ class Scenario:
    runtime: RuntimeConfig = field(default_factory=RuntimeConfig)
    app_adapter: AppAdapterConfig | None = None
    source_path: Path | None = None
    optimization_advisor: bool = False
    def snapshot(self) -> dict[str, Any]:
        """Serialize the scenario into a reporting-friendly dictionary snapshot."""
@@ -159,3 +160,4 @@ class RunArtifactPaths:
    invalid_csv: Path
    summary_md: Path
    metadata_json: Path
    advice_md: Path | None = None
--- a/run_eval.bat
+++ b/run_eval.bat
@@ -0,0 +1,107 @@
@echo off
 setlocal enabledelayedexpansion
 :: ============================================================
 ::  run_eval.bat  -  Run a RAGAS evaluation scenario with logs
 ::
 ::  Usage:
 ::    run_eval.bat                          (uses default online scenario)
 ::    run_eval.bat offline                  (runs offline smoke scenario)
 ::    run_eval.bat path\to\scenario.yaml    (any custom scenario)
 ::    run_eval.bat offline DEBUG            (second arg = log level)
 :: ============================================================
 cd /d "%~dp0"
 echo.
 echo ============================================================
 echo   Siemens RAGAS  -  Evaluation Runner
 echo ============================================================
 echo.
 :: ----------------------------------------------------------------
 :: 1. Resolve scenario path  (arg1)
 :: ----------------------------------------------------------------
 set "SCENARIO=%~1"
 if "%SCENARIO%"=="" set "SCENARIO=online"
 if /i "%SCENARIO%"=="online" (
    set "SCENARIO=scenarios\online\siemens-pdf-question-bank-online.yaml"
 )
 if /i "%SCENARIO%"=="offline" (
    set "SCENARIO=scenarios\offline\siemens-pdf-offline-smoke.yaml"
 )
 if not exist "%SCENARIO%" (
    echo [ERROR] Scenario file not found: %SCENARIO%
    echo.
    echo Usage examples:
    echo   run_eval.bat                    - online eval (default)
    echo   run_eval.bat offline            - offline smoke
    echo   run_eval.bat path\to\file.yaml  - custom scenario
    goto :error
 )
 echo [OK] Scenario : %SCENARIO%
 :: ----------------------------------------------------------------
 :: 2. Resolve log level  (arg2, default INFO)
 :: ----------------------------------------------------------------
 set "LOG_LEVEL=%~2"
 if "%LOG_LEVEL%"=="" set "LOG_LEVEL=INFO"
 echo [OK] Log level: %LOG_LEVEL%
 :: ----------------------------------------------------------------
 :: 3. Create logs dir and build timestamped log filename
 :: ----------------------------------------------------------------
 if not exist "logs" mkdir logs
 for /f "tokens=1-3 delims=/-" %%a in ("%DATE%") do (
    set "YMD=%%c-%%a-%%b"
 )
 for /f "tokens=1-3 delims=:." %%a in ("%TIME: =0%") do (
    set "HMS=%%a%%b%%c"
 )
 set "LOG_FILE=logs\eval_%YMD%_%HMS%.log"
 echo [OK] Log file : %LOG_FILE%
 echo.
 echo ============================================================
 echo   Starting evaluation...
 echo   (Logs also written to %LOG_FILE%)
 echo   Press Ctrl+C to abort
 echo ============================================================
 echo.
 :: ----------------------------------------------------------------
 :: 4. Run evaluation with UTF-8 and logging
 :: ----------------------------------------------------------------
 set PYTHONIOENCODING=utf-8
 set PYTHONPATH=.
 python main.py ^
    --scenario "%SCENARIO%" ^
    --log-file "%LOG_FILE%" ^
    --log-level %LOG_LEVEL%
 if errorlevel 1 (
    echo.
    echo [ERROR] Evaluation failed. Check log: %LOG_FILE%
    goto :error
 )
 echo.
 echo ============================================================
 echo   Evaluation complete!
 echo   Log saved to: %LOG_FILE%
 echo   Open the web console to view results: start.bat
 echo ============================================================
 echo.
 pause
 exit /b 0
 :error
 echo.
 echo ============================================================
 echo   Evaluation failed. See error above or check log file.
 echo ============================================================
 pause
 exit /b 1
--- a/run_eval.ps1
+++ b/run_eval.ps1
@@ -0,0 +1,96 @@
 # run_eval.ps1 - Siemens RAGAS Evaluation Runner
 # Usage:
 #   .\run_eval.ps1                         # online eval (default)
 #   .\run_eval.ps1 offline                 # offline smoke
 #   .\run_eval.ps1 path\to\scenario.yaml   # custom scenario
 #   .\run_eval.ps1 online DEBUG            # second arg = log level (DEBUG/INFO/WARNING)
 # Or: powershell -ExecutionPolicy Bypass -File run_eval.ps1 [scenario] [log-level]
 param(
    [string]$Scenario = "online",
    [string]$LogLevel = "INFO"
 )
 $ErrorActionPreference = "Stop"
 Set-Location $PSScriptRoot
 Write-Host ""
 Write-Host "============================================================" -ForegroundColor Cyan
 Write-Host "  Siemens RAGAS  -  Evaluation Runner" -ForegroundColor Cyan
 Write-Host "============================================================" -ForegroundColor Cyan
 Write-Host ""
 # ----------------------------------------------------------------
 # 1. Resolve scenario path
 # ----------------------------------------------------------------
 $scenarioMap = @{
    "online"  = "scenarios\online\siemens-pdf-question-bank-online.yaml"
    "offline" = "scenarios\offline\siemens-pdf-offline-smoke.yaml"
 }
 if ($scenarioMap.ContainsKey($Scenario.ToLower())) {
    $Scenario = $scenarioMap[$Scenario.ToLower()]
 }
 if (-not (Test-Path $Scenario)) {
    Write-Host "[ERROR] Scenario file not found: $Scenario" -ForegroundColor Red
    Write-Host ""
    Write-Host "Usage examples:"
    Write-Host "  .\run_eval.ps1                    - online eval (default)"
    Write-Host "  .\run_eval.ps1 offline            - offline smoke"
    Write-Host "  .\run_eval.ps1 path\to\file.yaml  - custom scenario"
    Read-Host "Press Enter to exit"
    exit 1
 }
 Write-Host "[OK] Scenario : $Scenario" -ForegroundColor Green
 # ----------------------------------------------------------------
 # 2. Validate log level
 # ----------------------------------------------------------------
 $validLevels = @("DEBUG", "INFO", "WARNING", "ERROR")
 if ($validLevels -notcontains $LogLevel.ToUpper()) {
    Write-Host "[WARN] Unknown log level '$LogLevel', defaulting to INFO" -ForegroundColor Yellow
    $LogLevel = "INFO"
 }
 Write-Host "[OK] Log level: $LogLevel" -ForegroundColor Green
 # ----------------------------------------------------------------
 # 3. Create logs dir with timestamped filename
 # ----------------------------------------------------------------
 if (-not (Test-Path "logs")) { New-Item -ItemType Directory "logs" | Out-Null }
 $timestamp = Get-Date -Format "yyyy-MM-dd_HHmmss"
 $logFile = "logs\eval_$timestamp.log"
 Write-Host "[OK] Log file : $logFile" -ForegroundColor Green
 Write-Host ""
 Write-Host "============================================================" -ForegroundColor Cyan
 Write-Host "  Starting evaluation..." -ForegroundColor Cyan
 Write-Host "  Logs also written to: $logFile" -ForegroundColor Cyan
 Write-Host "  Press Ctrl+C to abort" -ForegroundColor Yellow
 Write-Host "============================================================" -ForegroundColor Cyan
 Write-Host ""
 # ----------------------------------------------------------------
 # 4. Run evaluation
 # ----------------------------------------------------------------
 $env:PYTHONIOENCODING = "utf-8"
 $env:PYTHONPATH = "."
 & python main.py `
    --scenario $Scenario `
    --log-file $logFile `
    --log-level $LogLevel.ToUpper()
 if ($LASTEXITCODE -ne 0) {
    Write-Host ""
    Write-Host "[ERROR] Evaluation failed. Check log: $logFile" -ForegroundColor Red
    Read-Host "Press Enter to exit"
    exit 1
 }
 Write-Host ""
 Write-Host "============================================================" -ForegroundColor Green
 Write-Host "  Evaluation complete!" -ForegroundColor Green
 Write-Host "  Log saved to: $logFile" -ForegroundColor Green
 Write-Host "  Open the web console to view results: start.bat" -ForegroundColor Cyan
 Write-Host "============================================================" -ForegroundColor Green
 Write-Host ""
 Read-Host "Press Enter to exit"
--- a/scenarios/offline/siemens-pdf-offline-smoke.yaml
+++ b/scenarios/offline/siemens-pdf-offline-smoke.yaml
@@ -9,6 +9,10 @@ metrics:
  - answer_relevancy
  - context_recall
  - context_precision
  # 可选：鲁棒性 / 端到端指标（数据集已含 ground_truth，取消注释即可启用）
  # - noise_sensitivity      # 鲁棒性：对检索噪声的敏感度
  # - factual_correctness    # 端到端：事实正确性（相对标准答案）
  # - semantic_similarity    # 端到端：语义相似度（embedding，无 LLM 调用）
 output_dir: ../../outputs/siemens-pdf-offline-smoke
 runtime:
  batch_size: 4
--- a/scenarios/online/sample-pdf-question-bank-online.yaml
+++ b/scenarios/online/sample-pdf-question-bank-online.yaml
@@ -1,7 +1,7 @@
 scenario_name: sample-pdf-question-bank-online
 mode: online
 dataset: ../../datasets/raw/generated/sample-pdf-question-bank.csv
-judge_model: deepseek-v4-pro
+judge_model: qwen3.5-flash
 embedding_model: text-embedding-v3
 metrics:
 - faithfulness
@@ -19,4 +19,4 @@ app_adapter:
  callable: apps.pdf_question_bank.adapter:run
  static_kwargs:
    source_chunks_path: ../../outputs/dataset-builds/sample-pdf-question-bank/latest/source_chunks.jsonl
-    model: deepseek-v4-flash
+    model: glm-5
--- a/scenarios/online/siemens-pdf-question-bank-online.yaml
+++ b/scenarios/online/siemens-pdf-question-bank-online.yaml
@@ -3,20 +3,24 @@ mode: online
 dataset: ../../datasets/raw/generated/siemens-pdf-question-bank.csv
 judge_model: deepseek-v4-flash
 embedding_model: text-embedding-v3
 optimization_advisor: true
 metrics:
 - faithfulness
 - answer_relevancy
 - context_recall
 - context_precision
 - noise_sensitivity
 - factual_correctness
 - semantic_similarity
 output_dir: ../../outputs/online/siemens-pdf-question-bank
 runtime:
-  batch_size: 4
+  batch_size: 3
-  app_concurrency: 4
+  app_concurrency: 3
-  metric_concurrency: 4
+  metric_concurrency: 3
-  max_samples: 50
+  max_samples: 10
 app_adapter:
  type: python
  callable: apps.siemens_pdf_qa.adapter:run
  static_kwargs:
    source_chunks_path: ../../outputs/dataset-builds/siemens-pdf-question-bank/latest/source_chunks.jsonl
-    model: deepseek-v4-flash
+    model: glm-5
--- a/scripts/smoke_advisor.py
+++ b/scripts/smoke_advisor.py
@@ -0,0 +1,59 @@
 """Offline smoke-check for the advisor module wiring (no network required)."""
 import math
 import sys
 import tempfile
 from pathlib import Path
 sys.path.insert(0, str(Path(__file__).parent.parent))
 from rag_eval.advisor.rules import diagnose
 from rag_eval.advisor.writer import write_advice, _format_log_summary
 # Simulate score_rows with low faithfulness and high noise_sensitivity
 rows = [
    {
        "sample_id": f"s{i}",
        "question": f"问题{i}：西门子CT扫描的Flash技术原理是什么？",
        "answer": f"答案{i}：Flash技术采用双源CT扫描",
        "ground_truth": f"标准答案{i}：Flash扫描利用双源CT和大螺距实现超低辐射剂量扫描",
        "faithfulness": 0.3 + i * 0.05,
        "noise_sensitivity": 0.4 + i * 0.02,
        "context_recall": 0.75,
        "semantic_similarity": 0.65,
    }
    for i in range(5)
 ]
 diags = diagnose(rows, metrics=["faithfulness", "noise_sensitivity", "context_recall", "semantic_similarity"])
 print(f"Diagnosed {len(diags)} metric(s):")
 for d in diags:
    print(f"  {d.metric}: mean={d.mean_score}, severity={d.severity}, low_samples={len(d.low_samples)}")
 assert len(diags) >= 2, f"Expected at least 2 diagnoses, got {len(diags)}"
 metrics_hit = {d.metric for d in diags}
 assert "faithfulness" in metrics_hit, "faithfulness should be triggered"
 assert "noise_sensitivity" in metrics_hit, "noise_sensitivity should be triggered"
 with tempfile.TemporaryDirectory() as tmp:
    path = Path(tmp) / "optimization_advice.md"
    write_advice(
        diagnoses=diags,
        llm_markdown="",  # fallback mode (no LLM)
        advice_path=path,
        scenario_name="smoke-test-siemens",
        run_id="2026-06-16T00-00-00",
        judge_model="deepseek-v4-flash",
    )
    content = path.read_text(encoding="utf-8")
    assert "smoke-test-siemens" in content, "scenario name missing from report"
    assert "faithfulness" in content, "faithfulness missing from report"
    assert "noise_sensitivity" in content, "noise_sensitivity missing from report"
    print(f"\nAdvice file ({len(content)} chars) — assertions OK")
 # Verify log summary format
 summary = _format_log_summary(diags, Path("optimization_advice.md"))
 print(f"\nLog summary length: {len(summary)} chars, faithfulness present: {'faithfulness' in summary}")
 assert "触发诊断" in summary
 assert "faithfulness" in summary
 print("\nSmoke check PASSED")
--- a/start.bat
+++ b/start.bat
@@ -56,7 +56,17 @@ if errorlevel 1 (
 )
 :: ----------------------------------------------------------------
-:: 4. Seed demo data if no runs exist yet
+:: 4. Ensure configs/ directory exists for LLM profile storage
 :: ----------------------------------------------------------------
 if not exist "configs" (
    mkdir configs
    echo [OK] Created configs/ directory for LLM profile storage.
 ) else (
    echo [OK] configs/ directory ready.
 )
 :: ----------------------------------------------------------------
 :: 5. Seed demo data if no runs exist yet
 :: ----------------------------------------------------------------
 if not exist "outputs\kba-knowledge-base-offline-baseline" (
    echo [INFO] No run data found. Generating demo data...
@@ -71,7 +81,7 @@ if not exist "outputs\kba-knowledge-base-offline-baseline" (
 )
 :: ----------------------------------------------------------------
-:: 5. Pick an available port
+:: 6. Pick an available port
 :: ----------------------------------------------------------------
 set PORT=8800
 netstat -ano 2>nul | findstr ":8800" | findstr "LISTENING" >nul 2>&1
--- a/start.ps1
+++ b/start.ps1
@@ -58,7 +58,17 @@ if ($LASTEXITCODE -ne 0) {
 }
 # ----------------------------------------------------------------
-# 4. Seed demo data if missing
+# 4. Ensure configs/ directory exists for LLM profile storage
 # ----------------------------------------------------------------
 if (-not (Test-Path "configs")) {
    New-Item -ItemType Directory "configs" | Out-Null
    Write-Host "[OK] Created configs/ directory for LLM profile storage." -ForegroundColor Green
 } else {
    Write-Host "[OK] configs/ directory ready." -ForegroundColor Green
 }
 # ----------------------------------------------------------------
 # 5. Seed demo data if missing
 # ----------------------------------------------------------------
 if (-not (Test-Path "outputs\kba-knowledge-base-offline-baseline")) {
    Write-Host "[INFO] No run data found. Generating demo data..." -ForegroundColor Yellow
@@ -73,7 +83,7 @@ if (-not (Test-Path "outputs\kba-knowledge-base-offline-baseline")) {
 }
 # ----------------------------------------------------------------
-# 5. Pick an available port
+# 6. Pick an available port
 # ----------------------------------------------------------------
 $PORT = 8800
 $inUse = netstat -ano 2>$null | Select-String ":$PORT\s" | Select-String "LISTENING"
--- a/tests/init.py
+++ b/tests/init.py
--- a/tests/test_advisor_rules.py
+++ b/tests/test_advisor_rules.py
@@ -0,0 +1,72 @@
 import math
 import unittest
 from rag_eval.advisor.rules import Diagnosis, diagnose, METRIC_RULES
 class TestDiagnosis(unittest.TestCase):
    def _make_rows(self, metric: str, scores: list[float]) -> list[dict]:
        return [{metric: s, "question": f"q{i}", "answer": f"a{i}",
                 "ground_truth": f"gt{i}", "sample_id": f"s{i}"}
                for i, s in enumerate(scores)]
    def test_no_diagnosis_when_all_scores_above_threshold(self):
        rows = self._make_rows("faithfulness", [0.8, 0.9, 0.85])
        result = diagnose(rows, metrics=["faithfulness"])
        self.assertEqual(result, [])
    def test_warning_when_mean_below_warning_threshold(self):
        rows = self._make_rows("faithfulness", [0.65, 0.62, 0.68])
        result = diagnose(rows, metrics=["faithfulness"])
        self.assertEqual(len(result), 1)
        self.assertEqual(result[0].metric, "faithfulness")
        self.assertEqual(result[0].severity, "warning")
        self.assertAlmostEqual(result[0].mean_score, 0.65, places=2)
    def test_critical_when_mean_below_critical_threshold(self):
        rows = self._make_rows("faithfulness", [0.3, 0.4, 0.45])
        result = diagnose(rows, metrics=["faithfulness"])
        self.assertEqual(result[0].severity, "critical")
    def test_low_samples_selected_are_bottom_three(self):
        rows = self._make_rows("faithfulness", [0.1, 0.2, 0.3, 0.8, 0.9])
        result = diagnose(rows, metrics=["faithfulness"])
        self.assertEqual(len(result[0].low_samples), 3)
        scores = [s["faithfulness"] for s in result[0].low_samples]
        self.assertEqual(sorted(scores), [0.1, 0.2, 0.3])
    def test_nan_scores_excluded_from_mean_and_low_samples(self):
        rows = self._make_rows("faithfulness", [0.3, float("nan"), 0.4])
        result = diagnose(rows, metrics=["faithfulness"])
        self.assertEqual(len(result), 1)
        for s in result[0].low_samples:
            self.assertFalse(math.isnan(s["faithfulness"]))
    def test_noise_sensitivity_direction_inverted(self):
        # noise_sensitivity: higher is worse; threshold > 0.3 is warning
        rows = self._make_rows("noise_sensitivity", [0.4, 0.45, 0.5])
        result = diagnose(rows, metrics=["noise_sensitivity"])
        self.assertEqual(len(result), 1)
        self.assertEqual(result[0].metric, "noise_sensitivity")
    def test_noise_sensitivity_no_diagnosis_when_low(self):
        rows = self._make_rows("noise_sensitivity", [0.1, 0.15, 0.2])
        result = diagnose(rows, metrics=["noise_sensitivity"])
        self.assertEqual(result, [])
    def test_skips_metric_not_in_rows(self):
        rows = [{"faithfulness": 0.3, "question": "q", "answer": "a",
                 "ground_truth": "gt", "sample_id": "s1"}]
        result = diagnose(rows, metrics=["faithfulness", "context_recall"])
        metrics_found = [d.metric for d in result]
        self.assertIn("faithfulness", metrics_found)
        self.assertNotIn("context_recall", metrics_found)
    def test_all_seven_metrics_have_rules(self):
        expected = {"faithfulness", "answer_relevancy", "context_recall",
                    "context_precision", "noise_sensitivity",
                    "factual_correctness", "semantic_similarity"}
        self.assertEqual(set(METRIC_RULES.keys()), expected)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_advisor_writer.py
+++ b/tests/test_advisor_writer.py
@@ -0,0 +1,113 @@
 import shutil
 import unittest
 from pathlib import Path
 from rag_eval.advisor.rules import Diagnosis
 from rag_eval.advisor.writer import write_advice, _format_log_summary
 class TestWriteAdvice(unittest.TestCase):
    def setUp(self):
        self.tmp = Path("tests/.tmp/test_advisor_writer")
        shutil.rmtree(self.tmp, ignore_errors=True)
        self.tmp.mkdir(parents=True, exist_ok=True)
        self.advice_path = self.tmp / "optimization_advice.md"
    def tearDown(self):
        shutil.rmtree(self.tmp, ignore_errors=True)
    def _make_diagnosis(self, metric="faithfulness", severity="warning"):
        return Diagnosis(
            metric=metric,
            mean_score=0.55,
            threshold=0.7,
            severity=severity,
            root_causes=["原因1", "原因2"],
            suggested_actions=["建议1", "建议2"],
            low_samples=[
                {"sample_id": "s1", "question": "问题1", "answer": "答案1",
                 "ground_truth": "标准1", metric: 0.4},
            ],
        )
    def test_write_creates_file(self):
        diag = self._make_diagnosis()
        write_advice(
            diagnoses=[diag],
            llm_markdown="## faithfulness\n\nLLM 建议内容",
            advice_path=self.advice_path,
            scenario_name="test-scenario",
            run_id="2026-01-01T00-00-00",
            judge_model="deepseek-v4-flash",
        )
        self.assertTrue(self.advice_path.exists())
    def test_write_contains_scenario_name_and_run_id(self):
        diag = self._make_diagnosis()
        write_advice(
            diagnoses=[diag],
            llm_markdown="## faithfulness\n\nLLM 建议",
            advice_path=self.advice_path,
            scenario_name="siemens-test",
            run_id="2026-01-01T00-00-00",
            judge_model="deepseek-v4-flash",
        )
        content = self.advice_path.read_text(encoding="utf-8")
        self.assertIn("siemens-test", content)
        self.assertIn("2026-01-01T00-00-00", content)
    def test_write_contains_llm_markdown(self):
        diag = self._make_diagnosis()
        write_advice(
            diagnoses=[diag],
            llm_markdown="## faithfulness\n\n具体建议文本",
            advice_path=self.advice_path,
            scenario_name="test",
            run_id="rid",
            judge_model="model",
        )
        content = self.advice_path.read_text(encoding="utf-8")
        self.assertIn("具体建议文本", content)
    def test_write_fallback_when_no_llm_markdown(self):
        """When llm_markdown is empty, writer emits rule-only report."""
        diag = self._make_diagnosis()
        write_advice(
            diagnoses=[diag],
            llm_markdown="",
            advice_path=self.advice_path,
            scenario_name="test",
            run_id="rid",
            judge_model="model",
        )
        content = self.advice_path.read_text(encoding="utf-8")
        self.assertIn("faithfulness", content)
        self.assertIn("原因1", content)
    def test_log_summary_format(self):
        diags = [
            self._make_diagnosis("faithfulness", "critical"),
            self._make_diagnosis("context_recall", "warning"),
        ]
        summary = _format_log_summary(diags, self.advice_path)
        self.assertIn("faithfulness", summary)
        self.assertIn("critical", summary)
        self.assertIn("context_recall", summary)
        self.assertIn("warning", summary)
    def test_write_empty_diagnoses_still_creates_file(self):
        write_advice(
            diagnoses=[],
            llm_markdown="",
            advice_path=self.advice_path,
            scenario_name="test",
            run_id="rid",
            judge_model="model",
        )
        self.assertTrue(self.advice_path.exists())
        content = self.advice_path.read_text(encoding="utf-8")
        self.assertIn("未发现明显指标异常", content)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/webapp/init.py
+++ b/tests/webapp/init.py
--- a/tests/webapp/test_llm_profiles_api.py
+++ b/tests/webapp/test_llm_profiles_api.py
@@ -0,0 +1,139 @@
 """Integration tests for /api/llm-profiles endpoints."""
 import pytest
 from fastapi.testclient import TestClient
@pytest.fixture()
 def client(tmp_path, monkeypatch):
    """TestClient with a fresh ProfileManager backed by a temp file."""
    store = tmp_path / "profiles.json"
    import webapp.services.profile_manager as pm_mod
    from webapp.services.profile_manager import ProfileManager
    fresh_mgr = ProfileManager(store_path=store)
    monkeypatch.setattr(pm_mod, "profile_manager", fresh_mgr)
    import webapp.api.llm_profiles as api_mod
    monkeypatch.setattr(api_mod, "profile_manager", fresh_mgr)
    from webapp.server import create_app
    return TestClient(create_app())
 def test_list_empty(client):
    resp = client.get("/api/llm-profiles")
    assert resp.status_code == 200
    assert resp.json()["profiles"] == []
 def test_create_and_list(client):
    body = {"name": "Test", "model": "m1", "base_url": "http://x/v1", "api_key": "k"}
    resp = client.post("/api/llm-profiles", json=body)
    assert resp.status_code == 201
    data = resp.json()
    assert data["name"] == "Test"
    assert data["profile_id"] != ""
    resp2 = client.get("/api/llm-profiles")
    assert len(resp2.json()["profiles"]) == 1
 def test_update_profile(client):
    body = {"name": "Old", "model": "m1", "base_url": "http://x/v1", "api_key": "k"}
    pid = client.post("/api/llm-profiles", json=body).json()["profile_id"]
    upd = {"name": "New", "model": "m2", "base_url": "http://x/v1", "api_key": "k", "timeout_seconds": 60}
    resp = client.put(f"/api/llm-profiles/{pid}", json=upd)
    assert resp.status_code == 200
    assert resp.json()["name"] == "New"
    assert resp.json()["timeout_seconds"] == 60
 def test_delete_profile(client):
    body = {"name": "Del", "model": "m", "base_url": "http://x/v1", "api_key": "k"}
    pid = client.post("/api/llm-profiles", json=body).json()["profile_id"]
    resp = client.delete(f"/api/llm-profiles/{pid}")
    assert resp.status_code == 200
    assert resp.json()["deleted"] is True
    assert len(client.get("/api/llm-profiles").json()["profiles"]) == 0
 def test_update_nonexistent(client):
    resp = client.put("/api/llm-profiles/nope",
                      json={"name": "X", "model": "m", "base_url": "http://x/v1", "api_key": "k"})
    assert resp.status_code == 404
 def test_delete_nonexistent(client):
    resp = client.delete("/api/llm-profiles/nope")
    assert resp.status_code == 404
 # ---------------------------------------------------------------------------
 # YAML patcher tests
 # ---------------------------------------------------------------------------
 import yaml as yaml_lib
 from webapp.services.yaml_patcher import apply_profiles_to_scenario
 from webapp.models import LLMProfile
 def test_apply_judge_profile(tmp_path):
    """Applying a judge profile patches judge_model in the YAML."""
    scenario_file = tmp_path / "test-scenario.yaml"
    scenario_file.write_text(
        "scenario_name: test\nmode: offline\njudge_model: old-model\nembedding_model: emb\n"
        "dataset: data.csv\nmetrics:\n- faithfulness\noutput_dir: outputs/test\n",
        encoding="utf-8",
    )
    judge_p = LLMProfile(
        profile_id="x", name="J", model="new-model",
        base_url="http://x/v1", api_key="k", created_at="t", updated_at="t",
    )
    patched = apply_profiles_to_scenario(
        scenario_path=str(scenario_file),
        judge_profile=judge_p,
        answer_profile=None,
        dataset_profile=None,
        _resolve_absolute=True,
    )
    assert "judge_model" in patched
    data = yaml_lib.safe_load(scenario_file.read_text())
    assert data["judge_model"] == "new-model"
 def test_apply_answer_profile(tmp_path):
    """Applying an answer profile patches app_adapter.static_kwargs.model."""
    scenario_file = tmp_path / "online.yaml"
    scenario_file.write_text(
        "scenario_name: online\nmode: online\njudge_model: j\nembedding_model: emb\n"
        "dataset: d.csv\nmetrics:\n- faithfulness\noutput_dir: out\n"
        "app_adapter:\n  type: python\n  callable: apps.foo:run\n"
        "  static_kwargs:\n    model: old\n    source_chunks_path: chunks.jsonl\n",
        encoding="utf-8",
    )
    answer_p = LLMProfile(
        profile_id="y", name="A", model="new-answer-model",
        base_url="http://x/v1", api_key="k", created_at="t", updated_at="t",
    )
    patched = apply_profiles_to_scenario(
        scenario_path=str(scenario_file),
        judge_profile=None,
        answer_profile=answer_p,
        dataset_profile=None,
        _resolve_absolute=True,
    )
    assert "app_adapter.static_kwargs.model" in patched
    data = yaml_lib.safe_load(scenario_file.read_text())
    assert data["app_adapter"]["static_kwargs"]["model"] == "new-answer-model"
 def test_apply_no_profiles_returns_empty(tmp_path):
    """When no profiles are given, no fields are patched."""
    scenario_file = tmp_path / "noop.yaml"
    scenario_file.write_text("scenario_name: noop\njudge_model: m\n", encoding="utf-8")
    patched = apply_profiles_to_scenario(
        scenario_path=str(scenario_file),
        judge_profile=None,
        answer_profile=None,
        dataset_profile=None,
        _resolve_absolute=True,
    )
    assert patched == []
--- a/tests/webapp/test_profile_manager.py
+++ b/tests/webapp/test_profile_manager.py
@@ -0,0 +1,100 @@
 import pytest
 from webapp.models import LLMProfile, ProfileApplyRequest, ProfileApplyResponse
 def test_llm_profile_defaults():
    p = LLMProfile(
        profile_id="abc",
        name="Test",
        model="gpt-4",
        base_url="http://localhost/v1",
        api_key="sk-test",
    )
    assert p.timeout_seconds == 30
    assert p.created_at != ""
    assert p.updated_at != ""
 def test_profile_apply_request_fields():
    req = ProfileApplyRequest(
        scenario_path="scenarios/offline/sample.yaml",
        judge_profile_id="id1",
        answer_profile_id="id2",
        dataset_profile_id=None,
    )
    assert req.judge_profile_id == "id1"
    assert req.dataset_profile_id is None
 def test_profile_apply_response():
    resp = ProfileApplyResponse(scenario_path="scenarios/offline/sample.yaml", patched_fields=["judge_model"])
    assert "judge_model" in resp.patched_fields
 # ---------------------------------------------------------------------------
 # ProfileManager service tests
 # ---------------------------------------------------------------------------
 import json
 from webapp.services.profile_manager import ProfileManager
 def _make_manager(tmp_path):
    store = tmp_path / "profiles.json"
    return ProfileManager(store_path=store)
 def test_create_profile(tmp_path):
    mgr = _make_manager(tmp_path)
    p = mgr.create(name="Local", model="deepseek-v4-flash",
                   base_url="http://localhost/v1", api_key="sk-x")
    assert p.profile_id != ""
    assert p.name == "Local"
 def test_list_profiles(tmp_path):
    mgr = _make_manager(tmp_path)
    mgr.create(name="A", model="m1", base_url="http://a/v1", api_key="k1")
    mgr.create(name="B", model="m2", base_url="http://b/v1", api_key="k2")
    profiles = mgr.list_all()
    assert len(profiles) == 2
 def test_get_profile(tmp_path):
    mgr = _make_manager(tmp_path)
    created = mgr.create(name="X", model="m", base_url="http://x/v1", api_key="k")
    fetched = mgr.get(created.profile_id)
    assert fetched is not None
    assert fetched.name == "X"
 def test_update_profile(tmp_path):
    mgr = _make_manager(tmp_path)
    p = mgr.create(name="Old", model="m", base_url="http://x/v1", api_key="k")
    updated = mgr.update(p.profile_id, name="New", model="m2",
                         base_url="http://x/v1", api_key="k", timeout_seconds=60)
    assert updated is not None
    assert updated.name == "New"
    assert updated.model == "m2"
    assert updated.timeout_seconds == 60
 def test_delete_profile(tmp_path):
    mgr = _make_manager(tmp_path)
    p = mgr.create(name="Del", model="m", base_url="http://x/v1", api_key="k")
    assert mgr.delete(p.profile_id) is True
    assert mgr.get(p.profile_id) is None
 def test_persistence(tmp_path):
    store = tmp_path / "profiles.json"
    mgr1 = ProfileManager(store_path=store)
    p = mgr1.create(name="Persist", model="m", base_url="http://x/v1", api_key="k")
    mgr2 = ProfileManager(store_path=store)
    assert mgr2.get(p.profile_id) is not None
 def test_get_nonexistent(tmp_path):
    mgr = _make_manager(tmp_path)
    assert mgr.get("does-not-exist") is None
 def test_delete_nonexistent(tmp_path):
    mgr = _make_manager(tmp_path)
    assert mgr.delete("does-not-exist") is False
--- a/webapp/api/llm_profiles.py
+++ b/webapp/api/llm_profiles.py
@@ -0,0 +1,96 @@
 """CRUD routes for LLM profiles plus the scenario-patching apply endpoint."""
 from __future__ import annotations
 from fastapi import APIRouter, HTTPException
 from webapp.models import (
    CreateProfileRequest,
    LLMProfile,
    ProfileApplyRequest,
    ProfileApplyResponse,
 )
 from webapp.services.profile_manager import profile_manager
 from webapp.services.yaml_patcher import apply_profiles_to_scenario
 router = APIRouter(prefix="/api/llm-profiles", tags=["llm-profiles"])
@router.get("", response_model=dict)
 def list_profiles() -> dict:
    """Return all saved LLM profiles."""
    return {"profiles": [p.model_dump() for p in profile_manager.list_all()]}
@router.post("", status_code=201, response_model=LLMProfile)
 def create_profile(request: CreateProfileRequest) -> LLMProfile:
    """Create a new LLM profile."""
    return profile_manager.create(
        name=request.name,
        model=request.model,
        base_url=request.base_url,
        api_key=request.api_key,
        timeout_seconds=request.timeout_seconds,
    )
@router.put("/{profile_id}", response_model=LLMProfile)
 def update_profile(profile_id: str, request: CreateProfileRequest) -> LLMProfile:
    """Update an existing LLM profile by id."""
    updated = profile_manager.update(
        profile_id=profile_id,
        name=request.name,
        model=request.model,
        base_url=request.base_url,
        api_key=request.api_key,
        timeout_seconds=request.timeout_seconds,
    )
    if updated is None:
        raise HTTPException(status_code=404, detail=f"Profile not found: {profile_id}")
    return updated
@router.delete("/{profile_id}", response_model=dict)
 def delete_profile(profile_id: str) -> dict:
    """Delete an LLM profile by id."""
    deleted = profile_manager.delete(profile_id)
    if not deleted:
        raise HTTPException(status_code=404, detail=f"Profile not found: {profile_id}")
    return {"deleted": True}
@router.post("/apply", response_model=ProfileApplyResponse)
 def apply_profiles(request: ProfileApplyRequest) -> ProfileApplyResponse:
    """Patch selected LLM profiles into the target scenario YAML file."""
    role_profiles: dict[str, LLMProfile | None] = {
        "judge": profile_manager.get(request.judge_profile_id) if request.judge_profile_id else None,
        "answer": profile_manager.get(request.answer_profile_id) if request.answer_profile_id else None,
        "dataset": profile_manager.get(request.dataset_profile_id) if request.dataset_profile_id else None,
    }
    missing = [
        role
        for role, pid in [
            ("judge", request.judge_profile_id),
            ("answer", request.answer_profile_id),
            ("dataset", request.dataset_profile_id),
        ]
        if pid and role_profiles[role] is None
    ]
    if missing:
        raise HTTPException(
            status_code=400,
            detail=f"Profile(s) not found for roles: {', '.join(missing)}",
        )
    patched = apply_profiles_to_scenario(
        scenario_path=request.scenario_path,
        judge_profile=role_profiles["judge"],
        answer_profile=role_profiles["answer"],
        dataset_profile=role_profiles["dataset"],
    )
    return ProfileApplyResponse(
        scenario_path=request.scenario_path,
        patched_fields=patched,
    )
--- a/webapp/models.py
+++ b/webapp/models.py
@@ -2,11 +2,16 @@
 from __future__ import annotations
 from datetime import datetime, timezone
 from typing import Any
 from pydantic import BaseModel, Field
 def _utcnow_iso() -> str:
    return datetime.now(timezone.utc).isoformat()
 class RunSummary(BaseModel):
    """Compact description of a single evaluation run for list views."""
@@ -68,6 +73,7 @@ class ReportData(BaseModel):
    groupings: dict[str, list[GroupStat]] = Field(default_factory=dict)
    lowest_samples: list[SampleScore] = Field(default_factory=list)
    summary_markdown: str = ""
    advice_markdown: str = ""  # optimization_advice.md content (empty if not generated)
 class RunDetail(BaseModel):
@@ -114,6 +120,45 @@ class TriggerEvaluationResponse(BaseModel):
    task_id: str
 class LLMProfile(BaseModel):
    """A named LLM connection configuration that can be reused across tasks."""
    profile_id: str
    name: str
    model: str
    base_url: str
    api_key: str
    timeout_seconds: int = 30
    created_at: str = Field(default_factory=_utcnow_iso)
    updated_at: str = Field(default_factory=_utcnow_iso)
 class CreateProfileRequest(BaseModel):
    """Request body for creating or updating an LLM profile."""
    name: str
    model: str
    base_url: str
    api_key: str
    timeout_seconds: int = 30
 class ProfileApplyRequest(BaseModel):
    """Request body to patch LLM profile selections into a scenario YAML."""
    scenario_path: str
    judge_profile_id: str | None = None
    answer_profile_id: str | None = None
    dataset_profile_id: str | None = None
 class ProfileApplyResponse(BaseModel):
    """Response after patching a scenario YAML with profile settings."""
    scenario_path: str
    patched_fields: list[str] = Field(default_factory=list)
 def jsonable(value: Any) -> Any:
    """Convert NaN/inf floats into None so the payload stays valid JSON."""
    import math
--- a/webapp/server.py
+++ b/webapp/server.py
@@ -13,7 +13,7 @@ from fastapi import FastAPI
 from fastapi.responses import FileResponse
 from fastapi.staticfiles import StaticFiles
-from webapp.api import evaluations, runs, scenarios
+from webapp.api import evaluations, llm_profiles, runs, scenarios
 STATIC_DIR = Path(__file__).resolve().parent / "static"
@@ -29,6 +29,7 @@ def create_app() -> FastAPI:
    app.include_router(runs.router)
    app.include_router(scenarios.router)
    app.include_router(evaluations.router)
    app.include_router(llm_profiles.router)
    @app.get("/api/health", tags=["meta"])
    def health() -> dict[str, str]:
--- a/webapp/services/profile_manager.py
+++ b/webapp/services/profile_manager.py
@@ -0,0 +1,137 @@
 """In-memory + JSON-file LLM profile manager.
 Profiles are kept in a dict keyed by profile_id and written to a JSON file
 on every mutation, so they survive server restarts. The pattern mirrors
 TaskManager but without threading concerns beyond a simple lock (profiles
 are only mutated by API calls in FastAPI request handlers).
 """
 from __future__ import annotations
 import json
 import threading
 import uuid
 from datetime import datetime, timezone
 from pathlib import Path
 from webapp.models import LLMProfile
 _DEFAULT_STORE = Path(__file__).resolve().parents[2] / "configs" / "llm_profiles.json"
 def _now_iso() -> str:
    return datetime.now(timezone.utc).isoformat()
 class ProfileManager:
    """Manages LLM profiles with in-memory cache and JSON file persistence."""
    def __init__(self, store_path: Path = _DEFAULT_STORE) -> None:
        self._store_path = store_path
        self._lock = threading.Lock()
        self._profiles: dict[str, LLMProfile] = {}
        self._load()
    # ------------------------------------------------------------------ #
    # Public API
    # ------------------------------------------------------------------ #
    def list_all(self) -> list[LLMProfile]:
        """Return all profiles sorted by creation time."""
        with self._lock:
            return sorted(self._profiles.values(), key=lambda p: p.created_at)
    def get(self, profile_id: str) -> LLMProfile | None:
        """Return one profile by id, or None if not found."""
        with self._lock:
            return self._profiles.get(profile_id)
    def create(
        self,
        name: str,
        model: str,
        base_url: str,
        api_key: str,
        timeout_seconds: int = 30,
    ) -> LLMProfile:
        """Create and persist a new profile, returning it."""
        now = _now_iso()
        profile = LLMProfile(
            profile_id=uuid.uuid4().hex[:12],
            name=name,
            model=model,
            base_url=base_url,
            api_key=api_key,
            timeout_seconds=timeout_seconds,
            created_at=now,
            updated_at=now,
        )
        with self._lock:
            self._profiles[profile.profile_id] = profile
            self._persist()
        return profile
    def update(
        self,
        profile_id: str,
        name: str,
        model: str,
        base_url: str,
        api_key: str,
        timeout_seconds: int = 30,
    ) -> LLMProfile | None:
        """Update an existing profile in-place; returns None if not found."""
        with self._lock:
            existing = self._profiles.get(profile_id)
            if existing is None:
                return None
            updated = existing.model_copy(update={
                "name": name,
                "model": model,
                "base_url": base_url,
                "api_key": api_key,
                "timeout_seconds": timeout_seconds,
                "updated_at": _now_iso(),
            })
            self._profiles[profile_id] = updated
            self._persist()
        return updated
    def delete(self, profile_id: str) -> bool:
        """Remove a profile; returns True if deleted, False if not found."""
        with self._lock:
            if profile_id not in self._profiles:
                return False
            del self._profiles[profile_id]
            self._persist()
        return True
    # ------------------------------------------------------------------ #
    # Persistence helpers
    # ------------------------------------------------------------------ #
    def _load(self) -> None:
        """Load profiles from the JSON store file, ignoring missing/corrupt files."""
        if not self._store_path.exists():
            return
        try:
            data = json.loads(self._store_path.read_text(encoding="utf-8"))
            for raw in data.get("profiles", []):
                p = LLMProfile.model_validate(raw)
                self._profiles[p.profile_id] = p
        except Exception:  # noqa: BLE001
            pass  # Corrupt store — start fresh
    def _persist(self) -> None:
        """Write current profiles to the JSON store file (called under lock)."""
        self._store_path.parent.mkdir(parents=True, exist_ok=True)
        payload = {"profiles": [p.model_dump() for p in self._profiles.values()]}
        self._store_path.write_text(
            json.dumps(payload, ensure_ascii=False, indent=2),
            encoding="utf-8",
        )
 # Module-level singleton shared by FastAPI routes.
 profile_manager = ProfileManager()
--- a/webapp/services/report_builder.py
+++ b/webapp/services/report_builder.py
@@ -164,12 +164,14 @@ def build_report(run_dir: Path, metrics: list[str]) -> ReportData:
    """Build the full aggregated report payload for one run directory."""
    frame = run_reader.read_scores_frame(run_dir)
    summary_markdown = run_reader.read_summary_markdown(run_dir)
    advice_markdown = run_reader.read_advice_markdown(run_dir)
    if frame.empty or not metrics:
        return ReportData(
            metrics=metrics,
            metric_means={metric: None for metric in metrics},
            summary_markdown=summary_markdown,
            advice_markdown=advice_markdown,
        )
    distributions = {
@@ -185,4 +187,5 @@ def build_report(run_dir: Path, metrics: list[str]) -> ReportData:
        groupings=_groupings(frame, metrics),
        lowest_samples=_lowest_samples(frame, metrics),
        summary_markdown=summary_markdown,
        advice_markdown=advice_markdown,
    )
--- a/webapp/services/run_reader.py
+++ b/webapp/services/run_reader.py
@@ -220,3 +220,14 @@ def read_summary_markdown(run_dir: Path) -> str:
        return summary_path.read_text(encoding="utf-8")
    except OSError:
        return ""
 def read_advice_markdown(run_dir: Path) -> str:
    """Return the optimization_advice.md for a run, or an empty string if not generated."""
    advice_path = run_dir / "optimization_advice.md"
    if not advice_path.is_file():
        return ""
    try:
        return advice_path.read_text(encoding="utf-8")
    except OSError:
        return ""
--- a/webapp/services/yaml_patcher.py
+++ b/webapp/services/yaml_patcher.py
@@ -0,0 +1,74 @@
 """Patch LLM profile settings into scenario YAML files in-place.
 Only the fields that correspond to a provided (non-None) profile are touched.
 All other fields and structure are preserved as much as PyYAML allows
 (comments are lost on round-trip, which is an accepted trade-off).
 """
 from __future__ import annotations
 from pathlib import Path
 from typing import Any
 import yaml
 from webapp.models import LLMProfile
 def _repo_root() -> Path:
    return Path(__file__).resolve().parents[2]
 def _resolve_scenario_path(path_str: str) -> Path:
    """Resolve a scenario path; absolute paths are used as-is."""
    candidate = Path(path_str)
    if candidate.is_absolute():
        return candidate
    return (_repo_root() / candidate).resolve()
 def apply_profiles_to_scenario(
    scenario_path: str,
    judge_profile: LLMProfile | None,
    answer_profile: LLMProfile | None,
    dataset_profile: LLMProfile | None,
    _resolve_absolute: bool = False,
 ) -> list[str]:
    """Patch the YAML file at *scenario_path* with the supplied profiles.
    Returns a list of dotted field names that were actually patched.
    Setting *_resolve_absolute=True* skips repo-root resolution (used in tests).
    """
    if _resolve_absolute:
        resolved = Path(scenario_path)
    else:
        resolved = _resolve_scenario_path(scenario_path)
    if not resolved.exists():
        raise FileNotFoundError(f"Scenario file not found: {resolved}")
    data: dict[str, Any] = yaml.safe_load(resolved.read_text(encoding="utf-8")) or {}
    patched: list[str] = []
    if judge_profile is not None:
        data["judge_model"] = judge_profile.model
        patched.append("judge_model")
    if answer_profile is not None:
        adapter = data.get("app_adapter")
        if isinstance(adapter, dict):
            static_kwargs = adapter.setdefault("static_kwargs", {})
            static_kwargs["model"] = answer_profile.model
            patched.append("app_adapter.static_kwargs.model")
    if dataset_profile is not None:
        generation = data.get("generation")
        if isinstance(generation, dict):
            generation["model"] = dataset_profile.model
            patched.append("generation.model")
    resolved.write_text(
        yaml.dump(data, allow_unicode=True, default_flow_style=False, sort_keys=False),
        encoding="utf-8",
    )
    return patched
--- a/webapp/static/css/app.css
+++ b/webapp/static/css/app.css
@@ -265,3 +265,69 @@ table.group-table td { border-bottom: 1px solid #f1f5f9; font-variant-numeric: t
  .sidebar { width: 64px; }
  .brand-sub, .nav-item span:not(.nav-ico), .sidebar-foot span:last-child { display: none; }
 }
 /* ---------- LLM 配置管理页 ---------- */
 .profile-grid { display: grid; grid-template-columns: repeat(auto-fill, minmax(300px, 1fr)); gap: 16px; }
 .profile-card {
  background: var(--surface); border: 1px solid var(--line); border-radius: var(--radius);
  padding: 16px; box-shadow: var(--shadow);
 }
 .profile-card-head { display: flex; justify-content: space-between; align-items: center; margin-bottom: 10px; }
 .profile-card-name { font-size: 15px; font-weight: 600; }
 .profile-card-actions { display: flex; gap: 6px; }
 .profile-card-field { font-size: 12px; color: var(--slate); margin-top: 4px; }
 .field-label { font-weight: 600; color: var(--ink); }
 /* Form */
 .profile-form { display: flex; flex-direction: column; gap: 12px; margin-top: 14px; max-width: 560px; }
 .form-row { display: flex; flex-direction: column; gap: 4px; }
 .form-label { font-size: 13px; font-weight: 600; }
 .req { color: var(--bad); }
 .form-input {
  border: 1px solid var(--line); border-radius: 6px; padding: 8px 10px;
  font-size: 13px; font-family: inherit; width: 100%;
 }
 .form-input:focus { outline: none; border-color: var(--petrol); }
 .form-input-sm { max-width: 120px; }
 .form-actions { display: flex; gap: 10px; align-items: center; margin-top: 4px; }
 .form-error { font-size: 12px; color: var(--bad); }
 .btn-sm { padding: 4px 10px; font-size: 12px; }
 .btn-danger { color: var(--bad); border-color: var(--bad); }
 .btn-danger:hover { background: #fee2e2; }
 /* 选中态 run 卡片 */
 .run-card.selected {
  border-color: var(--petrol);
  box-shadow: 0 0 0 2px rgba(0,153,153,0.25), var(--shadow);
 }
 /* ---------- LLM 角色配置面板 ---------- */
 .llm-assignment-panel { border-left: 3px solid var(--petrol); }
 .llm-role-rows { display: flex; flex-direction: column; gap: 10px; }
 .llm-role-row { display: flex; align-items: center; gap: 14px; }
 .llm-role-label { font-size: 13px; font-weight: 600; min-width: 180px; color: var(--ink); }
 .llm-role-select { min-width: 240px; }
 /* ---------- ⑤ 优化建议面板 ---------- */
 .advice-panel { border-left: 3px solid #7c3aed; }
 .advice-header {
  display: flex; align-items: center; gap: 10px;
  margin-bottom: 14px;
 }
 .advice-badge {
  background: #7c3aed; color: #fff;
  font-size: 11px; font-weight: 700; letter-spacing: 0.5px;
  padding: 3px 8px; border-radius: 4px; text-transform: uppercase;
 }
 .advice-model { font-size: 12px; color: var(--slate); }
 .advice-body { line-height: 1.7; color: var(--ink); }
 .advice-md h1 { font-size: 16px; font-weight: 700; margin: 16px 0 8px; color: var(--ink); }
 .advice-md h2 {
  font-size: 14px; font-weight: 700; margin: 20px 0 8px;
  padding-bottom: 4px; border-bottom: 1px solid var(--line); color: var(--ink-soft);
 }
 .advice-md h3 { font-size: 13px; font-weight: 600; margin: 12px 0 6px; color: var(--slate); }
 .advice-md hr { border: none; border-top: 1px solid var(--line); margin: 14px 0; }
 .advice-md ul { padding-left: 20px; margin: 6px 0; }
 .advice-md li { margin: 3px 0; font-size: 13px; }
 .advice-md strong { color: var(--ink); font-weight: 600; }
--- a/webapp/static/index.html
+++ b/webapp/static/index.html
@@ -22,9 +22,12 @@
        <button class="nav-item" data-view="new">
          <span class="nav-ico">＋</span><span>新建评估</span>
        </button>
-        <button class="nav-item" data-view="report" data-requires-run="1">
+        <button class="nav-item" data-view="report" data-requires-run="1" disabled>
          <span class="nav-ico">▤</span><span>报告详情</span>
        </button>
        <button class="nav-item" data-view="profiles">
          <span class="nav-ico">⚙</span><span>LLM 配置</span>
        </button>
      </nav>
      <div class="sidebar-foot">
        <span class="dot" id="health-dot"></span>
@@ -59,6 +62,33 @@
            <span class="selected-scenario muted" id="selected-scenario">未选择场景</span>
          </div>
        </div>
        <!-- LLM 角色配置面板（选中场景后显示） -->
        <div class="panel llm-assignment-panel" id="llm-assignment-panel" hidden>
          <h2>LLM 角色配置 <span class="muted" style="font-size:13px;font-weight:400">（可选）</span></h2>
          <p class="muted" style="margin-bottom:14px">为不同任务角色选择已保存的 LLM 配置，留空则使用场景文件中的原始配置。</p>
          <div class="llm-role-rows">
            <div class="llm-role-row">
              <label class="llm-role-label">评测打分 Judge LLM</label>
              <select class="select llm-role-select" id="role-judge">
                <option value="">— 使用场景原始配置 —</option>
              </select>
            </div>
            <div class="llm-role-row">
              <label class="llm-role-label">生成答案 Answer LLM</label>
              <select class="select llm-role-select" id="role-answer">
                <option value="">— 使用场景原始配置 —</option>
              </select>
            </div>
            <div class="llm-role-row">
              <label class="llm-role-label">生成题库 Dataset LLM</label>
              <select class="select llm-role-select" id="role-dataset">
                <option value="">— 使用场景原始配置 —</option>
              </select>
            </div>
          </div>
        </div>
        <div class="panel" id="task-panel" hidden>
          <div class="task-head">
            <h2>评估进度</h2>
@@ -105,6 +135,68 @@
          <!-- ④ 最低分样本逐条复核 -->
          <div class="section-label">④ 最低分样本（点击展开逐条复核）</div>
          <div class="lowest-table" id="lowest-table"></div>
          <!-- ⑤ 优化建议（optimization_advisor: true 时显示） -->
          <div id="advice-section" hidden>
            <div class="section-label">⑤ 优化建议 OPTIMIZATION ADVICE</div>
            <div class="panel advice-panel">
              <div class="advice-header">
                <span class="advice-badge">AI 诊断报告</span>
                <span class="advice-model" id="advice-model-label"></span>
              </div>
              <div class="advice-body" id="advice-body"></div>
            </div>
          </div>
        </div>
      </section>
      <!-- LLM 配置视图 -->
      <section class="view" id="view-profiles" hidden>
        <div class="panel">
          <div class="panel-head">
            <h2>LLM 配置管理</h2>
            <button class="btn btn-primary" id="add-profile-btn">＋ 新建配置</button>
          </div>
          <p class="muted">保存常用 LLM 连接参数，在运行评估时按角色选择。</p>
        </div>
        <!-- 新建 / 编辑表单（默认隐藏） -->
        <div class="panel" id="profile-form-panel" hidden>
          <h2 id="profile-form-title">新建 LLM 配置</h2>
          <div class="profile-form">
            <input type="hidden" id="edit-profile-id" />
            <div class="form-row">
              <label class="form-label">配置名称 <span class="req">*</span></label>
              <input class="form-input" id="pf-name" placeholder="例：DeepSeek Flash（内网）" />
            </div>
            <div class="form-row">
              <label class="form-label">模型名称 <span class="req">*</span></label>
              <input class="form-input" id="pf-model" placeholder="例：deepseek-v4-flash" />
            </div>
            <div class="form-row">
              <label class="form-label">Base URL <span class="req">*</span></label>
              <input class="form-input" id="pf-base-url" placeholder="例：http://6.86.80.4:30080/v1" />
            </div>
            <div class="form-row">
              <label class="form-label">API Key <span class="req">*</span></label>
              <input class="form-input" id="pf-api-key" type="password" placeholder="sk-…" />
            </div>
            <div class="form-row">
              <label class="form-label">超时（秒）</label>
              <input class="form-input form-input-sm" id="pf-timeout" type="number" value="30" min="5" max="300" />
            </div>
            <div class="form-actions">
              <button class="btn btn-primary" id="save-profile-btn">保存</button>
              <button class="btn" id="cancel-profile-btn">取消</button>
              <span class="form-error muted" id="profile-form-error"></span>
            </div>
          </div>
        </div>
        <div id="profile-cards" class="profile-grid"></div>
        <div class="empty" id="profiles-empty" hidden>
          <p>尚未添加任何 LLM 配置。</p>
          <p class="muted">点击「新建配置」添加第一个。</p>
        </div>
      </section>
    </main>
@@ -112,6 +204,7 @@
  <script src="/static/js/api.js"></script>
  <script src="/static/js/report.js"></script>
  <script src="/static/js/profiles.js"></script>
  <script src="/static/js/runner.js"></script>
  <script src="/static/js/app.js"></script>
 </body>
--- a/webapp/static/js/api.js
+++ b/webapp/static/js/api.js
@@ -43,4 +43,26 @@ const API = {
    return API.post("/api/evaluations", { scenario_path: scenarioPath });
  },
  taskStatus(taskId) { return API.get(`/api/evaluations/${encodeURIComponent(taskId)}`); },
  // LLM Profile API
  profiles() { return API.get("/api/llm-profiles"); },
  createProfile(body) { return API.post("/api/llm-profiles", body); },
  updateProfile(id, body) {
    return fetch(`/api/llm-profiles/${encodeURIComponent(id)}`, {
      method: "PUT",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify(body),
    }).then(async r => {
      if (!r.ok) { const d = await API._extractError(r); throw new Error(d); }
      return r.json();
    });
  },
  deleteProfile(id) {
    return fetch(`/api/llm-profiles/${encodeURIComponent(id)}`, { method: "DELETE" })
      .then(async r => {
        if (!r.ok) { const d = await API._extractError(r); throw new Error(d); }
        return r.json();
      });
  },
  applyProfiles(body) { return API.post("/api/llm-profiles/apply", body); },
 };
--- a/webapp/static/js/app.js
+++ b/webapp/static/js/app.js
@@ -1,28 +1,59 @@
 // app.js — 视图路由、运行列表渲染、健康检查。整个控制台的入口编排。
 // 会话保持：URL hash 路由（#runs / #new / #profiles / #report/{runId}）
 //            + sessionStorage 兜底，F5 刷新 / 浏览器前进后退均可恢复。
 const App = {
  currentRunId: null,
-  views: ["runs", "new", "report"],
+  activeView: null,
-  titles: { runs: "运行列表", new: "新建评估", report: "报告详情" },
+  views: ["runs", "new", "report", "profiles"],
  titles: { runs: "运行列表", new: "新建评估", report: "报告详情", profiles: "LLM 配置" },
-  // 初始化：绑定导航、加载首屏、启动健康检查。
+  // 初始化：绑定导航、从 URL/sessionStorage 恢复上次位置、启动健康检查。
  init() {
    document.querySelectorAll(".nav-item").forEach((btn) => {
-      btn.addEventListener("click", () => App.switchView(btn.dataset.view));
+      btn.addEventListener("click", () => App.navigate(btn.dataset.view));
    });
    document.getElementById("refresh-btn").addEventListener("click", () => App.refreshCurrent());
    Runner.init();
-    App.switchView("runs");
+    Profiles.init();
    // 恢复上次会话（优先 URL hash，其次 sessionStorage）
    App._restoreSession();
    App.checkHealth();
    setInterval(App.checkHealth, 15000);
    // 浏览器前进 / 后退按钮
    window.addEventListener("popstate", () => App._restoreSession());
  },
-  // 切换主视图，并同步导航高亮与标题。
+  // ----------------------------------------------------------------
-  switchView(view) {
+  // 路由 —— 有历史记录的主动导航（更新 URL hash）
-    if (view === "report" && !App.currentRunId) {
+  // ----------------------------------------------------------------
-      // 没有选中的运行时，报告页显示占位。
+  navigate(view, runId) {
    if (runId !== undefined) App.currentRunId = runId;
    const hash = App._buildHash(view, App.currentRunId);
    if (location.hash !== `#${hash}`) {
      history.pushState({ view, runId: App.currentRunId }, "", `#${hash}`);
    }
    App._doSwitch(view);
  },
  // 供内部调用（不产生历史记录），例如刷新同一视图
  switchView(view) {
    App._doSwitch(view);
  },
  // 刷新当前视图数据
  refreshCurrent() {
    App._doSwitch(App.activeView || "runs");
  },
  // ----------------------------------------------------------------
  // 内部：实际切换 DOM + 触发数据加载
  // ----------------------------------------------------------------
  _doSwitch(view) {
    App.views.forEach((name) => {
      const el = document.getElementById(`view-${name}`);
      if (el) el.hidden = name !== view;
@@ -33,17 +64,53 @@ const App = {
    document.getElementById("view-title").textContent = App.titles[view] || view;
    App.activeView = view;
    // 持久化到 sessionStorage（URL 共享场景的备份）
    sessionStorage.setItem("rag_view", view);
    if (App.currentRunId) sessionStorage.setItem("rag_run_id", App.currentRunId);
    if (view === "runs")     App.loadRuns();
    if (view === "new")      Runner.loadScenarios();
    if (view === "report")   Report.render(App.currentRunId);
    if (view === "profiles") Profiles.load();
  },
-  // 刷新当前视图的数据。
+  // ----------------------------------------------------------------
-  refreshCurrent() {
+  // Hash 工具
-    App.switchView(App.activeView || "runs");
+  // ----------------------------------------------------------------
  _buildHash(view, runId) {
    if (view === "report" && runId) {
      return `report/${encodeURIComponent(runId)}`;
    }
    return view || "runs";
  },
-  // 加载并渲染运行列表。
+  _parseHash() {
    const raw = location.hash.replace(/^#\/?/, "");
    if (!raw) return { view: null, runId: null };
    if (raw.startsWith("report/")) {
      const runId = decodeURIComponent(raw.slice("report/".length));
      return { view: "report", runId };
    }
    const view = App.views.includes(raw) ? raw : null;
    return { view, runId: null };
  },
  // 会话恢复：hash → sessionStorage → 默认 runs
  _restoreSession() {
    const { view: hView, runId: hRunId } = App._parseHash();
    const view   = hView  || sessionStorage.getItem("rag_view")   || "runs";
    const runId  = hRunId || sessionStorage.getItem("rag_run_id") || null;
    if (runId) {
      App.currentRunId = runId;
      App.enableReportNav();
    }
    App._doSwitch(view);
  },
  // ----------------------------------------------------------------
  // 运行列表
  // ----------------------------------------------------------------
  async loadRuns() {
    const container = document.getElementById("runs-container");
    const empty = document.getElementById("runs-empty");
@@ -64,14 +131,16 @@ const App = {
    }
  },
  // 构造一张运行卡片。
  renderRunCard(run) {
    const card = document.createElement("div");
-    card.className = "run-card";
+    card.className = "run-card" + (run.run_id === App.currentRunId ? " selected" : "");
    card.addEventListener("click", () => {
-      App.currentRunId = run.run_id;
+      // 更新选中高亮
      document.querySelectorAll(".run-card").forEach((c) => c.classList.remove("selected"));
      card.classList.add("selected");
      App.enableReportNav();
-      App.switchView("report");
+      App.navigate("report", run.run_id);
    });
    const chips = (run.metrics || [])
@@ -79,7 +148,7 @@ const App = {
        const val = run.metric_means ? run.metric_means[m] : null;
        const cls = App.scoreClass(val);
        const text = val === null || val === undefined ? "n/a" : val.toFixed(2);
-        return `<span class="metric-chip">${App.escape(App.shortMetric(m))} <b class="${cls}">${text}</b></span>`;
+        return `<span class="metric-chip" title="${App.escape(m)}">${App.escape(App.shortMetric(m))} <b class="${cls}">${text}</b></span>`;
      })
      .join("");
@@ -96,13 +165,14 @@ const App = {
    return card;
  },
-  // 启用报告导航项（选中运行后）。
+  // ----------------------------------------------------------------
  // 工具方法
  // ----------------------------------------------------------------
  enableReportNav() {
    const btn = document.querySelector('.nav-item[data-view="report"]');
    if (btn) btn.disabled = false;
  },
  // 根据分值返回 good/warn/bad/na 配色类。
  scoreClass(value) {
    if (value === null || value === undefined) return "na";
    if (value >= 0.8) return "good";
@@ -110,31 +180,30 @@ const App = {
    return "bad";
  },
  // 指标名缩写，节省卡片横向空间。
  shortMetric(name) {
    const map = {
      faithfulness:        "faith.",
      answer_relevancy:    "ans.rel.",
      context_recall:      "ctx.recall",
      context_precision:   "ctx.prec.",
      noise_sensitivity:   "noise.sens.",
      factual_correctness: "fact.corr.",
      semantic_similarity: "sem.sim.",
    };
    return map[name] || name;
  },
  // 截取时间戳到分钟，便于阅读。
  shortTime(iso) {
    if (!iso) return "—";
    return String(iso).replace("T", " ").slice(0, 16);
  },
  // 简单 HTML 转义，防止注入。
  escape(text) {
    const div = document.createElement("div");
    div.textContent = text == null ? "" : String(text);
    return div.innerHTML;
  },
  // 健康检查，更新左下角状态点。
  async checkHealth() {
    const dot   = document.getElementById("health-dot");
    const label = document.getElementById("health-text");
--- a/webapp/static/js/profiles.js
+++ b/webapp/static/js/profiles.js
@@ -0,0 +1,118 @@
 // profiles.js — LLM 配置管理页面逻辑
 const Profiles = {
  _data: [],
  // 初始化：绑定按钮事件
  init() {
    document.getElementById("add-profile-btn").addEventListener("click", () => Profiles.showForm());
    document.getElementById("save-profile-btn").addEventListener("click", () => Profiles.save());
    document.getElementById("cancel-profile-btn").addEventListener("click", () => Profiles.hideForm());
  },
  // 加载并渲染 Profile 列表
  async load() {
    const grid = document.getElementById("profile-cards");
    const empty = document.getElementById("profiles-empty");
    grid.innerHTML = '<p class="muted">加载中…</p>';
    try {
      const data = await API.profiles();
      Profiles._data = data.profiles || [];
      grid.innerHTML = "";
      if (Profiles._data.length === 0) {
        empty.hidden = false;
      } else {
        empty.hidden = true;
        Profiles._data.forEach(p => grid.appendChild(Profiles.renderCard(p)));
      }
    } catch (err) {
      grid.innerHTML = `<p class="muted">加载失败：${App.escape(err.message)}</p>`;
    }
  },
  // 渲染单个 Profile 卡片
  renderCard(p) {
    const card = document.createElement("div");
    card.className = "profile-card";
    card.dataset.id = p.profile_id;
    card.innerHTML = `
      <div class="profile-card-head">
        <div class="profile-card-name">${App.escape(p.name)}</div>
        <div class="profile-card-actions">
          <button class="btn btn-sm" data-action="edit">编辑</button>
          <button class="btn btn-sm btn-danger" data-action="delete">删除</button>
        </div>
      </div>
      <div class="profile-card-field"><span class="field-label">模型</span> <code>${App.escape(p.model)}</code></div>
      <div class="profile-card-field"><span class="field-label">Base URL</span> <code>${App.escape(p.base_url)}</code></div>
      <div class="profile-card-field"><span class="field-label">超时</span> ${p.timeout_seconds}s</div>
    `;
    card.querySelector("[data-action=edit]").addEventListener("click", () => Profiles.showForm(p));
    card.querySelector("[data-action=delete]").addEventListener("click", () => Profiles.remove(p.profile_id, p.name));
    return card;
  },
  // 显示新建或编辑表单
  showForm(profile = null) {
    const panel = document.getElementById("profile-form-panel");
    const title = document.getElementById("profile-form-title");
    panel.hidden = false;
    title.textContent = profile ? "编辑 LLM 配置" : "新建 LLM 配置";
    document.getElementById("edit-profile-id").value = profile ? profile.profile_id : "";
    document.getElementById("pf-name").value = profile ? profile.name : "";
    document.getElementById("pf-model").value = profile ? profile.model : "";
    document.getElementById("pf-base-url").value = profile ? profile.base_url : "";
    document.getElementById("pf-api-key").value = profile ? profile.api_key : "";
    document.getElementById("pf-timeout").value = profile ? profile.timeout_seconds : 30;
    document.getElementById("profile-form-error").textContent = "";
    panel.scrollIntoView({ behavior: "smooth", block: "start" });
  },
  hideForm() {
    document.getElementById("profile-form-panel").hidden = true;
  },
  // 保存（新建 or 更新）
  async save() {
    const id = document.getElementById("edit-profile-id").value;
    const body = {
      name: document.getElementById("pf-name").value.trim(),
      model: document.getElementById("pf-model").value.trim(),
      base_url: document.getElementById("pf-base-url").value.trim(),
      api_key: document.getElementById("pf-api-key").value.trim(),
      timeout_seconds: parseInt(document.getElementById("pf-timeout").value, 10) || 30,
    };
    const errEl = document.getElementById("profile-form-error");
    if (!body.name || !body.model || !body.base_url || !body.api_key) {
      errEl.textContent = "请填写所有必填字段（名称、模型、Base URL、API Key）";
      return;
    }
    try {
      if (id) {
        await API.updateProfile(id, body);
      } else {
        await API.createProfile(body);
      }
      Profiles.hideForm();
      await Profiles.load();
    } catch (err) {
      errEl.textContent = `保存失败：${err.message}`;
    }
  },
  // 删除 Profile
  async remove(profileId, name) {
    if (!confirm(`确认删除配置「${name}」？`)) return;
    try {
      await API.deleteProfile(profileId);
      await Profiles.load();
    } catch (err) {
      alert(`删除失败：${err.message}`);
    }
  },
  // 获取当前已加载的 profiles（供 runner.js 使用）
  getAll() {
    return Profiles._data;
  },
 };
--- a/webapp/static/js/report.js
+++ b/webapp/static/js/report.js
@@ -26,6 +26,7 @@ const Report = {
      Report.renderDistribution(detail.report);
      Report.renderGroupings(detail.report);
      Report.renderLowest(detail.report);
      Report.renderAdvice(detail.summary, detail.report);
      content.style.opacity = "1";
    } catch (err) {
      empty.hidden = false;
@@ -186,8 +187,7 @@ const Report = {
  },
  // ④ 最低分样本逐条复核表（点击展开）。
-  renderLowest(report) {
+  renderLowest(report) {    const wrap = document.getElementById("lowest-table");
    const wrap = document.getElementById("lowest-table");
    const samples = report.lowest_samples || [];
    wrap.innerHTML = "";
    if (samples.length === 0) {
@@ -255,4 +255,35 @@ const Report = {
      </div>
    `;
  },
  // ⑤ 优化建议（仅 optimization_advice.md 存在时渲染）。
  renderAdvice(summary, report) {
    const section = document.getElementById("advice-section");
    const body = document.getElementById("advice-body");
    const modelLabel = document.getElementById("advice-model-label");
    const md = report.advice_markdown || "";
    if (!md.trim()) {
      section.hidden = true;
      return;
    }
    section.hidden = false;
    modelLabel.textContent = summary.judge_model ? `judge: ${summary.judge_model}` : "";
    // 简单 Markdown → HTML 转换（标题、列表、分隔线、粗体）
    const escaped = md
      .replace(/&/g, "&amp;").replace(/</g, "&lt;").replace(/>/g, "&gt;");
    const html = escaped
      .replace(/^#{3}\s+(.+)$/gm, "<h3>$1</h3>")
      .replace(/^#{2}\s+(.+)$/gm, "<h2>$1</h2>")
      .replace(/^#{1}\s+(.+)$/gm, "<h1>$1</h1>")
      .replace(/^---+$/gm, "<hr>")
      .replace(/\*\*(.+?)\*\*/g, "<strong>$1</strong>")
      .replace(/^- (.+)$/gm, "<li>$1</li>")
      .replace(/(<li>[^]*?<\/li>\n?)+/g, (m) => `<ul>${m}</ul>`)
      .replace(/\n\n+/g, "\n<br>\n");
    body.innerHTML = `<div class="advice-md">${html}</div>`;
  },
 };
--- a/webapp/static/js/runner.js
+++ b/webapp/static/js/runner.js
@@ -1,17 +1,17 @@
-// runner.js — 新建评估视图：列出场景、触发评估、轮询任务状态与日志。
+// runner.js — 新建评估视图：列出场景、LLM角色配置、触发评估、轮询任务状态与日志。
 const Runner = {
  selectedScenario: null,
  pollTimer: null,
  lastRunId: null,
  // 绑定运行按钮。
  init() {
    document.getElementById("run-btn").addEventListener("click", () => Runner.trigger());
    document.getElementById("view-report-btn").addEventListener("click", () => {
      if (Runner.lastRunId) {
        App.currentRunId = Runner.lastRunId;
        App.enableReportNav();
-        App.switchView("report");
+        App.navigate("report", Runner.lastRunId);
      }
    });
  },
@@ -32,6 +32,27 @@ const Runner = {
    } catch (err) {
      list.innerHTML = `<p class="muted">加载失败：${App.escape(err.message)}</p>`;
    }
    // 同时加载 profiles 供角色选择
    Runner._populateProfileSelects();
  },
  // 填充三个角色下拉框
  async _populateProfileSelects() {
    const cached = Profiles.getAll();
    const profiles = cached.length > 0
      ? cached
      : (await API.profiles().catch(() => ({ profiles: [] }))).profiles;
    ["role-judge", "role-answer", "role-dataset"].forEach(id => {
      const sel = document.getElementById(id);
      sel.innerHTML = '<option value="">— 使用场景原始配置 —</option>';
      profiles.forEach(p => {
        const opt = document.createElement("option");
        opt.value = p.profile_id;
        opt.textContent = `${p.name}  (${p.model})`;
        sel.appendChild(opt);
      });
    });
  },
  // 构造单个场景条目。
@@ -64,12 +85,14 @@ const Runner = {
        Runner.selectedScenario = sc.path;
        document.getElementById("selected-scenario").textContent = sc.path;
        document.getElementById("run-btn").disabled = false;
        // 显示 LLM 角色面板
        document.getElementById("llm-assignment-panel").hidden = false;
      });
    }
    return item;
  },
-  // 触发评估并开始轮询。
+  // 触发评估：先 apply profiles（若选了），再触发任务。
  async trigger() {
    if (!Runner.selectedScenario) return;
    const runBtn = document.getElementById("run-btn");
@@ -85,15 +108,41 @@ const Runner = {
    Runner._setStatus(statusBadge, "queued");
    try {
      // Step 1: apply LLM profiles to YAML if any selected
      await Runner._applyProfilesIfNeeded(logBox);
      // Step 2: trigger evaluation
      const resp = await API.triggerEvaluation(Runner.selectedScenario);
      Runner.poll(resp.task_id);
    } catch (err) {
      Runner._setStatus(statusBadge, "failed");
-      logBox.textContent = `触发失败：${err.message}`;
+      logBox.textContent = (logBox.textContent ? logBox.textContent + "\n" : "") + `触发失败：${err.message}`;
      runBtn.disabled = false;
    }
  },
  // 如果用户选了 profile，就先 apply 写回 YAML
  async _applyProfilesIfNeeded(logBox) {
    const judgeId = document.getElementById("role-judge").value;
    const answerId = document.getElementById("role-answer").value;
    const datasetId = document.getElementById("role-dataset").value;
    if (!judgeId && !answerId && !datasetId) return; // 全空，跳过
    logBox.textContent = "正在将 LLM 配置写入场景文件…\n";
    const body = {
      scenario_path: Runner.selectedScenario,
      judge_profile_id: judgeId || null,
      answer_profile_id: answerId || null,
      dataset_profile_id: datasetId || null,
    };
    const result = await API.applyProfiles(body);
    const fields = (result.patched_fields || []).join(", ");
    logBox.textContent += fields
      ? `✓ 已更新字段：${fields}\n`
      : "（未找到可更新的字段，继续运行）\n";
  },
  // 周期性轮询任务状态，刷新日志与徽标。
  poll(taskId) {
    const logBox = document.getElementById("task-log");
@@ -114,6 +163,7 @@ const Runner = {
          runBtn.disabled = false;
          if (status.status === "completed" && status.run_id) {
            Runner.lastRunId = status.run_id;
            sessionStorage.setItem("rag_run_id", status.run_id);
            reportBtn.hidden = false;
          }
        }
Author	SHA1	Message	Date
wangwei	24956bbf75	更新	2026-06-16 18:12:33 +08:00
wangwei	ca01e44ad2	feat(webapp): add session persistence via URL hash routing + sessionStorage - app.js: hash-based router (#runs / #new / #profiles / #report/{runId}) - navigate() pushes history entries for back/forward support - _restoreSession() reads hash on load and popstate - sessionStorage fallback for same-tab refreshes - run-card highlights selected run (.run-card.selected) - runner.js: use App.navigate() for report redirect; persist lastRunId to sessionStorage - index.html: report nav button starts disabled (enabled on run select/restore) - app.css: .run-card.selected with petrol border + ring Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-16 17:55:07 +08:00
wangwei	1a2cc534b8	feat(webapp): add optimization advice section to report UI - index.html: add section ⑤ advice block (hidden by default, shown when advice_markdown present) - report.js: add renderAdvice() called in render(), simple Markdown→HTML converter - app.js: add noise_sensitivity / factual_correctness / semantic_similarity to shortMetric map - app.css: add .advice-panel, .advice-badge, .advice-md styles (purple left-border theme) Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-16 17:26:37 +08:00
wangwei	91c0dab4f9	fix(advisor): fix LLM API call, wire advice_markdown to webapp, update .env.example timeouts - llm_analyzer.py: use llm.langchain_llm.ainvoke() (correct RAGAS 0.4.3 API) - webapp/models.py: add advice_markdown field to ReportData - webapp/services/run_reader.py: add read_advice_markdown() reading optimization_advice.md - webapp/services/report_builder.py: pass advice_markdown into ReportData - .env.example: OPENAI_TIMEOUT_SECONDS 30→180, RAGAS_METRIC_TIMEOUT_SECONDS 45→300 Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-16 17:12:32 +08:00
wangwei	f5c2dce64a	feat(advisor): add optimization advisor module - rag_eval/advisor/: new package with rules engine, LLM analyzer, writer - rules.py: 7-metric diagnostic rules (warning/critical thresholds, top-3 low samples) - llm_analyzer.py: Chinese optimization report via judge_model, graceful fallback - writer.py: writes optimization_advice.md + log summary - __init__.py: run_advisor() entry point (no-op when optimization_advisor=False) - Scenario.optimization_advisor: new bool field (default False) - ScenarioModel: same field added, loader.py透传 - RunArtifactPaths.advice_md: new path field - factory.py: build_models() now public; build_metric_pipeline() accepts pre-built llm/embeddings - runner.py: lifts llm, passes to pipeline and advisor; calls run_advisor() at end - siemens online YAML: optimization_advisor: true enabled - tests: 9 rules tests + 6 writer tests, all pass - docs: advisor section added to engine-flow.md and architecture.md Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-16 17:06:19 +08:00
wangwei	d68399d39b	chore: update startup scripts and .env.example for LLM profile feature	2026-06-16 17:03:25 +08:00
wangwei	719c3b4ca4	test: ensure test package structure and all webapp tests pass	2026-06-16 16:27:54 +08:00
wangwei	5b60ed12ea	feat: add LLM role-assignment panel to 新建评估 view	2026-06-16 16:27:00 +08:00
wangwei	dc8baf8662	feat: add LLM配置 management page (profiles view)	2026-06-16 16:25:20 +08:00
wangwei	e329f59139	feat: add yaml_patcher service to apply LLM profiles to scenario YAML	2026-06-16 16:21:19 +08:00
wangwei	b19054bd66	feat: add /api/llm-profiles CRUD router	2026-06-16 16:18:40 +08:00
wangwei	5d09deb420	feat: add ProfileManager service with JSON persistence	2026-06-16 16:14:31 +08:00
wangwei	b98af29449	feat: add LLMProfile pydantic models	2026-06-16 16:10:37 +08:00
wangwei	4173a40d93	feat(scripts): add run_eval.bat / run_eval.ps1 evaluation launcher scripts Both scripts support: - Shortcut args: online (default), offline, or any custom .yaml path - Second arg: log level (DEBUG/INFO/WARNING/ERROR), default INFO - Auto-timestamped log file saved to logs\eval_<date>_<time>.log - Sets PYTHONIOENCODING=utf-8 and PYTHONPATH=. automatically - Friendly error/success banners with log file path Usage: run_eval.bat # online eval run_eval.bat offline DEBUG # offline eval with DEBUG logs .\run_eval.ps1 online DEBUG # PowerShell equivalent Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-16 11:16:53 +08:00
wangwei	629304aa6d	feat(logging): add structured evaluation logs for metric-level debugging - pipeline.py: log each metric score/timeout/error with sample_id, elapsed time, and score value; log NaN list per sample; progress counter N/total after each sample completes - evaluator.py: log eval start, dataset counts, adapter enrichment progress (per-sample OK/FAIL with elapsed), metric scoring summary, and per-metric NaN rate at end of run - runner.py: _setup_logging() helper writes to stderr + optional file; ragas/httpx/openai noisy loggers throttled to WARNING - main.py: add --log-file and --log-level CLI flags Usage: python main.py --scenario scenarios/online/... --log-file logs/eval.log --log-level DEBUG Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-16 10:48:41 +08:00