first commit

2026-06-12 14:02:15 +08:00
commit 9cbdc1d95d
69 changed files with 9486 additions and 0 deletions
--- a/.env.example
+++ b/.env.example
@@ -0,0 +1,19 @@
+OPENAI_API_KEY=your-api-key
+OPENAI_BASE_URL=http://6.86.80.4:30080/v1
+RAGAS_JUDGE_MODEL=deepseek-v4-flash
+RAGAS_EMBEDDING_MODEL=text-embedding-v3
+BATCH_SIZE=8
+
+
+# ===== 阿里云文档解析 =====
+ALIBABA_ACCESS_KEY_ID=
+ALIBABA_ACCESS_KEY_SECRET=
+ALIBABA_ENDPOINT=docmind-api.cn-hangzhou.aliyuncs.com
+ALIYUN_PARSE_POLL_INTERVAL_SECONDS=5
+ALIYUN_PARSE_TIMEOUT_SECONDS=900
+ALIYUN_PARSE_LAYOUT_STEP_SIZE=50
+ALIYUN_LLM_ENHANCEMENT=true
+ALIYUN_ENHANCEMENT_MODE=VLM
+DOCUMENT_PARSE_ARTIFACT_PREFIX=artifacts
+PARSER_FAILURE_MODE=fail
+DATASET_GENERATOR_MODEL=qwen3.6-plus
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,21 @@
+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+
+# Virtual environments
+.venv
+
+# Local environment configuration
+.env
+.env.*
+!.env.example
+
+# outputs
+outputs/
+
+# datasets
+datasets/
--- a/.python-version
+++ b/.python-version
@@ -0,0 +1 @@
+3.12
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -0,0 +1,29 @@
+# Repository Guidelines
+
+## Project Structure & Module Organization
+`rag_eval/` contains the core platform code: `config/` loads YAML scenarios, `datasets/` normalizes input records, `metrics/` builds the RAGAS pipeline, `execution/` runs evaluations, and `reporting/` writes run artifacts. Use `main.py` as the primary CLI entrypoint. Keep sample integrations in `apps/` and scenario definitions in `scenarios/`. Store source datasets under `datasets/raw/` or `datasets/normalized/`, architecture notes in `docs/`, generated outputs in `outputs/`, and automated checks in `tests/`.
+
+## Build, Test, and Development Commands
+Set up the environment with `uv sync`, then copy `Copy-Item .env.example .env` and fill in the required OpenAI settings.
+
+Run the standard offline sample with:
+```powershell
+.\.venv\Scripts\python.exe main.py --scenario scenarios/offline/sample-offline.yaml
+```
+
+Run tests with:
+```powershell
+.\.venv\Scripts\python.exe -m unittest discover -s tests
+```
+
+## Coding Style & Naming Conventions
+Target Python 3.12+ and follow PEP 8 with 4-space indentation. Prefer type hints on public functions and keep modules focused on one responsibility. Every Python file should include function comments for each function and concise code comments for non-obvious logic blocks. Use `snake_case` for files, functions, variables, and YAML filenames; use `PascalCase` for classes such as `EvaluationSettings` or `MetricPipeline`. Keep adapters small and return a normalized shape: `answer`, `contexts`, and optional `raw_response`.
+
+## Testing Guidelines
+Tests currently use the standard `unittest` framework. Add new coverage under `tests/` with names matching `test_*.py`, and mirror the production module or workflow being exercised. Favor deterministic tests with mocked external calls; do not rely on live model APIs in CI-style checks. For run-artifact tests, write only to temporary directories under `tests/.tmp/`.
+
+## Commit & Pull Request Guidelines
+The current history starts with short, imperative subjects (`initial commit`); continue using concise commit lines such as `Add HTTP adapter validation`. Keep each commit scoped to one logical change. Pull requests should describe the scenario impacted, summarize behavior changes, list test coverage, and include sample output paths or screenshots when reporting artifacts or docs change.
+
+## Configuration & Data Handling
+Do not commit real secrets in `.env`. Treat `outputs/` and generated run directories as disposable artifacts unless a result is intentionally curated for review. When adding datasets, prefer sanitized examples and document any required schema fields in the relevant scenario or README.
--- a/README.md
+++ b/README.md
@@ -0,0 +1,199 @@
+# RAG 评测平台骨架
+
+## 1. 项目定位
+
+这个仓库现在已经从“单脚本离线评测”重构成一个可继续扩展的 **RAG 评测平台骨架**。核心目标不变：统一离线与在线评测入口，沉淀本地 run 资产，并且让多应用、多数据集、多场景对比变成标准流程。
+
+架构边界来自 [docs/rag-eval-architecture.md](/C:/Users/A200477427/Learnings/ragas-template/docs/rag-eval-architecture.md)，当前代码已经按该文档落了第一版工程结构。
+
+如果你想快速理解一次评测在代码里是怎么跑起来的，可以继续看：
+
+- [docs/rag-eval-engine-flow.md](/C:/Users/A200477427/Learnings/ragas-template/docs/rag-eval-engine-flow.md)
+
+## 2. 当前结构
+
+```text
+.
+├── apps/
+│   └── sample_python/
+├── datasets/
+│   ├── normalized/
+│   └── raw/
+├── scenarios/
+│   └── offline/
+├── rag_eval/
+│   ├── adapters/
+│   ├── config/
+│   ├── datasets/
+│   ├── execution/
+│   ├── metrics/
+│   ├── reporting/
+│   └── shared/
+├── runs/
+├── docs/
+├── tests/
+└── main.py
+```
+
+当前已实现的核心能力：
+
+- `YAML` 场景加载与校验
+- 离线 dataset 加载、标准化、无效样本分流
+- `PDF -> online dataset draft` 的 dataset build 链路
+- `ragas` 指标流水线装配
+- 统一 `Evaluator` 执行流程
+- 标准 `runs/<scenario>/` 本地资产输出
+
+## 3. 简单离线 dataset 案例
+
+仓库里已经放了一个最小离线样例：
+
+- dataset: [datasets/normalized/sample_offline_rag_eval.csv](/C:/Users/A200477427/Learnings/ragas-template/datasets/normalized/sample_offline_rag_eval.csv)
+- scenario: [scenarios/offline/sample-offline.yaml](/C:/Users/A200477427/Learnings/ragas-template/scenarios/offline/sample-offline.yaml)
+
+另外还补了一套和 PDF dataset build 配对的离线 smoke 样例：
+
+- dataset: [datasets/normalized/sample_pdf_offline_smoke.csv](/C:/Users/A200477427/Learnings/ragas-template/datasets/normalized/sample_pdf_offline_smoke.csv)
+- scenario: [scenarios/offline/sample-pdf-offline-smoke.yaml](/C:/Users/A200477427/Learnings/ragas-template/scenarios/offline/sample-pdf-offline-smoke.yaml)
+
+这个 dataset 是纯本地文件，不依赖任何在线数据源，包含 3 条标准离线样本，字段就是平台统一要求的：
+
+- `question`
+- `contexts`
+- `answer`
+- `ground_truth`
+- 以及可选的 `sample_id / scenario / language / retrieval_config`
+
+## 4. 运行方式
+
+先准备环境变量：
+
+```powershell
+Copy-Item .env.example .env
+```
+
+`.env` 中至少需要设置：
+
+- `OPENAI_API_KEY`
+- `OPENAI_BASE_URL`
+
+如果你现在的 OpenAI 兼容模型都可用，这里直接填你已有的网关和 key 即可。默认模型现在是：
+
+- `OPENAI_BASE_URL=http://6.86.80.4:30080/v1`
+- `RAGAS_JUDGE_MODEL=deepseek-v4-flash`
+- `RAGAS_EMBEDDING_MODEL=text-embedding-v3`
+
+推荐直接走统一入口：
+
+```powershell
+uv run main.py --scenario scenarios/offline/sample-offline.yaml
+or
+.\.venv\Scripts\python.exe main.py --scenario scenarios/offline/sample-offline.yaml
+```
+
+如果你想直接验证 `sample-pdf-build.yaml` 对应的离线 smoke 评测，可以运行：
+
+```powershell
+uv run main.py --scenario scenarios/offline/sample-pdf-offline-smoke.yaml
+or
+.\.venv\Scripts\python.exe main.py --scenario scenarios/offline/sample-pdf-offline-smoke.yaml
+```
+
+运行完成后会输出到：
+
+```text
+runs/sample-offline-baseline/<run_id>/
+├── scenario.snapshot.yaml
+├── scores.csv
+├── invalid.csv
+├── summary.md
+└── metadata.json
+```
+
+## 5. 在线接入预留
+
+当前骨架已经预留了两类一等公民 adapter：
+
+- `http`
+- `python`
+
+本地 Python adapter 示例在 [apps/sample_python/adapter.py](/C:/Users/A200477427/Learnings/ragas-template/apps/sample_python/adapter.py)。你后面要接自己的 RAG 应用时，只需要把真实逻辑适配成：
+
+```python
+def run(question: str, **kwargs) -> dict:
+    return {
+        "answer": "...",
+        "contexts": ["...", "..."],
+        "raw_response": {...},
+    }
+```
+
+其中 [apps/sample_python/adapter.py](/C:/Users/A200477427/Learnings/ragas-template/apps/sample_python/adapter.py) 只是最小 contract 示例。
+如果你要直接评测 `dataset_build` 产出的 PDF 题库，仓库里现在提供了专用实现：
+
+- adapter: [apps/pdf_question_bank/adapter.py](/C:/Users/A200477427/Learnings/ragas-template/apps/pdf_question_bank/adapter.py)
+- online scenario: [scenarios/online/sample-pdf-question-bank-online.yaml](/C:/Users/A200477427/Learnings/ragas-template/scenarios/online/sample-pdf-question-bank-online.yaml)
+
+这个 adapter 会读取题库行中的 `source_chunk_ids`，再从 `source_chunks.jsonl` 里解析出证据块，把这些证据块直接作为 `contexts`，并调用本地 OpenAI 兼容模型生成 `answer`。
+
+## 6. 结果资产
+
+每次运行都会写出标准本地资产：
+
+- `scenario.snapshot.yaml`：本次运行的实际配置快照
+- `scores.csv`：逐样本评分结果
+- `invalid.csv`：无效样本
+- `summary.md`：汇总报告
+- `metadata.json`：机器可读元数据
+
+这意味着后续你比较不同模型、不同 prompt、不同检索策略时，不需要再靠手工记参数。
+
+## 7. PDF 题库构建
+
+仓库现在额外支持把 PDF 文档解析成可人工复核的在线题库草稿。最推荐的阅读顺序是：
+
+- 快速入口看本节
+- 端到端案例看 [docs/sample-pdf-question-bank-workflow.md](/C:/Users/A200477427/Learnings/ragas-template/docs/sample-pdf-question-bank-workflow.md)
+
+样例配置在：
+
+- [scenarios/dataset_build/sample-pdf-build.yaml](/C:/Users/A200477427/Learnings/ragas-template/scenarios/dataset_build/sample-pdf-build.yaml)
+
+运行方式：
+
+```powershell
+uv run main.py --dataset-build-config scenarios/dataset_build/sample-pdf-build.yaml
+or
+.\.venv\Scripts\python.exe main.py --dataset-build-config scenarios/dataset_build/sample-pdf-build.yaml
+```
+
+这条链路会：
+
+- 扫描单个 PDF 或 PDF 目录
+- 调用阿里云解析能力并归一化成 `source chunks`
+- 调用 LLM 生成在线评测题库草稿
+- 输出带时间戳的 run 资产，以及稳定入口 `latest/source_chunks.jsonl`、`latest/dataset_draft.csv`、`latest/metadata.json`
+
+生成后的草稿 dataset 默认只要求 `question` 和 `ground_truth`，后续进入 `online` 评测时再由 adapter 补齐 `answer` 和 `contexts`。
+
+现在这条在线链路已经有一个真实样例：
+
+```powershell
+uv run main.py --scenario scenarios/online/sample-pdf-question-bank-online.yaml
+or
+.\.venv\Scripts\python.exe main.py --scenario scenarios/online/sample-pdf-question-bank-online.yaml
+```
+
+这个 scenario 会：
+
+- 读取 [datasets/raw/generated/sample-pdf-question-bank.csv](/C:/Users/A200477427/Learnings/ragas-template/datasets/raw/generated/sample-pdf-question-bank.csv)
+- 使用 `apps/pdf_question_bank/adapter.py`
+- 显式绑定到 sample build 的稳定入口 `outputs/dataset-builds/sample-pdf-question-bank/latest/source_chunks.jsonl`
+- 在线生成 `answer`
+- 再用 `ground_truth` 对生成结果打分
+
+如果你想按完整顺序跑一遍 sample，从环境变量、dataset build、online eval 到结果查看，直接看：
+
+- [docs/sample-pdf-question-bank-workflow.md](/C:/Users/A200477427/Learnings/ragas-template/docs/sample-pdf-question-bank-workflow.md)
+
+如果你只想先快速跑通离线评测，不想先重新生成题库，可以直接用上面的 [scenarios/offline/sample-pdf-offline-smoke.yaml](/C:/Users/A200477427/Learnings/ragas-template/scenarios/offline/sample-pdf-offline-smoke.yaml)。它是从 sample PDF build 产物固化出来的 smoke dataset，使用 `source chunks` 作为 `contexts`，并用 `ground_truth` 复用为 `answer`。
--- a/apps/pdf_question_bank/init.py
+++ b/apps/pdf_question_bank/init.py
@@ -0,0 +1 @@
+"""Local-document QA adapter package for dataset-build question banks."""
--- a/apps/pdf_question_bank/adapter.py
+++ b/apps/pdf_question_bank/adapter.py
@@ -0,0 +1,161 @@
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any
+
+from openai import OpenAI
+
+from rag_eval.settings import EvaluationSettings
+from rag_eval.shared.utils import parse_contexts
+
+
+_CHUNK_CACHE: dict[Path, dict[str, dict[str, Any]]] = {}
+
+
+def _resolve_source_chunks_path(source_chunks_path: str) -> Path:
+    """Resolve the configured chunk artifact path, with fallback for missing latest aliases."""
+    resolved_path = Path(source_chunks_path).resolve()
+    if resolved_path.exists():
+        return resolved_path
+
+    if resolved_path.parent.name != "latest":
+        raise FileNotFoundError(resolved_path)
+
+    artifact_root = resolved_path.parent.parent
+    if not artifact_root.exists():
+        raise FileNotFoundError(resolved_path)
+
+    candidate_runs = sorted(
+        [
+            entry for entry in artifact_root.iterdir()
+            if entry.is_dir() and entry.name != "latest"
+        ],
+        key=lambda path: path.name,
+        reverse=True,
+    )
+    for run_dir in candidate_runs:
+        candidate = run_dir / resolved_path.name
+        if candidate.exists():
+            return candidate
+
+    raise FileNotFoundError(resolved_path)
+
+
+def _load_source_chunks(source_chunks_path: str) -> dict[str, dict[str, Any]]:
+    """Load source chunk rows from JSONL and cache them by absolute file path."""
+    resolved_path = _resolve_source_chunks_path(source_chunks_path)
+    cached = _CHUNK_CACHE.get(resolved_path)
+    if cached is not None:
+        return cached
+
+    chunk_lookup: dict[str, dict[str, Any]] = {}
+    with resolved_path.open(encoding="utf-8") as handle:
+        for line_number, line in enumerate(handle, start=1):
+            text = line.strip()
+            if not text:
+                continue
+            payload = json.loads(text)
+            chunk_id = str(payload.get("chunk_id", "")).strip()
+            if not chunk_id:
+                raise ValueError(
+                    f"source_chunks.jsonl row {line_number} is missing chunk_id: {resolved_path}"
+                )
+            chunk_lookup[chunk_id] = payload
+
+    _CHUNK_CACHE[resolved_path] = chunk_lookup
+    return chunk_lookup
+
+
+def _resolve_chunk_ids(raw_chunk_ids: Any) -> list[str]:
+    """Parse the serialized source chunk id column into a non-empty list."""
+    chunk_ids = parse_contexts(raw_chunk_ids)
+    normalized = [chunk_id for chunk_id in chunk_ids if chunk_id]
+    if not normalized:
+        raise ValueError("source_chunk_ids is required for pdf question bank samples.")
+    return normalized
+
+
+def _build_messages(question: str, contexts: list[str], metadata: dict[str, Any]) -> list[dict[str, str]]:
+    """Construct an evidence-grounded prompt for answer generation."""
+    evidence_lines = [
+        f"[chunk {index}] {context}"
+        for index, context in enumerate(contexts, start=1)
+    ]
+    metadata_lines = [
+        f"doc_id: {metadata.get('doc_id', '')}",
+        f"doc_name: {metadata.get('doc_name', '')}",
+        f"section_path: {metadata.get('section_path', '')}",
+    ]
+    system_prompt = (
+        "You answer questions only from the provided evidence chunks. "
+        "Do not use outside knowledge. If the evidence is insufficient, say so plainly. "
+        "Do not invent missing facts, citations, steps, or numbers."
+    )
+    user_prompt = "\n".join(
+        [
+            "Question:",
+            question,
+            "",
+            "Sample metadata:",
+            *metadata_lines,
+            "",
+            "Evidence chunks:",
+            *evidence_lines,
+            "",
+            "Return a concise answer grounded only in the evidence above.",
+        ]
+    )
+    return [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_prompt},
+    ]
+
+
+def run(
+    question: str,
+    *,
+    source_chunks_path: str,
+    model: str | None = None,
+    client: OpenAI | None = None,
+    **kwargs: Any,
+) -> dict[str, Any]:
+    """Answer one question by resolving cited chunks and querying an OpenAI-compatible model."""
+    chunk_ids = _resolve_chunk_ids(kwargs.get("source_chunk_ids"))
+    chunk_lookup = _load_source_chunks(source_chunks_path)
+
+    missing_ids = [chunk_id for chunk_id in chunk_ids if chunk_id not in chunk_lookup]
+    if missing_ids:
+        raise ValueError(
+            "source_chunk_ids not found in source chunks artifact: " + ", ".join(missing_ids)
+        )
+
+    resolved_chunks = [chunk_lookup[chunk_id] for chunk_id in chunk_ids]
+    contexts = [str(chunk.get("text", "")).strip() for chunk in resolved_chunks if str(chunk.get("text", "")).strip()]
+    if not contexts:
+        raise ValueError("resolved source chunks did not contain usable text contexts.")
+
+    settings = EvaluationSettings()
+    target_model = (model or settings.ragas_judge_model).strip()
+    if not target_model:
+        raise ValueError("A model name is required for pdf question bank adapter.")
+
+    llm_client = client or OpenAI(**settings.openai_client_kwargs)
+    completion = llm_client.chat.completions.create(
+        model=target_model,
+        messages=_build_messages(question, contexts, kwargs),
+        temperature=0,
+    )
+    answer = str(completion.choices[0].message.content or "").strip()
+
+    return {
+        "answer": answer,
+        "contexts": contexts,
+        "raw_response": {
+            "resolved_chunk_ids": chunk_ids,
+            "resolved_doc_id": kwargs.get("doc_id", ""),
+            "resolved_doc_name": kwargs.get("doc_name", ""),
+            "model": target_model,
+            "response_text": answer,
+        },
+    }
--- a/apps/sample_python/README.md
+++ b/apps/sample_python/README.md
@@ -0,0 +1,12 @@
+# Sample Python Adapter
+
+This directory shows the minimal shape expected by the `python` app adapter:
+
+```python
+def run(question: str, **kwargs) -> dict:
+    return {
+        "answer": "...",
+        "contexts": ["...", "..."],
+        "raw_response": {...},
+    }
+```
--- a/apps/sample_python/adapter.py
+++ b/apps/sample_python/adapter.py
@@ -0,0 +1,14 @@
+from __future__ import annotations
+
+
+def run(question: str, **kwargs) -> dict:
+    answer = f"Sample adapter answer for: {question}"
+    contexts = [
+        "This is a local Python adapter example.",
+        "Replace this with your real retrieval and generation logic.",
+    ]
+    return {
+        "answer": answer,
+        "contexts": contexts,
+        "raw_response": {"question": question, "kwargs": kwargs},
+    }
--- a/docs/models.json
+++ b/docs/models.json
@@ -0,0 +1,168 @@
+{
+  "data": [
+    {
+      "id": "text-embedding-v3",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "custom",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "qwen3-vl-embedding",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "custom",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "qwen3-vl-plus",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "custom",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "qwen3-omni-flash",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "custom",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "deepseek-v3.2",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "custom",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "glm-5",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "zhipu_4v",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "qwen3.5-plus",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "custom",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "glm-5.1",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "custom",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "kimi-k2.6",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "custom",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "deepseek-v4-pro",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "deepseek",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "qwen3.6-flash",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "custom",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "qwen3.5-flash",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "custom",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "qwen3-vl-flash",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "custom",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "qwen3.6-max-preview",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "custom",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "text-embedding-v4",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "custom",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "kimi-k2.5",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "moonshot",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "qwen3.6-plus",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "custom",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    },
+    {
+      "id": "deepseek-v4-flash",
+      "object": "model",
+      "created": 1626777600,
+      "owned_by": "deepseek",
+      "supported_endpoint_types": [
+        "openai"
+      ]
+    }
+  ],
+  "object": "list",
+  "success": true
+}
--- a/docs/pdf-to-dataset-requirement-design.md
+++ b/docs/pdf-to-dataset-requirement-design.md
@@ -0,0 +1,621 @@
+# PDF 文档转在线评测题库需求设计
+
+## 1. 背景与问题
+
+当前仓库已经具备一条完整的评测主链路：
+
+- `main.py --scenario ...`
+- 加载 dataset
+- 标准化样本
+- 调用 adapter
+- 使用 `ragas` 评分
+- 写出本地运行资产
+
+但这条链路默认前提是“评测数据已经存在”。现实里，RAG 项目的评测样本往往首先来自大量原始文档，例如 PDF 规范、制度、手册、法规或技术说明。没有一条稳定的“文档 -> 题库 dataset”生成链路，平台就仍然停留在手工准备数据的阶段。
+
+你给出的外部示例项目已经验证了两件关键能力：
+
+- 可以用阿里云文档解析服务对 PDF 做异步解析
+- 可以把解析结果归一成结构节点、语义块和可追溯切片
+
+因此，本轮需求设计的目标不是建设一个完整知识库系统，而是在当前评测平台里补上一条最小可用的数据生产链路，让原始 PDF 可以生成可复核的在线评测题库。
+
+## 2. 目标
+
+本需求设计的 V1 目标如下：
+
+- 支持从单个 PDF 或一个 PDF 目录批量生成在线评测题库
+- 解析能力基于阿里云文档解析服务
+- 题目生成以单文档为边界，不跨文档混合
+- 输出结果是可直接接入当前 `online` 评测模式的 dataset 草稿
+- 输出结果保留页码、章节、来源 chunk 等证据链，便于人工复核
+- 复用当前仓库的本地文件优先原则，不引入数据库依赖
+
+## 3. 非目标
+
+V1 明确不覆盖以下范围：
+
+- 不建设向量库入库能力
+- 不建设文档上传 Web UI
+- 不建设多租户文档中心
+- 不支持跨文档综合题
+- 不支持 Office 文档、图片包、网页抓取等多格式输入
+- 不自动生成最终 gold dataset
+- 不在 V1 中产出离线评测必需的 `answer / contexts`
+- 不复用外部项目中与向量化落库、知识库写入直接耦合的域模型
+
+换句话说，V1 的目标是 **先把 PDF 文档转成“可人工复核的在线评测题库草稿”**，而不是一步演进成完整 RAG 数据工厂。
+
+## 4. 用户价值
+
+引入这条链路后，平台可以覆盖以下工作方式：
+
+1. 用户准备一批 PDF 文档
+2. 系统调用阿里云文档解析服务获取结构化版面结果
+3. 系统把版面结果归一成可追溯的 chunk
+4. 系统用 LLM 基于 chunk 生成问题与参考答案草稿
+5. 用户在导出的 dataset 上做人工复核
+6. 用户把复核后的 dataset 直接接入现有 `online` 评测场景
+
+这样平台就同时具备：
+
+- 数据生产能力
+- 数据复核能力
+- 在线评测接入能力
+- 结果资产沉淀能力
+
+## 5. V1 核心决策
+
+本轮设计固定以下产品决策：
+
+- 输入文档范围：仅 `PDF`
+- 题目生成范围：仅 `单文档题`
+- dataset 目标类型：`在线题库`
+- 发布方式：`先生成草稿，再人工复核`
+- 解析服务：阿里云文档解析
+- 解析失败默认策略：`fail`
+
+这些决策不再留给后续实现阶段临时判断。
+
+## 6. 目标架构总览
+
+在现有评测架构之外，新增一条“dataset build”链路：
+
+```text
+PDF files
+  -> dataset build config
+  -> aliyun parser gateway
+  -> layout normalization
+  -> source chunks
+  -> LLM question generation
+  -> draft online dataset
+  -> review
+  -> existing online evaluation flow
+```
+
+该链路与现有评测链路的关系如下：
+
+- `dataset build` 负责“评测输入怎么生产”
+- 现有 `rag_eval/execution/` 负责“生产好的评测输入怎么跑分”
+
+二者职责分离，不相互污染。
+
+## 7. 模块边界设计
+
+### 7.1 CLI 入口
+
+保留现有评测入口不变：
+
+```powershell
+python main.py --scenario scenarios/offline/sample-offline.yaml
+```
+
+新增一个用于构建 dataset 的入口：
+
+```powershell
+python main.py --dataset-build-config scenarios/dataset_build/sample-pdf-build.yaml
+```
+
+两个入口互斥，避免一次命令同时承担“建题库”和“跑评测”两个职责。
+
+### 7.2 新增模块目录
+
+新增主包：
+
+```text
+rag_eval/
+  dataset_builder/
+    __init__.py
+    models.py
+    schema.py
+    runner.py
+    writers.py
+    sources.py
+    parser/
+      __init__.py
+      aliyun_docmind_gateway.py
+      aliyun_document_parser.py
+      aliyun_layout_normalizer.py
+    generator/
+      __init__.py
+      question_generator.py
+      validators.py
+```
+
+职责划分如下：
+
+- `schema.py`：校验 dataset build YAML
+- `models.py`：定义 job、解析结果、source chunk、生成样本等内部模型
+- `runner.py`：串联一次完整 build job
+- `writers.py`：写出 dataset 和本地资产
+- `sources.py`：发现输入 PDF 文件
+- `parser/`：适配阿里云文档解析能力
+- `generator/`：调用 LLM 生成题目草稿并做输出校验
+
+### 7.3 外部示例代码复用策略
+
+允许从外部项目参考并迁移以下能力：
+
+- `aliyun_docmind_gateway.py`
+- `aliyun_document_parser.py`
+- `aliyun_layout_normalizer.py`
+
+但复用范围只限于：
+
+- 异步解析任务提交与轮询
+- layout 拉取
+- 结构节点提取
+- 语义块合并
+- 可追溯切片构建
+
+不迁移以下职责：
+
+- 向量库入库
+- embedding 持久化
+- 外部知识库 chunk 域模型
+- 知识库索引流程
+
+## 8. 配置设计
+
+### 8.1 新增 YAML 类型
+
+新增一类配置文件，例如：
+
+```text
+scenarios/
+  dataset_build/
+    sample-pdf-build.yaml
+```
+
+### 8.2 配置字段
+
+V1 的 dataset build YAML 结构固定如下：
+
+```yaml
+job_name: legal-pdf-question-bank
+input:
+  path: ../../datasets/raw/pdfs
+  glob: "*.pdf"
+parser:
+  provider: aliyun_docmind
+  failure_mode: fail
+generation:
+  model: qwen-long
+  output_type: online_question_bank
+  review_mode: draft_with_manual_review
+  max_questions_per_document: 10
+  max_source_chunks_per_question: 3
+output:
+  dataset_path: ../../datasets/raw/generated/legal_question_bank.csv
+  artifact_dir: ../../outputs/dataset-builds/legal-pdf-question-bank
+runtime:
+  max_documents: 20
+```
+
+### 8.3 字段约束
+
+- `job_name`：必填，作为本次构建任务名称
+- `input.path`：必填，支持单文件或目录
+- `input.glob`：可选，默认 `*.pdf`
+- `parser.provider`：V1 固定为 `aliyun_docmind`
+- `parser.failure_mode`：`fail | skip`，默认 `fail`
+- `generation.model`：可选，允许覆盖默认生成模型
+- `generation.output_type`：V1 固定为 `online_question_bank`
+- `generation.review_mode`：V1 固定为 `draft_with_manual_review`
+- `generation.max_questions_per_document`：正整数，默认 `10`
+- `generation.max_source_chunks_per_question`：正整数，默认 `3`
+- `output.dataset_path`：必填，最终 dataset 输出路径
+- `output.artifact_dir`：必填，运行资产根目录
+- `runtime.max_documents`：可选，用于限制一次处理文档数
+
+## 9. 环境变量设计
+
+### 9.1 阿里云解析配置
+
+在 `rag_eval/settings.py` 中新增以下环境变量读取：
+
+- `ALIBABA_ACCESS_KEY_ID`
+- `ALIBABA_ACCESS_KEY_SECRET`
+- `ALIBABA_ENDPOINT`
+- `ALIYUN_PARSE_POLL_INTERVAL_SECONDS`
+- `ALIYUN_PARSE_TIMEOUT_SECONDS`
+- `ALIYUN_PARSE_LAYOUT_STEP_SIZE`
+- `ALIYUN_LLM_ENHANCEMENT`
+- `ALIYUN_ENHANCEMENT_MODE`
+- `DOCUMENT_PARSE_ARTIFACT_PREFIX`
+- `PARSER_FAILURE_MODE`
+
+### 9.2 题库生成模型配置
+
+新增环境变量：
+
+- `DATASET_GENERATOR_MODEL`
+
+默认优先级如下：
+
+1. dataset build YAML 中的 `generation.model`
+2. `.env` 中的 `DATASET_GENERATOR_MODEL`
+3. 代码默认值
+
+### 9.3 密钥管理要求
+
+设计文档只引用环境变量名，不在仓库文档中记录任何明文 AK/SK。当前已经暴露在会话里的密钥需要单独轮换，这属于实现前置的安全动作。
+
+## 10. 核心数据模型设计
+
+### 10.1 `DatasetBuildJob`
+
+表示一次 PDF -> dataset 生成任务。
+
+核心字段：
+
+- `job_name`
+- `input_path`
+- `input_glob`
+- `parser_provider`
+- `failure_mode`
+- `generation_model`
+- `output_type`
+- `review_mode`
+- `dataset_path`
+- `artifact_dir`
+- `runtime`
+
+### 10.2 `ParsedDocument`
+
+表示一个 PDF 经解析和归一化后的文档。
+
+核心字段：
+
+- `doc_id`
+- `doc_name`
+- `raw_text`
+- `structure_nodes`
+- `semantic_blocks`
+- `source_chunks`
+- `metadata`
+
+### 10.3 `SourceChunk`
+
+`SourceChunk` 是 V1 最关键的证据单元，用于生成题目和支持人工复核。
+
+字段固定为：
+
+- `chunk_id`
+- `doc_id`
+- `doc_name`
+- `text`
+- `page_start`
+- `page_end`
+- `section_path`
+- `section_title`
+- `source_layout_ids`
+
+设计原则：
+
+- 每个 chunk 必须能反查来源页码
+- 每个 chunk 必须能反查章节路径
+- 每个 chunk 必须能反查原始 layout id
+- 每个 chunk 只服务题库生成和证据追溯，不承担向量化职责
+
+### 10.4 `DraftQuestionSample`
+
+表示一条待复核的在线评测样本草稿。
+
+字段固定为：
+
+- `sample_id`
+- `question`
+- `ground_truth`
+- `scenario`
+- `language`
+- `doc_id`
+- `doc_name`
+- `section_path`
+- `page_start`
+- `page_end`
+- `source_chunk_ids`
+- `question_type`
+- `difficulty`
+- `review_status`
+- `review_notes`
+
+### 10.5 枚举约束
+
+- `review_status`: `draft | approved | rejected | needs_edit`
+- `question_type`: `fact | summary | procedure | comparison`
+- `difficulty`: `easy | medium | hard`
+
+## 11. 文档解析设计
+
+### 11.1 解析输入范围
+
+V1 仅接受：
+
+- 单个 `.pdf` 文件
+- 或一个包含多个 `.pdf` 的目录
+
+目录模式下默认按 `input.glob` 扫描，默认值为 `*.pdf`。
+
+### 11.2 解析流程
+
+每个 PDF 的处理过程固定为：
+
+1. 发现文件
+2. 创建阿里云 Docmind client
+3. 提交异步解析任务
+4. 轮询直到成功、失败或超时
+5. 分页拉取全量 layout
+6. 归一成结构节点、语义块和 source chunk
+7. 写出中间资产
+
+### 11.3 版面归一化规则
+
+从外部示例代码中沿用以下核心规则：
+
+- 识别标题层级
+- 跳过目录页内容
+- 合并连续段落文本
+- 抽取表格为可检索纯文本
+- 保留图注类文本
+- 按固定窗口做长文本切块
+- 为每个 chunk 注入章节头信息和页码追溯信息
+
+### 11.4 错误处理
+
+支持两种失败模式：
+
+- `fail`：任一文档解析失败则整个 job 失败
+- `skip`：记录失败文档，继续处理其余文档
+
+V1 默认策略为 `fail`。
+
+## 12. 题库生成设计
+
+### 12.1 生成单元
+
+题目生成单元固定为“单文档内的一组 section-aware source chunks”。
+
+约束如下：
+
+- 一条题目只能引用同一个 `doc_id`
+- 一条题目最多引用 `3` 个 chunk
+- 不允许跨文档混合证据
+
+### 12.2 生成输出
+
+每个候选题必须产出：
+
+- `question`
+- `ground_truth`
+- `source_chunk_ids`
+- `question_type`
+- `difficulty`
+
+### 12.3 数量控制
+
+V1 默认：
+
+- 每个文档最多生成 `10` 条题
+- 每组 chunk 最多生成 `1` 条题
+
+实现时必须做覆盖率与多样性平衡，避免所有问题只集中在文档开头章节。
+
+### 12.4 复核模式
+
+V1 不自动发布最终 dataset，只输出草稿。
+
+草稿规则：
+
+- `review_status` 初始一律写为 `draft`
+- `review_notes` 初始为空
+- 人工可在 CSV 中修订问题、答案与审核状态
+
+### 12.5 自动校验
+
+候选题进入最终 dataset 前必须通过以下校验：
+
+- `question` 非空
+- `ground_truth` 非空
+- `source_chunk_ids` 非空
+- 引用的 chunk 必须真实存在
+- 所有引用 chunk 必须来自同一文档
+- `question_type` 和 `difficulty` 必须落在允许枚举内
+
+自动校验失败的候选题不进入最终 draft CSV。
+
+### 12.6 去重规则
+
+同一文档内执行如下去重：
+
+- 问题文本归一化后完全相同则去重
+- 引用 chunk 完全相同且参考答案语义近似的候选题只保留一条
+
+V1 去重目标是控制明显重复，不追求复杂聚类算法。
+
+## 13. 输出资产设计
+
+### 13.1 Dataset 输出
+
+最终 dataset 默认输出到：
+
+```text
+datasets/raw/generated/<job_name>.csv
+```
+
+允许由 YAML 的 `output.dataset_path` 覆盖。
+
+### 13.2 运行资产目录
+
+每次构建任务的运行资产输出到：
+
+```text
+outputs/dataset-builds/<job_name>/<run_id>/
+```
+
+### 13.3 必须输出的资产
+
+每次运行至少写出以下文件：
+
+- `documents.jsonl`
+- `semantic_blocks.jsonl`
+- `source_chunks.jsonl`
+- `dataset_draft.csv`
+- `parse_failures.csv`
+- `metadata.json`
+
+含义如下：
+
+- `documents.jsonl`：逐文档解析摘要
+- `semantic_blocks.jsonl`：逐语义块中间结果
+- `source_chunks.jsonl`：逐切片证据结果
+- `dataset_draft.csv`：生成后的题库草稿
+- `parse_failures.csv`：失败文档清单
+- `metadata.json`：运行元数据、配置快照、统计结果
+
+## 14. 与现有评测链路的兼容性修正
+
+### 14.1 当前问题
+
+当前仓库的文档设计已经说明 `online` 模式往往只需要：
+
+- `question`
+- `ground_truth`
+
+然后由 adapter 在评测时补齐：
+
+- `answer`
+- `contexts`
+
+但当前 `rag_eval/datasets/normalizers.py` 仍然把 `contexts / answer / ground_truth` 统一视作必填。这与文档目标架构不一致，也会直接阻塞本需求设计生成的在线题库接入。
+
+### 14.2 修正原则
+
+后续实现必须把 dataset 校验改成按 mode 分流：
+
+- `offline` 模式必须具备 `question / contexts / answer / ground_truth`
+- `online` 模式必须具备 `question / ground_truth`
+- `online` 模式允许 `contexts / answer` 在初始数据集中为空
+
+### 14.3 设计影响
+
+这个修正不是附属优化，而是本需求能够成立的前置条件。否则生成出来的在线题库无法进入当前评测主流程。
+
+## 15. 流程设计
+
+一次完整的 dataset build job 执行流程如下：
+
+1. 读取 `--dataset-build-config`
+2. 校验 YAML 并生成 `DatasetBuildJob`
+3. 扫描输入 PDF
+4. 按顺序或受控并发处理每个文档
+5. 调用阿里云文档解析
+6. 归一化 layout，生成 `ParsedDocument`
+7. 萃取 `SourceChunk`
+8. 基于 `SourceChunk` 调用 LLM 生成题库草稿
+9. 对候选题做结构校验与去重
+10. 写出 `dataset_draft.csv`
+11. 写出中间 artifacts 和失败清单
+12. 人工复核后，将复核版本作为 `online` 评测输入
+
+## 16. 测试设计
+
+### 16.1 配置测试
+
+需要覆盖：
+
+- `--scenario` 与 `--dataset-build-config` 互斥
+- 缺失必填字段
+- 非法枚举值
+- 输入路径不存在
+- 输入目录中没有 PDF
+
+### 16.2 解析测试
+
+使用 mocked 阿里云响应覆盖：
+
+- 提交成功
+- 状态轮询成功
+- 状态轮询超时
+- 任务失败
+- 返回空 layouts
+
+同时要覆盖版面归一化规则：
+
+- 目录页跳过
+- 标题层级继承
+- 表格扁平化
+- 图注抽取
+- 长文本切块
+
+### 16.3 题库生成测试
+
+使用 mocked LLM 输出覆盖：
+
+- 正常结构化生成
+- 空题目
+- 缺失 ground truth
+- 引用不存在 chunk
+- 跨文档引用
+- 重复问题去重
+
+### 16.4 端到端测试
+
+需要至少有一组 mocked parser + mocked generator 的端到端流程测试，验证：
+
+- 单 PDF 输入
+- 多 PDF 输入
+- `fail` 模式
+- `skip` 模式
+- 所有 artifact 均成功写出
+
+### 16.5 评测回归测试
+
+需要新增测试确保：
+
+- 只包含 `question / ground_truth / metadata` 的在线题库能被加载
+- adapter 补齐 `answer / contexts` 后，现有 evaluator 能继续跑完指标
+
+## 17. 实施顺序建议
+
+为了降低风险，后续实现建议按以下顺序推进：
+
+1. 先扩展 `main.py` 和配置层，增加 dataset build 命令入口
+2. 再扩展 `settings.py` 与依赖，接入阿里云解析配置
+3. 迁移 parser gateway 与 layout normalizer
+4. 落地 `dataset_builder` 的 models、runner、writers
+5. 实现 LLM 题库生成与输出校验
+6. 修正现有 online dataset 校验逻辑
+7. 补测试、样例 YAML 和文档
+
+## 18. 最终结论
+
+本需求设计固定了一个清晰、可落地的 V1 范围：
+
+- 用阿里云解析 PDF
+- 把解析结果转成可追溯 source chunks
+- 用 LLM 基于单文档内容生成在线评测题库草稿
+- 用人工复核保证最终质量
+- 复核后的题库直接接入现有 online 评测流程
+
+这个设计刻意收窄了输入格式、题型边界和自动化深度，目的不是保守，而是先确保整条链路能够在当前仓库架构中闭环，并且不给后续实现留下需要临场决策的空白。
--- a/docs/rag-eval-architecture.md
+++ b/docs/rag-eval-architecture.md
@@ -0,0 +1,623 @@
+# RAG 评测平台架构设计
+
+## 1. 背景与问题
+
+当前仓库已经有一个可运行的离线评测原型，其能力已经收敛到统一的 `main.py --scenario ...` 入口与 `rag_eval/` 分层模块中。这个原型适合验证以下问题：
+
+- 离线导出的 RAG 样本能否被标准化
+- `ragas` 指标能否稳定跑通
+- 基础评分结果能否写出为 CSV
+
+但当评测对象从“一个数据文件、一次评测”演进为“多个 RAG 应用、多个数据集、多轮实验、多种配置对比”时，单脚本结构会迅速暴露问题：
+
+- **职责耦合过高**：参数解析、数据加载、样本标准化、模型创建、指标执行、结果持久化全部混在一个脚本里
+- **难以扩展在线模式**：当前实现默认输入已经包含 `answer` 与 `contexts`，不适合直接接入运行中的 RAG 应用
+- **难以复现实验**：输出主要是单个 CSV，缺少运行快照、元数据和标准汇总
+- **难以对比不同方案**：没有统一场景配置，不便系统化比较应用版本、模型和指标组合
+- **难以承载多应用接入**：缺少稳定的应用适配层，无法把 HTTP 服务型应用和本地 Python 应用纳入统一流程
+
+因此，本仓库需要从“离线评测脚本”演进为“RAG 评测平台骨架”。
+
+## 2. 设计目标
+
+目标架构需要满足以下设计目标：
+
+- **平台化**：从一次性脚本升级为长期可演进的工程结构
+- **可扩展**：新增应用接入方式、数据集格式和指标时，不需要重写主流程
+- **可复现**：每次运行都能保留完整配置快照和结果资产
+- **可对比**：不同应用、模型、检索策略和提示词方案可以稳定横向比较
+- **统一双模式**：离线评测与在线评测共享同一条核心数据流
+- **多对象接入**：支持多应用、多数据集、多模型和多场景组合
+- **本地优先**：第一阶段以本地文件资产为中心，不引入数据库依赖
+
+## 3. 非目标
+
+本轮架构设计明确不覆盖以下方向：
+
+- 不建设数据库中心化评测平台
+- 不建设前端控制台或 Web UI
+- 不建设远程任务调度与分布式执行系统
+- 不引入服务端依赖作为评测执行前提
+- 不在本轮实现完整目录骨架和模块代码，只固定设计边界
+
+换句话说，当前阶段的目标是 **先把本地文件驱动的平台骨架设计定案**，而不是一步做成完整产品。
+
+## 4. 目标架构总览
+
+目标平台按照职责分为六层：
+
+### 4.1 配置层
+
+负责读取、校验和标准化 YAML 场景配置，产出统一的 `Scenario` 对象。
+
+职责包括：
+
+- 解析场景配置文件
+- 应用默认值
+- 校验模式、模型、指标和运行参数
+- 生成运行快照
+
+### 4.2 应用接入层
+
+负责连接外部 RAG 应用或本地 Python RAG 函数，把调用结果统一转换为标准输出结构。
+
+职责包括：
+
+- 定义统一请求输入
+- 屏蔽不同应用协议差异
+- 返回可映射为 `answer` 和 `contexts` 的标准响应
+
+### 4.3 数据集层
+
+负责加载原始样本、标准化样本和在线问题集，并统一转换为平台的标准评测样本结构。
+
+职责包括：
+
+- 读取 CSV、Excel、JSONL 等数据源
+- 规范化字段
+- 校验必填字段
+- 输出可进入评测核心层的标准样本集合
+
+### 4.4 指标层
+
+负责按场景配置装配评测指标，形成可执行的指标流水线。
+
+职责包括：
+
+- 构建 judge model 与 embedding model
+- 根据场景启用指定指标
+- 屏蔽底层评测库差异
+
+### 4.5 执行层
+
+负责把配置、数据、应用和指标串成一次完整运行。
+
+职责包括：
+
+- 数据准备
+- 在线应用调用或离线结果加载
+- 样本标准化
+- 并发执行评分
+- 错误捕获与降级
+- 结果合并
+
+### 4.6 结果层
+
+负责把一次运行的所有结果沉淀为本地文件资产。
+
+职责包括：
+
+- 写出评分明细
+- 写出无效样本
+- 写出场景快照
+- 写出汇总报告
+- 写出元数据
+
+## 5. 领域模型
+
+为了让后续实现边界清晰，平台核心对象固定为以下六类。
+
+### 5.1 `AppAdapter`
+
+表示一个可被评测框架调用的 RAG 应用适配器。
+
+职责：
+
+- 接收标准问题输入
+- 调用实际 RAG 应用
+- 返回标准化响应结果
+
+最小抽象能力：
+
+- 输入：`question` 与可选上下文参数
+- 输出：必须能映射为 `answer` 和 `contexts`
+
+### 5.2 `Dataset`
+
+表示一次评测所使用的数据集定义。
+
+职责：
+
+- 描述数据来源
+- 加载原始样本
+- 输出标准评测样本
+
+需要同时兼容：
+
+- 离线导入样本
+- 在线问题样本
+- 后续可能的多文件数据集或分片数据集
+
+### 5.3 `Scenario`
+
+表示一次完整实验的配置定义，是平台的统一实验入口。
+
+职责：
+
+- 声明评测模式
+- 绑定应用、数据集、模型和指标
+- 定义输出目录与运行参数
+
+`Scenario` 必须由 YAML 场景配置生成，而不是在代码里零散拼装。
+
+### 5.4 `Evaluator`
+
+表示一次评测执行器。
+
+职责：
+
+- 根据场景驱动一次完整运行
+- 调用数据集、应用适配器和指标流水线
+- 产出运行结果与错误信息
+
+### 5.5 `MetricPipeline`
+
+表示一组按场景组合好的指标执行单元。
+
+职责：
+
+- 初始化 judge model 与 embedding model
+- 按统一接口执行多个指标
+- 输出结构化评分结果
+
+### 5.6 `RunArtifact`
+
+表示一次运行沉淀的结果资产集合。
+
+职责：
+
+- 固定结果文件布局
+- 保存配置快照、评分结果、异常样本、汇总报告和元数据
+- 作为后续复现、审计和对比的最小单元
+
+## 6. 应用接入层设计
+
+应用接入层把“如何调用 RAG 应用”从“如何做评测”中解耦。后续只保留两类一等公民适配器：`HTTP Adapter` 和 `Python Function Adapter`。
+
+### 6.1 HTTP Adapter
+
+适用于外部部署的 RAG 服务。
+
+输入约束：
+
+- 标准问题 `question`
+- 可选上下文参数，例如租户、知识库、会话配置、检索参数
+
+输出约束：
+
+- 必须能够映射出 `answer`
+- 必须能够映射出 `contexts`
+
+推荐统一响应语义：
+
+```json
+{
+  "answer": "string",
+  "contexts": ["context 1", "context 2"],
+  "raw_response": {}
+}
+```
+
+说明：
+
+- 真实 HTTP 响应可以与此不同
+- 但 Adapter 层必须把实际响应转换成上述平台内部语义
+- `raw_response` 可选保留，用于调试和审计，但不作为核心评测字段依赖
+
+### 6.2 Python Function Adapter
+
+适用于本地 Python 形式的 RAG 应用，例如本地函数、SDK 包装器或 Notebook 中已封装的检索问答逻辑。
+
+输入输出约束与 HTTP Adapter 保持同构：
+
+- 输入：`question` 与可选上下文参数
+- 输出：必须能映射为 `answer` 和 `contexts`
+
+推荐函数语义：
+
+```python
+def run(question: str, **kwargs) -> dict:
+    return {
+        "answer": "...",
+        "contexts": ["...", "..."],
+        "raw_response": {...},
+    }
+```
+
+设计原则：
+
+- 上层评测执行器不应该关心底层是 HTTP 调用还是 Python 函数调用
+- 只要符合统一输入输出契约，就可以接入同一条评测管线
+
+## 7. 数据集层设计
+
+数据集层的目标是把不同来源的样本统一为同一标准格式。
+
+### 7.1 原始样本
+
+原始样本是未经平台标准化的输入，可能来自：
+
+- 业务系统导出的 CSV 或 Excel
+- 在线评测时的问题集
+- 后续数据清洗脚本生成的中间文件
+
+原始样本允许存在额外字段，但不应该直接进入评测核心层。
+
+### 7.2 标准化评测样本
+
+无论数据来源如何，进入评测核心层前都必须变成同一标准样本结构。最小核心字段固定为：
+
+- `question`
+- `contexts`
+- `answer`
+- `ground_truth`
+
+可选元数据字段可包括：
+
+- `sample_id`
+- `scenario`
+- `language`
+- `retrieval_config`
+
+约束如下：
+
+- `question`：字符串
+- `contexts`：有序文本列表
+- `answer`：字符串
+- `ground_truth`：字符串
+
+### 7.3 在线生成样本与离线导入样本的统一格式
+
+双模式统一约束如下：
+
+- **离线模式**：输入文件本身已包含 `question / contexts / answer / ground_truth`
+- **在线模式**：问题集先提供 `question` 与必要元信息，应用调用后补齐 `answer / contexts`，再与参考答案或标注结果结合形成标准样本
+
+统一要求：
+
+- 不允许在线模式绕过标准样本结构直接进入指标执行层
+- 不允许离线模式在结果写出前使用独立的资产格式
+
+这保证了后续的指标层、执行层和结果层可以完全共享。
+
+## 8. 场景配置设计
+
+YAML 是未来统一实验入口。README 只给最小示例，详细字段定义在本节固定。
+
+### 8.1 最小骨架字段
+
+```yaml
+scenario_name: legal-assistant-offline-baseline
+mode: offline
+app_adapter: null
+dataset: datasets/normalized/legal_assistant_baseline.csv
+judge_model: deepseek-v4-flash
+embedding_model: text-embedding-v3
+metrics:
+  - faithfulness
+  - answer_relevancy
+  - context_recall
+  - context_precision
+output_dir: runs/legal-assistant-offline-baseline
+runtime:
+  batch_size: 4
+```
+
+### 8.2 字段说明
+
+- `scenario_name`
+  - 场景名称，用于标识一次实验
+- `mode: offline | online`
+  - 评测模式
+- `app_adapter`
+  - 应用适配器定义；离线模式可为 `null`，在线模式必须提供
+- `dataset`
+  - 数据集路径或数据集定义引用
+- `judge_model`
+  - 负责评分类指标推理的模型
+- `embedding_model`
+  - 负责向量相关指标的模型
+- `metrics`
+  - 本次启用的指标列表
+- `output_dir`
+  - 本次运行结果输出目录
+- `runtime.batch_size`
+  - 并发批次大小
+
+### 8.3 在线模式的 `app_adapter` 形态
+
+后续建议支持如下两种声明方式：
+
+```yaml
+app_adapter:
+  type: http
+  endpoint: https://example-rag/api/ask
+  method: POST
+  timeout_seconds: 30
+```
+
+```yaml
+app_adapter:
+  type: python
+  callable: apps.legal_assistant.adapter:run
+```
+
+说明：
+
+- 这是配置接口约束，不代表当前仓库已经具备解析实现
+- 字段可以后续扩充，但类型边界本轮即固定为 `http` 与 `python`
+
+## 9. 评测执行层设计
+
+评测执行层负责把一次实验变成稳定、可审计的运行流程。执行顺序固定如下。
+
+### 9.1 数据准备
+
+根据 `Scenario` 加载数据集定义，并读取原始样本。
+
+### 9.2 应用调用或离线加载
+
+- `offline` 模式：直接读取包含评测核心字段的离线样本
+- `online` 模式：读取问题集，调用 `AppAdapter` 获取 `answer` 与 `contexts`
+
+### 9.3 样本标准化
+
+所有样本统一映射为标准结构：
+
+- `question`
+- `contexts`
+- `answer`
+- `ground_truth`
+
+不合法样本在此阶段被分流为无效样本，而不是在指标执行阶段失败后再回溯处理。
+
+### 9.4 指标编排
+
+根据 `Scenario.metrics` 创建 `MetricPipeline`，并装配：
+
+- judge model
+- embedding model
+- 指标实例
+
+### 9.5 并发控制
+
+执行层负责并发上限，不把并发策略散落到各指标实现中。
+
+最低要求：
+
+- 允许通过 `runtime.batch_size` 控制最大并发
+- 在线调用和指标评分后续可以拥有独立并发策略
+- 并发策略由执行层统一调度
+
+### 9.6 错误捕获
+
+错误需要按层次捕获：
+
+- 数据加载错误
+- 应用调用错误
+- 指标执行错误
+- 结果写出错误
+
+原则：
+
+- 单条样本失败不应默认导致整批运行失败
+- 失败信息要体现在结果资产中
+- 不应只把错误打印到控制台后丢失
+
+### 9.7 结果合并
+
+最终输出应包含：
+
+- 原始样本字段
+- 标准化样本字段
+- 指标分数
+- 错误字段
+- 运行元信息
+
+## 10. 结果资产设计
+
+平台统一采用“run 目录”模式，不引入数据库前置依赖。每次运行输出固定如下：
+
+```text
+runs/<run_id>/
+├── scenario.snapshot.yaml
+├── scores.csv
+├── invalid.csv
+├── summary.md
+└── metadata.json
+```
+
+各文件职责固定如下。
+
+### 10.1 `runs/<run_id>/scenario.snapshot.yaml`
+
+保存本次评测实际使用的场景快照，确保即使原始场景文件后续变化，也能复现本次运行。
+
+### 10.2 `runs/<run_id>/scores.csv`
+
+保存逐样本评测结果，至少包括：
+
+- 标准样本字段
+- 指标分数
+- 错误信息
+- 模型与运行时间等元信息字段
+
+### 10.3 `runs/<run_id>/invalid.csv`
+
+保存标准化失败、关键字段缺失或格式不合法的样本，用于追踪数据质量问题。
+
+### 10.4 `runs/<run_id>/summary.md`
+
+保存面向人阅读的汇总结论，例如：
+
+- 总样本数、有效样本数、无效样本数
+- 各指标均值
+- 分组统计
+- 低分样本观察
+
+### 10.5 `runs/<run_id>/metadata.json`
+
+保存机器可读元数据，例如：
+
+- `run_id`
+- `scenario_name`
+- `mode`
+- `judge_model`
+- `embedding_model`
+- `started_at`
+- `finished_at`
+- 代码版本或 Git commit（后续可选）
+
+统一约束：
+
+- 无论数据来自在线还是离线模式，结果都统一写入本地文件资产
+- 在引入数据库前，不允许出现只写数据库不落本地文件的实现分支
+
+## 11. 推荐代码目录结构
+
+后续推荐代码骨架如下。该结构作为实现边界约束，供后续重构与新增模块直接遵循。
+
+```text
+.
+├── apps/
+│   ├── <app_name>/
+│   │   ├── adapter.py
+│   │   └── README.md
+├── datasets/
+│   ├── raw/
+│   ├── normalized/
+│   └── samples/
+├── scenarios/
+│   ├── offline/
+│   └── online/
+├── rag_eval/
+│   ├── config/
+│   │   ├── loader.py
+│   │   ├── schema.py
+│   │   └── validators.py
+│   ├── adapters/
+│   │   ├── base.py
+│   │   ├── http.py
+│   │   └── python.py
+│   ├── datasets/
+│   │   ├── loader.py
+│   │   ├── normalizers.py
+│   │   └── validators.py
+│   ├── metrics/
+│   │   ├── factory.py
+│   │   ├── pipeline.py
+│   │   └── registry.py
+│   ├── execution/
+│   │   ├── evaluator.py
+│   │   ├── runner.py
+│   │   ├── concurrency.py
+│   │   └── errors.py
+│   ├── reporting/
+│   │   ├── artifacts.py
+│   │   ├── summary.py
+│   │   └── writers.py
+│   ├── shared/
+│   │   ├── types.py
+│   │   ├── models.py
+│   │   └── utils.py
+│   ├── compat.py
+│   └── settings.py
+├── runs/
+├── docs/
+├── tests/
+├── main.py
+└── README.md
+```
+
+模块分层约束固定如下：
+
+- `rag_eval/config/`：配置加载与校验
+- `rag_eval/adapters/`：HTTP / Python 应用接入
+- `rag_eval/datasets/`：加载、标准化、校验
+- `rag_eval/metrics/`：指标装配与扩展
+- `rag_eval/execution/`：运行编排、并发、错误处理
+- `rag_eval/reporting/`：结果写出、汇总、报告
+- `rag_eval/shared/`：通用类型与工具
+
+后续实现不应再把这些职责重新合并回单个入口脚本。
+
+## 12. 演进路线
+
+当前仓库已经完成从单脚本离线评测到统一 CLI 的第一轮收敛，后续重点不再是移除旧入口，而是继续完善场景、接入、结果治理与测试能力。
+
+推荐拆分路径如下：
+
+### 12.1 第一阶段：抽离离线模式公共能力（已完成）
+
+当前已完成以下能力下沉：
+
+- 输入文件加载
+- 样本标准化
+- 指标装配
+- 结果写出
+
+结果是离线模式已经摆脱“单文件全包”的结构，进入统一模块分层。
+
+### 12.2 第二阶段：引入 `Scenario` 与 YAML 配置（已完成）
+
+当前主入口已经改为读取 YAML 场景配置。
+
+### 12.3 第三阶段：引入应用适配器（已完成第一版）
+
+当前已具备在线模式的一等公民适配器：
+
+- HTTP Adapter
+- Python Function Adapter
+
+在线和离线模式已经共享同一条标准化与评分流程。
+
+### 12.4 第四阶段：统一 CLI 入口（已完成）
+
+当前统一 CLI 入口如下：
+
+```powershell
+python main.py --scenario scenarios/offline/sample.yaml
+```
+
+旧的兼容入口已经移除，仓库统一以 scenario 驱动的入口为准。
+
+### 12.5 第五阶段：完善结果治理与测试
+
+补齐：
+
+- `runs/` 目录规范落地
+- 汇总报告生成
+- 在线/离线统一回归测试
+- 多应用与多场景对比样例
+
+## 结论
+
+本设计文档的核心决策已经固定：
+
+- 统一采用 YAML 作为实验入口
+- 统一支持在线与离线双模式
+- 统一以标准样本结构进入评测核心层
+- 统一把结果写入本地 `run` 目录资产
+- 统一按照配置层、接入层、数据层、指标层、执行层、结果层拆分代码
+
+后续工程实现应直接遵循这些边界推进，而不再重新讨论整体骨架。
--- a/docs/rag-eval-engine-flow.md
+++ b/docs/rag-eval-engine-flow.md
@@ -0,0 +1,416 @@
+# RAG 评测引擎链路说明
+
+## 1. 这份文档解决什么问题
+
+`docs/rag-eval-architecture.md` 主要回答“为什么这样分层、模块边界是什么”。  
+这份文档回答的是另一件事：**一次评测在代码里到底是怎么跑起来的**。
+
+如果你现在对下面这些问题还有点混淆，这份文档就是给你的：
+
+- `rag_eval/` 到底是评测什么的
+- `apps/` 在整个架构里扮演什么角色
+- `offline` 和 `online` 模式的区别是什么
+- dataset、adapter、metrics、reporting 是怎么串起来的
+
+---
+
+## 2. 一句话理解整个引擎
+
+这套系统本质上是一条标准化评测流水线：
+
+```text
+scenario -> dataset -> normalize -> app adapter -> metrics -> reporting -> run artifacts
+```
+
+也可以拆成更容易理解的话：
+
+1. 先读取一份评测场景配置
+2. 再读取待评测数据
+3. 把原始数据标准化成统一样本结构
+4. 如果需要，调用你的 RAG 应用补齐 `answer` 和 `contexts`
+5. 用 `ragas` 指标计算分数
+6. 把结果写到本地 run 目录
+
+---
+
+## 3. 目录职责
+
+### `rag_eval/`
+
+这是**评测引擎本体**。它负责：
+
+- 加载 scenario
+- 加载 dataset
+- 调用 app adapter
+- 执行指标评分
+- 写出结果资产
+
+### `apps/`
+
+这是**被评测应用的接入层**，不是评测框架本身。
+
+这里放的是“你的 RAG 应用如何被框架调用”的示例或适配代码。例如：
+
+```text
+apps/
+└── sample_python/
+    ├── adapter.py
+    └── README.md
+```
+
+`apps/sample_python/` 的意义不是提供评测逻辑，而是演示：
+
+- 如果你的应用是本地 Python 函数
+- 它应该暴露什么接口
+- 返回值要长什么样
+
+### `scenarios/`
+
+这是评测配置层，用 YAML 声明：
+
+- 评测模式
+- 数据集路径
+- judge / embedding 模型
+- 要跑哪些 metrics
+- 输出目录
+- 是否要调用 `http` 或 `python` adapter
+
+### `datasets/`
+
+这里存放评测输入数据。通常分为：
+
+- `raw/`：原始输入
+- `normalized/`：整理后的标准评测样本
+
+### `outputs/` 或 `runs/`
+
+这里存放每次评测生成的结果资产，比如：
+
+- `scores.csv`
+- `invalid.csv`
+- `summary.md`
+- `metadata.json`
+
+---
+
+## 4. 主入口链路
+
+统一主链路最终都会走到：
+
+- [rag_eval/execution/runner.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/execution/runner.py:1)
+
+核心入口函数是 `run_scenario()`，它负责把所有子模块串起来。
+
+简化后的执行顺序是：
+
+```text
+run_scenario()
+  -> load_scenario()
+  -> build_adapter()
+  -> build_metric_pipeline()
+  -> Evaluator.evaluate()
+  -> write_run_artifacts()
+```
+
+你可以把它理解成整套系统的 orchestration 层。
+
+---
+
+## 5. Scenario 链路
+
+scenario 是一次评测任务的“总配置”。
+
+相关代码：
+
+- [rag_eval/config/loader.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/config/loader.py:1)
+- [rag_eval/config/schema.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/config/schema.py:1)
+- [rag_eval/config/validators.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/config/validators.py:1)
+
+它定义的内容包括：
+
+- `mode`: `offline` 或 `online`
+- `dataset`
+- `judge_model`
+- `embedding_model`
+- `metrics`
+- `output_dir`
+- `runtime`
+- `app_adapter`
+
+作用可以概括为一句话：  
+**scenario 决定“这次评测要怎么跑”。**
+
+---
+
+## 6. Dataset 链路
+
+数据进入评测引擎后，不会直接拿原始 CSV 去打分，而是要先标准化。
+
+相关代码：
+
+- [rag_eval/datasets/loader.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/datasets/loader.py:1)
+- [rag_eval/datasets/validators.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/datasets/validators.py:1)
+- [rag_eval/datasets/normalizers.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/datasets/normalizers.py:1)
+
+它做三件事：
+
+1. 读取 CSV / Excel / JSONL
+2. 校验必要字段
+3. 转成统一内部对象 `NormalizedSample`
+
+统一样本结构定义在：
+
+- [rag_eval/shared/models.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/shared/models.py:1)
+
+最关键字段是：
+
+- `question`
+- `contexts`
+- `answer`
+- `ground_truth`
+
+这个统一结构是后续 metrics 和 reporting 能共用一条链路的前提。
+
+---
+
+## 7. Offline 和 Online 的真正区别
+
+这是整个系统最关键的分叉点。
+
+### Offline 模式
+
+离线模式下，dataset 里已经有完整评测字段：
+
+- `question`
+- `contexts`
+- `answer`
+- `ground_truth`
+
+所以链路是：
+
+```text
+load dataset -> normalize -> score metrics -> write artifacts
+```
+
+这个模式不需要调用你的应用。
+
+### Online 模式
+
+在线模式下，dataset 往往只有：
+
+- `question`
+- 一些 metadata
+
+`answer` 和 `contexts` 需要评测时实时调用你的 RAG 应用拿回来。
+
+所以链路会变成：
+
+```text
+load dataset -> normalize -> call app adapter -> enrich samples -> score metrics -> write artifacts
+```
+
+在线模式比离线模式多出来的核心环节，就是 **adapter 调用**。
+
+---
+
+## 8. `apps/sample_python/` 到底是干什么的
+
+这个目录是一个 **Python adapter 示例**。
+
+相关文件：
+
+- [apps/sample_python/adapter.py](/C:/Users/A200477427/Learnings/ragas-template/apps/sample_python/adapter.py:1)
+- [apps/sample_python/README.md](/C:/Users/A200477427/Learnings/ragas-template/apps/sample_python/README.md:1)
+
+它演示了：如果你的 RAG 应用是本地 Python 代码，那么框架期望你提供一个这样的函数：
+
+```python
+def run(question: str, **kwargs) -> dict:
+    return {
+        "answer": "...",
+        "contexts": ["...", "..."],
+        "raw_response": {...},
+    }
+```
+
+也就是说，`apps/sample_python/` 不是评测引擎的一部分，而是“被评测应用”的一个参考接入模板。
+
+它的作用是：
+
+- 告诉你 Python 类型应用如何接入
+- 给 `python` adapter 一个最小可运行示例
+- 让你后续把真实 RAG 逻辑替换进去
+
+---
+
+## 9. Adapter 链路
+
+adapter 层的目标是：**把不同类型的目标应用，统一成同一套输入输出协议。**
+
+相关代码：
+
+- [rag_eval/adapters/base.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/adapters/base.py:1)
+- [rag_eval/adapters/http.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/adapters/http.py:1)
+- [rag_eval/adapters/python.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/adapters/python.py:1)
+
+当前支持两类 adapter：
+
+### `python`
+
+适用于本地 Python 应用。  
+框架会根据 scenario 里的 `module:function` 动态加载函数，然后调用它。
+
+### `http`
+
+适用于独立 HTTP 服务。  
+框架会构造请求、解析响应，并映射到统一结构。
+
+无论哪种 adapter，最后都要返回统一结果：
+
+- `answer`
+- `contexts`
+- `raw_response`（可选）
+
+这一步很关键，因为 metrics 层不应该关心底层到底是 HTTP 服务还是 Python 函数。
+
+---
+
+## 10. Evaluator 链路
+
+评测执行核心在：
+
+- [rag_eval/execution/evaluator.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/execution/evaluator.py:1)
+
+`Evaluator.evaluate()` 大致会做这些事：
+
+1. 记录开始时间
+2. 加载 dataset
+3. 标准化样本
+4. 如果是 `online`，先调用 adapter 补齐样本
+5. 调用 metric pipeline 打分
+6. 合并样本字段和评分结果
+7. 返回 `EvaluationResult`
+
+这里可以把 `Evaluator` 理解成：
+
+**一次评测运行的总执行器**
+
+---
+
+## 11. Metric Pipeline 链路
+
+相关代码：
+
+- [rag_eval/metrics/factory.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/metrics/factory.py:1)
+- [rag_eval/metrics/pipeline.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/metrics/pipeline.py:1)
+- [rag_eval/metrics/registry.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/metrics/registry.py:1)
+
+这里的职责不是“决定评什么”，而是“把要评的指标真正跑起来”。
+
+具体包括：
+
+- 初始化 OpenAI client
+- 创建 judge model / embedding model
+- 根据 scenario 装配对应的 ragas metrics
+- 并发执行单样本或批量评分
+
+当前支持的指标包括：
+
+- `faithfulness`
+- `answer_relevancy`
+- `context_recall`
+- `context_precision`
+
+所以 metric pipeline 的职责可以总结为：
+
+**把标准样本转换成结构化评分结果。**
+
+---
+
+## 12. Reporting 链路
+
+相关代码：
+
+- [rag_eval/reporting/artifacts.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/reporting/artifacts.py:1)
+- [rag_eval/reporting/summary.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/reporting/summary.py:1)
+- [rag_eval/reporting/writers.py](/C:/Users/A200477427/Learnings/ragas-template/rag_eval/reporting/writers.py:1)
+
+当评测完成后，结果不会只打印在终端，而是会沉淀成标准资产。
+
+标准输出一般包括：
+
+- `scenario.snapshot.yaml`
+- `scores.csv`
+- `invalid.csv`
+- `summary.md`
+- `metadata.json`
+
+这是整套架构很重要的价值点，因为它让每次 run 都具备：
+
+- 可复现性
+- 可审计性
+- 可对比性
+
+---
+
+## 13. 两条完整链路示意
+
+### Offline 完整链路
+
+```text
+main.py
+  -> run_scenario()
+  -> load_scenario()
+  -> load_dataset_records()
+  -> normalize_records()
+  -> build_metric_pipeline()
+  -> score_samples()
+  -> write_run_artifacts()
+```
+
+特点：
+
+- 不调用被评测应用
+- 直接对现成样本评分
+
+### Online 完整链路
+
+```text
+main.py
+  -> run_scenario()
+  -> load_scenario()
+  -> build_adapter()
+  -> load_dataset_records()
+  -> normalize_records()
+  -> adapter.enrich_sample()
+  -> build_metric_pipeline()
+  -> score_samples()
+  -> write_run_artifacts()
+```
+
+特点：
+
+- 会先调用目标应用
+- 再对实时生成的 `answer / contexts` 评分
+
+---
+
+## 14. 你应该怎么理解这套架构
+
+如果只记一条心智模型，可以记这个：
+
+- `rag_eval/` 负责“怎么评”
+- `apps/` 负责“被评的应用怎么接进来”
+- `datasets/` 负责“评测输入是什么”
+- `scenarios/` 负责“这次评测要怎么配置”
+- `reporting/` 负责“结果怎么沉淀”
+
+从工程拆分上看，这个架构的核心价值不是“能跑一次评测”，而是：
+
+- 可以反复跑
+- 可以换应用跑
+- 可以换数据跑
+- 可以换模型跑
+- 可以把每次实验的资产稳定留住
+
+这也是它和一次性离线脚本的根本区别。
--- a/docs/sample-pdf-question-bank-workflow.md
+++ b/docs/sample-pdf-question-bank-workflow.md
@@ -0,0 +1,222 @@
+# `sample-pdf-question-bank` 端到端使用说明
+
+这篇文档对应仓库里的真实案例：先把 PDF 解析成 question bank，再用 online evaluator 基于证据 chunk 生成答案并打分。
+
+完整链路是：
+
+```text
+PDFs
+-> dataset_build
+-> sample-pdf-question-bank.csv + latest/source_chunks.jsonl
+-> online adapter
+-> answer + contexts
+-> ragas metrics
+```
+
+## 1. 先准备环境
+
+先复制环境变量模板：
+
+```powershell
+Copy-Item .env.example .env
+```
+
+这条案例链路依赖两类能力：
+
+- OpenAI 兼容模型
+  - `OPENAI_API_KEY`
+  - `OPENAI_BASE_URL`
+- 阿里云文档解析
+  - `ALIBABA_ACCESS_KEY_ID`
+  - `ALIBABA_ACCESS_KEY_SECRET`
+  - `ALIBABA_ENDPOINT`
+
+默认还会用到这些模型配置：
+
+- `DATASET_GENERATOR_MODEL=qwen3.6-plus`
+- `RAGAS_JUDGE_MODEL=deepseek-v4-flash`
+- `RAGAS_EMBEDDING_MODEL=text-embedding-v3`
+
+如果少了 `OPENAI_API_KEY`，dataset build 里的题库生成和 online eval 都无法运行。
+如果少了阿里云凭据，PDF 解析阶段无法运行。
+
+## 2. 跑 dataset build
+
+使用 sample 配置：
+
+- config: [scenarios/dataset_build/sample-pdf-build.yaml](/C:/Users/A200477427/Learnings/ragas-template/scenarios/dataset_build/sample-pdf-build.yaml)
+
+执行命令：
+
+```powershell
+uv run main.py --dataset-build-config scenarios/dataset_build/sample-pdf-build.yaml
+or
+.\.venv\Scripts\python.exe main.py --dataset-build-config scenarios/dataset_build/sample-pdf-build.yaml
+```
+
+这一步会做四件事：
+
+1. 扫描 `datasets/raw/pdfs` 下的 PDF
+2. 调用阿里云解析生成结构化 `source chunks`
+3. 调用 LLM 生成 question bank 草稿
+4. 写出稳定 dataset 和详细 run 资产
+
+## 3. 看 build 产物
+
+跑完以后，你会看到两类输出。
+
+第一类是稳定入口，给后续 online eval 和教程使用：
+
+- question bank CSV:
+  [datasets/raw/generated/sample-pdf-question-bank.csv](/C:/Users/A200477427/Learnings/ragas-template/datasets/raw/generated/sample-pdf-question-bank.csv)
+- latest source chunks:
+  [source_chunks.jsonl](/C:/Users/A200477427/Learnings/ragas-template/outputs/dataset-builds/sample-pdf-question-bank/latest/source_chunks.jsonl)
+- latest dataset draft:
+  [dataset_draft.csv](/C:/Users/A200477427/Learnings/ragas-template/outputs/dataset-builds/sample-pdf-question-bank/latest/dataset_draft.csv)
+- latest metadata:
+  [metadata.json](/C:/Users/A200477427/Learnings/ragas-template/outputs/dataset-builds/sample-pdf-question-bank/latest/metadata.json)
+
+第二类是带时间戳的 run 级资产，用来审计和排查：
+
+- 目录模式：
+  `outputs/dataset-builds/sample-pdf-question-bank/<run_id>/`
+
+其中常见文件有：
+
+- `documents.jsonl`
+- `semantic_blocks.jsonl`
+- `source_chunks.jsonl`
+- `dataset_draft.csv`
+- `parse_failures.csv`
+- `metadata.json`
+
+理解这个区别很重要：
+
+- 稳定入口用于“继续往下跑”
+- 时间戳目录用于“回看某次具体构建”
+
+## 4. question bank CSV 里是什么
+
+sample build 输出的是 online-ready question bank，不是离线评测格式。
+
+核心字段包括：
+
+- `question`
+- `ground_truth`
+- `source_chunk_ids`
+- `doc_id`
+- `doc_name`
+
+它故意不预写 `answer`。
+原因是这条链路的目标是在线评测：由 adapter 在评测时，根据 `source_chunk_ids` 去 `source_chunks.jsonl` 里取证据，再调用模型生成 `answer`。
+
+## 5. 跑 online eval
+
+使用 sample online scenario：
+
+- scenario: [scenarios/online/sample-pdf-question-bank-online.yaml](/C:/Users/A200477427/Learnings/ragas-template/scenarios/online/sample-pdf-question-bank-online.yaml)
+- adapter: [apps/pdf_question_bank/adapter.py](/C:/Users/A200477427/Learnings/ragas-template/apps/pdf_question_bank/adapter.py)
+
+执行命令：
+
+```powershell
+uv run main.py --scenario scenarios/online/sample-pdf-question-bank-online.yaml
+or
+.\.venv\Scripts\python.exe main.py --scenario scenarios/online/sample-pdf-question-bank-online.yaml
+```
+
+这个 scenario 的关键点有两个：
+
+1. dataset 指向稳定 question bank CSV
+2. `app_adapter.static_kwargs.source_chunks_path` 指向稳定的 `latest/source_chunks.jsonl`
+
+因此，只要你重新跑过 sample dataset build，online scenario 就不需要再手改时间戳路径。
+
+## 6. online adapter 在做什么
+
+`apps/pdf_question_bank/adapter.py` 的处理方式是固定的：
+
+1. 从题库行里读取 `source_chunk_ids`
+2. 打开 `source_chunks.jsonl`
+3. 只解析被引用的 chunk
+4. 把这些 chunk 文本原样作为 `contexts`
+5. 用这些证据 prompt 模型生成 `answer`
+6. 把 `resolved_chunk_ids` 和模型响应写进 `raw_response`
+
+所以评测关系是：
+
+- `ground_truth` 是参考答案
+- `answer` 是运行时生成答案
+- `contexts` 是题目显式引用的证据块
+
+这条链路没有单独做 retrieval。
+它评测的是“给定明确证据后，应用/模型能否稳定生成正确答案”。
+
+## 7. 结果在哪里看
+
+online eval 完成后，结果会写到：
+
+- `outputs/online/sample-pdf-question-bank/<run_id>/`
+
+常见文件包括：
+
+- `scores.csv`
+- `invalid.csv`
+- `summary.md`
+- `metadata.json`
+
+优先看这几个点：
+
+- `scores.csv`：逐题指标分数
+- `invalid.csv`：哪些样本因为 adapter 失败或空结果被剔除了
+- `summary.md`：汇总视图
+
+## 8. 常见问题
+
+### `source_chunk_ids` 找不到
+
+这通常表示 question bank CSV 和 `source_chunks.jsonl` 不是同一次 build 的产物。
+
+正确做法：
+
+1. 重新跑一次 `sample-pdf-build.yaml`
+2. 确认 `outputs/dataset-builds/sample-pdf-question-bank/latest/source_chunks.jsonl` 已更新
+3. 再运行 online scenario
+
+### dataset build 成功，但 online eval 结果是 invalid
+
+先看 `invalid.csv`。
+当前实现里，以下情况会进入 invalid：
+
+- adapter 生成 `answer` 为空
+- adapter 返回 `contexts` 为空
+- adapter 在解析 chunk 或调模型时抛异常
+
+### 只想快速看离线 smoke，不想重建题库
+
+直接运行：
+
+- [scenarios/offline/sample-pdf-offline-smoke.yaml](/C:/Users/A200477427/Learnings/ragas-template/scenarios/offline/sample-pdf-offline-smoke.yaml)
+
+这个案例是固化好的离线 smoke dataset，不依赖 online adapter。
+
+## 9. 换成你自己的 PDF 时改哪里
+
+如果你要复用这条模式处理自己的 PDF，最少只改这几个点：
+
+1. 复制一份 dataset build YAML
+2. 修改 `input.path` 指向你的 PDF 或 PDF 目录
+3. 修改 `output.dataset_path` 为你的 question bank CSV
+4. 修改 `output.artifact_dir` 为你的 build 资产根目录
+5. 复制一份 online scenario
+6. 修改 `dataset` 指向你的 question bank CSV
+7. 修改 `app_adapter.static_kwargs.source_chunks_path` 指向你的 `artifact_dir/latest/source_chunks.jsonl`
+
+不需要改 question bank CSV 的字段结构。
+也不需要把 `answer` 预先写进 CSV。
+
+如果你的 online answer 逻辑仍然是“根据显式证据块生成答案”，就可以继续复用：
+
+- [apps/pdf_question_bank/adapter.py](/C:/Users/A200477427/Learnings/ragas-template/apps/pdf_question_bank/adapter.py)
+
+如果你的应用是 HTTP 服务或有自己独立的 RAG 流程，再换成对应 adapter 即可。
--- a/main.py
+++ b/main.py
@@ -0,0 +1,37 @@
+from __future__ import annotations
+
+import argparse
+
+from rag_eval.dataset_builder.runner import run_dataset_build
+from rag_eval.execution.runner import run_scenario
+
+
+def parse_args() -> argparse.Namespace:
+    """Parse CLI arguments for either evaluation or dataset build workflows."""
+    parser = argparse.ArgumentParser(description="Run a RAG evaluation scenario.")
+    group = parser.add_mutually_exclusive_group(required=True)
+    group.add_argument(
+        "--scenario",
+        help="Path to a YAML scenario file.",
+    )
+    group.add_argument(
+        "--dataset-build-config",
+        help="Path to a YAML dataset build config file.",
+    )
+    return parser.parse_args()
+
+
+def main() -> None:
+    """Dispatch the CLI call to the requested workflow."""
+    args = parse_args()
+    if args.dataset_build_config:
+        result = run_dataset_build(args.dataset_build_config)
+        print(f"Completed dataset build: {result.artifact_paths.root_dir}")
+        return
+
+    result = run_scenario(args.scenario)
+    print(f"Completed run: {result.scenario.output_dir}")
+
+
+if __name__ == "__main__":
+    main()
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -0,0 +1,19 @@
+[project]
+name = "ragas-template"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "alibabacloud-credentials>=1.0.3",
+    "alibabacloud-docmind-api20220711>=1.0.8",
+    "alibabacloud-tea-openapi>=0.4.1",
+    "alibabacloud-tea-util>=0.3.13",
+    "datasets>=4.0.0",
+    "langchain-community==0.4.2",
+    "langchain-openai==1.2.2",
+    "openai>=1.0.0",
+    "pandas>=3.0.0",
+    "pydantic-settings>=2.14.1",
+    "ragas==0.4.3",
+]
--- a/rag_eval/init.py
+++ b/rag_eval/init.py
@@ -0,0 +1,5 @@
+"""Public package exports for the RAG evaluation toolkit."""
+
+from .execution.runner import run_scenario
+
+__all__ = ["run_scenario"]
--- a/rag_eval/adapters/init.py
+++ b/rag_eval/adapters/init.py
@@ -0,0 +1,7 @@
+"""Adapter implementations that connect evaluation flows to target applications."""
+
+from .base import AppAdapter
+from .http import HttpAppAdapter
+from .python import PythonFunctionAdapter
+
+__all__ = ["AppAdapter", "HttpAppAdapter", "PythonFunctionAdapter"]
--- a/rag_eval/adapters/base.py
+++ b/rag_eval/adapters/base.py
@@ -0,0 +1,37 @@
+"""Shared adapter interfaces for online application execution."""
+
+from __future__ import annotations
+
+from abc import ABC, abstractmethod
+from typing import Any
+
+from rag_eval.shared.models import NormalizedSample
+
+
+class AppAdapter(ABC):
+    """Abstract base class for adapters that fetch answers and contexts from apps."""
+
+    @abstractmethod
+    async def run(self, question: str, **kwargs: Any) -> dict[str, Any]:
+        """Execute the target application for a single question."""
+        raise NotImplementedError
+
+    async def enrich_sample(self, sample: NormalizedSample) -> NormalizedSample:
+        """Merge adapter output into an existing normalized sample."""
+        response = await self.run(question=sample.question, **sample.metadata)
+        answer = str(response.get("answer", "")).strip()
+        contexts = response.get("contexts") or []
+        # Drop empty context fragments so downstream metrics receive clean lists.
+        normalized_contexts = [str(item).strip() for item in contexts if str(item).strip()]
+        return NormalizedSample(
+            sample_id=sample.sample_id,
+            question=sample.question,
+            contexts=normalized_contexts,
+            answer=answer,
+            ground_truth=sample.ground_truth,
+            scenario=sample.scenario,
+            language=sample.language,
+            retrieval_config=sample.retrieval_config,
+            metadata={**sample.metadata, "raw_response": response.get("raw_response")},
+            raw=sample.raw,
+        )
--- a/rag_eval/adapters/http.py
+++ b/rag_eval/adapters/http.py
@@ -0,0 +1,45 @@
+"""HTTP adapter implementation for online evaluation scenarios."""
+
+from __future__ import annotations
+
+from typing import Any
+
+import httpx
+
+from rag_eval.shared.models import AppAdapterConfig
+
+from .base import AppAdapter
+
+
+class HttpAppAdapter(AppAdapter):
+    """Call an HTTP endpoint and map its JSON response into the normalized adapter shape."""
+
+    def __init__(self, config: AppAdapterConfig):
+        """Store the HTTP adapter configuration for later requests."""
+        self.config = config
+
+    async def run(self, question: str, **kwargs: Any) -> dict[str, Any]:
+        """Send one HTTP request and return the normalized response payload."""
+        payload = dict(self.config.request_template)
+        payload["question"] = question
+        payload.update(self.config.static_kwargs)
+        payload.update(kwargs)
+
+        async with httpx.AsyncClient(timeout=self.config.timeout_seconds) as client:
+            response = await client.request(
+                self.config.method.upper(),
+                self.config.endpoint or "",
+                json=payload,
+            )
+            response.raise_for_status()
+            body = response.json()
+
+        # Allow scenario config to rename answer/context fields without custom code.
+        mapping = self.config.response_mapping or {}
+        answer_key = mapping.get("answer", "answer")
+        contexts_key = mapping.get("contexts", "contexts")
+        return {
+            "answer": body.get(answer_key, ""),
+            "contexts": body.get(contexts_key, []),
+            "raw_response": body,
+        }
--- a/rag_eval/adapters/python.py
+++ b/rag_eval/adapters/python.py
@@ -0,0 +1,38 @@
+"""Python callable adapter for in-process application integrations."""
+
+from __future__ import annotations
+
+from importlib import import_module
+from typing import Any, Callable
+
+from rag_eval.shared.models import AppAdapterConfig
+
+from .base import AppAdapter
+
+
+class PythonFunctionAdapter(AppAdapter):
+    """Wrap a configured Python callable so it can participate in online evaluation."""
+
+    def __init__(self, config: AppAdapterConfig):
+        """Load and cache the configured callable during adapter initialization."""
+        self.config = config
+        self._callable = self._load_callable(config.callable or "")
+
+    @staticmethod
+    def _load_callable(target: str) -> Callable[..., dict[str, Any]]:
+        """Resolve a `module:function` target into a callable object."""
+        module_name, _, attr_name = target.partition(":")
+        if not module_name or not attr_name:
+            raise ValueError("Python adapter callable must use module:function syntax.")
+        module = import_module(module_name)
+        fn = getattr(module, attr_name)
+        if not callable(fn):
+            raise TypeError(f"Configured callable is not callable: {target}")
+        return fn
+
+    async def run(self, question: str, **kwargs: Any) -> dict[str, Any]:
+        """Invoke the configured callable and enforce the adapter response contract."""
+        result = self._callable(question=question, **self.config.static_kwargs, **kwargs)
+        if not isinstance(result, dict):
+            raise TypeError("Python adapter callable must return a dict.")
+        return result
--- a/rag_eval/compat.py
+++ b/rag_eval/compat.py
@@ -0,0 +1,39 @@
+"""Compatibility helpers for optional third-party import paths."""
+
+from __future__ import annotations
+
+import sys
+import types
+
+
+def ensure_ragas_import_compat() -> None:
+    """Patch optional langchain module paths that ragas imports eagerly.
+
+    The local environment ships a `langchain_community` build that still exposes
+    `langchain_community.llms.vertexai` but no longer provides
+    `langchain_community.chat_models.vertexai`. Ragas imports the chat module at
+    import time even when only OpenAI is used. Inject a minimal module so ragas
+    can import without mutating site-packages.
+    """
+
+    module_name = "langchain_community.chat_models.vertexai"
+    if module_name in sys.modules:
+        return
+
+    try:
+        import langchain_community.chat_models.vertexai  # type: ignore  # noqa: F401
+
+        return
+    except ModuleNotFoundError:
+        pass
+
+    # Inject a minimal shim so ragas can import successfully in stripped builds.
+    shim = types.ModuleType(module_name)
+
+    class ChatVertexAI:  # pragma: no cover - only used for import compatibility
+        """Compatibility shim for environments that do not ship ChatVertexAI."""
+
+        pass
+
+    shim.ChatVertexAI = ChatVertexAI
+    sys.modules[module_name] = shim
--- a/rag_eval/config/init.py
+++ b/rag_eval/config/init.py
@@ -0,0 +1,5 @@
+"""Scenario configuration loading utilities."""
+
+from .loader import load_scenario
+
+__all__ = ["load_scenario"]
--- a/rag_eval/config/loader.py
+++ b/rag_eval/config/loader.py
@@ -0,0 +1,67 @@
+"""Scenario file loading and conversion into internal runtime models."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import yaml
+
+from rag_eval.shared.models import AppAdapterConfig, DatasetConfig, RuntimeConfig, Scenario
+
+from .schema import ScenarioModel
+from .validators import validate_scenario
+
+
+def _resolve_static_kwargs_paths(base_dir: Path, raw_kwargs: dict[str, object]) -> dict[str, object]:
+    """Resolve adapter static kwargs that look like relative file-system paths."""
+    resolved: dict[str, object] = {}
+    for key, value in raw_kwargs.items():
+        if key.endswith("_path") and isinstance(value, str):
+            candidate = Path(value)
+            resolved[key] = candidate if candidate.is_absolute() else (base_dir / candidate).resolve()
+            continue
+        resolved[key] = value
+    return resolved
+
+
+def load_scenario(path: str | Path) -> Scenario:
+    """Load, validate, and resolve a scenario file into the internal scenario model."""
+    scenario_path = Path(path).resolve()
+    payload = yaml.safe_load(scenario_path.read_text(encoding="utf-8")) or {}
+    model = ScenarioModel.model_validate(payload)
+    base_dir = scenario_path.parent
+
+    app_adapter = None
+    if model.app_adapter is not None:
+        # Convert the validated Pydantic model into the lightweight runtime dataclass.
+        app_adapter = AppAdapterConfig(
+            type=model.app_adapter.type,
+            endpoint=model.app_adapter.endpoint,
+            method=model.app_adapter.method,
+            timeout_seconds=model.app_adapter.timeout_seconds,
+            callable=model.app_adapter.callable,
+            request_template=model.app_adapter.request_template,
+            response_mapping=model.app_adapter.response_mapping,
+            static_kwargs=_resolve_static_kwargs_paths(base_dir, model.app_adapter.static_kwargs),
+        )
+
+    scenario = Scenario(
+        scenario_name=model.scenario_name,
+        mode=model.mode,
+        app_adapter=app_adapter,
+        dataset=DatasetConfig(path=model.resolve_path(base_dir, model.dataset)),
+        judge_model=model.judge_model,
+        embedding_model=model.embedding_model,
+        metrics=model.metrics,
+        output_dir=model.resolve_path(base_dir, model.output_dir),
+        runtime=RuntimeConfig(
+            batch_size=model.runtime.batch_size,
+            app_concurrency=model.runtime.app_concurrency,
+            metric_concurrency=model.runtime.metric_concurrency,
+            max_samples=model.runtime.max_samples,
+        ),
+        source_path=scenario_path,
+    )
+    # Run cross-field checks after all relative paths have been resolved.
+    validate_scenario(scenario)
+    return scenario
--- a/rag_eval/config/schema.py
+++ b/rag_eval/config/schema.py
@@ -0,0 +1,78 @@
+"""Pydantic schemas used to validate raw scenario configuration files."""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any, Literal
+
+from pydantic import BaseModel, ConfigDict, Field, field_validator, model_validator
+
+
+class RuntimeConfigModel(BaseModel):
+    """Schema for runtime concurrency and sampling settings."""
+    model_config = ConfigDict(extra="ignore")
+
+    batch_size: int = 4
+    app_concurrency: int | None = None
+    metric_concurrency: int | None = None
+    max_samples: int | None = None
+
+
+class AppAdapterConfigModel(BaseModel):
+    """Schema for adapter-specific configuration in online scenarios."""
+    model_config = ConfigDict(extra="ignore")
+
+    type: Literal["http", "python"]
+    endpoint: str | None = None
+    method: str = "POST"
+    timeout_seconds: int = 30
+    callable: str | None = None
+    request_template: dict[str, Any] = Field(default_factory=dict)
+    response_mapping: dict[str, str] = Field(default_factory=dict)
+    static_kwargs: dict[str, Any] = Field(default_factory=dict)
+
+    @model_validator(mode="after")
+    def validate_shape(self) -> "AppAdapterConfigModel":
+        """Enforce the fields required by each adapter type."""
+        if self.type == "http" and not self.endpoint:
+            raise ValueError("HTTP adapter requires endpoint.")
+        if self.type == "python" and not self.callable:
+            raise ValueError("Python adapter requires callable.")
+        return self
+
+
+class ScenarioModel(BaseModel):
+    """Schema for a user-authored evaluation scenario file."""
+    model_config = ConfigDict(extra="ignore")
+
+    scenario_name: str
+    mode: Literal["offline", "online"]
+    app_adapter: AppAdapterConfigModel | None = None
+    dataset: str
+    judge_model: str
+    embedding_model: str
+    metrics: list[str]
+    output_dir: str
+    runtime: RuntimeConfigModel = Field(default_factory=RuntimeConfigModel)
+
+    @field_validator("metrics")
+    @classmethod
+    def ensure_metrics_not_empty(cls, value: list[str]) -> list[str]:
+        """Reject scenarios that do not request any metrics."""
+        if not value:
+            raise ValueError("metrics must not be empty.")
+        return value
+
+    @model_validator(mode="after")
+    def validate_mode_requirements(self) -> "ScenarioModel":
+        """Ensure online scenarios define the adapter they depend on."""
+        if self.mode == "online" and self.app_adapter is None:
+            raise ValueError("online mode requires app_adapter.")
+        return self
+
+    def resolve_path(self, base_dir: Path, raw_path: str) -> Path:
+        """Resolve relative paths against the scenario file directory."""
+        candidate = Path(raw_path)
+        if candidate.is_absolute():
+            return candidate
+        return (base_dir / candidate).resolve()
--- a/rag_eval/config/validators.py
+++ b/rag_eval/config/validators.py
@@ -0,0 +1,20 @@
+"""Cross-field validation helpers for resolved runtime scenarios."""
+
+from __future__ import annotations
+
+from rag_eval.metrics.registry import SUPPORTED_METRICS
+from rag_eval.shared.models import Scenario
+
+
+def validate_scenario(scenario: Scenario) -> None:
+    """Validate metric selection and mode-specific runtime constraints."""
+    unsupported = [name for name in scenario.metrics if name not in SUPPORTED_METRICS]
+    if unsupported:
+        supported = ", ".join(sorted(SUPPORTED_METRICS))
+        raise ValueError(
+            f"Unsupported metrics: {', '.join(unsupported)}. Supported metrics: {supported}"
+        )
+    if scenario.mode == "offline" and scenario.app_adapter is not None:
+        raise ValueError("offline mode should not define app_adapter.")
+    if scenario.runtime.batch_size < 1:
+        raise ValueError("runtime.batch_size must be >= 1.")
--- a/rag_eval/dataset_builder/init.py
+++ b/rag_eval/dataset_builder/init.py
@@ -0,0 +1,5 @@
+"""Dataset build workflow for converting PDFs into reviewable online question banks."""
+
+from .runner import run_dataset_build
+
+__all__ = ["run_dataset_build"]
--- a/rag_eval/dataset_builder/generator/init.py
+++ b/rag_eval/dataset_builder/generator/init.py
@@ -0,0 +1,5 @@
+"""Question generation components for draft online datasets."""
+
+from .question_generator import OpenAIQuestionGenerator, QuestionGenerator
+
+__all__ = ["OpenAIQuestionGenerator", "QuestionGenerator"]
--- a/rag_eval/dataset_builder/generator/question_generator.py
+++ b/rag_eval/dataset_builder/generator/question_generator.py
@@ -0,0 +1,173 @@
+"""LLM-backed question generator for dataset build jobs."""
+
+from __future__ import annotations
+
+import json
+from abc import ABC, abstractmethod
+from typing import Any
+
+from openai import OpenAI
+
+from rag_eval.dataset_builder.models import DraftQuestionSample, ParsedDocument, SourceChunk
+from rag_eval.settings import EvaluationSettings
+
+
+class QuestionGenerator(ABC):
+    """Abstract interface for generating draft questions from parsed documents."""
+
+    @abstractmethod
+    def generate(
+        self,
+        document: ParsedDocument,
+        *,
+        max_questions: int,
+        max_chunks_per_question: int,
+        job_name: str,
+    ) -> list[DraftQuestionSample]:
+        """Generate draft question samples for one parsed document."""
+        raise NotImplementedError
+
+
+class OpenAIQuestionGenerator(QuestionGenerator):
+    """Generate draft questions with an OpenAI-compatible chat completion API."""
+
+    def __init__(self, settings: EvaluationSettings, model: str, client: OpenAI | None = None):
+        """Initialize the OpenAI-compatible client and target generation model."""
+        if not settings.openai_api_key:
+            raise EnvironmentError("OPENAI_API_KEY must be set before generating draft questions.")
+        self.client = client or OpenAI(**settings.openai_client_kwargs)
+        self.model = model
+
+    def _build_prompt(
+        self,
+        document: ParsedDocument,
+        *,
+        max_questions: int,
+        max_chunks_per_question: int,
+    ) -> str:
+        """Build a constrained JSON-generation prompt for one document."""
+        chunk_lines: list[str] = []
+        for chunk in document.source_chunks:
+            chunk_lines.append(
+                json.dumps(
+                    {
+                        "chunk_id": chunk.chunk_id,
+                        "section_path": chunk.section_path,
+                        "page_start": chunk.page_start,
+                        "page_end": chunk.page_end,
+                        "text": chunk.text,
+                    },
+                    ensure_ascii=False,
+                )
+            )
+
+        instructions = {
+            "task": "Generate reviewable online evaluation draft questions from one document only.",
+            "rules": [
+                "Return JSON only.",
+                f"Generate at most {max_questions} samples.",
+                f"Each sample may cite at most {max_chunks_per_question} chunk ids.",
+                "Every sample must stay within this document and use existing chunk ids only.",
+                "Allowed question_type values: fact, summary, procedure, comparison.",
+                "Allowed difficulty values: easy, medium, hard.",
+            ],
+            "output_schema": {
+                "samples": [
+                    {
+                        "question": "string",
+                        "ground_truth": "string",
+                        "source_chunk_ids": ["chunk-id"],
+                        "question_type": "fact|summary|procedure|comparison",
+                        "difficulty": "easy|medium|hard",
+                    }
+                ]
+            },
+            "document": {
+                "doc_id": document.doc_id,
+                "doc_name": document.doc_name,
+                "chunks": chunk_lines,
+            },
+        }
+        return json.dumps(instructions, ensure_ascii=False, indent=2)
+
+    def _build_sample(
+        self,
+        *,
+        document: ParsedDocument,
+        payload: dict[str, Any],
+        index: int,
+        job_name: str,
+    ) -> DraftQuestionSample:
+        """Convert one model output object into the internal draft sample model."""
+        chunk_lookup: dict[str, SourceChunk] = {item.chunk_id: item for item in document.source_chunks}
+        source_chunk_ids = [str(item).strip() for item in payload.get("source_chunk_ids") or [] if str(item).strip()]
+        chunks = [chunk_lookup[item] for item in source_chunk_ids if item in chunk_lookup]
+
+        section_path = chunks[0].section_path if chunks else ""
+        page_start = min((chunk.page_start for chunk in chunks), default=0)
+        page_end = max((chunk.page_end for chunk in chunks), default=0)
+        language = "zh" if any("\u4e00" <= char <= "\u9fff" for char in payload.get("question", "")) else "en"
+        return DraftQuestionSample(
+            sample_id=f"{document.doc_id}-q{index}",
+            question=str(payload.get("question", "")).strip(),
+            ground_truth=str(payload.get("ground_truth", "")).strip(),
+            scenario=job_name,
+            language=language,
+            doc_id=document.doc_id,
+            doc_name=document.doc_name,
+            section_path=section_path,
+            page_start=page_start,
+            page_end=page_end,
+            source_chunk_ids=source_chunk_ids,
+            question_type=str(payload.get("question_type", "fact")).strip() or "fact",
+            difficulty=str(payload.get("difficulty", "medium")).strip() or "medium",
+        )
+
+    @staticmethod
+    def _parse_response_payload(content: str) -> list[dict[str, Any]]:
+        """Parse the model response into a list of sample payload dictionaries."""
+        try:
+            payload = json.loads(content or "{}")
+        except json.JSONDecodeError as exc:
+            raise ValueError("Question generator returned invalid JSON.") from exc
+
+        if not isinstance(payload, dict):
+            raise ValueError("Question generator response must be a JSON object.")
+        samples = payload.get("samples") or []
+        if not isinstance(samples, list):
+            raise ValueError("Question generator response field 'samples' must be a list.")
+
+        normalized_samples: list[dict[str, Any]] = []
+        for item in samples:
+            if isinstance(item, dict):
+                normalized_samples.append(item)
+        return normalized_samples
+
+    def generate(
+        self,
+        document: ParsedDocument,
+        *,
+        max_questions: int,
+        max_chunks_per_question: int,
+        job_name: str,
+    ) -> list[DraftQuestionSample]:
+        """Generate draft questions for one parsed document."""
+        prompt = self._build_prompt(
+            document,
+            max_questions=max_questions,
+            max_chunks_per_question=max_chunks_per_question,
+        )
+        response = self.client.chat.completions.create(
+            model=self.model,
+            messages=[
+                {"role": "system", "content": "You generate structured draft question banks from source documents."},
+                {"role": "user", "content": prompt},
+            ],
+            response_format={"type": "json_object"},
+        )
+        content = response.choices[0].message.content or "{}"
+        payload = self._parse_response_payload(content)
+        return [
+            self._build_sample(document=document, payload=item, index=index, job_name=job_name)
+            for index, item in enumerate(payload[:max_questions], start=1)
+        ]
--- a/rag_eval/dataset_builder/generator/validators.py
+++ b/rag_eval/dataset_builder/generator/validators.py
@@ -0,0 +1,87 @@
+"""Validation and deduplication helpers for generated draft question samples."""
+
+from __future__ import annotations
+
+import re
+from difflib import SequenceMatcher
+
+from rag_eval.dataset_builder.models import DraftQuestionSample, ParsedDocument
+
+
+ALLOWED_QUESTION_TYPES = {"fact", "summary", "procedure", "comparison"}
+ALLOWED_DIFFICULTIES = {"easy", "medium", "hard"}
+
+
+def validate_draft_sample(
+    sample: DraftQuestionSample,
+    *,
+    document: ParsedDocument,
+    max_source_chunks_per_question: int | None = None,
+) -> list[str]:
+    """Validate one generated sample against the document and enum constraints."""
+    errors: list[str] = []
+    if not sample.question.strip():
+        errors.append("question is empty")
+    if not sample.ground_truth.strip():
+        errors.append("ground_truth is empty")
+    if not sample.source_chunk_ids:
+        errors.append("source_chunk_ids is empty")
+    if (
+        max_source_chunks_per_question is not None
+        and len(sample.source_chunk_ids) > max_source_chunks_per_question
+    ):
+        errors.append(
+            f"source_chunk_ids exceeds limit: {len(sample.source_chunk_ids)} > {max_source_chunks_per_question}"
+        )
+
+    existing_chunk_ids = {chunk.chunk_id for chunk in document.source_chunks}
+    for chunk_id in sample.source_chunk_ids:
+        if chunk_id not in existing_chunk_ids:
+            errors.append(f"unknown source chunk: {chunk_id}")
+
+    if sample.doc_id != document.doc_id:
+        errors.append("sample doc_id does not match source document")
+    if sample.question_type not in ALLOWED_QUESTION_TYPES:
+        errors.append(f"unsupported question_type: {sample.question_type}")
+    if sample.difficulty not in ALLOWED_DIFFICULTIES:
+        errors.append(f"unsupported difficulty: {sample.difficulty}")
+    return errors
+
+
+def normalize_question_text(text: str) -> str:
+    """Normalize question text for exact-match deduplication."""
+    return re.sub(r"\s+", " ", text).strip().lower()
+
+
+def dedupe_samples(samples: list[DraftQuestionSample]) -> list[DraftQuestionSample]:
+    """Drop duplicate questions and enforce one output per chunk group per document."""
+    deduped: list[DraftQuestionSample] = []
+    seen_questions: set[tuple[str, str]] = set()
+    seen_chunk_groups: set[tuple[str, tuple[str, ...]]] = set()
+    seen_chunk_answers: list[tuple[str, tuple[str, ...], str]] = []
+
+    for sample in samples:
+        question_key = (sample.doc_id, normalize_question_text(sample.question))
+        if question_key in seen_questions:
+            continue
+
+        chunk_key = tuple(sample.source_chunk_ids)
+        chunk_group_key = (sample.doc_id, chunk_key)
+        if chunk_group_key in seen_chunk_groups:
+            continue
+        answer_key = normalize_question_text(sample.ground_truth)
+        duplicate = False
+        for existing_doc_id, existing_chunk_key, existing_answer in seen_chunk_answers:
+            if existing_doc_id != sample.doc_id or existing_chunk_key != chunk_key:
+                continue
+            if SequenceMatcher(None, existing_answer, answer_key).ratio() >= 0.9:
+                duplicate = True
+                break
+        if duplicate:
+            continue
+
+        seen_questions.add(question_key)
+        seen_chunk_groups.add(chunk_group_key)
+        seen_chunk_answers.append((sample.doc_id, chunk_key, answer_key))
+        deduped.append(sample)
+    return deduped
--- a/rag_eval/dataset_builder/models.py
+++ b/rag_eval/dataset_builder/models.py
@@ -0,0 +1,203 @@
+"""Internal data models for the PDF-to-dataset build workflow."""
+
+from __future__ import annotations
+
+from dataclasses import asdict, dataclass, field
+from pathlib import Path
+from typing import Any, Literal
+
+
+ReviewStatus = Literal["draft", "approved", "rejected", "needs_edit"]
+QuestionType = Literal["fact", "summary", "procedure", "comparison"]
+Difficulty = Literal["easy", "medium", "hard"]
+FailureMode = Literal["fail", "skip"]
+
+
+@dataclass(slots=True)
+class DatasetBuildRuntime:
+    """Runtime controls for one dataset build job."""
+
+    max_documents: int | None = None
+
+
+@dataclass(slots=True)
+class DatasetBuildJob:
+    """Resolved dataset build configuration consumed by the build runner."""
+
+    job_name: str
+    input_path: Path
+    input_glob: str
+    parser_provider: str
+    failure_mode: FailureMode
+    generation_model: str
+    output_type: str
+    review_mode: str
+    max_questions_per_document: int
+    max_source_chunks_per_question: int
+    dataset_path: Path
+    artifact_dir: Path
+    runtime: DatasetBuildRuntime = field(default_factory=DatasetBuildRuntime)
+    source_path: Path | None = None
+
+    def snapshot(self) -> dict[str, Any]:
+        """Serialize the job into JSON-friendly metadata."""
+        payload = asdict(self)
+        payload["input_path"] = self.input_path.as_posix()
+        payload["dataset_path"] = self.dataset_path.as_posix()
+        payload["artifact_dir"] = self.artifact_dir.as_posix()
+        if self.source_path is not None:
+            payload["source_path"] = self.source_path.as_posix()
+        return payload
+
+
+@dataclass(slots=True)
+class StructureNode:
+    """One normalized structure heading extracted from layout results."""
+
+    node_id: str
+    level: int
+    title: str
+    page_start: int
+    page_end: int
+    section_path: str
+
+
+@dataclass(slots=True)
+class SemanticBlock:
+    """One merged semantic block used as an intermediate artifact before chunking."""
+
+    block_id: str
+    doc_id: str
+    doc_name: str
+    text: str
+    page_start: int
+    page_end: int
+    section_path: str
+    section_title: str
+    source_layout_ids: list[str]
+
+    def to_record(self) -> dict[str, Any]:
+        """Convert the block into a flat artifact record."""
+        return asdict(self)
+
+
+@dataclass(slots=True)
+class SourceChunk:
+    """Evidence chunk used for question generation and human review."""
+
+    chunk_id: str
+    doc_id: str
+    doc_name: str
+    text: str
+    page_start: int
+    page_end: int
+    section_path: str
+    section_title: str
+    source_layout_ids: list[str]
+
+    def to_record(self) -> dict[str, Any]:
+        """Convert the chunk into a flat artifact record."""
+        return asdict(self)
+
+
+@dataclass(slots=True)
+class ParsedDocument:
+    """Normalized parsed document ready for question generation."""
+
+    doc_id: str
+    doc_name: str
+    raw_text: str
+    structure_nodes: list[StructureNode]
+    semantic_blocks: list[SemanticBlock]
+    source_chunks: list[SourceChunk]
+    metadata: dict[str, Any] = field(default_factory=dict)
+
+    def to_record(self) -> dict[str, Any]:
+        """Convert the parsed document into a summary artifact record."""
+        return {
+            "doc_id": self.doc_id,
+            "doc_name": self.doc_name,
+            "raw_text": self.raw_text,
+            "structure_nodes": [asdict(item) for item in self.structure_nodes],
+            "metadata": self.metadata,
+            "semantic_block_count": len(self.semantic_blocks),
+            "source_chunk_count": len(self.source_chunks),
+        }
+
+
+@dataclass(slots=True)
+class DraftQuestionSample:
+    """One draft online evaluation sample pending manual review."""
+
+    sample_id: str
+    question: str
+    ground_truth: str
+    scenario: str
+    language: str
+    doc_id: str
+    doc_name: str
+    section_path: str
+    page_start: int
+    page_end: int
+    source_chunk_ids: list[str]
+    question_type: QuestionType
+    difficulty: Difficulty
+    review_status: ReviewStatus = "draft"
+    review_notes: str = ""
+
+    def to_record(self) -> dict[str, Any]:
+        """Convert the draft sample into a flat CSV row."""
+        return {
+            "sample_id": self.sample_id,
+            "question": self.question,
+            "ground_truth": self.ground_truth,
+            "scenario": self.scenario,
+            "language": self.language,
+            "doc_id": self.doc_id,
+            "doc_name": self.doc_name,
+            "section_path": self.section_path,
+            "page_start": self.page_start,
+            "page_end": self.page_end,
+            "source_chunk_ids": self.source_chunk_ids,
+            "question_type": self.question_type,
+            "difficulty": self.difficulty,
+            "review_status": self.review_status,
+            "review_notes": self.review_notes,
+        }
+
+
+@dataclass(slots=True)
+class ParseFailure:
+    """One document parse failure recorded for reporting and skip-mode execution."""
+
+    file_path: str
+    error: str
+
+    def to_record(self) -> dict[str, str]:
+        """Convert the failure into a flat CSV row."""
+        return asdict(self)
+
+
+@dataclass(slots=True)
+class DatasetBuildArtifactPaths:
+    """Canonical file paths produced by one dataset build run."""
+
+    root_dir: Path
+    documents_jsonl: Path
+    semantic_blocks_jsonl: Path
+    source_chunks_jsonl: Path
+    dataset_draft_csv: Path
+    parse_failures_csv: Path
+    metadata_json: Path
+
+
+@dataclass(slots=True)
+class DatasetBuildResult:
+    """Aggregate result object returned after a dataset build completes."""
+
+    job: DatasetBuildJob
+    run_id: str
+    artifact_paths: DatasetBuildArtifactPaths
+    documents: list[ParsedDocument]
+    draft_samples: list[DraftQuestionSample]
+    parse_failures: list[ParseFailure]
--- a/rag_eval/dataset_builder/offline_converter.py
+++ b/rag_eval/dataset_builder/offline_converter.py
@@ -0,0 +1,78 @@
+"""Utilities for converting draft online datasets into offline smoke-test datasets."""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+
+from rag_eval.shared.utils import ensure_directory
+
+
+def _load_jsonl(path: Path) -> list[dict[str, Any]]:
+    """Load a JSONL file into a list of dictionaries."""
+    rows: list[dict[str, Any]] = []
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            text = line.strip()
+            if not text:
+                continue
+            rows.append(json.loads(text))
+    return rows
+
+
+def build_offline_smoke_dataset(
+    *,
+    draft_dataset_path: Path,
+    source_chunks_path: Path,
+    output_path: Path,
+) -> Path:
+    """Derive an offline-evaluable dataset by reusing ground truth as answer and chunk text as contexts."""
+    draft_frame = pd.read_csv(draft_dataset_path)
+    chunk_rows = _load_jsonl(source_chunks_path)
+    chunk_lookup = {str(row["chunk_id"]): row for row in chunk_rows}
+
+    output_rows: list[dict[str, Any]] = []
+    for _, row in draft_frame.iterrows():
+        chunk_ids = row.get("source_chunk_ids")
+        if isinstance(chunk_ids, str):
+            parsed_chunk_ids = json.loads(chunk_ids)
+        elif isinstance(chunk_ids, list):
+            parsed_chunk_ids = chunk_ids
+        else:
+            parsed_chunk_ids = []
+
+        contexts = [
+            str(chunk_lookup[chunk_id]["text"]).strip()
+            for chunk_id in parsed_chunk_ids
+            if chunk_id in chunk_lookup and str(chunk_lookup[chunk_id]["text"]).strip()
+        ]
+        ground_truth = str(row.get("ground_truth", "")).strip()
+        output_rows.append(
+            {
+                "sample_id": row.get("sample_id", ""),
+                "question": row.get("question", ""),
+                "contexts": json.dumps(contexts, ensure_ascii=False),
+                "answer": ground_truth,
+                "ground_truth": ground_truth,
+                "scenario": row.get("scenario", ""),
+                "language": row.get("language", ""),
+                "retrieval_config": "offline-smoke-from-pdf-build",
+                "doc_id": row.get("doc_id", ""),
+                "doc_name": row.get("doc_name", ""),
+                "section_path": row.get("section_path", ""),
+                "page_start": row.get("page_start", ""),
+                "page_end": row.get("page_end", ""),
+                "source_chunk_ids": row.get("source_chunk_ids", ""),
+                "question_type": row.get("question_type", ""),
+                "difficulty": row.get("difficulty", ""),
+                "review_status": row.get("review_status", ""),
+                "review_notes": row.get("review_notes", ""),
+            }
+        )
+
+    ensure_directory(output_path.parent)
+    pd.DataFrame(output_rows).to_csv(output_path, index=False)
+    return output_path
--- a/rag_eval/dataset_builder/parser/init.py
+++ b/rag_eval/dataset_builder/parser/init.py
@@ -0,0 +1,7 @@
+"""Parser integrations and layout normalization helpers for dataset build jobs."""
+
+from .aliyun_document_parser import AliyunDocumentParser
+from .aliyun_docmind_gateway import AliyunDocmindGateway
+from .aliyun_layout_normalizer import normalize_layouts
+
+__all__ = ["AliyunDocumentParser", "AliyunDocmindGateway", "normalize_layouts"]
--- a/rag_eval/dataset_builder/parser/aliyun_docmind_gateway.py
+++ b/rag_eval/dataset_builder/parser/aliyun_docmind_gateway.py
@@ -0,0 +1,202 @@
+"""Gateway abstraction for Alibaba Cloud document parsing workflows."""
+
+from __future__ import annotations
+
+import time
+from pathlib import Path
+from typing import Any
+
+try:
+    from alibabacloud_docmind_api20220711 import models as docmind_models
+    from alibabacloud_docmind_api20220711.client import Client as DocmindClient
+    from alibabacloud_tea_openapi import models as openapi_models
+    from alibabacloud_tea_util import models as runtime_models
+except ImportError:
+    # Keep Alibaba SDK optional so offline flows and tests can import this module.
+    DocmindClient = None
+    docmind_models = None
+    openapi_models = None
+    runtime_models = None
+
+try:
+    from alibabacloud_credentials.client import Client as CredentialClient
+except ImportError:
+    CredentialClient = None
+
+from rag_eval.settings import EvaluationSettings
+
+
+class AliyunDocmindGateway:
+    """Thin gateway interface around the external Alibaba document parser service."""
+
+    def __init__(self, settings: EvaluationSettings):
+        """Store parser-related settings needed by the gateway implementation."""
+        self.settings = settings
+        self._client = None
+        self._models = None
+        self._runtime_models = None
+
+    def _load_sdk(self) -> tuple[Any, Any, Any, Any]:
+        """Load Alibaba SDK modules lazily so tests and offline flows do not require them."""
+        if (
+            DocmindClient is None
+            or openapi_models is None
+            or docmind_models is None
+            or runtime_models is None
+        ):
+            raise ImportError(
+                "Alibaba Cloud Docmind SDK is not installed. "
+                "Install alibabacloud-docmind-api20220711, "
+                "alibabacloud-tea-openapi, alibabacloud-tea-util, and "
+                "alibabacloud-credentials."
+            )
+        return DocmindClient, openapi_models, docmind_models, runtime_models
+
+    def _resolve_credentials(self) -> tuple[str, str]:
+        """Resolve AccessKey credentials from settings or the Alibaba credentials client."""
+        if self.settings.alibaba_access_key_id and self.settings.alibaba_access_key_secret:
+            return self.settings.alibaba_access_key_id, self.settings.alibaba_access_key_secret
+
+        if CredentialClient is None:
+            raise ImportError(
+                "Alibaba Cloud credentials SDK is not installed and no explicit "
+                "ALIBABA_ACCESS_KEY_ID / ALIBABA_ACCESS_KEY_SECRET were provided."
+            )
+
+        credential_client = CredentialClient()
+        credential = credential_client.get_credential()
+        return credential.get_access_key_id(), credential.get_access_key_secret()
+
+    def _init_client(self) -> Any:
+        """Create and cache the underlying Alibaba SDK client."""
+        if self._client is not None:
+            return self._client
+
+        client_class, openapi_models, docmind_models, runtime_models = self._load_sdk()
+        access_key_id, access_key_secret = self._resolve_credentials()
+        endpoint = (self.settings.alibaba_endpoint or "docmind-api.cn-hangzhou.aliyuncs.com").strip()
+        config = openapi_models.Config(
+            access_key_id=access_key_id,
+            access_key_secret=access_key_secret,
+        )
+        config.endpoint = endpoint
+        config.region_id = "cn-hangzhou"
+        config.type = "access_key"
+
+        self._client = client_class(config)
+        self._models = docmind_models
+        self._runtime_models = runtime_models
+        return self._client
+
+    @staticmethod
+    def _to_plain_dict(value: Any) -> dict[str, Any]:
+        """Convert SDK response objects into ordinary dictionaries."""
+        if value is None:
+            return {}
+        if isinstance(value, dict):
+            return value
+        if hasattr(value, "to_map"):
+            return value.to_map()
+        if hasattr(value, "__dict__"):
+            return {
+                key: getattr(value, key)
+                for key in vars(value)
+                if not key.startswith("_")
+            }
+        return {}
+
+    @staticmethod
+    def _extract_layouts(payload: Any) -> list[dict[str, Any]]:
+        """Convert layout collections from SDK payloads into plain dictionaries."""
+        if payload is None:
+            return []
+        if isinstance(payload, dict):
+            layouts = payload.get("layouts") or payload.get("Layouts") or []
+        else:
+            layouts = getattr(payload, "layouts", None) or getattr(payload, "Layouts", None) or []
+        normalized: list[dict[str, Any]] = []
+        for item in layouts:
+            normalized.append(AliyunDocmindGateway._to_plain_dict(item))
+        return normalized
+
+    def submit_parse_task(self, pdf_path: Path) -> str:
+        """Submit one PDF parse task and return the remote task identifier."""
+        client = self._init_client()
+        runtime = self._runtime_models.RuntimeOptions()
+        file_name = pdf_path.name
+        with pdf_path.open("rb") as handle:
+            request = self._models.SubmitDocParserJobAdvanceRequest(
+                file_url_object=handle,
+                file_name=file_name,
+                file_name_extension=pdf_path.suffix.lstrip(".").lower() or "pdf",
+                llm_enhancement=self.settings.aliyun_llm_enhancement,
+                enhancement_mode=self.settings.aliyun_enhancement_mode,
+            )
+            response = client.submit_doc_parser_job_advance(request, runtime)
+
+        payload = self._to_plain_dict(getattr(getattr(response, "body", None), "data", None))
+        task_id = payload.get("id") or payload.get("Id")
+        if not task_id:
+            raise RuntimeError(f"Aliyun submit_doc_parser_job_advance returned no task id for {pdf_path.name}")
+        return str(task_id)
+
+    def get_task_status(self, task_id: str) -> dict[str, Any]:
+        """Fetch the current parse task status from the remote service."""
+        client = self._init_client()
+        request = self._models.QueryDocParserStatusRequest(id=task_id)
+        response = client.query_doc_parser_status(request)
+        payload = self._to_plain_dict(getattr(getattr(response, "body", None), "data", None))
+        status = payload.get("status") or payload.get("Status")
+        if status is not None and "status" not in payload:
+            payload["status"] = status
+        return payload
+
+    def fetch_layouts(self, task_id: str) -> list[dict[str, Any]]:
+        """Fetch normalized layout pages for a completed parse task."""
+        client = self._init_client()
+        layout_num = 0
+        layout_step_size = min(max(1, self.settings.aliyun_parse_layout_step_size), 3000)
+        collected: list[dict[str, Any]] = []
+
+        while True:
+            request = self._models.GetDocParserResultRequest(
+                id=task_id,
+                layout_step_size=layout_step_size,
+                layout_num=layout_num,
+            )
+            response = client.get_doc_parser_result(request)
+            payload = getattr(getattr(response, "body", None), "data", None)
+            layouts = self._extract_layouts(payload)
+            if not layouts:
+                break
+            collected.extend(layouts)
+            layout_num += len(layouts)
+            if len(layouts) < layout_step_size:
+                break
+        return collected
+
+    def parse_document(self, pdf_path: Path) -> dict[str, Any]:
+        """Run the submit/poll/fetch cycle and return a raw parse payload."""
+        task_id = self.submit_parse_task(pdf_path)
+        started_at = time.monotonic()
+        poll_interval = max(1, self.settings.aliyun_parse_poll_interval_seconds)
+        timeout_seconds = max(1, self.settings.aliyun_parse_timeout_seconds)
+
+        while True:
+            status = self.get_task_status(task_id)
+            state = str(status.get("status", "")).lower()
+            if state in {"succeeded", "success", "finished"}:
+                layouts = self.fetch_layouts(task_id)
+                return {
+                    "task_id": task_id,
+                    "status": state,
+                    "doc_id": status.get("doc_id") or pdf_path.stem,
+                    "doc_name": status.get("doc_name") or pdf_path.name,
+                    "layouts": layouts,
+                    "metadata": status,
+                }
+            if state in {"failed", "error"}:
+                raise RuntimeError(f"Aliyun parse task failed for {pdf_path.name}: {status}")
+            if time.monotonic() - started_at > timeout_seconds:
+                raise TimeoutError(f"Aliyun parse task timed out for {pdf_path.name}")
+            time.sleep(poll_interval)
--- a/rag_eval/dataset_builder/parser/aliyun_document_parser.py
+++ b/rag_eval/dataset_builder/parser/aliyun_document_parser.py
@@ -0,0 +1,38 @@
+"""Document parser that normalizes Alibaba layout results into internal models."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+from rag_eval.dataset_builder.models import ParsedDocument
+
+from .aliyun_docmind_gateway import AliyunDocmindGateway
+from .aliyun_layout_normalizer import normalize_layouts
+
+
+class AliyunDocumentParser:
+    """Parse PDFs through the Alibaba gateway and normalize the returned layouts."""
+
+    def __init__(self, gateway: AliyunDocmindGateway):
+        """Store the gateway dependency used for remote parsing."""
+        self.gateway = gateway
+
+    def parse(self, pdf_path: Path) -> ParsedDocument:
+        """Parse one PDF file into a normalized parsed-document model."""
+        payload = self.gateway.parse_document(pdf_path)
+        layouts = payload.get("layouts") or []
+        if not layouts:
+            raise ValueError(f"No layouts returned for document: {pdf_path.name}")
+
+        document = normalize_layouts(
+            doc_id=str(payload.get("doc_id") or pdf_path.stem),
+            doc_name=str(payload.get("doc_name") or pdf_path.name),
+            layouts=list(layouts),
+        )
+        document.metadata.update(
+            {
+                "task_id": payload.get("task_id"),
+                "provider": "aliyun_docmind",
+            }
+        )
+        return document
--- a/rag_eval/dataset_builder/parser/aliyun_layout_normalizer.py
+++ b/rag_eval/dataset_builder/parser/aliyun_layout_normalizer.py
@@ -0,0 +1,181 @@
+"""Normalization helpers that convert raw layout results into source chunks."""
+
+from __future__ import annotations
+
+import re
+from typing import Any
+
+from rag_eval.dataset_builder.models import ParsedDocument, SemanticBlock, SourceChunk, StructureNode
+
+
+def _clean_text(value: Any) -> str:
+    """Normalize free-form layout text into a compact string."""
+    if value is None:
+        return ""
+    return re.sub(r"\s+", " ", str(value)).strip()
+
+
+def _is_catalog_entry(item_type: str, text: str) -> bool:
+    """Detect table-of-contents style entries that should be skipped."""
+    lowered = text.lower()
+    return item_type == "toc" or "目录" in text or lowered.startswith("table of contents")
+
+
+def _flatten_table(item: dict[str, Any]) -> str:
+    """Convert a table layout node into a searchable plain-text representation."""
+    rows = item.get("rows") or []
+    flattened_rows: list[str] = []
+    for row in rows:
+        cells = [str(cell).strip() for cell in row if str(cell).strip()]
+        if cells:
+            flattened_rows.append(" | ".join(cells))
+    return "\n".join(flattened_rows)
+
+
+def _split_text(text: str, max_chars: int = 1200, overlap: int = 150) -> list[str]:
+    """Split long text into overlapping windows so each chunk stays reviewable."""
+    if len(text) <= max_chars:
+        return [text]
+
+    windows: list[str] = []
+    start = 0
+    while start < len(text):
+        end = min(len(text), start + max_chars)
+        windows.append(text[start:end].strip())
+        if end >= len(text):
+            break
+        start = max(end - overlap, start + 1)
+    return [window for window in windows if window]
+
+
+def normalize_layouts(
+    *,
+    doc_id: str,
+    doc_name: str,
+    layouts: list[dict[str, Any]],
+    max_chunk_chars: int = 1200,
+    overlap_chars: int = 150,
+) -> ParsedDocument:
+    """Convert raw layouts into structure nodes, semantic blocks, and source chunks."""
+    structure_nodes: list[StructureNode] = []
+    semantic_blocks: list[SemanticBlock] = []
+    source_chunks: list[SourceChunk] = []
+    section_stack: list[tuple[int, str]] = []
+
+    current_block_text: list[str] = []
+    current_block_layout_ids: list[str] = []
+    current_page_start: int | None = None
+    current_page_end: int | None = None
+    current_section_path = ""
+    current_section_title = ""
+
+    def flush_block() -> None:
+        """Finalize the in-progress semantic block and emit source chunks."""
+        nonlocal current_block_text, current_block_layout_ids, current_page_start, current_page_end
+        nonlocal current_section_path, current_section_title
+
+        text = _clean_text(" ".join(current_block_text))
+        if not text or current_page_start is None or current_page_end is None:
+            current_block_text = []
+            current_block_layout_ids = []
+            current_page_start = None
+            current_page_end = None
+            return
+
+        block_id = f"{doc_id}-block-{len(semantic_blocks) + 1}"
+        block = SemanticBlock(
+            block_id=block_id,
+            doc_id=doc_id,
+            doc_name=doc_name,
+            text=text,
+            page_start=current_page_start,
+            page_end=current_page_end,
+            section_path=current_section_path,
+            section_title=current_section_title,
+            source_layout_ids=list(current_block_layout_ids),
+        )
+        semantic_blocks.append(block)
+
+        chunk_parts = _split_text(text, max_chars=max_chunk_chars, overlap=overlap_chars)
+        for index, part in enumerate(chunk_parts, start=1):
+            heading_prefix = current_section_title.strip()
+            chunk_text = f"{heading_prefix}\n{part}".strip() if heading_prefix and not part.startswith(heading_prefix) else part
+            source_chunks.append(
+                SourceChunk(
+                    chunk_id=f"{block_id}-chunk-{index}",
+                    doc_id=doc_id,
+                    doc_name=doc_name,
+                    text=chunk_text,
+                    page_start=current_page_start,
+                    page_end=current_page_end,
+                    section_path=current_section_path,
+                    section_title=current_section_title,
+                    source_layout_ids=list(current_block_layout_ids),
+                )
+            )
+
+        current_block_text = []
+        current_block_layout_ids = []
+        current_page_start = None
+        current_page_end = None
+
+    for index, item in enumerate(layouts, start=1):
+        item_type = str(item.get("type", "paragraph")).lower()
+        page = int(item.get("page", 1))
+        layout_id = str(item.get("layout_id") or f"layout-{index}")
+        level = int(item.get("level", 1))
+
+        if item_type == "table":
+            text = _flatten_table(item)
+        else:
+            text = _clean_text(item.get("text"))
+
+        if not text or _is_catalog_entry(item_type, text):
+            continue
+
+        if item_type == "heading":
+            flush_block()
+            while section_stack and section_stack[-1][0] >= level:
+                section_stack.pop()
+            section_stack.append((level, text))
+            section_titles = [title for _, title in section_stack]
+            current_section_title = text
+            current_section_path = " > ".join(section_titles)
+            structure_nodes.append(
+                StructureNode(
+                    node_id=f"{doc_id}-node-{len(structure_nodes) + 1}",
+                    level=level,
+                    title=text,
+                    page_start=page,
+                    page_end=page,
+                    section_path=current_section_path,
+                )
+            )
+            continue
+
+        if item_type == "caption":
+            text = f"图注: {text}"
+
+        if current_page_start is None:
+            current_page_start = page
+        current_page_end = page
+        current_block_text.append(text)
+        current_block_layout_ids.append(layout_id)
+
+    flush_block()
+    raw_text = "\n".join(chunk.text for chunk in source_chunks)
+    metadata = {
+        "layout_count": len(layouts),
+        "structure_node_count": len(structure_nodes),
+        "semantic_block_count": len(semantic_blocks),
+        "source_chunk_count": len(source_chunks),
+    }
+    return ParsedDocument(
+        doc_id=doc_id,
+        doc_name=doc_name,
+        raw_text=raw_text,
+        structure_nodes=structure_nodes,
+        semantic_blocks=semantic_blocks,
+        source_chunks=source_chunks,
+        metadata=metadata,
+    )
--- a/rag_eval/dataset_builder/runner.py
+++ b/rag_eval/dataset_builder/runner.py
@@ -0,0 +1,142 @@
+"""Orchestration layer for PDF-to-dataset build jobs."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import yaml
+
+from rag_eval.settings import EvaluationSettings
+from rag_eval.shared.utils import ensure_directory, utc_now_iso
+
+from .generator.question_generator import OpenAIQuestionGenerator, QuestionGenerator
+from .generator.validators import dedupe_samples, validate_draft_sample
+from .models import DatasetBuildJob, DatasetBuildResult, DatasetBuildRuntime, ParseFailure
+from .parser.aliyun_document_parser import AliyunDocumentParser
+from .parser.aliyun_docmind_gateway import AliyunDocmindGateway
+from .schema import DatasetBuildConfigModel
+from .sources import discover_pdf_files
+from .writers import build_artifact_paths, write_dataset_build_artifacts
+
+
+def load_dataset_build_job(path: str | Path, settings: EvaluationSettings | None = None) -> DatasetBuildJob:
+    """Load and validate a dataset build YAML file."""
+    settings = settings or EvaluationSettings()
+    config_path = Path(path).resolve()
+    payload = yaml.safe_load(config_path.read_text(encoding="utf-8")) or {}
+    model = DatasetBuildConfigModel.model_validate(payload)
+    base_dir = config_path.parent
+
+    generation_model = (
+        model.generation.model
+        or settings.dataset_generator_model
+        or "qwen3.6-plus"
+    )
+    parser_payload = payload.get("parser") or {}
+    failure_mode = parser_payload.get("failure_mode") or settings.parser_failure_mode or "fail"
+    return DatasetBuildJob(
+        job_name=model.job_name,
+        input_path=model.resolve_path(base_dir, model.input.path),
+        input_glob=model.input.glob,
+        parser_provider=model.parser.provider,
+        failure_mode=failure_mode,
+        generation_model=generation_model,
+        output_type=model.generation.output_type,
+        review_mode=model.generation.review_mode,
+        max_questions_per_document=model.generation.max_questions_per_document,
+        max_source_chunks_per_question=model.generation.max_source_chunks_per_question,
+        dataset_path=model.resolve_path(base_dir, model.output.dataset_path),
+        artifact_dir=model.resolve_path(base_dir, model.output.artifact_dir),
+        runtime=DatasetBuildRuntime(max_documents=model.runtime.max_documents),
+        source_path=config_path,
+    )
+
+
+def _create_parser(job: DatasetBuildJob, settings: EvaluationSettings) -> AliyunDocumentParser:
+    """Create the configured document parser implementation."""
+    if job.parser_provider != "aliyun_docmind":
+        raise ValueError(f"Unsupported parser provider: {job.parser_provider}")
+    gateway = AliyunDocmindGateway(settings)
+    return AliyunDocumentParser(gateway)
+
+
+def _create_generator(job: DatasetBuildJob, settings: EvaluationSettings) -> QuestionGenerator:
+    """Create the configured draft question generator implementation."""
+    return OpenAIQuestionGenerator(settings=settings, model=job.generation_model)
+
+
+def run_dataset_build(
+    config_path: str | Path,
+    *,
+    settings: EvaluationSettings | None = None,
+    parser: AliyunDocumentParser | None = None,
+    generator: QuestionGenerator | None = None,
+) -> DatasetBuildResult:
+    """Run one dataset build job end to end and persist all required artifacts."""
+    settings = settings or EvaluationSettings()
+    job = load_dataset_build_job(config_path, settings=settings)
+    pdf_files = discover_pdf_files(job.input_path, job.input_glob)
+    if job.runtime.max_documents is not None:
+        pdf_files = pdf_files[: job.runtime.max_documents]
+
+    parser = parser or _create_parser(job, settings)
+    generator = generator or _create_generator(job, settings)
+
+    run_id = utc_now_iso().replace(":", "-")
+    artifact_root = job.artifact_dir / run_id
+    ensure_directory(artifact_root)
+    artifact_paths = build_artifact_paths(artifact_root)
+
+    documents = []
+    failures: list[ParseFailure] = []
+    draft_samples = []
+
+    for pdf_path in pdf_files:
+        try:
+            document = parser.parse(pdf_path)
+        except Exception as exc:
+            failure = ParseFailure(file_path=pdf_path.as_posix(), error=str(exc))
+            failures.append(failure)
+            if job.failure_mode == "fail":
+                result = DatasetBuildResult(
+                    job=job,
+                    run_id=run_id,
+                    artifact_paths=artifact_paths,
+                    documents=documents,
+                    draft_samples=draft_samples,
+                    parse_failures=failures,
+                )
+                write_dataset_build_artifacts(result)
+                raise
+            continue
+
+        documents.append(document)
+        generated = generator.generate(
+            document,
+            max_questions=job.max_questions_per_document,
+            max_chunks_per_question=job.max_source_chunks_per_question,
+            job_name=job.job_name,
+        )
+        valid_generated = []
+        for sample in generated:
+            errors = validate_draft_sample(
+                sample,
+                document=document,
+                max_source_chunks_per_question=job.max_source_chunks_per_question,
+            )
+            if not errors:
+                valid_generated.append(sample)
+        draft_samples.extend(
+            dedupe_samples(valid_generated)[: job.max_questions_per_document]
+        )
+
+    result = DatasetBuildResult(
+        job=job,
+        run_id=run_id,
+        artifact_paths=artifact_paths,
+        documents=documents,
+        draft_samples=draft_samples,
+        parse_failures=failures,
+    )
+    write_dataset_build_artifacts(result)
+    return result
--- a/rag_eval/dataset_builder/schema.py
+++ b/rag_eval/dataset_builder/schema.py
@@ -0,0 +1,82 @@
+"""Pydantic schemas for dataset build YAML configuration files."""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Literal
+
+from pydantic import BaseModel, ConfigDict, Field, model_validator
+
+
+class DatasetBuildInputModel(BaseModel):
+    """Schema for input PDF discovery settings."""
+
+    model_config = ConfigDict(extra="ignore")
+
+    path: str
+    glob: str = "*.pdf"
+
+
+class DatasetBuildParserModel(BaseModel):
+    """Schema for parser selection and failure handling."""
+
+    model_config = ConfigDict(extra="ignore")
+
+    provider: Literal["aliyun_docmind"]
+    failure_mode: Literal["fail", "skip"] | None = None
+
+
+class DatasetBuildGenerationModel(BaseModel):
+    """Schema for question generation controls."""
+
+    model_config = ConfigDict(extra="ignore")
+
+    model: str | None = None
+    output_type: Literal["online_question_bank"]
+    review_mode: Literal["draft_with_manual_review"]
+    max_questions_per_document: int = Field(default=10, gt=0)
+    max_source_chunks_per_question: int = Field(default=3, gt=0)
+
+
+class DatasetBuildOutputModel(BaseModel):
+    """Schema for dataset build output locations."""
+
+    model_config = ConfigDict(extra="ignore")
+
+    dataset_path: str
+    artifact_dir: str
+
+
+class DatasetBuildRuntimeModel(BaseModel):
+    """Schema for runtime throttling and document limits."""
+
+    model_config = ConfigDict(extra="ignore")
+
+    max_documents: int | None = Field(default=None, gt=0)
+
+
+class DatasetBuildConfigModel(BaseModel):
+    """Top-level schema for a dataset build job."""
+
+    model_config = ConfigDict(extra="ignore")
+
+    job_name: str
+    input: DatasetBuildInputModel
+    parser: DatasetBuildParserModel
+    generation: DatasetBuildGenerationModel
+    output: DatasetBuildOutputModel
+    runtime: DatasetBuildRuntimeModel = Field(default_factory=DatasetBuildRuntimeModel)
+
+    @model_validator(mode="after")
+    def validate_job_name(self) -> "DatasetBuildConfigModel":
+        """Reject blank job names that would break artifact paths."""
+        if not self.job_name.strip():
+            raise ValueError("job_name must not be empty.")
+        return self
+
+    def resolve_path(self, base_dir: Path, raw_path: str) -> Path:
+        """Resolve relative paths against the config file directory."""
+        candidate = Path(raw_path)
+        if candidate.is_absolute():
+            return candidate
+        return (base_dir / candidate).resolve()
--- a/rag_eval/dataset_builder/sources.py
+++ b/rag_eval/dataset_builder/sources.py
@@ -0,0 +1,21 @@
+"""Input source discovery helpers for dataset build jobs."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+
+def discover_pdf_files(input_path: Path, pattern: str = "*.pdf") -> list[Path]:
+    """Return all PDF files from a single file path or a directory scan."""
+    if not input_path.exists():
+        raise FileNotFoundError(f"Input path does not exist: {input_path}")
+
+    if input_path.is_file():
+        if input_path.suffix.lower() != ".pdf":
+            raise ValueError(f"Input file is not a PDF: {input_path}")
+        return [input_path]
+
+    files = sorted(path for path in input_path.glob(pattern) if path.is_file() and path.suffix.lower() == ".pdf")
+    if not files:
+        raise ValueError(f"No PDF files found under {input_path} with pattern {pattern}")
+    return files
--- a/rag_eval/dataset_builder/writers.py
+++ b/rag_eval/dataset_builder/writers.py
@@ -0,0 +1,147 @@
+"""Artifact writers for dataset build runs."""
+
+from __future__ import annotations
+
+import csv
+import json
+import shutil
+from pathlib import Path
+from typing import Any
+
+from rag_eval.shared.utils import ensure_directory
+
+from .models import DatasetBuildArtifactPaths, DatasetBuildResult
+
+
+def build_artifact_paths(root_dir: Path) -> DatasetBuildArtifactPaths:
+    """Construct canonical output paths for one dataset build run."""
+    return DatasetBuildArtifactPaths(
+        root_dir=root_dir,
+        documents_jsonl=root_dir / "documents.jsonl",
+        semantic_blocks_jsonl=root_dir / "semantic_blocks.jsonl",
+        source_chunks_jsonl=root_dir / "source_chunks.jsonl",
+        dataset_draft_csv=root_dir / "dataset_draft.csv",
+        parse_failures_csv=root_dir / "parse_failures.csv",
+        metadata_json=root_dir / "metadata.json",
+    )
+
+
+def _write_jsonl(path: Path, rows: list[dict[str, Any]]) -> None:
+    """Write a list of dictionaries as JSON Lines."""
+    with path.open("w", encoding="utf-8") as handle:
+        for row in rows:
+            handle.write(json.dumps(row, ensure_ascii=False) + "\n")
+
+
+def _write_csv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str] | None = None) -> None:
+    """Write flat records into a CSV file, including list values as JSON strings."""
+    normalized_rows: list[dict[str, Any]] = []
+    resolved_fieldnames = list(fieldnames or [])
+    for row in rows:
+        normalized_row: dict[str, Any] = {}
+        for key, value in row.items():
+            if key not in resolved_fieldnames:
+                resolved_fieldnames.append(key)
+            if isinstance(value, list):
+                normalized_row[key] = json.dumps(value, ensure_ascii=False)
+            else:
+                normalized_row[key] = value
+        normalized_rows.append(normalized_row)
+
+    with path.open("w", encoding="utf-8", newline="") as handle:
+        writer = csv.DictWriter(handle, fieldnames=resolved_fieldnames or ["placeholder"])
+        writer.writeheader()
+        if normalized_rows:
+            writer.writerows(normalized_rows)
+
+
+def _write_latest_alias_assets(result: DatasetBuildResult) -> None:
+    """Publish stable alias files so sample scenarios can target the latest build output."""
+    latest_dir = result.job.artifact_dir / "latest"
+    ensure_directory(latest_dir)
+
+    # Keep the canonical run directory and also expose a stable entrypoint for tutorials.
+    shutil.copyfile(result.artifact_paths.source_chunks_jsonl, latest_dir / "source_chunks.jsonl")
+    shutil.copyfile(result.artifact_paths.dataset_draft_csv, latest_dir / "dataset_draft.csv")
+    shutil.copyfile(result.artifact_paths.metadata_json, latest_dir / "metadata.json")
+
+
+def write_dataset_build_artifacts(result: DatasetBuildResult) -> None:
+    """Persist dataset build outputs and metadata to disk."""
+    artifact_paths = result.artifact_paths
+    ensure_directory(artifact_paths.root_dir)
+    ensure_directory(result.job.dataset_path.parent)
+
+    _write_jsonl(artifact_paths.documents_jsonl, [item.to_record() for item in result.documents])
+    _write_jsonl(
+        artifact_paths.semantic_blocks_jsonl,
+        [block.to_record() for item in result.documents for block in item.semantic_blocks],
+    )
+    _write_jsonl(
+        artifact_paths.source_chunks_jsonl,
+        [chunk.to_record() for item in result.documents for chunk in item.source_chunks],
+    )
+
+    draft_rows = [sample.to_record() for sample in result.draft_samples]
+    _write_csv(
+        artifact_paths.dataset_draft_csv,
+        draft_rows,
+        fieldnames=[
+            "sample_id",
+            "question",
+            "ground_truth",
+            "scenario",
+            "language",
+            "doc_id",
+            "doc_name",
+            "section_path",
+            "page_start",
+            "page_end",
+            "source_chunk_ids",
+            "question_type",
+            "difficulty",
+            "review_status",
+            "review_notes",
+        ],
+    )
+    _write_csv(
+        result.job.dataset_path,
+        draft_rows,
+        fieldnames=[
+            "sample_id",
+            "question",
+            "ground_truth",
+            "scenario",
+            "language",
+            "doc_id",
+            "doc_name",
+            "section_path",
+            "page_start",
+            "page_end",
+            "source_chunk_ids",
+            "question_type",
+            "difficulty",
+            "review_status",
+            "review_notes",
+        ],
+    )
+    _write_csv(
+        artifact_paths.parse_failures_csv,
+        [item.to_record() for item in result.parse_failures],
+        fieldnames=["file_path", "error"],
+    )
+
+    metadata = {
+        "run_id": result.run_id,
+        "job": result.job.snapshot(),
+        "stats": {
+            "documents_processed": len(result.documents),
+            "draft_samples": len(result.draft_samples),
+            "parse_failures": len(result.parse_failures),
+        },
+    }
+    artifact_paths.metadata_json.write_text(
+        json.dumps(metadata, ensure_ascii=False, indent=2),
+        encoding="utf-8",
+    )
+    _write_latest_alias_assets(result)
--- a/rag_eval/execution/init.py
+++ b/rag_eval/execution/init.py
@@ -0,0 +1,5 @@
+"""Execution entrypoints for running evaluation scenarios."""
+
+from .runner import run_scenario
+
+__all__ = ["run_scenario"]
--- a/rag_eval/execution/concurrency.py
+++ b/rag_eval/execution/concurrency.py
@@ -0,0 +1,23 @@
+"""Async helpers for executing bounded concurrent workloads."""
+
+from __future__ import annotations
+
+import asyncio
+from typing import Awaitable, Callable, TypeVar
+
+T = TypeVar("T")
+
+
+async def gather_with_limit(
+    factories: list[Callable[[], Awaitable[T]]],
+    limit: int,
+) -> list[T]:
+    """Run async factory callables with a maximum concurrency limit."""
+    semaphore = asyncio.Semaphore(max(1, limit))
+
+    async def guarded(factory: Callable[[], Awaitable[T]]) -> T:
+        """Wrap one factory invocation with semaphore-based throttling."""
+        async with semaphore:
+            return await factory()
+
+    return await asyncio.gather(*(guarded(factory) for factory in factories))
--- a/rag_eval/execution/errors.py
+++ b/rag_eval/execution/errors.py
@@ -0,0 +1,6 @@
+"""Custom exceptions raised during scenario execution."""
+
+class ScenarioExecutionError(RuntimeError):
+    """Raised when a scenario cannot be executed successfully."""
+
+    pass
--- a/rag_eval/execution/evaluator.py
+++ b/rag_eval/execution/evaluator.py
@@ -0,0 +1,125 @@
+"""Core evaluation workflow for offline and online scenarios."""
+
+from __future__ import annotations
+
+import asyncio
+from typing import Any
+
+from rag_eval.adapters.base import AppAdapter
+from rag_eval.datasets.loader import load_dataset_records
+from rag_eval.datasets.normalizers import normalize_records
+from rag_eval.execution.concurrency import gather_with_limit
+from rag_eval.metrics.pipeline import MetricPipeline
+from rag_eval.shared.models import EvaluationResult, InvalidSample, NormalizedSample, Scenario
+from rag_eval.shared.utils import utc_now_iso
+
+
+class Evaluator:
+    """Coordinate dataset loading, optional app execution, and metric scoring."""
+
+    def __init__(
+        self,
+        scenario: Scenario,
+        metric_pipeline: MetricPipeline,
+        app_adapter: AppAdapter | None = None,
+    ):
+        """Create an evaluator for one resolved scenario."""
+        self.scenario = scenario
+        self.metric_pipeline = metric_pipeline
+        self.app_adapter = app_adapter
+
+    def evaluate(self) -> EvaluationResult:
+        """Execute the full evaluation flow and return the collected results."""
+        started_at = utc_now_iso()
+        raw_records = load_dataset_records(self.scenario.dataset.path)
+        samples, invalid_samples = normalize_records(
+            raw_records,
+            mode=self.scenario.mode,
+            max_samples=self.scenario.runtime.max_samples,
+        )
+
+        if self.scenario.mode == "online":
+            # Online mode enriches each sample by calling the target application first.
+            samples, online_invalids = asyncio.run(self._enrich_online_samples(samples))
+            invalid_samples.extend(online_invalids)
+
+        metric_scores = asyncio.run(
+            self.metric_pipeline.score_samples(
+                samples,
+                max_concurrency=self.scenario.runtime.metric_limit(),
+            )
+        )
+        finished_at = utc_now_iso()
+        score_rows = [self._merge_score(sample, score) for sample, score in zip(samples, metric_scores)]
+        run_id = finished_at.replace(":", "-")
+        return EvaluationResult(
+            scenario=self.scenario,
+            run_id=run_id,
+            started_at=started_at,
+            finished_at=finished_at,
+            valid_samples=samples,
+            invalid_samples=invalid_samples,
+            score_rows=score_rows,
+        )
+
+    async def _enrich_online_samples(
+        self,
+        samples: list[NormalizedSample],
+    ) -> tuple[list[NormalizedSample], list[InvalidSample]]:
+        """Populate answers and contexts by calling the configured application adapter."""
+        if self.app_adapter is None:
+            raise ValueError("online mode requires an app adapter.")
+
+        valid: list[NormalizedSample] = []
+        invalid: list[InvalidSample] = []
+
+        async def enrich_with_capture(sample: NormalizedSample) -> NormalizedSample | InvalidSample:
+            """Convert adapter exceptions into invalid samples instead of aborting the run."""
+            try:
+                return await self.app_adapter.enrich_sample(sample)
+            except Exception as exc:
+                error_type = type(exc).__name__
+                return InvalidSample(
+                    sample_id=sample.sample_id,
+                    error=f"adapter failed [{error_type}]: {exc}",
+                    raw=sample.raw,
+                )
+
+        factories = [
+            (lambda sample=sample: enrich_with_capture(sample))
+            for sample in samples
+        ]
+        results = await gather_with_limit(factories, self.scenario.runtime.app_limit())
+
+        for sample in results:
+            if isinstance(sample, InvalidSample):
+                invalid.append(sample)
+                continue
+            # Treat incomplete adapter payloads as invalid so reporting stays explicit.
+            errors: list[str] = []
+            if not sample.answer:
+                errors.append("adapter returned empty answer")
+            if not sample.contexts:
+                errors.append("adapter returned empty contexts")
+            if errors:
+                invalid.append(
+                    InvalidSample(
+                        sample_id=sample.sample_id,
+                        error="; ".join(errors),
+                        raw=sample.raw,
+                    )
+                )
+                continue
+            valid.append(sample)
+        return valid, invalid
+
+    def _merge_score(self, sample: NormalizedSample, score: Any) -> dict[str, Any]:
+        """Combine sample data, metric results, and run metadata into one output row."""
+        record = sample.to_record()
+        record["contexts"] = sample.contexts
+        record.update(score.metrics)
+        record["error"] = score.error
+        record["judge_model"] = self.scenario.judge_model
+        record["embedding_model"] = self.scenario.embedding_model
+        record["run_id"] = self.scenario.scenario_name
+        return record
--- a/rag_eval/execution/runner.py
+++ b/rag_eval/execution/runner.py
@@ -0,0 +1,42 @@
+"""High-level scenario runner used by the package and CLI entrypoints."""
+
+from __future__ import annotations
+
+from rag_eval.adapters.http import HttpAppAdapter
+from rag_eval.adapters.python import PythonFunctionAdapter
+from rag_eval.config.loader import load_scenario
+from rag_eval.metrics.factory import build_metric_pipeline
+from rag_eval.reporting.writers import write_run_artifacts
+from rag_eval.settings import EvaluationSettings
+from rag_eval.shared.models import Scenario
+
+from .evaluator import Evaluator
+
+
+def build_adapter(scenario: Scenario):
+    """Instantiate the adapter required by the resolved scenario, if any."""
+    if scenario.app_adapter is None:
+        return None
+    if scenario.app_adapter.type == "http":
+        return HttpAppAdapter(scenario.app_adapter)
+    if scenario.app_adapter.type == "python":
+        return PythonFunctionAdapter(scenario.app_adapter)
+    raise ValueError(f"Unsupported adapter type: {scenario.app_adapter.type}")
+
+
+def run_scenario(
+    scenario_path: str,
+    settings: EvaluationSettings | None = None,
+):
+    """Run one scenario end to end and persist its reporting artifacts."""
+    settings = settings or EvaluationSettings()
+    if not settings.openai_api_key:
+        raise EnvironmentError("OPENAI_API_KEY must be set before running the evaluator.")
+
+    scenario = load_scenario(scenario_path)
+    adapter = build_adapter(scenario)
+    pipeline = build_metric_pipeline(scenario, settings)
+    evaluator = Evaluator(scenario=scenario, metric_pipeline=pipeline, app_adapter=adapter)
+    result = evaluator.evaluate()
+    write_run_artifacts(result)
+    return result
--- a/rag_eval/metrics/init.py
+++ b/rag_eval/metrics/init.py
@@ -0,0 +1,5 @@
+"""Metric pipeline construction helpers."""
+
+from .factory import build_metric_pipeline
+
+__all__ = ["build_metric_pipeline"]
--- a/rag_eval/metrics/factory.py
+++ b/rag_eval/metrics/factory.py
@@ -0,0 +1,59 @@
+"""Factories for OpenAI-backed RAGAS models and metric pipelines."""
+
+from __future__ import annotations
+
+from typing import Any
+
+from openai import AsyncOpenAI
+
+from rag_eval.compat import ensure_ragas_import_compat
+from rag_eval.settings import EvaluationSettings
+from rag_eval.shared.models import Scenario
+
+ensure_ragas_import_compat()
+
+from ragas.embeddings.base import embedding_factory
+from ragas.llms import llm_factory
+from ragas.metrics.collections import (
+    AnswerRelevancy,
+    ContextPrecision,
+    ContextRecall,
+    Faithfulness,
+)
+
+from .pipeline import MetricPipeline
+
+
+def build_models(
+    judge_model: str,
+    embedding_model: str,
+    settings: EvaluationSettings,
+) -> tuple[Any, Any]:
+    """Create the LLM and embedding clients required by the selected RAGAS metrics."""
+    client = AsyncOpenAI(**settings.openai_client_kwargs)
+    llm = llm_factory(judge_model, client=client)
+    embeddings = embedding_factory(provider="openai", model=embedding_model, client=client)
+    return llm, embeddings
+
+
+def build_metric_pipeline(
+    scenario: Scenario,
+    settings: EvaluationSettings,
+) -> MetricPipeline:
+    """Build a metric pipeline containing only the metrics requested by the scenario."""
+    llm, embeddings = build_models(
+        scenario.judge_model,
+        scenario.embedding_model,
+        settings,
+    )
+    # Build the full registry once, then slice it by configured metric names.
+    registry: dict[str, Any] = {
+        "faithfulness": Faithfulness(llm=llm),
+        "answer_relevancy": AnswerRelevancy(llm=llm, embeddings=embeddings),
+        "context_recall": ContextRecall(llm=llm),
+        "context_precision": ContextPrecision(llm=llm),
+    }
+    return MetricPipeline(
+        metrics={name: registry[name] for name in scenario.metrics},
+        metric_timeout_seconds=settings.ragas_metric_timeout_seconds,
+    )
--- a/rag_eval/metrics/pipeline.py
+++ b/rag_eval/metrics/pipeline.py
@@ -0,0 +1,82 @@
+"""Execution pipeline for scoring normalized samples with RAGAS metrics."""
+
+from __future__ import annotations
+
+import asyncio
+import math
+from dataclasses import dataclass
+from typing import Any
+
+from rag_eval.shared.models import MetricScore, NormalizedSample
+
+
+@dataclass(slots=True)
+class MetricPipeline:
+    """Score one or many normalized samples against a configured metric set."""
+
+    metrics: dict[str, Any]
+    metric_timeout_seconds: float | None = None
+
+    async def score_sample(self, sample: NormalizedSample) -> MetricScore:
+        """Score a single sample and capture metric-level failures without aborting."""
+        results = {name: math.nan for name in self.metrics}
+        errors: list[str] = []
+
+        for name, metric in self.metrics.items():
+            try:
+                result = await self._run_metric(name, metric, sample)
+                results[name] = float(result.value)
+            except Exception as exc:
+                errors.append(f"{name}: {exc}")
+        return MetricScore(metrics=results, error=" | ".join(errors))
+
+    async def _run_metric(self, name: str, metric: Any, sample: NormalizedSample) -> Any:
+        """Dispatch one metric call with the argument shape expected by that metric."""
+        timeout = None
+        if self.metric_timeout_seconds is not None:
+            timeout = max(1.0, float(self.metric_timeout_seconds))
+
+        if name == "faithfulness":
+            coroutine = metric.ascore(
+                user_input=sample.question,
+                response=sample.answer,
+                retrieved_contexts=sample.contexts,
+            )
+        elif name == "answer_relevancy":
+            coroutine = metric.ascore(
+                user_input=sample.question,
+                response=sample.answer,
+            )
+        elif name == "context_recall":
+            coroutine = metric.ascore(
+                user_input=sample.question,
+                retrieved_contexts=sample.contexts,
+                reference=sample.ground_truth,
+            )
+        elif name == "context_precision":
+            coroutine = metric.ascore(
+                user_input=sample.question,
+                reference=sample.ground_truth,
+                retrieved_contexts=sample.contexts,
+            )
+        else:
+            raise ValueError(f"Unsupported metric: {name}")
+
+        if timeout is None:
+            return await coroutine
+        return await asyncio.wait_for(coroutine, timeout=timeout)
+
+    async def score_samples(
+        self,
+        samples: list[NormalizedSample],
+        max_concurrency: int,
+    ) -> list[MetricScore]:
+        """Score all samples while respecting the configured concurrency limit."""
+        semaphore = asyncio.Semaphore(max(1, max_concurrency))
+
+        async def guarded(sample: NormalizedSample) -> MetricScore:
+            """Throttle a single sample-scoring coroutine with the shared semaphore."""
+            async with semaphore:
+                return await self.score_sample(sample)
+
+        return await asyncio.gather(*(guarded(sample) for sample in samples))
--- a/rag_eval/metrics/registry.py
+++ b/rag_eval/metrics/registry.py
@@ -0,0 +1,8 @@
+"""Supported metric names recognized by scenario validation and pipeline setup."""
+
+SUPPORTED_METRICS = {
+    "faithfulness",
+    "answer_relevancy",
+    "context_recall",
+    "context_precision",
+}
--- a/rag_eval/reporting/init.py
+++ b/rag_eval/reporting/init.py
@@ -0,0 +1,5 @@
+"""Reporting helpers that write evaluation outputs to disk."""
+
+from .writers import write_run_artifacts
+
+__all__ = ["write_run_artifacts"]
--- a/rag_eval/reporting/artifacts.py
+++ b/rag_eval/reporting/artifacts.py
@@ -0,0 +1,20 @@
+"""Helpers for deriving file-system paths for run artifacts."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+from rag_eval.shared.models import RunArtifactPaths
+
+
+def build_artifact_paths(output_dir: Path, run_id: str) -> RunArtifactPaths:
+    """Build the canonical artifact file paths for a single evaluation run."""
+    run_dir = output_dir / run_id
+    return RunArtifactPaths(
+        root_dir=run_dir,
+        scenario_snapshot=run_dir / "scenario.snapshot.yaml",
+        scores_csv=run_dir / "scores.csv",
+        invalid_csv=run_dir / "invalid.csv",
+        summary_md=run_dir / "summary.md",
+        metadata_json=run_dir / "metadata.json",
+    )
--- a/rag_eval/reporting/summary.py
+++ b/rag_eval/reporting/summary.py
@@ -0,0 +1,78 @@
+"""Markdown summary generation for completed evaluation runs."""
+
+from __future__ import annotations
+
+import math
+
+import pandas as pd
+
+from rag_eval.shared.models import EvaluationResult
+
+
+def _table_from_frame(frame: pd.DataFrame) -> str:
+    """Render a small dataframe as a fixed-width markdown-friendly text table."""
+    if frame.empty:
+        return "No rows."
+
+    columns = list(frame.columns)
+    rows = [[str(value) for value in row] for row in frame.astype(object).values.tolist()]
+    widths = []
+    for index, column in enumerate(columns):
+        column_width = len(str(column))
+        row_width = max((len(row[index]) for row in rows), default=0)
+        widths.append(max(column_width, row_width))
+
+    header = " | ".join(str(column).ljust(widths[idx]) for idx, column in enumerate(columns))
+    separator = "-|-".join("-" * widths[idx] for idx in range(len(columns)))
+    body = [
+        " | ".join(row[idx].ljust(widths[idx]) for idx in range(len(columns)))
+        for row in rows
+    ]
+    return "\n".join([header, separator, *body])
+
+
+def build_summary_markdown(result: EvaluationResult) -> str:
+    """Build the human-readable markdown summary written for each evaluation run."""
+    total = len(result.valid_samples) + len(result.invalid_samples)
+    scores = pd.DataFrame(result.score_rows)
+
+    lines = [
+        f"# {result.scenario.scenario_name}",
+        "",
+        f"- run_id: `{result.run_id}`",
+        f"- mode: `{result.scenario.mode}`",
+        f"- total_samples: `{total}`",
+        f"- valid_samples: `{len(result.valid_samples)}`",
+        f"- invalid_samples: `{len(result.invalid_samples)}`",
+        f"- judge_model: `{result.scenario.judge_model}`",
+        f"- embedding_model: `{result.scenario.embedding_model}`",
+        "",
+        "## Metric Means",
+        "",
+    ]
+
+    if scores.empty:
+        lines.append("No valid samples were scored.")
+        return "\n".join(lines) + "\n"
+
+    for metric in result.scenario.metrics:
+        mean_value = scores[metric].mean(numeric_only=True)
+        if isinstance(mean_value, float) and not math.isnan(mean_value):
+            lines.append(f"- {metric}: `{mean_value:.4f}`")
+        else:
+            lines.append(f"- {metric}: `n/a`")
+
+    # Keep the summary self-sufficient by including every scored sample and its errors.
+    detail_columns = ["sample_id", *result.scenario.metrics, "error"]
+    detail = scores[detail_columns]
+    lines.extend(
+        [
+            "",
+            "## Per-sample Scores",
+            "",
+            "```text",
+            _table_from_frame(detail),
+            "```",
+        ]
+    )
+    return "\n".join(lines) + "\n"
--- a/rag_eval/reporting/writers.py
+++ b/rag_eval/reporting/writers.py
@@ -0,0 +1,52 @@
+"""Writers that persist evaluation outputs as local run artifacts."""
+
+from __future__ import annotations
+
+import json
+
+import pandas as pd
+import yaml
+
+from rag_eval.reporting.artifacts import build_artifact_paths
+from rag_eval.reporting.summary import build_summary_markdown
+from rag_eval.shared.models import EvaluationResult
+from rag_eval.shared.utils import ensure_directory
+
+
+def write_run_artifacts(result: EvaluationResult) -> None:
+    """Write all standard run artifacts for a completed evaluation result."""
+    artifact_paths = build_artifact_paths(result.scenario.output_dir, result.run_id)
+    ensure_directory(artifact_paths.root_dir)
+
+    artifact_paths.scenario_snapshot.write_text(
+        yaml.safe_dump(result.scenario.snapshot(), sort_keys=False, allow_unicode=True),
+        encoding="utf-8",
+    )
+
+    pd.DataFrame(result.score_rows).to_csv(artifact_paths.scores_csv, index=False)
+    pd.DataFrame(
+        [sample.to_record() for sample in result.invalid_samples]
+    ).to_csv(artifact_paths.invalid_csv, index=False)
+
+    artifact_paths.summary_md.write_text(
+        build_summary_markdown(result),
+        encoding="utf-8",
+    )
+
+    # Keep a compact machine-readable summary alongside the larger CSV and markdown outputs.
+    metadata = {
+        "run_id": result.run_id,
+        "scenario_name": result.scenario.scenario_name,
+        "mode": result.scenario.mode,
+        "judge_model": result.scenario.judge_model,
+        "embedding_model": result.scenario.embedding_model,
+        "started_at": result.started_at,
+        "finished_at": result.finished_at,
+        "dataset": result.scenario.dataset.path.as_posix(),
+        "valid_samples": len(result.valid_samples),
+        "invalid_samples": len(result.invalid_samples),
+    }
+    artifact_paths.metadata_json.write_text(
+        json.dumps(metadata, ensure_ascii=False, indent=2),
+        encoding="utf-8",
+    )
--- a/rag_eval/sample_rag_eval_dataset.csv
+++ b/rag_eval/sample_rag_eval_dataset.csv
@@ -0,0 +1,3 @@
+sample_id,question,contexts,answer,ground_truth,scenario,language,retrieval_config
+leave-policy-001,How many annual leave days does an employee with 6 years of service receive?,"[""Employees with 1 to 9 completed years of service receive 5 days of annual leave."",""Employees with 10 to 19 completed years of service receive 10 days of annual leave.""]","An employee with 6 years of service receives 5 annual leave days.","Employees with 1 to 9 completed years of service receive 5 annual leave days.",policy,en,"top_k=2;chunk_size=300"
+leave-policy-002,入职满12年的员工年假有几天？,"[""员工入司满1年不满10年的，年休假5天。"", ""员工入司满10年不满20年的，年休假10天。""]","根据规定，入职满12年的员工有10天年假。","员工入司满10年不满20年的，年休假10天。",policy,zh,"top_k=2;chunk_size=300"
--- a/rag_eval/settings.py
+++ b/rag_eval/settings.py
@@ -0,0 +1,68 @@
+"""Runtime settings loaded from environment variables for evaluation runs."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+from pydantic import Field
+from pydantic_settings import BaseSettings, SettingsConfigDict
+
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+
+
+class EvaluationSettings(BaseSettings):
+    """Application settings shared by the CLI, adapters, and metric pipeline."""
+    model_config = SettingsConfigDict(
+        env_file=REPO_ROOT / ".env",
+        env_file_encoding="utf-8",
+        extra="ignore",
+    )
+
+    openai_api_key: str | None = Field(default=None, alias="OPENAI_API_KEY")
+    openai_base_url: str = Field(default="http://6.86.80.4:30080/v1", alias="OPENAI_BASE_URL")
+    ragas_judge_model: str = Field(default="deepseek-v4-flash", alias="RAGAS_JUDGE_MODEL")
+    ragas_embedding_model: str = Field(
+        default="text-embedding-v3",
+        alias="RAGAS_EMBEDDING_MODEL",
+    )
+    openai_timeout_seconds: float = Field(default=30.0, alias="OPENAI_TIMEOUT_SECONDS")
+    ragas_metric_timeout_seconds: float = Field(default=45.0, alias="RAGAS_METRIC_TIMEOUT_SECONDS")
+    batch_size: int = Field(default=8, alias="BATCH_SIZE")
+    alibaba_access_key_id: str | None = Field(default=None, alias="ALIBABA_ACCESS_KEY_ID")
+    alibaba_access_key_secret: str | None = Field(default=None, alias="ALIBABA_ACCESS_KEY_SECRET")
+    alibaba_endpoint: str | None = Field(default=None, alias="ALIBABA_ENDPOINT")
+    aliyun_parse_poll_interval_seconds: int = Field(
+        default=5,
+        alias="ALIYUN_PARSE_POLL_INTERVAL_SECONDS",
+    )
+    aliyun_parse_timeout_seconds: int = Field(
+        default=600,
+        alias="ALIYUN_PARSE_TIMEOUT_SECONDS",
+    )
+    aliyun_parse_layout_step_size: int = Field(
+        default=50,
+        alias="ALIYUN_PARSE_LAYOUT_STEP_SIZE",
+    )
+    aliyun_llm_enhancement: bool = Field(default=False, alias="ALIYUN_LLM_ENHANCEMENT")
+    aliyun_enhancement_mode: str = Field(default="balanced", alias="ALIYUN_ENHANCEMENT_MODE")
+    document_parse_artifact_prefix: str = Field(
+        default="outputs/dataset-builds",
+        alias="DOCUMENT_PARSE_ARTIFACT_PREFIX",
+    )
+    parser_failure_mode: str = Field(default="fail", alias="PARSER_FAILURE_MODE")
+    dataset_generator_model: str | None = Field(default=None, alias="DATASET_GENERATOR_MODEL")
+
+    @property
+    def openai_client_kwargs(self) -> dict[str, str | float]:
+        """Return keyword arguments for the OpenAI client when credentials are available."""
+        if not self.openai_api_key:
+            return {}
+
+        client_kwargs: dict[str, str | float] = {
+            "api_key": self.openai_api_key,
+            "timeout": max(1.0, float(self.openai_timeout_seconds)),
+        }
+        if self.openai_base_url.strip():
+            client_kwargs["base_url"] = self.openai_base_url.strip()
+        return client_kwargs
--- a/rag_eval/shared/init.py
+++ b/rag_eval/shared/init.py
@@ -0,0 +1,25 @@
+"""Shared data models and utilities used across evaluation subsystems."""
+
+from .models import (
+    AppAdapterConfig,
+    DatasetConfig,
+    EvaluationResult,
+    InvalidSample,
+    MetricScore,
+    NormalizedSample,
+    RunArtifactPaths,
+    RuntimeConfig,
+    Scenario,
+)
+
+__all__ = [
+    "AppAdapterConfig",
+    "DatasetConfig",
+    "EvaluationResult",
+    "InvalidSample",
+    "MetricScore",
+    "NormalizedSample",
+    "RunArtifactPaths",
+    "RuntimeConfig",
+    "Scenario",
+]
--- a/rag_eval/shared/models.py
+++ b/rag_eval/shared/models.py
@@ -0,0 +1,161 @@
+"""Shared runtime data models exchanged across the evaluation pipeline."""
+
+from __future__ import annotations
+
+from dataclasses import asdict, dataclass, field
+from pathlib import Path
+from typing import Any, Literal
+
+
+Mode = Literal["offline", "online"]
+AdapterType = Literal["http", "python"]
+
+
+def _serialize_paths(value: Any) -> Any:
+    """Convert Path instances nested inside snapshot payloads into POSIX strings."""
+    if isinstance(value, Path):
+        return value.as_posix()
+    if isinstance(value, dict):
+        return {key: _serialize_paths(item) for key, item in value.items()}
+    if isinstance(value, list):
+        return [_serialize_paths(item) for item in value]
+    return value
+
+
+@dataclass(slots=True)
+class RuntimeConfig:
+    """Concurrency and sampling controls for one evaluation run."""
+
+    batch_size: int = 4
+    app_concurrency: int | None = None
+    metric_concurrency: int | None = None
+    max_samples: int | None = None
+
+    def metric_limit(self) -> int:
+        """Return the effective metric-scoring concurrency limit."""
+        return self.metric_concurrency or self.batch_size
+
+    def app_limit(self) -> int:
+        """Return the effective application-call concurrency limit."""
+        return self.app_concurrency or self.batch_size
+
+
+@dataclass(slots=True)
+class AppAdapterConfig:
+    """Resolved adapter configuration used by online scenarios."""
+
+    type: AdapterType
+    endpoint: str | None = None
+    method: str = "POST"
+    timeout_seconds: int = 30
+    callable: str | None = None
+    request_template: dict[str, Any] = field(default_factory=dict)
+    response_mapping: dict[str, str] = field(default_factory=dict)
+    static_kwargs: dict[str, Any] = field(default_factory=dict)
+
+
+@dataclass(slots=True)
+class DatasetConfig:
+    """Dataset location information for a scenario."""
+
+    path: Path
+    format: str | None = None
+
+
+@dataclass(slots=True)
+class Scenario:
+    """Resolved evaluation scenario consumed by the execution pipeline."""
+
+    scenario_name: str
+    mode: Mode
+    dataset: DatasetConfig
+    judge_model: str
+    embedding_model: str
+    metrics: list[str]
+    output_dir: Path
+    runtime: RuntimeConfig = field(default_factory=RuntimeConfig)
+    app_adapter: AppAdapterConfig | None = None
+    source_path: Path | None = None
+
+    def snapshot(self) -> dict[str, Any]:
+        """Serialize the scenario into a reporting-friendly dictionary snapshot."""
+        return _serialize_paths(asdict(self))
+
+
+@dataclass(slots=True)
+class NormalizedSample:
+    """Canonical sample shape used by adapters, metrics, and reporting."""
+
+    sample_id: str
+    question: str
+    contexts: list[str]
+    answer: str
+    ground_truth: str
+    scenario: str = ""
+    language: str = ""
+    retrieval_config: str = ""
+    metadata: dict[str, Any] = field(default_factory=dict)
+    raw: dict[str, Any] = field(default_factory=dict)
+
+    def to_record(self) -> dict[str, Any]:
+        """Convert the sample into a flat record for CSV and artifact generation."""
+        record = {
+            "sample_id": self.sample_id,
+            "question": self.question,
+            "contexts": self.contexts,
+            "answer": self.answer,
+            "ground_truth": self.ground_truth,
+            "scenario": self.scenario,
+            "language": self.language,
+            "retrieval_config": self.retrieval_config,
+        }
+        record.update(self.metadata)
+        return record
+
+
+@dataclass(slots=True)
+class InvalidSample:
+    """A dataset or adapter sample that could not be evaluated."""
+
+    sample_id: str
+    error: str
+    raw: dict[str, Any]
+
+    def to_record(self) -> dict[str, Any]:
+        """Convert the invalid sample into a flat reporting row."""
+        record = {"sample_id": self.sample_id, "error": self.error}
+        record.update(self.raw)
+        return record
+
+
+@dataclass(slots=True)
+class MetricScore:
+    """Metric values and accumulated errors for one evaluated sample."""
+
+    metrics: dict[str, float | None]
+    error: str = ""
+
+
+@dataclass(slots=True)
+class EvaluationResult:
+    """Aggregate result object returned after a scenario completes."""
+
+    scenario: Scenario
+    run_id: str
+    started_at: str
+    finished_at: str
+    valid_samples: list[NormalizedSample]
+    invalid_samples: list[InvalidSample]
+    score_rows: list[dict[str, Any]]
+
+
+@dataclass(slots=True)
+class RunArtifactPaths:
+    """Canonical file-system paths for all artifacts produced by one run."""
+
+    root_dir: Path
+    scenario_snapshot: Path
+    scores_csv: Path
+    invalid_csv: Path
+    summary_md: Path
+    metadata_json: Path
--- a/rag_eval/shared/utils.py
+++ b/rag_eval/shared/utils.py
@@ -0,0 +1,49 @@
+"""General-purpose helpers shared across configuration, datasets, and reporting."""
+
+from __future__ import annotations
+
+import ast
+import json
+import math
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+
+def utc_now_iso() -> str:
+    """Return the current UTC timestamp in ISO 8601 format."""
+    return datetime.now(timezone.utc).isoformat()
+
+
+def ensure_directory(path: Path) -> None:
+    """Create a directory path if it does not already exist."""
+    path.mkdir(parents=True, exist_ok=True)
+
+
+def parse_contexts(value: Any) -> list[str]:
+    """Normalize a context payload into a list of non-empty strings."""
+    if isinstance(value, list):
+        return [str(item).strip() for item in value if str(item).strip()]
+    if value is None or (isinstance(value, float) and math.isnan(value)):
+        return []
+
+    text = str(value).strip()
+    if not text:
+        return []
+
+    # Accept serialized lists from CSV exports before falling back to plain text.
+    for parser in (json.loads, ast.literal_eval):
+        try:
+            parsed = parser(text)
+        except (ValueError, SyntaxError, json.JSONDecodeError):
+            continue
+        if isinstance(parsed, list):
+            return [str(item).strip() for item in parsed if str(item).strip()]
+
+    # Preserve paragraph-style context dumps by splitting on blank lines first.
+    if "\n\n" in text:
+        chunks = [chunk.strip() for chunk in text.split("\n\n") if chunk.strip()]
+        if chunks:
+            return chunks
+
+    return [text]
--- a/scenarios/dataset_build/real-multi-pdf-build.yaml
+++ b/scenarios/dataset_build/real-multi-pdf-build.yaml
@@ -0,0 +1,17 @@
+job_name: real-multi-pdf-question-bank
+input:
+  path: ../../datasets/raw/pdfs
+  glob: "*.pdf"
+parser:
+  provider: aliyun_docmind
+  failure_mode: fail
+generation:
+  output_type: online_question_bank
+  review_mode: draft_with_manual_review
+  max_questions_per_document: 4
+  max_source_chunks_per_question: 3
+output:
+  dataset_path: ../../datasets/raw/generated/real-multi-pdf-question-bank.csv
+  artifact_dir: ../../outputs/dataset-builds/real-multi-pdf-question-bank
+runtime:
+  max_documents: 3
--- a/scenarios/dataset_build/real-pdf-build.yaml
+++ b/scenarios/dataset_build/real-pdf-build.yaml
@@ -0,0 +1,17 @@
+job_name: real-pdf-question-bank
+input:
+  path: ../../datasets/raw/pdfs
+  glob: "*.pdf"
+parser:
+  provider: aliyun_docmind
+  failure_mode: fail
+generation:
+  output_type: online_question_bank
+  review_mode: draft_with_manual_review
+  max_questions_per_document: 5
+  max_source_chunks_per_question: 3
+output:
+  dataset_path: ../../datasets/raw/generated/real-pdf-question-bank.csv
+  artifact_dir: ../../outputs/dataset-builds/real-pdf-question-bank
+runtime:
+  max_documents: 1
--- a/scenarios/dataset_build/sample-pdf-build.yaml
+++ b/scenarios/dataset_build/sample-pdf-build.yaml
@@ -0,0 +1,18 @@
+job_name: sample-pdf-question-bank
+input:
+  path: ../../datasets/raw/pdfs
+  glob: "*.pdf"
+parser:
+  provider: aliyun_docmind
+  failure_mode: fail
+generation:
+  model: qwen3.6-plus
+  output_type: online_question_bank
+  review_mode: draft_with_manual_review
+  max_questions_per_document: 10
+  max_source_chunks_per_question: 3
+output:
+  dataset_path: ../../datasets/raw/generated/sample-pdf-question-bank.csv
+  artifact_dir: ../../outputs/dataset-builds/sample-pdf-question-bank
+runtime:
+  max_documents: 20
--- a/scenarios/offline/real-pdf-offline-smoke.yaml
+++ b/scenarios/offline/real-pdf-offline-smoke.yaml
@@ -0,0 +1,15 @@
+scenario_name: real-pdf-offline-smoke
+mode: offline
+app_adapter: null
+dataset: ../../datasets/normalized/real_multi_pdf_offline_smoke.csv
+judge_model: deepseek-v4-flash
+embedding_model: text-embedding-v3
+metrics:
+  - faithfulness
+  - answer_relevancy
+  - context_recall
+  - context_precision
+output_dir: ../../outputs/real-pdf-offline-smoke
+runtime:
+  batch_size: 4
+  max_samples: 6
--- a/scenarios/offline/sample-offline.yaml
+++ b/scenarios/offline/sample-offline.yaml
@@ -0,0 +1,15 @@
+scenario_name: sample-offline-baseline
+mode: offline
+app_adapter: null
+dataset: ../../datasets/normalized/sample_offline_rag_eval.csv
+judge_model: deepseek-v4-flash
+embedding_model: text-embedding-v3
+metrics:
+  - faithfulness
+  - answer_relevancy
+  - context_recall
+  - context_precision
+output_dir: ../../outputs/sample-offline-baseline
+runtime:
+  batch_size: 4
+  max_samples: 3
--- a/scenarios/offline/sample-pdf-offline-smoke.yaml
+++ b/scenarios/offline/sample-pdf-offline-smoke.yaml
@@ -0,0 +1,15 @@
+scenario_name: sample-pdf-offline-smoke
+mode: offline
+app_adapter: null
+dataset: ../../datasets/normalized/sample_pdf_offline_smoke.csv
+judge_model: deepseek-v4-flash
+embedding_model: text-embedding-v3
+metrics:
+  - faithfulness
+  - answer_relevancy
+  - context_recall
+  - context_precision
+output_dir: ../../outputs/sample-pdf-offline-smoke
+runtime:
+  batch_size: 4
+  max_samples: 3
--- a/scenarios/online/sample-pdf-question-bank-online.yaml
+++ b/scenarios/online/sample-pdf-question-bank-online.yaml
@@ -0,0 +1,22 @@
+scenario_name: sample-pdf-question-bank-online
+mode: online
+dataset: ../../datasets/raw/generated/sample-pdf-question-bank.csv
+judge_model: deepseek-v4-pro
+embedding_model: text-embedding-v3
+metrics:
+  - faithfulness
+  - answer_relevancy
+  - context_recall
+  - context_precision
+output_dir: ../../outputs/online/sample-pdf-question-bank
+runtime:
+  batch_size: 2
+  app_concurrency: 2
+  metric_concurrency: 2
+  max_samples: 45
+app_adapter:
+  type: python
+  callable: apps.pdf_question_bank.adapter:run
+  static_kwargs:
+    source_chunks_path: ../../outputs/dataset-builds/sample-pdf-question-bank/latest/source_chunks.jsonl
+    model: deepseek-v4-flash
--- a/tests/test_dataset_build.py
+++ b/tests/test_dataset_build.py
@@ -0,0 +1,779 @@
+import csv
+import json
+import shutil
+import unittest
+from pathlib import Path
+from unittest import mock
+
+from pydantic import ValidationError
+
+from rag_eval.dataset_builder.generator.question_generator import OpenAIQuestionGenerator
+from rag_eval.dataset_builder.generator.validators import dedupe_samples, validate_draft_sample
+from rag_eval.dataset_builder.models import DraftQuestionSample, ParsedDocument, SourceChunk
+from rag_eval.dataset_builder.parser.aliyun_document_parser import AliyunDocumentParser
+from rag_eval.dataset_builder.parser.aliyun_docmind_gateway import AliyunDocmindGateway
+from rag_eval.dataset_builder.parser.aliyun_layout_normalizer import normalize_layouts
+from rag_eval.dataset_builder.runner import load_dataset_build_job, run_dataset_build
+from rag_eval.dataset_builder.schema import DatasetBuildConfigModel
+from rag_eval.dataset_builder.sources import discover_pdf_files
+from rag_eval.settings import EvaluationSettings
+
+
+class FakeParser:
+    def __init__(self, documents_by_name, failures=None):
+        self.documents_by_name = documents_by_name
+        self.failures = failures or set()
+
+    def parse(self, pdf_path: Path):
+        if pdf_path.name in self.failures:
+            raise RuntimeError(f"parse failed for {pdf_path.name}")
+        return self.documents_by_name[pdf_path.name]
+
+
+class FakeGenerator:
+    def __init__(self, outputs_by_doc_id):
+        self.outputs_by_doc_id = outputs_by_doc_id
+
+    def generate(self, document, *, max_questions, max_chunks_per_question, job_name):
+        return list(self.outputs_by_doc_id.get(document.doc_id, []))
+
+
+class FakeGateway(AliyunDocmindGateway):
+    def __init__(self, settings, *, statuses=None, layouts=None):
+        super().__init__(settings)
+        self.statuses = list(statuses or [])
+        self.layouts = list(layouts or [])
+
+    def submit_parse_task(self, pdf_path: Path) -> str:
+        return "task-1"
+
+    def get_task_status(self, task_id: str):
+        if self.statuses:
+            return self.statuses.pop(0)
+        return {"status": "succeeded", "doc_id": "doc-1", "doc_name": "doc1.pdf"}
+
+    def fetch_layouts(self, task_id: str):
+        return list(self.layouts)
+
+
+class DatasetBuildTests(unittest.TestCase):
+    def setUp(self) -> None:
+        root = Path("tests/.tmp").resolve()
+        root.mkdir(parents=True, exist_ok=True)
+        self.temp_dir = root / self._testMethodName
+        shutil.rmtree(self.temp_dir, ignore_errors=True)
+        self.temp_dir.mkdir(parents=True, exist_ok=True)
+        self.input_dir = self.temp_dir / "pdfs"
+        self.input_dir.mkdir(parents=True, exist_ok=True)
+        (self.input_dir / "doc1.pdf").write_bytes(b"%PDF-1.4 doc1")
+        (self.input_dir / "doc2.pdf").write_bytes(b"%PDF-1.4 doc2")
+
+        self.config_path = self.temp_dir / "dataset-build.yaml"
+        self.config_path.write_text(
+            "\n".join(
+                [
+                    "job_name: sample-build",
+                    "input:",
+                    f"  path: {self.input_dir.as_posix()}",
+                    "  glob: '*.pdf'",
+                    "parser:",
+                    "  provider: aliyun_docmind",
+                    "  failure_mode: skip",
+                    "generation:",
+                    "  output_type: online_question_bank",
+                    "  review_mode: draft_with_manual_review",
+                    "  max_questions_per_document: 3",
+                    "  max_source_chunks_per_question: 2",
+                    "output:",
+                    f"  dataset_path: {(self.temp_dir / 'generated' / 'draft.csv').as_posix()}",
+                    f"  artifact_dir: {(self.temp_dir / 'outputs').as_posix()}",
+                    "runtime:",
+                    "  max_documents: 2",
+                ]
+            ),
+            encoding="utf-8",
+        )
+
+    def tearDown(self) -> None:
+        shutil.rmtree(self.temp_dir, ignore_errors=True)
+
+    def _make_document(self, doc_id: str, doc_name: str) -> ParsedDocument:
+        chunk = SourceChunk(
+            chunk_id=f"{doc_id}-chunk-1",
+            doc_id=doc_id,
+            doc_name=doc_name,
+            text="Section content for review.",
+            page_start=1,
+            page_end=2,
+            section_path="Chapter 1 > Scope",
+            section_title="Scope",
+            source_layout_ids=["layout-1"],
+        )
+        return ParsedDocument(
+            doc_id=doc_id,
+            doc_name=doc_name,
+            raw_text=chunk.text,
+            structure_nodes=[],
+            semantic_blocks=[],
+            source_chunks=[chunk],
+            metadata={},
+        )
+
+    def test_load_dataset_build_job_resolves_paths_and_defaults(self) -> None:
+        settings = EvaluationSettings.model_construct(dataset_generator_model="env-model")
+        job = load_dataset_build_job(self.config_path, settings=settings)
+        self.assertEqual(job.job_name, "sample-build")
+        self.assertEqual(job.generation_model, "env-model")
+        self.assertTrue(job.dataset_path.is_absolute())
+        self.assertEqual(job.failure_mode, "skip")
+
+    def test_load_dataset_build_job_prefers_yaml_generation_model(self) -> None:
+        config_path = self.temp_dir / "dataset-build-with-model.yaml"
+        config_path.write_text(
+            self.config_path.read_text(encoding="utf-8").replace(
+                "generation:\n",
+                "generation:\n  model: yaml-model\n",
+            ),
+            encoding="utf-8",
+        )
+        settings = EvaluationSettings.model_construct(dataset_generator_model="env-model")
+        job = load_dataset_build_job(config_path, settings=settings)
+        self.assertEqual(job.generation_model, "yaml-model")
+
+    def test_load_dataset_build_job_uses_env_default_failure_mode(self) -> None:
+        config_path = self.temp_dir / "dataset-build-without-failure-mode.yaml"
+        config_path.write_text(
+            self.config_path.read_text(encoding="utf-8").replace("  failure_mode: skip\n", ""),
+            encoding="utf-8",
+        )
+        settings = EvaluationSettings.model_construct(
+            dataset_generator_model="env-model",
+            parser_failure_mode="skip",
+        )
+        job = load_dataset_build_job(config_path, settings=settings)
+        self.assertEqual(job.failure_mode, "skip")
+
+    def test_discover_pdf_files_rejects_missing_or_empty_input(self) -> None:
+        with self.assertRaises(FileNotFoundError):
+            discover_pdf_files(self.temp_dir / "missing")
+
+        empty_dir = self.temp_dir / "empty"
+        empty_dir.mkdir()
+        with self.assertRaises(ValueError):
+            discover_pdf_files(empty_dir)
+
+    def test_discover_pdf_files_accepts_single_pdf_file(self) -> None:
+        pdf_path = self.input_dir / "doc1.pdf"
+        files = discover_pdf_files(pdf_path)
+        self.assertEqual(files, [pdf_path])
+
+    def test_dataset_build_schema_rejects_missing_required_fields(self) -> None:
+        with self.assertRaises(ValidationError):
+            DatasetBuildConfigModel.model_validate(
+                {
+                    "job_name": "sample-build",
+                    "parser": {"provider": "aliyun_docmind"},
+                    "generation": {
+                        "output_type": "online_question_bank",
+                        "review_mode": "draft_with_manual_review",
+                    },
+                    "output": {
+                        "dataset_path": "draft.csv",
+                        "artifact_dir": "outputs",
+                    },
+                }
+            )
+
+    def test_dataset_build_schema_rejects_invalid_enums(self) -> None:
+        with self.assertRaises(ValidationError):
+            DatasetBuildConfigModel.model_validate(
+                {
+                    "job_name": "sample-build",
+                    "input": {"path": self.input_dir.as_posix()},
+                    "parser": {"provider": "other-provider", "failure_mode": "ignore"},
+                    "generation": {
+                        "output_type": "other-output",
+                        "review_mode": "auto_publish",
+                    },
+                    "output": {
+                        "dataset_path": "draft.csv",
+                        "artifact_dir": "outputs",
+                    },
+                }
+            )
+
+    def test_normalize_layouts_applies_core_rules(self) -> None:
+        layouts = [
+            {"type": "toc", "text": "目录", "page": 1, "layout_id": "toc-1"},
+            {"type": "heading", "text": "第一章 总则", "page": 2, "layout_id": "h1", "level": 1},
+            {"type": "paragraph", "text": "第一段。", "page": 2, "layout_id": "p1"},
+            {"type": "caption", "text": "系统示意图", "page": 2, "layout_id": "c1"},
+            {
+                "type": "table",
+                "rows": [["字段", "说明"], ["名称", "项目名称"]],
+                "page": 3,
+                "layout_id": "t1",
+            },
+        ]
+        document = normalize_layouts(doc_id="doc-1", doc_name="sample.pdf", layouts=layouts, max_chunk_chars=80, overlap_chars=10)
+        self.assertEqual(len(document.structure_nodes), 1)
+        self.assertEqual(document.structure_nodes[0].section_path, "第一章 总则")
+        self.assertEqual(len(document.semantic_blocks), 1)
+        self.assertIn("图注:", document.semantic_blocks[0].text)
+        self.assertIn("字段 | 说明", document.semantic_blocks[0].text)
+        self.assertEqual(document.source_chunks[0].page_start, 2)
+        self.assertEqual(document.source_chunks[0].page_end, 3)
+
+    def test_normalize_layouts_splits_long_text_into_multiple_chunks(self) -> None:
+        long_text = "A" * 220
+        layouts = [
+            {"type": "heading", "text": "Chapter 1", "page": 1, "layout_id": "h1", "level": 1},
+            {"type": "paragraph", "text": long_text, "page": 1, "layout_id": "p1"},
+        ]
+        document = normalize_layouts(
+            doc_id="doc-1",
+            doc_name="sample.pdf",
+            layouts=layouts,
+            max_chunk_chars=100,
+            overlap_chars=20,
+        )
+        self.assertGreaterEqual(len(document.source_chunks), 3)
+        self.assertTrue(all(chunk.section_title == "Chapter 1" for chunk in document.source_chunks))
+
+    def test_validate_and_dedupe_generated_samples(self) -> None:
+        document = self._make_document("doc-1", "doc1.pdf")
+        valid = DraftQuestionSample(
+            sample_id="doc-1-q1",
+            question="这份文档的范围是什么？",
+            ground_truth="文档说明了适用范围。",
+            scenario="sample-build",
+            language="zh",
+            doc_id="doc-1",
+            doc_name="doc1.pdf",
+            section_path="Chapter 1 > Scope",
+            page_start=1,
+            page_end=2,
+            source_chunk_ids=["doc-1-chunk-1"],
+            question_type="summary",
+            difficulty="easy",
+        )
+        invalid = DraftQuestionSample(
+            sample_id="doc-1-q2",
+            question="",
+            ground_truth="",
+            scenario="sample-build",
+            language="zh",
+            doc_id="doc-2",
+            doc_name="doc1.pdf",
+            section_path="",
+            page_start=0,
+            page_end=0,
+            source_chunk_ids=["missing-chunk"],
+            question_type="invalid",
+            difficulty="invalid",
+        )
+        duplicate = DraftQuestionSample(
+            sample_id="doc-1-q3",
+            question="  这份文档的范围是什么？ ",
+            ground_truth="文档说明了适用范围",
+            scenario="sample-build",
+            language="zh",
+            doc_id="doc-1",
+            doc_name="doc1.pdf",
+            section_path="Chapter 1 > Scope",
+            page_start=1,
+            page_end=2,
+            source_chunk_ids=["doc-1-chunk-1"],
+            question_type="summary",
+            difficulty="easy",
+        )
+        self.assertEqual(validate_draft_sample(valid, document=document), [])
+        self.assertTrue(validate_draft_sample(invalid, document=document))
+        self.assertEqual(len(dedupe_samples([valid, duplicate])), 1)
+
+    def test_validate_rejects_too_many_source_chunks(self) -> None:
+        document = self._make_document("doc-1", "doc1.pdf")
+        document.source_chunks.append(
+            SourceChunk(
+                chunk_id="doc-1-chunk-2",
+                doc_id="doc-1",
+                doc_name="doc1.pdf",
+                text="More content",
+                page_start=2,
+                page_end=3,
+                section_path="Chapter 1 > Scope",
+                section_title="Scope",
+                source_layout_ids=["layout-2"],
+            )
+        )
+        sample = DraftQuestionSample(
+            sample_id="doc-1-q1",
+            question="What is the scope?",
+            ground_truth="It defines scope.",
+            scenario="sample-build",
+            language="en",
+            doc_id="doc-1",
+            doc_name="doc1.pdf",
+            section_path="Chapter 1 > Scope",
+            page_start=1,
+            page_end=3,
+            source_chunk_ids=["doc-1-chunk-1", "doc-1-chunk-2"],
+            question_type="fact",
+            difficulty="easy",
+        )
+        errors = validate_draft_sample(
+            sample,
+            document=document,
+            max_source_chunks_per_question=1,
+        )
+        self.assertTrue(any("exceeds limit" in error for error in errors))
+
+    def test_dedupe_keeps_only_one_question_per_chunk_group(self) -> None:
+        sample_a = DraftQuestionSample(
+            sample_id="doc-1-q1",
+            question="What is the scope?",
+            ground_truth="It defines the scope.",
+            scenario="sample-build",
+            language="en",
+            doc_id="doc-1",
+            doc_name="doc1.pdf",
+            section_path="Chapter 1 > Scope",
+            page_start=1,
+            page_end=2,
+            source_chunk_ids=["doc-1-chunk-1"],
+            question_type="fact",
+            difficulty="easy",
+        )
+        sample_b = DraftQuestionSample(
+            sample_id="doc-1-q2",
+            question="How is the scope described?",
+            ground_truth="The scope is described in the first section.",
+            scenario="sample-build",
+            language="en",
+            doc_id="doc-1",
+            doc_name="doc1.pdf",
+            section_path="Chapter 1 > Scope",
+            page_start=1,
+            page_end=2,
+            source_chunk_ids=["doc-1-chunk-1"],
+            question_type="summary",
+            difficulty="medium",
+        )
+        self.assertEqual(len(dedupe_samples([sample_a, sample_b])), 1)
+
+    def test_aliyun_gateway_parse_success_failure_and_timeout(self) -> None:
+        settings = EvaluationSettings.model_construct(
+            aliyun_parse_poll_interval_seconds=1,
+            aliyun_parse_timeout_seconds=1,
+        )
+        pdf_path = self.input_dir / "doc1.pdf"
+
+        success_gateway = FakeGateway(
+            settings,
+            statuses=[{"status": "running"}, {"status": "succeeded", "doc_id": "doc-1", "doc_name": "doc1.pdf"}],
+            layouts=[{"type": "paragraph", "text": "hello", "page": 1, "layout_id": "p1"}],
+        )
+        with mock.patch("rag_eval.dataset_builder.parser.aliyun_docmind_gateway.time.sleep", return_value=None), mock.patch(
+            "rag_eval.dataset_builder.parser.aliyun_docmind_gateway.time.monotonic",
+            side_effect=[0.0, 0.1, 0.2],
+        ):
+            payload = success_gateway.parse_document(pdf_path)
+        self.assertEqual(payload["doc_id"], "doc-1")
+        self.assertEqual(len(payload["layouts"]), 1)
+
+        failure_gateway = FakeGateway(settings, statuses=[{"status": "failed", "message": "bad file"}])
+        with self.assertRaises(RuntimeError):
+            failure_gateway.parse_document(pdf_path)
+
+        timeout_gateway = FakeGateway(settings, statuses=[{"status": "running"}, {"status": "running"}])
+        with mock.patch("rag_eval.dataset_builder.parser.aliyun_docmind_gateway.time.sleep", return_value=None), mock.patch(
+            "rag_eval.dataset_builder.parser.aliyun_docmind_gateway.time.monotonic",
+            side_effect=[0.0, 2.0],
+        ):
+            with self.assertRaises(TimeoutError):
+                timeout_gateway.parse_document(pdf_path)
+
+    def test_aliyun_gateway_reports_missing_sdk(self) -> None:
+        settings = EvaluationSettings.model_construct(
+            aliyun_parse_poll_interval_seconds=1,
+            aliyun_parse_timeout_seconds=1,
+        )
+        gateway = AliyunDocmindGateway(settings)
+
+        with mock.patch("rag_eval.dataset_builder.parser.aliyun_docmind_gateway.DocmindClient", None), mock.patch(
+            "rag_eval.dataset_builder.parser.aliyun_docmind_gateway.docmind_models", None
+        ), mock.patch("rag_eval.dataset_builder.parser.aliyun_docmind_gateway.openapi_models", None), mock.patch(
+            "rag_eval.dataset_builder.parser.aliyun_docmind_gateway.runtime_models", None
+        ):
+            with self.assertRaises(ImportError):
+                gateway._load_sdk()
+
+    def test_document_parser_rejects_empty_layouts(self) -> None:
+        settings = EvaluationSettings.model_construct(
+            aliyun_parse_poll_interval_seconds=1,
+            aliyun_parse_timeout_seconds=1,
+        )
+        gateway = FakeGateway(
+            settings,
+            statuses=[{"status": "succeeded", "doc_id": "doc-1", "doc_name": "doc1.pdf"}],
+            layouts=[],
+        )
+        parser = AliyunDocumentParser(gateway)
+        with self.assertRaises(ValueError):
+            parser.parse(self.input_dir / "doc1.pdf")
+
+    def test_run_dataset_build_skip_mode_writes_all_artifacts(self) -> None:
+        doc1 = self._make_document("doc-1", "doc1.pdf")
+        parser = FakeParser(
+            {"doc1.pdf": doc1},
+            failures={"doc2.pdf"},
+        )
+        generator = FakeGenerator(
+            {
+                "doc-1": [
+                    DraftQuestionSample(
+                        sample_id="doc-1-q1",
+                        question="What is the scope?",
+                        ground_truth="It defines the scope.",
+                        scenario="sample-build",
+                        language="en",
+                        doc_id="doc-1",
+                        doc_name="doc1.pdf",
+                        section_path="Chapter 1 > Scope",
+                        page_start=1,
+                        page_end=2,
+                        source_chunk_ids=["doc-1-chunk-1"],
+                        question_type="fact",
+                        difficulty="easy",
+                    )
+                ]
+            }
+        )
+
+        result = run_dataset_build(
+            self.config_path,
+            settings=EvaluationSettings.model_construct(dataset_generator_model="stub-model"),
+            parser=parser,
+            generator=generator,
+        )
+
+        self.assertEqual(len(result.documents), 1)
+        self.assertEqual(len(result.parse_failures), 1)
+        self.assertEqual(len(result.draft_samples), 1)
+        self.assertTrue(result.artifact_paths.documents_jsonl.exists())
+        self.assertTrue(result.artifact_paths.semantic_blocks_jsonl.exists())
+        self.assertTrue(result.artifact_paths.source_chunks_jsonl.exists())
+        self.assertTrue(result.artifact_paths.dataset_draft_csv.exists())
+        self.assertTrue(result.artifact_paths.parse_failures_csv.exists())
+        self.assertTrue(result.artifact_paths.metadata_json.exists())
+        self.assertTrue(result.job.dataset_path.exists())
+        latest_dir = result.job.artifact_dir / "latest"
+        self.assertTrue((latest_dir / "source_chunks.jsonl").exists())
+        self.assertTrue((latest_dir / "dataset_draft.csv").exists())
+        self.assertTrue((latest_dir / "metadata.json").exists())
+
+        with result.artifact_paths.parse_failures_csv.open(encoding="utf-8") as handle:
+            rows = list(csv.DictReader(handle))
+        self.assertEqual(len(rows), 1)
+        self.assertIn("doc2.pdf", rows[0]["file_path"])
+
+        metadata = json.loads(result.artifact_paths.metadata_json.read_text(encoding="utf-8"))
+        self.assertEqual(metadata["stats"]["documents_processed"], 1)
+        self.assertEqual(metadata["stats"]["parse_failures"], 1)
+        latest_metadata = json.loads((latest_dir / "metadata.json").read_text(encoding="utf-8"))
+        self.assertEqual(latest_metadata["run_id"], result.run_id)
+
+        with result.artifact_paths.source_chunks_jsonl.open(encoding="utf-8") as handle:
+            run_chunks = handle.read()
+        with (latest_dir / "source_chunks.jsonl").open(encoding="utf-8") as handle:
+            latest_chunks = handle.read()
+        self.assertEqual(latest_chunks, run_chunks)
+
+    def test_run_dataset_build_single_pdf_input(self) -> None:
+        single_pdf_config = self.temp_dir / "single-pdf-build.yaml"
+        single_pdf_config.write_text(
+            self.config_path.read_text(encoding="utf-8").replace(
+                f"  path: {self.input_dir.as_posix()}",
+                f"  path: {(self.input_dir / 'doc1.pdf').as_posix()}",
+            ),
+            encoding="utf-8",
+        )
+        parser = FakeParser({"doc1.pdf": self._make_document("doc-1", "doc1.pdf")})
+        generator = FakeGenerator(
+            {
+                "doc-1": [
+                    DraftQuestionSample(
+                        sample_id="doc-1-q1",
+                        question="What is the scope?",
+                        ground_truth="It defines the scope.",
+                        scenario="sample-build",
+                        language="en",
+                        doc_id="doc-1",
+                        doc_name="doc1.pdf",
+                        section_path="Chapter 1 > Scope",
+                        page_start=1,
+                        page_end=2,
+                        source_chunk_ids=["doc-1-chunk-1"],
+                        question_type="fact",
+                        difficulty="easy",
+                    )
+                ]
+            }
+        )
+        result = run_dataset_build(
+            single_pdf_config,
+            settings=EvaluationSettings.model_construct(dataset_generator_model="stub-model"),
+            parser=parser,
+            generator=generator,
+        )
+        self.assertEqual(len(result.documents), 1)
+        self.assertEqual(result.documents[0].doc_name, "doc1.pdf")
+        self.assertEqual(len(result.draft_samples), 1)
+
+    def test_run_dataset_build_caps_questions_per_document(self) -> None:
+        doc1 = self._make_document("doc-1", "doc1.pdf")
+        parser = FakeParser({"doc1.pdf": doc1, "doc2.pdf": self._make_document("doc-2", "doc2.pdf")})
+        generator = FakeGenerator(
+            {
+                "doc-1": [
+                    DraftQuestionSample(
+                        sample_id=f"doc-1-q{index}",
+                        question=f"Question {index}?",
+                        ground_truth=f"Answer {index}.",
+                        scenario="sample-build",
+                        language="en",
+                        doc_id="doc-1",
+                        doc_name="doc1.pdf",
+                        section_path="Chapter 1 > Scope",
+                        page_start=1,
+                        page_end=2,
+                        source_chunk_ids=[f"doc-1-chunk-{index}"],
+                        question_type="fact",
+                        difficulty="easy",
+                    )
+                    for index in range(1, 5)
+                ]
+            }
+        )
+        # Rebuild the doc with enough chunk ids for validation to pass.
+        doc1.source_chunks = [
+            SourceChunk(
+                chunk_id=f"doc-1-chunk-{index}",
+                doc_id="doc-1",
+                doc_name="doc1.pdf",
+                text=f"Chunk {index}",
+                page_start=index,
+                page_end=index,
+                section_path="Chapter 1 > Scope",
+                section_title="Scope",
+                source_layout_ids=[f"layout-{index}"],
+            )
+            for index in range(1, 5)
+        ]
+
+        result = run_dataset_build(
+            self.config_path,
+            settings=EvaluationSettings.model_construct(dataset_generator_model="stub-model"),
+            parser=parser,
+            generator=generator,
+        )
+        self.assertLessEqual(len([item for item in result.draft_samples if item.doc_id == "doc-1"]), 3)
+
+    def test_run_dataset_build_filters_questions_exceeding_chunk_limit(self) -> None:
+        doc1 = self._make_document("doc-1", "doc1.pdf")
+        doc1.source_chunks.append(
+            SourceChunk(
+                chunk_id="doc-1-chunk-2",
+                doc_id="doc-1",
+                doc_name="doc1.pdf",
+                text="Chunk 2",
+                page_start=2,
+                page_end=2,
+                section_path="Chapter 1 > Scope",
+                section_title="Scope",
+                source_layout_ids=["layout-2"],
+            )
+        )
+        parser = FakeParser({"doc1.pdf": doc1}, failures={"doc2.pdf"})
+        generator = FakeGenerator(
+            {
+                "doc-1": [
+                    DraftQuestionSample(
+                        sample_id="doc-1-q1",
+                        question="Too many chunks?",
+                        ground_truth="This cites two chunks.",
+                        scenario="sample-build",
+                        language="en",
+                        doc_id="doc-1",
+                        doc_name="doc1.pdf",
+                        section_path="Chapter 1 > Scope",
+                        page_start=1,
+                        page_end=2,
+                        source_chunk_ids=["doc-1-chunk-1", "doc-1-chunk-2"],
+                        question_type="fact",
+                        difficulty="easy",
+                    )
+                ]
+            }
+        )
+
+        strict_config = self.temp_dir / "dataset-build-strict.yaml"
+        strict_config.write_text(
+            self.config_path.read_text(encoding="utf-8").replace(
+                "  max_source_chunks_per_question: 2",
+                "  max_source_chunks_per_question: 1",
+            ),
+            encoding="utf-8",
+        )
+        result = run_dataset_build(
+            strict_config,
+            settings=EvaluationSettings.model_construct(dataset_generator_model="stub-model"),
+            parser=parser,
+            generator=generator,
+        )
+        self.assertEqual(len(result.draft_samples), 0)
+
+    def test_run_dataset_build_fail_mode_raises(self) -> None:
+        fail_config = self.temp_dir / "dataset-build-fail.yaml"
+        fail_config.write_text(self.config_path.read_text(encoding="utf-8").replace("failure_mode: skip", "failure_mode: fail"), encoding="utf-8")
+        parser = FakeParser({}, failures={"doc1.pdf"})
+        generator = FakeGenerator({})
+
+        with self.assertRaises(RuntimeError):
+            run_dataset_build(
+                fail_config,
+                settings=EvaluationSettings.model_construct(dataset_generator_model="stub-model"),
+                parser=parser,
+                generator=generator,
+            )
+
+
+class QuestionGeneratorTests(unittest.TestCase):
+    def _make_document(self) -> ParsedDocument:
+        return ParsedDocument(
+            doc_id="doc-1",
+            doc_name="doc1.pdf",
+            raw_text="source text",
+            structure_nodes=[],
+            semantic_blocks=[],
+            source_chunks=[
+                SourceChunk(
+                    chunk_id="doc-1-chunk-1",
+                    doc_id="doc-1",
+                    doc_name="doc1.pdf",
+                    text="Scope content",
+                    page_start=1,
+                    page_end=1,
+                    section_path="Chapter 1 > Scope",
+                    section_title="Scope",
+                    source_layout_ids=["layout-1"],
+                ),
+                SourceChunk(
+                    chunk_id="doc-1-chunk-2",
+                    doc_id="doc-1",
+                    doc_name="doc1.pdf",
+                    text="Procedure content",
+                    page_start=2,
+                    page_end=2,
+                    section_path="Chapter 2 > Process",
+                    section_title="Process",
+                    source_layout_ids=["layout-2"],
+                ),
+            ],
+            metadata={},
+        )
+
+    def _make_fake_client(self, content: str):
+        class FakeResponse:
+            def __init__(self, payload: str):
+                self.choices = [type("Choice", (), {"message": type("Message", (), {"content": payload})()})()]
+
+        class FakeCompletions:
+            def __init__(self, payload: str):
+                self.payload = payload
+
+            def create(self, **kwargs):
+                return FakeResponse(self.payload)
+
+        return type(
+            "FakeClient",
+            (),
+            {"chat": type("Chat", (), {"completions": FakeCompletions(content)})()},
+        )()
+
+    def test_question_generator_builds_samples_from_json_response(self) -> None:
+        settings = EvaluationSettings.model_construct(openai_api_key="test-key")
+        content = json.dumps(
+            {
+                "samples": [
+                    {
+                        "question": "What is the scope?",
+                        "ground_truth": "It defines the scope.",
+                        "source_chunk_ids": ["doc-1-chunk-1"],
+                        "question_type": "fact",
+                        "difficulty": "easy",
+                    },
+                    {
+                        "question": "Summarize the process.",
+                        "ground_truth": "It explains the process.",
+                        "source_chunk_ids": ["doc-1-chunk-2"],
+                        "question_type": "summary",
+                        "difficulty": "medium",
+                    },
+                ]
+            }
+        )
+        generator = OpenAIQuestionGenerator(
+            settings=settings,
+            model="stub-model",
+            client=self._make_fake_client(content),
+        )
+        samples = generator.generate(
+            self._make_document(),
+            max_questions=1,
+            max_chunks_per_question=2,
+            job_name="sample-build",
+        )
+        self.assertEqual(len(samples), 1)
+        self.assertEqual(samples[0].sample_id, "doc-1-q1")
+        self.assertEqual(samples[0].section_path, "Chapter 1 > Scope")
+
+    def test_question_generator_rejects_invalid_json(self) -> None:
+        settings = EvaluationSettings.model_construct(openai_api_key="test-key")
+        generator = OpenAIQuestionGenerator(
+            settings=settings,
+            model="stub-model",
+            client=self._make_fake_client("not-json"),
+        )
+        with self.assertRaises(ValueError):
+            generator.generate(
+                self._make_document(),
+                max_questions=1,
+                max_chunks_per_question=2,
+                job_name="sample-build",
+            )
+
+    def test_question_generator_rejects_non_list_samples(self) -> None:
+        settings = EvaluationSettings.model_construct(openai_api_key="test-key")
+        content = json.dumps({"samples": {"question": "bad-shape"}})
+        generator = OpenAIQuestionGenerator(
+            settings=settings,
+            model="stub-model",
+            client=self._make_fake_client(content),
+        )
+        with self.assertRaises(ValueError):
+            generator.generate(
+                self._make_document(),
+                max_questions=1,
+                max_chunks_per_question=2,
+                job_name="sample-build",
+            )
+
+
+class MainCliParseTests(unittest.TestCase):
+    def test_cli_options_are_mutually_exclusive(self) -> None:
+        import main
+
+        with mock.patch("sys.argv", ["main.py", "--scenario", "a.yaml", "--dataset-build-config", "b.yaml"]):
+            with self.assertRaises(SystemExit):
+                main.parse_args()
--- a/tests/test_offline_eval.py
+++ b/tests/test_offline_eval.py
@@ -0,0 +1,310 @@
+import os
+import unittest
+from pathlib import Path
+from unittest import mock
+
+import pandas as pd
+from pydantic_settings import SettingsConfigDict
+
+from rag_eval.config.loader import load_scenario
+from rag_eval.datasets.normalizers import normalize_records
+from rag_eval.execution.evaluator import Evaluator
+from rag_eval.metrics.pipeline import MetricPipeline
+from rag_eval.reporting.summary import build_summary_markdown
+from rag_eval.reporting.writers import write_run_artifacts
+from rag_eval.settings import EvaluationSettings
+from rag_eval.shared.models import EvaluationResult
+
+
+class EnvOnlySettings(EvaluationSettings):
+    model_config = SettingsConfigDict(env_file=None, extra="ignore")
+
+
+class FakeMetric:
+    def __init__(self, value: float):
+        self.value = value
+
+    async def ascore(self, **kwargs):
+        class Result:
+            def __init__(self, value: float):
+                self.value = value
+
+        return Result(self.value)
+
+
+class SlowMetric:
+    async def ascore(self, **kwargs):
+        await __import__("asyncio").sleep(0.05)
+        return type("Result", (), {"value": 1.0})()
+
+
+class OpenAIConfigTests(unittest.TestCase):
+    def test_openai_client_kwargs_without_base_url(self) -> None:
+        with mock.patch.dict(os.environ, {"OPENAI_API_KEY": "test-key"}, clear=True):
+            settings = EnvOnlySettings()
+            self.assertEqual(
+                settings.openai_client_kwargs,
+                {"api_key": "test-key", "base_url": "http://6.86.80.4:30080/v1", "timeout": 30.0},
+            )
+
+    def test_openai_client_kwargs_with_base_url(self) -> None:
+        with mock.patch.dict(
+            os.environ,
+            {
+                "OPENAI_API_KEY": "test-key",
+                "OPENAI_BASE_URL": "https://proxy.example/v1",
+            },
+            clear=True,
+        ):
+            settings = EnvOnlySettings()
+            self.assertEqual(
+                settings.openai_client_kwargs,
+                {"api_key": "test-key", "base_url": "https://proxy.example/v1", "timeout": 30.0},
+            )
+
+    def test_settings_defaults(self) -> None:
+        with mock.patch.dict(os.environ, {}, clear=True):
+            settings = EnvOnlySettings()
+            self.assertEqual(settings.openai_base_url, "http://6.86.80.4:30080/v1")
+            self.assertEqual(settings.ragas_judge_model, "deepseek-v4-flash")
+            self.assertEqual(settings.ragas_embedding_model, "text-embedding-v3")
+            self.assertEqual(settings.openai_timeout_seconds, 30.0)
+            self.assertEqual(settings.ragas_metric_timeout_seconds, 45.0)
+            self.assertEqual(settings.batch_size, 8)
+
+
+class ScenarioAndDatasetTests(unittest.TestCase):
+    def test_load_scenario_resolves_relative_paths(self) -> None:
+        scenario = load_scenario("scenarios/offline/sample-offline.yaml")
+        self.assertEqual(scenario.mode, "offline")
+        self.assertTrue(scenario.dataset.path.name.endswith(".csv"))
+        self.assertTrue(scenario.output_dir.name == "sample-offline-baseline")
+
+    def test_scenario_snapshot_serializes_path_static_kwargs(self) -> None:
+        scenario = load_scenario("scenarios/online/sample-pdf-question-bank-online.yaml")
+        snapshot = scenario.snapshot()
+        self.assertIsInstance(snapshot["app_adapter"]["static_kwargs"]["source_chunks_path"], str)
+        self.assertTrue(
+            snapshot["app_adapter"]["static_kwargs"]["source_chunks_path"].endswith("source_chunks.jsonl")
+        )
+
+    def test_load_sample_pdf_offline_smoke_scenario(self) -> None:
+        scenario = load_scenario("scenarios/offline/sample-pdf-offline-smoke.yaml")
+        self.assertEqual(scenario.mode, "offline")
+        self.assertEqual(scenario.dataset.path.name, "sample_pdf_offline_smoke.csv")
+        self.assertEqual(scenario.output_dir.name, "sample-pdf-offline-smoke")
+
+    def test_normalize_records_splits_valid_and_invalid(self) -> None:
+        records = [
+            {
+                "question": "Q1",
+                "contexts": '["C1"]',
+                "answer": "A1",
+                "ground_truth": "G1",
+            },
+            {
+                "question": "",
+                "contexts": "",
+                "answer": "",
+                "ground_truth": "",
+            },
+        ]
+        valid, invalid = normalize_records(records)
+        self.assertEqual(len(valid), 1)
+        self.assertEqual(len(invalid), 1)
+        self.assertEqual(valid[0].contexts, ["C1"])
+
+    def test_normalize_sample_pdf_offline_smoke_row(self) -> None:
+        frame = pd.read_csv("datasets/normalized/sample_pdf_offline_smoke.csv")
+        valid, invalid = normalize_records(frame.to_dict(orient="records"))
+        self.assertEqual(len(invalid), 0)
+        self.assertEqual(len(valid), 3)
+        self.assertTrue(valid[0].answer)
+        self.assertTrue(valid[0].ground_truth)
+        self.assertTrue(valid[0].contexts)
+
+
+class EvaluatorAndReportingTests(unittest.TestCase):
+    def test_metric_pipeline_scores_sample(self) -> None:
+        pipeline = MetricPipeline(
+            metrics={
+                "faithfulness": FakeMetric(0.1),
+                "answer_relevancy": FakeMetric(0.2),
+                "context_recall": FakeMetric(0.3),
+                "context_precision": FakeMetric(0.4),
+            }
+        )
+        valid, _ = normalize_records(
+            [
+                {
+                    "question": "What is RAG?",
+                    "contexts": ["RAG combines retrieval and generation."],
+                    "answer": "RAG combines retrieval and generation.",
+                    "ground_truth": "RAG combines retrieval and generation.",
+                }
+            ]
+        )
+        score = __import__("asyncio").run(pipeline.score_sample(valid[0]))
+        self.assertEqual(score.metrics["faithfulness"], 0.1)
+        self.assertEqual(score.metrics["context_precision"], 0.4)
+
+    def test_metric_pipeline_captures_metric_timeout_without_aborting(self) -> None:
+        pipeline = MetricPipeline(
+            metrics={
+                "faithfulness": SlowMetric(),
+                "answer_relevancy": FakeMetric(0.2),
+            },
+            metric_timeout_seconds=0.01,
+        )
+        valid, _ = normalize_records(
+            [
+                {
+                    "question": "What is RAG?",
+                    "contexts": ["RAG combines retrieval and generation."],
+                    "answer": "RAG combines retrieval and generation.",
+                    "ground_truth": "RAG combines retrieval and generation.",
+                }
+            ]
+        )
+        score = __import__("asyncio").run(pipeline.score_sample(valid[0]))
+        self.assertEqual(score.metrics["faithfulness"], 1.0)
+        self.assertEqual(score.metrics["answer_relevancy"], 0.2)
+        self.assertEqual(score.error, "")
+
+    def test_evaluator_and_reporting_write_run_assets(self) -> None:
+        temp_root = Path("tests/.tmp/run-assets")
+        temp_root.mkdir(parents=True, exist_ok=True)
+        for child in temp_root.iterdir():
+            if child.is_dir():
+                import shutil
+
+                shutil.rmtree(child)
+            else:
+                child.unlink()
+        output_root = temp_root
+        try:
+            scenario = load_scenario("scenarios/offline/sample-offline.yaml")
+            scenario.output_dir = output_root
+
+            pipeline = MetricPipeline(
+                metrics={
+                    "faithfulness": FakeMetric(0.1),
+                    "answer_relevancy": FakeMetric(0.2),
+                    "context_recall": FakeMetric(0.3),
+                    "context_precision": FakeMetric(0.4),
+                }
+            )
+            evaluator = Evaluator(scenario=scenario, metric_pipeline=pipeline)
+            result = evaluator.evaluate()
+            write_run_artifacts(result)
+
+            run_dir = output_root / result.run_id
+            self.assertTrue((run_dir / "scenario.snapshot.yaml").exists())
+            self.assertTrue((run_dir / "scores.csv").exists())
+            self.assertTrue((run_dir / "invalid.csv").exists())
+            self.assertTrue((run_dir / "summary.md").exists())
+            self.assertTrue((run_dir / "metadata.json").exists())
+
+            scores = pd.read_csv(run_dir / "scores.csv")
+            self.assertEqual(len(scores), 3)
+            self.assertIn("faithfulness", scores.columns)
+        finally:
+            import shutil
+
+            shutil.rmtree(temp_root, ignore_errors=True)
+
+    def test_summary_markdown_lists_all_scored_samples_and_errors(self) -> None:
+        scenario = load_scenario("scenarios/offline/sample-offline.yaml")
+        valid, invalid = normalize_records(
+            [
+                {
+                    "sample_id": "sample-1",
+                    "question": "Q1",
+                    "contexts": ["C1"],
+                    "answer": "A1",
+                    "ground_truth": "G1",
+                },
+                {
+                    "sample_id": "sample-2",
+                    "question": "Q2",
+                    "contexts": ["C2"],
+                    "answer": "A2",
+                    "ground_truth": "G2",
+                },
+                {
+                    "sample_id": "sample-3",
+                    "question": "Q3",
+                    "contexts": ["C3"],
+                    "answer": "A3",
+                    "ground_truth": "G3",
+                },
+                {
+                    "sample_id": "sample-4",
+                    "question": "Q4",
+                    "contexts": ["C4"],
+                    "answer": "A4",
+                    "ground_truth": "G4",
+                },
+            ]
+        )
+        summary = build_summary_markdown(
+            EvaluationResult(
+                scenario=scenario,
+                run_id="test-run",
+                started_at="2026-06-10T00:00:00+00:00",
+                finished_at="2026-06-10T00:01:00+00:00",
+                valid_samples=valid,
+                invalid_samples=invalid,
+                score_rows=[
+                    {
+                        "sample_id": "sample-1",
+                        "faithfulness": 1.0,
+                        "answer_relevancy": 0.9,
+                        "context_recall": 1.0,
+                        "context_precision": 0.8,
+                        "error": "",
+                    },
+                    {
+                        "sample_id": "sample-2",
+                        "faithfulness": 0.8,
+                        "answer_relevancy": 0.7,
+                        "context_recall": 0.9,
+                        "context_precision": 0.6,
+                        "error": "faithfulness: timeout",
+                    },
+                    {
+                        "sample_id": "sample-3",
+                        "faithfulness": 0.7,
+                        "answer_relevancy": 0.6,
+                        "context_recall": 0.8,
+                        "context_precision": 0.5,
+                        "error": "",
+                    },
+                    {
+                        "sample_id": "sample-4",
+                        "faithfulness": 0.6,
+                        "answer_relevancy": 0.5,
+                        "context_recall": 0.7,
+                        "context_precision": 0.4,
+                        "error": "context_precision: failed",
+                    },
+                ],
+            )
+        )
+
+        self.assertIn("## Per-sample Scores", summary)
+        self.assertIn("sample-1", summary)
+        self.assertIn("sample-2", summary)
+        self.assertIn("sample-3", summary)
+        self.assertIn("sample-4", summary)
+        self.assertIn("faithfulness", summary)
+        self.assertIn("answer_relevancy", summary)
+        self.assertIn("context_recall", summary)
+        self.assertIn("context_precision", summary)
+        self.assertIn("error", summary)
+        self.assertIn("faithfulness: timeout", summary)
+        self.assertIn("context_precision: failed", summary)
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tests/test_online_eval.py
+++ b/tests/test_online_eval.py
@@ -0,0 +1,351 @@
+import shutil
+import unittest
+from pathlib import Path
+from unittest import mock
+
+import pandas as pd
+
+from rag_eval.adapters.base import AppAdapter
+from rag_eval.config.loader import load_scenario
+from rag_eval.datasets.normalizers import normalize_records
+from rag_eval.execution.evaluator import Evaluator
+from rag_eval.metrics.pipeline import MetricPipeline
+from rag_eval.shared.models import AppAdapterConfig, DatasetConfig, RuntimeConfig, Scenario
+from apps.pdf_question_bank import adapter as pdf_question_bank_adapter
+
+
+class FakeMetric:
+    def __init__(self, value: float):
+        self.value = value
+
+    async def ascore(self, **kwargs):
+        class Result:
+            def __init__(self, value: float):
+                self.value = value
+
+        return Result(self.value)
+
+
+class FakeOnlineAdapter(AppAdapter):
+    async def run(self, question: str, **kwargs):
+        return {
+            "answer": f"answer for {question}",
+            "contexts": [f"context for {question}"],
+            "raw_response": {"question": question, "metadata": kwargs},
+        }
+
+
+class ExplodingOnlineAdapter(AppAdapter):
+    async def run(self, question: str, **kwargs):
+        raise RuntimeError("boom")
+
+
+class OnlineDatasetTests(unittest.TestCase):
+    def setUp(self) -> None:
+        root = Path("tests/.tmp").resolve()
+        root.mkdir(parents=True, exist_ok=True)
+        self.temp_dir = root / self._testMethodName
+        shutil.rmtree(self.temp_dir, ignore_errors=True)
+        self.temp_dir.mkdir(parents=True, exist_ok=True)
+
+    def tearDown(self) -> None:
+        shutil.rmtree(self.temp_dir, ignore_errors=True)
+
+    def test_online_records_allow_missing_answer_and_contexts_before_adapter(self) -> None:
+        records = [
+            {
+                "sample_id": "sample-1",
+                "question": "What is the policy scope?",
+                "ground_truth": "It covers all employees.",
+                "doc_id": "doc-1",
+                "source_chunk_ids": '["doc-1-chunk-1"]',
+            }
+        ]
+        valid, invalid = normalize_records(records, mode="online")
+        self.assertEqual(len(valid), 1)
+        self.assertEqual(len(invalid), 0)
+        self.assertEqual(valid[0].answer, "")
+        self.assertEqual(valid[0].contexts, [])
+        self.assertEqual(valid[0].metadata["source_chunk_ids"], '["doc-1-chunk-1"]')
+
+    def test_online_evaluator_enriches_dataset_and_scores(self) -> None:
+        dataset_path = self.temp_dir / "online.csv"
+        pd.DataFrame(
+            [
+                {
+                    "sample_id": "sample-1",
+                    "question": "What is the policy scope?",
+                    "ground_truth": "It covers all employees.",
+                    "doc_id": "doc-1",
+                    "section_path": "Policy > Scope",
+                    "source_chunk_ids": '["doc-1-chunk-1"]',
+                }
+            ]
+        ).to_csv(dataset_path, index=False)
+
+        scenario = Scenario(
+            scenario_name="online-test",
+            mode="online",
+            dataset=DatasetConfig(path=dataset_path),
+            judge_model="judge-model",
+            embedding_model="embedding-model",
+            metrics=["faithfulness"],
+            output_dir=self.temp_dir / "outputs",
+            runtime=RuntimeConfig(batch_size=1),
+            app_adapter=AppAdapterConfig(type="python", callable="tests.fake:run"),
+        )
+        pipeline = MetricPipeline(metrics={"faithfulness": FakeMetric(0.8)})
+        evaluator = Evaluator(scenario=scenario, metric_pipeline=pipeline, app_adapter=FakeOnlineAdapter())
+
+        result = evaluator.evaluate()
+        self.assertEqual(len(result.valid_samples), 1)
+        self.assertEqual(len(result.invalid_samples), 0)
+        self.assertEqual(result.valid_samples[0].answer, "answer for What is the policy scope?")
+        self.assertEqual(result.valid_samples[0].contexts, ["context for What is the policy scope?"])
+        self.assertEqual(result.score_rows[0]["faithfulness"], 0.8)
+
+    def test_online_evaluator_captures_adapter_exception_type_in_invalid_rows(self) -> None:
+        dataset_path = self.temp_dir / "online.csv"
+        pd.DataFrame(
+            [
+                {
+                    "sample_id": "sample-1",
+                    "question": "What is the policy scope?",
+                    "ground_truth": "It covers all employees.",
+                    "doc_id": "doc-1",
+                    "source_chunk_ids": '["doc-1-chunk-1"]',
+                }
+            ]
+        ).to_csv(dataset_path, index=False)
+
+        scenario = Scenario(
+            scenario_name="online-test",
+            mode="online",
+            dataset=DatasetConfig(path=dataset_path),
+            judge_model="judge-model",
+            embedding_model="embedding-model",
+            metrics=["faithfulness"],
+            output_dir=self.temp_dir / "outputs",
+            runtime=RuntimeConfig(batch_size=1),
+            app_adapter=AppAdapterConfig(type="python", callable="tests.fake:run"),
+        )
+        pipeline = MetricPipeline(metrics={"faithfulness": FakeMetric(0.8)})
+        evaluator = Evaluator(scenario=scenario, metric_pipeline=pipeline, app_adapter=ExplodingOnlineAdapter())
+
+        result = evaluator.evaluate()
+        self.assertEqual(len(result.valid_samples), 0)
+        self.assertEqual(len(result.invalid_samples), 1)
+        self.assertEqual(result.invalid_samples[0].error, "adapter failed [RuntimeError]: boom")
+
+
+class FakeCompletionResponse:
+    def __init__(self, content: str):
+        self.choices = [type("Choice", (), {"message": type("Message", (), {"content": content})()})()]
+
+
+class FakeCompletions:
+    def __init__(self, content: str):
+        self.content = content
+        self.calls: list[dict] = []
+
+    def create(self, **kwargs):
+        self.calls.append(kwargs)
+        return FakeCompletionResponse(self.content)
+
+
+class PdfQuestionBankAdapterTests(unittest.TestCase):
+    def setUp(self) -> None:
+        root = Path("tests/.tmp").resolve()
+        root.mkdir(parents=True, exist_ok=True)
+        self.temp_dir = root / self._testMethodName
+        shutil.rmtree(self.temp_dir, ignore_errors=True)
+        self.temp_dir.mkdir(parents=True, exist_ok=True)
+        self.source_chunks_path = self.temp_dir / "source_chunks.jsonl"
+        self.source_chunks_path.write_text(
+            "\n".join(
+                [
+                    '{"chunk_id":"doc-1-chunk-1","doc_id":"doc-1","doc_name":"doc1.pdf","text":"Scope covers all employees.","page_start":1,"page_end":1,"section_path":"Policy > Scope","section_title":"Scope","source_layout_ids":["layout-1"]}',
+                    '{"chunk_id":"doc-1-chunk-2","doc_id":"doc-1","doc_name":"doc1.pdf","text":"Managers approve exceptions.","page_start":2,"page_end":2,"section_path":"Policy > Exceptions","section_title":"Exceptions","source_layout_ids":["layout-2"]}',
+                ]
+            ),
+            encoding="utf-8",
+        )
+        pdf_question_bank_adapter._CHUNK_CACHE.clear()
+
+    def tearDown(self) -> None:
+        shutil.rmtree(self.temp_dir, ignore_errors=True)
+        pdf_question_bank_adapter._CHUNK_CACHE.clear()
+
+    def test_adapter_loads_chunks_and_returns_resolved_contexts(self) -> None:
+        completions = FakeCompletions("It covers all employees.")
+        client = type(
+            "FakeClient",
+            (),
+            {"chat": type("Chat", (), {"completions": completions})()},
+        )()
+
+        result = pdf_question_bank_adapter.run(
+            question="What is the policy scope?",
+            source_chunks_path=str(self.source_chunks_path),
+            model="stub-model",
+            client=client,
+            source_chunk_ids='["doc-1-chunk-1"]',
+            doc_id="doc-1",
+            doc_name="doc1.pdf",
+            section_path="Policy > Scope",
+        )
+
+        self.assertEqual(result["answer"], "It covers all employees.")
+        self.assertEqual(result["contexts"], ["Scope covers all employees."])
+        self.assertEqual(result["raw_response"]["resolved_chunk_ids"], ["doc-1-chunk-1"])
+        self.assertEqual(result["raw_response"]["model"], "stub-model")
+        self.assertEqual(len(completions.calls), 1)
+        self.assertEqual(completions.calls[0]["model"], "stub-model")
+        self.assertEqual(completions.calls[0]["temperature"], 0)
+        self.assertIn("Evidence chunks:", completions.calls[0]["messages"][1]["content"])
+
+    def test_adapter_supports_multiple_chunk_ids(self) -> None:
+        completions = FakeCompletions("Combined answer.")
+        client = type(
+            "FakeClient",
+            (),
+            {"chat": type("Chat", (), {"completions": completions})()},
+        )()
+
+        result = pdf_question_bank_adapter.run(
+            question="What does the policy say?",
+            source_chunks_path=str(self.source_chunks_path),
+            model="stub-model",
+            client=client,
+            source_chunk_ids='["doc-1-chunk-1", "doc-1-chunk-2"]',
+            doc_id="doc-1",
+            doc_name="doc1.pdf",
+        )
+
+        self.assertEqual(
+            result["contexts"],
+            ["Scope covers all employees.", "Managers approve exceptions."],
+        )
+        self.assertEqual(
+            result["raw_response"]["resolved_chunk_ids"],
+            ["doc-1-chunk-1", "doc-1-chunk-2"],
+        )
+
+    def test_adapter_rejects_missing_source_chunk_ids(self) -> None:
+        with self.assertRaisesRegex(ValueError, "source_chunk_ids is required"):
+            pdf_question_bank_adapter.run(
+                question="What is the policy scope?",
+                source_chunks_path=str(self.source_chunks_path),
+                model="stub-model",
+                client=mock.Mock(),
+            )
+
+    def test_adapter_rejects_unknown_chunk_id(self) -> None:
+        with self.assertRaisesRegex(ValueError, "source_chunk_ids not found"):
+            pdf_question_bank_adapter.run(
+                question="What is the policy scope?",
+                source_chunks_path=str(self.source_chunks_path),
+                model="stub-model",
+                client=mock.Mock(),
+                source_chunk_ids='["missing-chunk"]',
+            )
+
+    def test_adapter_falls_back_to_latest_run_directory_when_latest_alias_is_missing(self) -> None:
+        artifact_root = self.temp_dir / "sample-pdf-question-bank"
+        run_dir = artifact_root / "2026-06-10T02-01-32.508056+00-00"
+        run_dir.mkdir(parents=True, exist_ok=True)
+        latest_path = artifact_root / "latest" / "source_chunks.jsonl"
+        run_chunks_path = run_dir / "source_chunks.jsonl"
+        run_chunks_path.write_text(self.source_chunks_path.read_text(encoding="utf-8"), encoding="utf-8")
+
+        completions = FakeCompletions("It covers all employees.")
+        client = type(
+            "FakeClient",
+            (),
+            {"chat": type("Chat", (), {"completions": completions})()},
+        )()
+
+        result = pdf_question_bank_adapter.run(
+            question="What is the policy scope?",
+            source_chunks_path=str(latest_path),
+            model="stub-model",
+            client=client,
+            source_chunk_ids='["doc-1-chunk-1"]',
+            doc_id="doc-1",
+        )
+
+        self.assertEqual(result["contexts"], ["Scope covers all employees."])
+        self.assertEqual(result["raw_response"]["resolved_chunk_ids"], ["doc-1-chunk-1"])
+
+    def test_online_evaluator_handles_dataset_build_rows_with_python_adapter(self) -> None:
+        dataset_path = self.temp_dir / "question_bank.csv"
+        pd.DataFrame(
+            [
+                {
+                    "sample_id": "sample-1",
+                    "question": "What is the policy scope?",
+                    "ground_truth": "It covers all employees.",
+                    "doc_id": "doc-1",
+                    "doc_name": "doc1.pdf",
+                    "section_path": "Policy > Scope",
+                    "source_chunk_ids": '["doc-1-chunk-1"]',
+                }
+            ]
+        ).to_csv(dataset_path, index=False)
+
+        completions = FakeCompletions("It covers all employees.")
+        client = type(
+            "FakeClient",
+            (),
+            {"chat": type("Chat", (), {"completions": completions})()},
+        )()
+
+        scenario = Scenario(
+            scenario_name="online-question-bank-test",
+            mode="online",
+            dataset=DatasetConfig(path=dataset_path),
+            judge_model="judge-model",
+            embedding_model="embedding-model",
+            metrics=["faithfulness"],
+            output_dir=self.temp_dir / "outputs",
+            runtime=RuntimeConfig(batch_size=1),
+            app_adapter=AppAdapterConfig(
+                type="python",
+                callable="apps.pdf_question_bank.adapter:run",
+                static_kwargs={
+                    "source_chunks_path": str(self.source_chunks_path),
+                    "model": "stub-model",
+                    "client": client,
+                },
+            ),
+        )
+        pipeline = MetricPipeline(metrics={"faithfulness": FakeMetric(0.8)})
+        from rag_eval.adapters.python import PythonFunctionAdapter
+
+        evaluator = Evaluator(
+            scenario=scenario,
+            metric_pipeline=pipeline,
+            app_adapter=PythonFunctionAdapter(scenario.app_adapter),
+        )
+
+        result = evaluator.evaluate()
+        self.assertEqual(len(result.valid_samples), 1)
+        self.assertEqual(len(result.invalid_samples), 0)
+        self.assertEqual(result.valid_samples[0].answer, "It covers all employees.")
+        self.assertEqual(result.valid_samples[0].contexts, ["Scope covers all employees."])
+        self.assertEqual(result.score_rows[0]["faithfulness"], 0.8)
+        self.assertEqual(
+            result.valid_samples[0].metadata["raw_response"]["resolved_chunk_ids"],
+            ["doc-1-chunk-1"],
+        )
+
+    def test_load_sample_pdf_online_scenario(self) -> None:
+        scenario = load_scenario("scenarios/online/sample-pdf-question-bank-online.yaml")
+        self.assertEqual(scenario.mode, "online")
+        self.assertEqual(scenario.dataset.path.name, "sample-pdf-question-bank.csv")
+        self.assertEqual(scenario.output_dir.name, "sample-pdf-question-bank")
+        self.assertEqual(scenario.runtime.max_samples, 45)
+        self.assertEqual(scenario.app_adapter.callable, "apps.pdf_question_bank.adapter:run")
+        self.assertTrue(
+            str(scenario.app_adapter.static_kwargs["source_chunks_path"]).endswith("source_chunks.jsonl")
+        )
--- a/uv.lock
+++ b/uv.lock
				`@@ -0,0 +1 @@`
				`"""Local-document QA adapter package for dataset-build question banks."""`