docs: add Dify score API integration design spec

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-22 14:51:52 +08:00
parent ccf25eb1f9
commit eee96eb158
1 changed files with 138 additions and 0 deletions
--- a/docs/superpowers/specs/2026-06-22-dify-score-api-design.md
+++ b/docs/superpowers/specs/2026-06-22-dify-score-api-design.md
@@ -0,0 +1,138 @@
+# Dify 集成 — 单题实时评分 API 设计
+
+**日期**: 2026-06-22  
+**状态**: 已批准，待实现  
+**范围**: 在现有 FastAPI 服务中新增 `POST /api/score` 端点，供 Dify 外部 Tool 调用，实现单条问答记录的实时 RAGAS 指标评分。
+
+---
+
+## 1. 目标
+
+让 Dify Agent 能在回答完问题后，将 `(question, answer, contexts, ground_truth)` 发给 siemens_ragas 服务，实时获取各 RAGAS 指标得分，用于质量监控或 Agent 自我改进。
+
+---
+
+## 2. API 规范
+
+### `POST /api/score`
+
+**请求体：**
+
+```json
+{
+  "question":          "双源CT的时间分辨率是多少?",
+  "answer":            "双源CT的单扇区时间分辨率为75ms。",
+  "contexts":          "片段1：双源CT采用两套管-探测器系统... |||| 片段2：单扇区采集旋转135度...",
+  "ground_truth":      "双源CT单扇区时间分辨率为75ms，需旋转135度。",
+  "context_separator": " |||| ",
+  "metrics":           ["faithfulness", "answer_relevancy"],
+  "judge_model":       "deepseek-v4-flash",
+  "embedding_model":   "text-embedding-v3"
+}
+```
+
+**字段说明：**
+
+| 字段 | 类型 | 必填 | 说明 |
+|------|------|------|------|
+| `question` | str | ✅ | 问题文本 |
+| `answer` | str | ✅ | 待评分的回答 |
+| `contexts` | str | ✅ | 检索到的上下文，多段用 `context_separator` 拼接 |
+| `ground_truth` | str | ❌ | 标准答案；缺失时跳过依赖它的指标（context_recall、factual_correctness、semantic_similarity） |
+| `context_separator` | str | ❌ | 默认 `" \|\|\|\| "`（四个竖线，两侧各一空格） |
+| `metrics` | list[str] | ❌ | 默认 `["faithfulness", "answer_relevancy", "context_recall", "context_precision"]` |
+| `judge_model` | str | ❌ | 默认读 `.env` 中 `RAGAS_JUDGE_MODEL` |
+| `embedding_model` | str | ❌ | 默认读 `.env` 中 `RAGAS_EMBEDDING_MODEL` |
+
+**响应体（200 OK）：**
+
+```json
+{
+  "scores": {
+    "faithfulness":     0.8750,
+    "answer_relevancy": 0.9200
+  },
+  "weighted_score": 0.8975,
+  "latency_ms": 3420
+}
+```
+
+**错误响应：**
+
+| 状态码 | 场景 |
+|--------|------|
+| 400 | 必填字段缺失、metrics 名称不合法 |
+| 401 | 配置了 `SCORE_API_TOKEN` 但请求未携带有效 Bearer Token |
+| 422 | 请求体 JSON 格式错误（Pydantic 校验） |
+| 500 | RAGAS 内部评分异常，附带 error 字段 |
+
+**鉴权（可选）：**  
+若 `.env` 中 `SCORE_API_TOKEN` 非空，则要求请求头携带 `Authorization: Bearer <token>`。为空则不鉴权（内网部署场景）。
+
+---
+
+## 3. 架构与文件改动
+
+### 新文件
+
+| 文件 | 职责 |
+|------|------|
+| `webapp/api/score.py` | 路由定义，请求验证，调用 InlineScorer |
+| `webapp/services/inline_scorer.py` | LLM 客户端缓存 + RAGAS 评分逻辑封装 |
+
+### 修改文件
+
+| 文件 | 改动 |
+|------|------|
+| `webapp/models.py` | 新增 `ScoreRequest`、`ScoreResponse` |
+| `webapp/server.py` | 注册 `score.router`，更新 `openapi_tags` |
+| `rag_eval/settings.py` | 新增 `score_api_token: str | None` 字段 |
+
+---
+
+## 4. `inline_scorer.py` 设计
+
+```python
+class InlineScorer:
+    """同步执行 RAGAS 单题评分，内部缓存 LLM 客户端。"""
+
+    def score(
+        self,
+        question: str,
+        answer: str,
+        contexts: list[str],
+        ground_truth: str | None,
+        metrics: list[str],
+        judge_model: str,
+        embedding_model: str,
+        settings: EvaluationSettings,
+    ) -> dict[str, float | None]:
+        """返回 {metric_name: score} 字典，NaN 记为 None。"""
+```
+
+**客户端缓存策略：**  
+以 `(judge_model, embedding_model)` 为 key，缓存 `(llm, embeddings)` 对象，避免每次请求都重建 AsyncOpenAI 连接。缓存为模块级单例（`_scorer_cache: dict`），线程安全（加 `threading.Lock`）。
+
+**评分执行：**  
+复用 `build_metric_pipeline` 构建 `MetricPipeline`，然后 `asyncio.run(pipeline.score_sample(sample))` 执行。与现有 `evaluator.py` 模式一致。
+
+**ground_truth 为空时的指标跳过逻辑：**  
+`context_recall`、`factual_correctness`、`semantic_similarity`、`noise_sensitivity` 需要 ground_truth；若请求中未提供，自动从 metrics 列表中移除这些指标，并在响应中对应字段返回 `null`。
+
+---
+
+## 5. Dify 侧配置方法
+
+1. 在 Dify 「工具」→「自定义工具」中创建新工具
+2. 填写 OpenAPI Schema（与 `/api/score` 端点对齐）
+3. 鉴权方式：API Key（Bearer）或无鉴权
+4. 在 Agent / Workflow 节点中引用该工具，将 `question`、`answer`、`contexts` 变量映射到工具输入
+
+---
+
+## 6. 不在范围内
+
+- 批量评分接口（异步 job）
+- Dify Workflow 节点插件（需要 Dify 插件开发框架）
+- 评分结果持久化到 scores.csv
+- 与现有 report_builder 集成展示