Files

wangwei eee96eb158 docs: add Dify score API integration design spec

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-06-22 14:51:52 +08:00

4.7 KiB

Raw Blame History

Dify 集成 — 单题实时评分 API 设计

日期: 2026-06-22
状态: 已批准，待实现
范围: 在现有 FastAPI 服务中新增 POST /api/score 端点，供 Dify 外部 Tool 调用，实现单条问答记录的实时 RAGAS 指标评分。

1. 目标

让 Dify Agent 能在回答完问题后，将 (question, answer, contexts, ground_truth) 发给 siemens_ragas 服务，实时获取各 RAGAS 指标得分，用于质量监控或 Agent 自我改进。

2. API 规范

`POST /api/score`

请求体：

{
  "question":          "双源CT的时间分辨率是多少?",
  "answer":            "双源CT的单扇区时间分辨率为75ms。",
  "contexts":          "片段1：双源CT采用两套管-探测器系统... |||| 片段2：单扇区采集旋转135度...",
  "ground_truth":      "双源CT单扇区时间分辨率为75ms，需旋转135度。",
  "context_separator": " |||| ",
  "metrics":           ["faithfulness", "answer_relevancy"],
  "judge_model":       "deepseek-v4-flash",
  "embedding_model":   "text-embedding-v3"
}

字段说明：

字段	类型	必填	说明
`question`	str	✅	问题文本
`answer`	str	✅	待评分的回答
`contexts`	str	✅	检索到的上下文，多段用 `context_separator` 拼接
`ground_truth`	str	❌	标准答案；缺失时跳过依赖它的指标（context_recall、factual_correctness、semantic_similarity）
`context_separator`	str	❌	默认 `" \|\|\|\| "`（四个竖线，两侧各一空格）
`metrics`	list[str]	❌	默认 `["faithfulness", "answer_relevancy", "context_recall", "context_precision"]`
`judge_model`	str	❌	默认读 `.env` 中 `RAGAS_JUDGE_MODEL`
`embedding_model`	str	❌	默认读 `.env` 中 `RAGAS_EMBEDDING_MODEL`

响应体（200 OK）：

{
  "scores": {
    "faithfulness":     0.8750,
    "answer_relevancy": 0.9200
  },
  "weighted_score": 0.8975,
  "latency_ms": 3420
}

错误响应：

状态码	场景
400	必填字段缺失、metrics 名称不合法
401	配置了 `SCORE_API_TOKEN` 但请求未携带有效 Bearer Token
422	请求体 JSON 格式错误（Pydantic 校验）
500	RAGAS 内部评分异常，附带 error 字段

鉴权（可选）：
若 .env 中 SCORE_API_TOKEN 非空，则要求请求头携带 Authorization: Bearer <token>。为空则不鉴权（内网部署场景）。

3. 架构与文件改动

新文件

文件	职责
`webapp/api/score.py`	路由定义，请求验证，调用 InlineScorer
`webapp/services/inline_scorer.py`	LLM 客户端缓存 + RAGAS 评分逻辑封装

修改文件

文件	改动
`webapp/models.py`	新增 `ScoreRequest`、`ScoreResponse`
`webapp/server.py`	注册 `score.router`，更新 `openapi_tags`
`rag_eval/settings.py`	新增 `score_api_token: str

4. `inline_scorer.py` 设计

class InlineScorer:
    """同步执行 RAGAS 单题评分，内部缓存 LLM 客户端。"""

    def score(
        self,
        question: str,
        answer: str,
        contexts: list[str],
        ground_truth: str | None,
        metrics: list[str],
        judge_model: str,
        embedding_model: str,
        settings: EvaluationSettings,
    ) -> dict[str, float | None]:
        """返回 {metric_name: score} 字典，NaN 记为 None。"""

客户端缓存策略：
以 (judge_model, embedding_model) 为 key，缓存 (llm, embeddings) 对象，避免每次请求都重建 AsyncOpenAI 连接。缓存为模块级单例（_scorer_cache: dict），线程安全（加 threading.Lock）。

评分执行：
复用 build_metric_pipeline 构建 MetricPipeline，然后 asyncio.run(pipeline.score_sample(sample)) 执行。与现有 evaluator.py 模式一致。

ground_truth 为空时的指标跳过逻辑：
context_recall、factual_correctness、semantic_similarity、noise_sensitivity 需要 ground_truth；若请求中未提供，自动从 metrics 列表中移除这些指标，并在响应中对应字段返回 null。

5. Dify 侧配置方法

在 Dify 「工具」→「自定义工具」中创建新工具
填写 OpenAPI Schema（与 /api/score 端点对齐）
鉴权方式：API Key（Bearer）或无鉴权
在 Agent / Workflow 节点中引用该工具，将 question、answer、contexts 变量映射到工具输入

6. 不在范围内

批量评分接口（异步 job）
Dify Workflow 节点插件（需要 Dify 插件开发框架）
评分结果持久化到 scores.csv
与现有 report_builder 集成展示

4.7 KiB Raw Blame History Unescape Escape