feat(session-async): add /api/score/session_async with incremental session report aggregation

- New POST /api/score/session_async endpoint: same session_id calls append to one shared report - New GET /api/score/sessions/{session_id}: returns call_count, metric_means, all job records - New GET /api/score/session/jobs/{job_id}: individual call status - SessionScoreJobManager: deterministic run_id from session_id, per-session mutex for CSV append, advisor regenerated on every call - SessionScoreRequest (extends ScoreRequest + session_id), SessionScoreJobResponse, SessionStatus models added - 24 new tests, all passing chore(weighted-score): comment out 综合加权得分 display and computation - report.js: hide 综合加权得分 card in report detail page - score_jobs.js: hide 综合 chip in async job list - report_builder.py: overall_ws=None (computation disabled) - summary.py: weighted_score summary line disabled - evaluator.py: weighted_score/sample_weight columns no longer written to scores.csv - score.py /api/score: weighted_score always returns null - score_job_manager.py + session_score_manager.py: weighted=None - Updated 3 tests to match new behaviour (6 pre-existing failures unchanged) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-26 16:09:33 +08:00
parent e1751447df
commit 754a30ad59
36 changed files with 2004 additions and 51 deletions
--- a/rag_eval/reporting/summary.py
+++ b/rag_eval/reporting/summary.py
@@ -75,15 +75,16 @@ def build_summary_markdown(result: EvaluationResult) -> str:
        else:
            lines.append(f"- {metric}: `n/a`{weight_note}")

-    if has_weights:
-        overall_ws = compute_overall_weighted_score_mean(
-            score_rows_list, result.scenario.metric_weights, result.scenario.doc_weights
-        )
-        weight_suffix = " (加权)"
-        if overall_ws is not None and not math.isnan(overall_ws):
-            lines.append(f"- **weighted_score{weight_suffix}: `{overall_ws:.4f}`**")
-        else:
-            lines.append(f"- **weighted_score{weight_suffix}: `n/a`**")
+    # 综合加权得分（已暂时禁用）
+    # if has_weights:
+    #     overall_ws = compute_overall_weighted_score_mean(
+    #         score_rows_list, result.scenario.metric_weights, result.scenario.doc_weights
+    #     )
+    #     weight_suffix = " (加权)"
+    #     if overall_ws is not None and not math.isnan(overall_ws):
+    #         lines.append(f"- **weighted_score{weight_suffix}: `{overall_ws:.4f}`**")
+    #     else:
+    #         lines.append(f"- **weighted_score{weight_suffix}: `n/a`**")

    detail_columns = ["sample_id", *result.scenario.metrics, "weighted_score", "error"]
    existing_columns = [c for c in detail_columns if c in scores.columns]