siemens_ragas

Author	SHA1	Message	Date
wangwei	e1751447df	feat(advisor): add 0.85 advisory threshold triggering LLM suggestions - Add advisory_threshold=0.85 field to MetricRule (higher-is-better metrics) - diagnose() now emits severity='low' for scores in (warning_threshold, 0.85) - noise_sensitivity (lower-is-better) keeps its existing two-tier thresholds - writer.py: severity labels mapped to Chinese (严重/警告/待优化) - llm_analyzer.py: prompt explains low/warning/critical tiers in Chinese - Tests: 5 new cases for 'low' severity, updated log summary assertions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-25 11:35:49 +08:00
wangwei	a781ba1e4a	config: set default judge_model=gpt-5, embedding_model=text-embedding-3-small gpt-5.4/5.5/5.2/5.4-mini/5.4-nano are incompatible with RAGAS 0.4.3 because they require max_completion_tokens instead of max_tokens. gpt-5 / gpt-4.1 support max_tokens and json_object mode required by RAGAS. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-23 15:29:01 +08:00
wangwei	761faf9c42	feat: add ScoreRequest/ScoreResponse models and SCORE_API_TOKEN setting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-22 15:00:05 +08:00
wangwei	480f6d66ea	feat: use weighted metric means and add weighted_score row to summary.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-18 16:59:56 +08:00
wangwei	d371ef7d24	feat: add weighted_score and sample_weight columns to score rows Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-18 16:53:45 +08:00
wangwei	8617eaa5aa	feat: add metric_weights and doc_weights to Scenario schema and dataclass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-18 16:50:33 +08:00
wangwei	e0b064587f	feat: add metric/doc weight computation module (weights.py) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-18 16:47:47 +08:00
wangwei	24956bbf75	更新	2026-06-16 18:12:33 +08:00
wangwei	91c0dab4f9	fix(advisor): fix LLM API call, wire advice_markdown to webapp, update .env.example timeouts - llm_analyzer.py: use llm.langchain_llm.ainvoke() (correct RAGAS 0.4.3 API) - webapp/models.py: add advice_markdown field to ReportData - webapp/services/run_reader.py: add read_advice_markdown() reading optimization_advice.md - webapp/services/report_builder.py: pass advice_markdown into ReportData - .env.example: OPENAI_TIMEOUT_SECONDS 30→180, RAGAS_METRIC_TIMEOUT_SECONDS 45→300 Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-16 17:12:32 +08:00
wangwei	f5c2dce64a	feat(advisor): add optimization advisor module - rag_eval/advisor/: new package with rules engine, LLM analyzer, writer - rules.py: 7-metric diagnostic rules (warning/critical thresholds, top-3 low samples) - llm_analyzer.py: Chinese optimization report via judge_model, graceful fallback - writer.py: writes optimization_advice.md + log summary - __init__.py: run_advisor() entry point (no-op when optimization_advisor=False) - Scenario.optimization_advisor: new bool field (default False) - ScenarioModel: same field added, loader.py透传 - RunArtifactPaths.advice_md: new path field - factory.py: build_models() now public; build_metric_pipeline() accepts pre-built llm/embeddings - runner.py: lifts llm, passes to pipeline and advisor; calls run_advisor() at end - siemens online YAML: optimization_advisor: true enabled - tests: 9 rules tests + 6 writer tests, all pass - docs: advisor section added to engine-flow.md and architecture.md Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-16 17:06:19 +08:00
wangwei	629304aa6d	feat(logging): add structured evaluation logs for metric-level debugging - pipeline.py: log each metric score/timeout/error with sample_id, elapsed time, and score value; log NaN list per sample; progress counter N/total after each sample completes - evaluator.py: log eval start, dataset counts, adapter enrichment progress (per-sample OK/FAIL with elapsed), metric scoring summary, and per-metric NaN rate at end of run - runner.py: _setup_logging() helper writes to stderr + optional file; ragas/httpx/openai noisy loggers throttled to WARNING - main.py: add --log-file and --log-level CLI flags Usage: python main.py --scenario scenarios/online/... --log-file logs/eval.log --log-level DEBUG Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-16 10:48:41 +08:00
wangwei	1ff4a3943a	feat(dataset-builder): add retry logic and ASCII-safe logging for Siemens PDF pipeline - question_generator.py: add max_retries=3/retry_delay=5s loop with exponential backoff on LLM timeout or server errors; encode filenames with ascii/replace before printing to avoid UnicodeEncodeError on Windows cp1252 consoles - runner.py: encode PDF filenames ASCII-safe for progress messages; catch generation failures per-document and skip (or re-raise) based on failure_mode, preventing one bad doc from aborting the whole build Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-15 23:06:33 +08:00
wangwei	e89695e490	Add RAGAS evaluation web console (FastAPI + vanilla JS) - webapp/: FastAPI backend with runs/scenarios/evaluations API routers; services for run_reader, report_builder, scenario_scanner, task_manager (lazy ragas import — server boots even without ragas); Pydantic models - webapp/static/: single-page console (layout A: left-nav + main area); report detail with metric cards, Chart.js distribution histogram, grouping table, lowest-score sample review; trigger evaluation + log polling - webmain.py: uvicorn entry point (alongside existing main.py CLI) - start.bat: Windows one-click launcher with env checks and auto-browser open - rag_eval/datasets/: implement missing loader + normalizer modules (load_dataset_records, normalize_records) required by evaluator - scripts/seed_sample_run.py: generate realistic demo run artifacts - .gitignore: exclude datasets/ data files but keep rag_eval/datasets/ source Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>	2026-06-15 15:53:57 +08:00
Guangfei.Zhao	9cbdc1d95d	first commit	2026-06-12 14:02:15 +08:00

14 Commits