refine agent guidance and project split

2026-03-26 17:11:31 +08:00
parent 19e1e1e2cf
commit 33d3759bef
2 changed files with 215 additions and 3 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -0,0 +1,154 @@
+# AGENTS.md
+
+This file guides agentic coding assistants operating in this repository.
+Current workspace: `C:\Users\A200477427\Learnings\AIOps-Docs`.
+
+## 1) Repository Overview
+- Repository type: docs-first (no in-repo application runtime yet).
+- Main artifacts: architecture/proposal markdown documents and diagrams.
+- Key markdown docs:
+  - `README.md`
+  - `AIOps_Product_Architecture_and_Commercialization.md`
+  - `AIOps_Architecture_Diagram_Explanation.md`
+  - `AIOps_Project_Proposal.md`
+- Key binary assets:
+  - `AIOps_Practical_Route_Architecture.png`
+  - `Implementable_AIOps_Platform_Route_A_Architecture.png`
+  - `AIOps_智能运维平台项目立项书.pdf`
+
+## 2) Rule Sources and Precedence
+Follow instructions in this order:
+1. System/developer/runtime instructions from the active harness.
+2. Repository Cursor/Copilot rules (if present).
+3. This `AGENTS.md` file.
+4. Existing repository patterns.
+
+Rule-file status checked in this repo:
+- `.cursorrules`: not found.
+- `.cursor/rules/`: not found.
+- `.github/copilot-instructions.md`: not found.
+
+Agent requirement:
+- Re-check the 3 rule locations before major edits.
+- If any are added later, treat them as higher priority than this file.
+
+## 3) Optimization Goals
+- Keep terminology consistent across all long-form docs.
+- Prefer focused, incremental edits over broad rewrites.
+- Keep markdown content and diagram references synchronized.
+- Avoid inventing tooling/process assumptions not present in repo.
+
+## 4) Build/Lint/Test Commands
+Important: this repository currently has no formal build system, linter config, or test suite.
+
+### 4.1 Current command reality
+- Build command: not applicable.
+- Lint command: not enforced by repository config.
+- Test command: not applicable.
+- Single-test command: not applicable (no tests exist yet).
+
+### 4.2 Optional local documentation checks
+Run only when Node.js is available in the environment.
+
+```bash
+# Lint all markdown files (optional)
+npx markdownlint-cli2 "**/*.md"
+
+# Lint one markdown file (single-file check)
+npx markdownlint-cli2 "README.md"
+```
+
+### 4.3 Future test templates (if tests are added)
+Update this section with actual project commands when a framework is introduced.
+
+```bash
+# Pytest: run all tests
+pytest
+
+# Pytest: run one test function
+pytest tests/test_example.py::test_specific_case
+```
+
+## 5) Documentation Style Guidelines
+
+### 5.1 Language and tone
+- Default language is Chinese with technical English terms preserved.
+- Keep writing professional, specific, and implementation oriented.
+- Prefer concrete statements over vague strategic wording.
+- Keep terms and definitions stable within each document.
+
+### 5.2 Structure and formatting
+- Use one top-level `#` heading per file.
+- Use clear heading hierarchy (`##`, `###`) with logical progression.
+- Keep numeric section prefixes when a document already uses them.
+- Use `---` only when it improves readability.
+
+### 5.3 Terminology consistency
+- Wrap module/service identifiers in backticks.
+- Keep canonical terms consistent across docs:
+  - `Incident`
+  - `RCA Result`
+  - `NormalizedEvent`
+  - `IncidentContext`
+  - `execution-gateway`
+  - `policy-engine`
+- Keep status flow wording consistent:
+  - `new -> triaged -> diagnosing -> remediating -> resolved/closed`
+
+### 5.4 Lists, tables, and links
+- Use bullets for principles, scope, constraints, and key points.
+- Use ordered lists for procedures and time-based sequences.
+- Use tables for role splits, capability matrices, and KPI definitions.
+- Keep table headers explicit and unambiguous.
+- Use relative links for local docs/images and verify exact filenames.
+
+## 6) Code Style Guidelines (for future code/scripts)
+This repo is docs-only today.
+If code/scripts are added, apply these defaults unless project configs override them.
+
+### 6.1 Imports
+- Do not use wildcard imports.
+- Group imports as: standard library, third-party, local modules.
+- Keep one import per line unless language idioms require grouping.
+- Remove unused imports in the same change.
+
+### 6.2 Formatting
+- Prefer automated formatters once configured.
+- Keep functions small and focused.
+- Avoid dead code and commented-out legacy blocks.
+
+### 6.3 Types and contracts
+- Add explicit types for public functions and interfaces.
+- Prefer narrow, precise types over broad untyped structures.
+- Validate external input at boundaries.
+- Keep schema/type names aligned with AIOps domain terminology.
+
+### 6.4 Naming
+- Use descriptive, domain-accurate names.
+- Avoid unclear abbreviations (except standard terms like RCA/KPI/SLA/MTTR).
+- Python naming: `snake_case` for functions/variables, `PascalCase` for classes.
+- TypeScript/JS naming: `camelCase` for functions/variables, `PascalCase` for types/classes.
+
+### 6.5 Error handling and logging
+- Fail fast on invalid state; do not silently swallow exceptions.
+- Return actionable errors with debugging context.
+- Separate user-facing errors from internal diagnostic details.
+- Prefer structured logs including incident id/service/action when available.
+- For automation actions, record actor, timestamp, and result.
+
+## 7) Change Management
+- Do not rename core docs unless explicitly requested.
+- Preserve historical rationale in proposal/architecture documents.
+- If canonical terms change in one doc, update related docs in the same change.
+- Keep diagram references and explanation text mutually consistent.
+- Prefer small, reviewable changes grouped by topic.
+
+## 8) Agent Completion Checklist
+Before finishing a task, verify:
+- Only necessary files changed.
+- Terminology is consistent with existing AIOps docs.
+- Local links and image references resolve.
+- Any optional checks you ran are reported with outcomes.
+- If no tests exist, state that explicitly.
+
+If repository structure or tooling changes, update this file in the same PR.
--- a/AIOps_Product_Architecture_and_Commercialization.md
+++ b/AIOps_Product_Architecture_and_Commercialization.md
@@ -222,10 +222,61 @@
 - 平台负责：`状态变更`、`执行授权`、`审计归档`、`SLA 统计`。
 - 禁止 Dify 直接改生产资源；所有执行请求必须经 `execution-gateway` 和 `policy-engine`。

+### 项目仓库拆分建议（MVP 推荐）
+
+当前架构建议按 3 个项目域推进，便于边界清晰、并行开发和后续替换演进：
+
+- `aiops-platform`：平台主仓库，承载前端、平台后端、Incident 主流程、审批、审计、时间线、执行控制台。
+- `aiops-tools`：工具与执行适配层仓库，承载查询工具、执行网关、安全治理与底层系统适配。
+- `aiops-workflow`：Agent 工作流与 RAG 资产仓库，承载 Dify workflows、Prompt、知识库配置、结构化输出 schema、评测样例。
+
+需要强调：这里是“项目域拆分建议”，不是要求一开始就做成 3 套重型微服务。MVP 阶段可以按 3 个仓库建设，但在部署上保持适度合并，优先保证联调效率与交付速度。
+
+### 三个项目的职责边界
+
+#### `aiops-platform`
+
+- 负责“事件如何流转、谁来审批、结果如何展示”。
+- 作为唯一主控层，持有 `incident`、`incident_timeline`、`remediation_action`、`audit_log` 等核心业务数据。
+- 负责调用 `aiops-workflow` 获取结构化 RCA 与动作建议。
+- 负责调用 `aiops-tools` 获取执行结果或触发审批后的受控动作。
+- 不负责具体 AI 工作流编排细节，也不直接耦合底层 Prometheus / Loki / K8s API 协议。
+
+#### `aiops-tools`
+
+- 负责“如何安全地查询数据、如何安全地执行动作”。
+- 提供 `query_metrics`、`query_logs`、`query_k8s`、`query_changes`、`execute_action` 等标准化能力。
+- 负责鉴权、限流、超时、重试、幂等、审计打点、回滚入口等安全治理能力。
+- 负责对接 Prometheus、Loki/ELK、K8s API、Ansible 等真实系统。
+- 不负责 Incident 生命周期，不负责页面，也不负责最终 RCA 编排逻辑。
+
+#### `aiops-workflow`
+
+- 负责“如何做诊断、如何组织证据、如何输出结构化 RCA 和建议”。
+- 以 Dify workflow 为核心承载方式，管理诊断工作流、Prompt、RAG 检索策略、输出 schema 和评测样例。
+- 通过调用 `aiops-tools` 暴露的查询工具补齐证据链。
+- 不持有 Incident 主状态，不承担审批和执行控制职责。
+- 不允许直接拥有生产执行权限；执行建议必须回到平台决策。
+
+### Workflow 优先于 Multi-Agent 的技术路线
+
+在当前 MVP 阶段，推荐采用“Dify workflow 优先”的路线，而不是一开始引入复杂的 multi-agent 协作。原因如下：
+
+- Workflow 更适合当前 AIOps 场景的固定输入、结构化输出和可审计要求。
+- Workflow 更容易控制时延、成本、失败降级和结构化 schema 一致性。
+- Workflow 更适合实现 `evidence -> conclusion -> confidence -> actions` 这类稳定链路。
+- Multi-agent 更适合复杂探索型诊断，但会显著提升调试成本、时延和结果不确定性。
+
+因此，本项目当前阶段的原则是：
+
+- `skill` 在工程上优先实现为一个定义清晰、输入输出稳定的 workflow。
+- 复杂诊断能力可在后续阶段逐步引入 agent 化节点，但不作为 MVP 的默认实现方式。
+- 只有在复杂跨域故障、证据冲突、多轮角色协同确有必要时，再考虑将局部能力升级为 multi-agent。
+
 ### 建议组件分工

 - 平台层（前后端自研，对应图中 Layer C / Layer E / 部分 Layer F）：Incident API、状态机、执行控制台、审计中心。
- Dify 层（对应图中 Layer D 的 Dify Platform）：诊断 Agent、RAG 管理、工具调用编排。
+- Workflow 层（对应图中 Layer D 的 Dify Platform）：诊断工作流、RAG 管理、工具调用编排、结构化输出控制。
 - 工具层（Python 服务，对应查询工具与执行器适配层）：metrics/logs/k8s/change 查询与执行器网关。

 ### 三层项目详细设计
@@ -256,11 +307,12 @@
 - Sprint 2：执行控制台 + 审批流 + 时间线完整回放。
 - Sprint 3：审计中心 + KPI 看板 + 异常处理与告警。

-#### 2) Dify 层（Agent 与 RAG 中枢）
+#### 2) Workflow 层（Agent Workflow 与 RAG 中枢）

 **项目目标**

 - 提供可解释、可追踪的诊断能力，输出结构化 RCA 与动作建议。
+- 以 workflow 方式固化诊断路径、工具调用顺序和输出 schema，优先保证稳定性与可审计性。

 **核心子系统**

@@ -283,6 +335,12 @@
 - Sprint 2：完善 RAG（标签、重排、引用），稳定结构化输出。
 - Sprint 3：加入低置信度降级策略与评测集回归（RCA 准确率、幻觉率）。

+**实现原则**
+
+- `skill` 优先沉淀为标准 workflow，而不是自由形态的 multi-agent 协作。
+- 查询类 tool 可由 workflow 直接调用；执行类能力必须回到平台决策。
+- 输出必须严格符合平台约定 schema，避免自由文本直接进入自动化链路。
+
 #### 3) 工具层（Python 服务）

 **项目目标**
@@ -304,7 +362,7 @@
 #### 三层协同联调顺序（必须按顺序）

 1. 平台层先完成 Incident 状态机与诊断触发入口。
-2. Dify 层返回结构化 RCA（先不接自动执行）。
+2. Workflow 层返回结构化 RCA（先不接自动执行）。
 3. 工具层完成查询工具后，补齐证据链。
 4. 最后接执行网关与审批流，放开低风险自动动作。