diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..0bddc04 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,154 @@ +# AGENTS.md + +This file guides agentic coding assistants operating in this repository. +Current workspace: `C:\Users\A200477427\Learnings\AIOps-Docs`. + +## 1) Repository Overview +- Repository type: docs-first (no in-repo application runtime yet). +- Main artifacts: architecture/proposal markdown documents and diagrams. +- Key markdown docs: + - `README.md` + - `AIOps_Product_Architecture_and_Commercialization.md` + - `AIOps_Architecture_Diagram_Explanation.md` + - `AIOps_Project_Proposal.md` +- Key binary assets: + - `AIOps_Practical_Route_Architecture.png` + - `Implementable_AIOps_Platform_Route_A_Architecture.png` + - `AIOps_智能运维平台项目立项书.pdf` + +## 2) Rule Sources and Precedence +Follow instructions in this order: +1. System/developer/runtime instructions from the active harness. +2. Repository Cursor/Copilot rules (if present). +3. This `AGENTS.md` file. +4. Existing repository patterns. + +Rule-file status checked in this repo: +- `.cursorrules`: not found. +- `.cursor/rules/`: not found. +- `.github/copilot-instructions.md`: not found. + +Agent requirement: +- Re-check the 3 rule locations before major edits. +- If any are added later, treat them as higher priority than this file. + +## 3) Optimization Goals +- Keep terminology consistent across all long-form docs. +- Prefer focused, incremental edits over broad rewrites. +- Keep markdown content and diagram references synchronized. +- Avoid inventing tooling/process assumptions not present in repo. + +## 4) Build/Lint/Test Commands +Important: this repository currently has no formal build system, linter config, or test suite. + +### 4.1 Current command reality +- Build command: not applicable. +- Lint command: not enforced by repository config. +- Test command: not applicable. +- Single-test command: not applicable (no tests exist yet). + +### 4.2 Optional local documentation checks +Run only when Node.js is available in the environment. + +```bash +# Lint all markdown files (optional) +npx markdownlint-cli2 "**/*.md" + +# Lint one markdown file (single-file check) +npx markdownlint-cli2 "README.md" +``` + +### 4.3 Future test templates (if tests are added) +Update this section with actual project commands when a framework is introduced. + +```bash +# Pytest: run all tests +pytest + +# Pytest: run one test function +pytest tests/test_example.py::test_specific_case +``` + +## 5) Documentation Style Guidelines + +### 5.1 Language and tone +- Default language is Chinese with technical English terms preserved. +- Keep writing professional, specific, and implementation oriented. +- Prefer concrete statements over vague strategic wording. +- Keep terms and definitions stable within each document. + +### 5.2 Structure and formatting +- Use one top-level `#` heading per file. +- Use clear heading hierarchy (`##`, `###`) with logical progression. +- Keep numeric section prefixes when a document already uses them. +- Use `---` only when it improves readability. + +### 5.3 Terminology consistency +- Wrap module/service identifiers in backticks. +- Keep canonical terms consistent across docs: + - `Incident` + - `RCA Result` + - `NormalizedEvent` + - `IncidentContext` + - `execution-gateway` + - `policy-engine` +- Keep status flow wording consistent: + - `new -> triaged -> diagnosing -> remediating -> resolved/closed` + +### 5.4 Lists, tables, and links +- Use bullets for principles, scope, constraints, and key points. +- Use ordered lists for procedures and time-based sequences. +- Use tables for role splits, capability matrices, and KPI definitions. +- Keep table headers explicit and unambiguous. +- Use relative links for local docs/images and verify exact filenames. + +## 6) Code Style Guidelines (for future code/scripts) +This repo is docs-only today. +If code/scripts are added, apply these defaults unless project configs override them. + +### 6.1 Imports +- Do not use wildcard imports. +- Group imports as: standard library, third-party, local modules. +- Keep one import per line unless language idioms require grouping. +- Remove unused imports in the same change. + +### 6.2 Formatting +- Prefer automated formatters once configured. +- Keep functions small and focused. +- Avoid dead code and commented-out legacy blocks. + +### 6.3 Types and contracts +- Add explicit types for public functions and interfaces. +- Prefer narrow, precise types over broad untyped structures. +- Validate external input at boundaries. +- Keep schema/type names aligned with AIOps domain terminology. + +### 6.4 Naming +- Use descriptive, domain-accurate names. +- Avoid unclear abbreviations (except standard terms like RCA/KPI/SLA/MTTR). +- Python naming: `snake_case` for functions/variables, `PascalCase` for classes. +- TypeScript/JS naming: `camelCase` for functions/variables, `PascalCase` for types/classes. + +### 6.5 Error handling and logging +- Fail fast on invalid state; do not silently swallow exceptions. +- Return actionable errors with debugging context. +- Separate user-facing errors from internal diagnostic details. +- Prefer structured logs including incident id/service/action when available. +- For automation actions, record actor, timestamp, and result. + +## 7) Change Management +- Do not rename core docs unless explicitly requested. +- Preserve historical rationale in proposal/architecture documents. +- If canonical terms change in one doc, update related docs in the same change. +- Keep diagram references and explanation text mutually consistent. +- Prefer small, reviewable changes grouped by topic. + +## 8) Agent Completion Checklist +Before finishing a task, verify: +- Only necessary files changed. +- Terminology is consistent with existing AIOps docs. +- Local links and image references resolve. +- Any optional checks you ran are reported with outcomes. +- If no tests exist, state that explicitly. + +If repository structure or tooling changes, update this file in the same PR. diff --git a/AIOps_Product_Architecture_and_Commercialization.md b/AIOps_Product_Architecture_and_Commercialization.md index 908fdda..f57c636 100644 --- a/AIOps_Product_Architecture_and_Commercialization.md +++ b/AIOps_Product_Architecture_and_Commercialization.md @@ -222,10 +222,61 @@ - 平台负责:`状态变更`、`执行授权`、`审计归档`、`SLA 统计`。 - 禁止 Dify 直接改生产资源;所有执行请求必须经 `execution-gateway` 和 `policy-engine`。 +### 项目仓库拆分建议(MVP 推荐) + +当前架构建议按 3 个项目域推进,便于边界清晰、并行开发和后续替换演进: + +- `aiops-platform`:平台主仓库,承载前端、平台后端、Incident 主流程、审批、审计、时间线、执行控制台。 +- `aiops-tools`:工具与执行适配层仓库,承载查询工具、执行网关、安全治理与底层系统适配。 +- `aiops-workflow`:Agent 工作流与 RAG 资产仓库,承载 Dify workflows、Prompt、知识库配置、结构化输出 schema、评测样例。 + +需要强调:这里是“项目域拆分建议”,不是要求一开始就做成 3 套重型微服务。MVP 阶段可以按 3 个仓库建设,但在部署上保持适度合并,优先保证联调效率与交付速度。 + +### 三个项目的职责边界 + +#### `aiops-platform` + +- 负责“事件如何流转、谁来审批、结果如何展示”。 +- 作为唯一主控层,持有 `incident`、`incident_timeline`、`remediation_action`、`audit_log` 等核心业务数据。 +- 负责调用 `aiops-workflow` 获取结构化 RCA 与动作建议。 +- 负责调用 `aiops-tools` 获取执行结果或触发审批后的受控动作。 +- 不负责具体 AI 工作流编排细节,也不直接耦合底层 Prometheus / Loki / K8s API 协议。 + +#### `aiops-tools` + +- 负责“如何安全地查询数据、如何安全地执行动作”。 +- 提供 `query_metrics`、`query_logs`、`query_k8s`、`query_changes`、`execute_action` 等标准化能力。 +- 负责鉴权、限流、超时、重试、幂等、审计打点、回滚入口等安全治理能力。 +- 负责对接 Prometheus、Loki/ELK、K8s API、Ansible 等真实系统。 +- 不负责 Incident 生命周期,不负责页面,也不负责最终 RCA 编排逻辑。 + +#### `aiops-workflow` + +- 负责“如何做诊断、如何组织证据、如何输出结构化 RCA 和建议”。 +- 以 Dify workflow 为核心承载方式,管理诊断工作流、Prompt、RAG 检索策略、输出 schema 和评测样例。 +- 通过调用 `aiops-tools` 暴露的查询工具补齐证据链。 +- 不持有 Incident 主状态,不承担审批和执行控制职责。 +- 不允许直接拥有生产执行权限;执行建议必须回到平台决策。 + +### Workflow 优先于 Multi-Agent 的技术路线 + +在当前 MVP 阶段,推荐采用“Dify workflow 优先”的路线,而不是一开始引入复杂的 multi-agent 协作。原因如下: + +- Workflow 更适合当前 AIOps 场景的固定输入、结构化输出和可审计要求。 +- Workflow 更容易控制时延、成本、失败降级和结构化 schema 一致性。 +- Workflow 更适合实现 `evidence -> conclusion -> confidence -> actions` 这类稳定链路。 +- Multi-agent 更适合复杂探索型诊断,但会显著提升调试成本、时延和结果不确定性。 + +因此,本项目当前阶段的原则是: + +- `skill` 在工程上优先实现为一个定义清晰、输入输出稳定的 workflow。 +- 复杂诊断能力可在后续阶段逐步引入 agent 化节点,但不作为 MVP 的默认实现方式。 +- 只有在复杂跨域故障、证据冲突、多轮角色协同确有必要时,再考虑将局部能力升级为 multi-agent。 + ### 建议组件分工 - 平台层(前后端自研,对应图中 Layer C / Layer E / 部分 Layer F):Incident API、状态机、执行控制台、审计中心。 -- Dify 层(对应图中 Layer D 的 Dify Platform):诊断 Agent、RAG 管理、工具调用编排。 +- Workflow 层(对应图中 Layer D 的 Dify Platform):诊断工作流、RAG 管理、工具调用编排、结构化输出控制。 - 工具层(Python 服务,对应查询工具与执行器适配层):metrics/logs/k8s/change 查询与执行器网关。 ### 三层项目详细设计 @@ -256,11 +307,12 @@ - Sprint 2:执行控制台 + 审批流 + 时间线完整回放。 - Sprint 3:审计中心 + KPI 看板 + 异常处理与告警。 -#### 2) Dify 层(Agent 与 RAG 中枢) +#### 2) Workflow 层(Agent Workflow 与 RAG 中枢) **项目目标** - 提供可解释、可追踪的诊断能力,输出结构化 RCA 与动作建议。 +- 以 workflow 方式固化诊断路径、工具调用顺序和输出 schema,优先保证稳定性与可审计性。 **核心子系统** @@ -283,6 +335,12 @@ - Sprint 2:完善 RAG(标签、重排、引用),稳定结构化输出。 - Sprint 3:加入低置信度降级策略与评测集回归(RCA 准确率、幻觉率)。 +**实现原则** + +- `skill` 优先沉淀为标准 workflow,而不是自由形态的 multi-agent 协作。 +- 查询类 tool 可由 workflow 直接调用;执行类能力必须回到平台决策。 +- 输出必须严格符合平台约定 schema,避免自由文本直接进入自动化链路。 + #### 3) 工具层(Python 服务) **项目目标** @@ -304,7 +362,7 @@ #### 三层协同联调顺序(必须按顺序) 1. 平台层先完成 Incident 状态机与诊断触发入口。 -2. Dify 层返回结构化 RCA(先不接自动执行)。 +2. Workflow 层返回结构化 RCA(先不接自动执行)。 3. 工具层完成查询工具后,补齐证据链。 4. 最后接执行网关与审批流,放开低风险自动动作。