Files
AIRegulation-DocAnalysis/docs/superpowers/plans/2026-06-05-perception-intelligence.md
2026-06-08 11:16:28 +08:00

86 KiB
Raw Blame History

Regulatory Signals Intelligence Enhancement — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Replace MockEventStore with real regulatory data from CATARC / 国标委 / EUR-Lex / UN-ECE, add LLM-driven structure extraction + impact assessment + semantic diff, and expose all of this through a manual-trigger crawl UI.

Architecture: New BaseEventStore ABC → PostgresEventStore implementation (psycopg2, same pattern as PostgresDocumentRepository) → CrawlService orchestrates 4 crawlers + LlmPipeline → 3 new API endpoints (SSE crawl progress, single-event process, diff detail) → bootstrap.py selects store by DOCUMENT_REPOSITORY_BACKEND → frontend adds crawl bar + detail tabs.

Tech Stack: httpx (already in requirements), BeautifulSoup4 + lxml (new), psycopg2-binary (already present), existing LLM factory (app.services.llm.llm_factory), existing OpenAICompatibleEmbeddingProvider for semantic diff, FastAPI SSE (existing pattern from perception.py + async_utils.iter_in_thread).


File Map

Action Path Purpose
Create backend/app/infrastructure/perception/base_event_store.py ABC with all/get/filter/stats/upsert/get_by_standard_code
Modify backend/app/infrastructure/perception/mock_event_store.py Inherit BaseEventStore
Create backend/app/infrastructure/perception/postgres_event_store.py PostgreSQL-backed store
Create backend/app/infrastructure/perception/crawlers/__init__.py Package init
Create backend/app/infrastructure/perception/crawlers/base.py RawEvent dataclass + BaseCrawler ABC
Create backend/app/infrastructure/perception/crawlers/catarc_crawler.py CATARC scraper
Create backend/app/infrastructure/perception/crawlers/guobiao_crawler.py 国标委 JSON API crawler
Create backend/app/infrastructure/perception/crawlers/eurlex_crawler.py EUR-Lex RSS + CELLAR
Create backend/app/infrastructure/perception/llm_pipeline.py Extract / assess / diff
Create backend/app/application/perception/crawl_service.py Orchestrates crawlers + pipeline
Modify backend/app/application/perception/services.py Type hint: BaseEventStore instead of MockEventStore
Modify backend/app/api/routes/perception.py Add 3 new endpoints
Modify backend/app/shared/bootstrap.py Wire new classes; add get_crawl_service()
Modify backend/app/config/settings.py 3 new perception settings
Modify backend/.env + .env.example New env vars
Modify backend/requirements.txt Add beautifulsoup4, lxml
Modify frontend/src/pages/Perception/PerceptionPage.tsx Crawl bar + detail tabs
Create backend/tests/perception/__init__.py Test package
Create backend/tests/perception/test_base_event_store.py BaseEventStore contract tests
Create backend/tests/perception/test_postgres_event_store.py PostgresEventStore unit tests (mock psycopg2)
Create backend/tests/perception/test_crawlers.py Crawler unit tests (mock httpx)
Create backend/tests/perception/test_llm_pipeline.py Pipeline unit tests (mock LLM + embed)
Create backend/tests/perception/test_crawl_service.py CrawlService integration tests

Task 1: BaseEventStore ABC + MockEventStore implements it

Files:

  • Create: backend/app/infrastructure/perception/base_event_store.py

  • Modify: backend/app/infrastructure/perception/mock_event_store.py

  • Create: backend/tests/perception/__init__.py

  • Create: backend/tests/perception/test_base_event_store.py

  • Step 1: Write the failing test

# backend/tests/perception/__init__.py
# (empty)
# backend/tests/perception/test_base_event_store.py
"""Contract tests: any BaseEventStore implementation must pass these."""
from app.infrastructure.perception.base_event_store import BaseEventStore
from app.infrastructure.perception.mock_event_store import MockEventStore


def _store() -> BaseEventStore:
    return MockEventStore()


def test_is_base_event_store():
    assert isinstance(_store(), BaseEventStore)


def test_all_returns_list():
    result = _store().all()
    assert isinstance(result, list)
    assert len(result) > 0


def test_get_known_id():
    store = _store()
    first = store.all()[0]
    result = store.get(first["id"])
    assert result is not None
    assert result["id"] == first["id"]


def test_get_unknown_returns_none():
    assert _store().get("does-not-exist") is None


def test_filter_by_impact():
    store = _store()
    highs = store.filter(impact_level="high", limit=100)
    assert all(e["impact_level"] == "high" for e in highs)


def test_filter_limit():
    store = _store()
    result = store.filter(limit=3)
    assert len(result) <= 3


def test_stats_keys():
    stats = _store().stats()
    for key in ("total", "high_impact", "medium_impact", "recent_90d"):
        assert key in stats, f"missing key: {key}"


def test_upsert_and_get():
    store = _store()
    event = {
        "id": "test-upsert-001",
        "source": "TEST",
        "source_label": "Test Source",
        "standard_code": "TST-001",
        "title": "Test Event",
        "summary": "A test event",
        "full_text_url": "https://example.com",
        "status": "draft",
        "impact_level": "low",
        "published_at": "2026-01-01",
        "effective_at": None,
        "category": "test",
        "tags": ["test"],
        "content_hash": "abc123",
        "previous_hash": None,
    }
    store.upsert(event)
    result = store.get("test-upsert-001")
    assert result is not None
    assert result["title"] == "Test Event"


def test_get_by_standard_code():
    store = _store()
    first = store.all()[0]
    result = store.get_by_standard_code(first["standard_code"])
    assert result is not None
    assert result["standard_code"] == first["standard_code"]
  • Step 2: Run test to verify it fails
cd backend && PYTHONPATH=. pytest tests/perception/test_base_event_store.py -v

Expected: ImportError on base_event_store

  • Step 3: Create BaseEventStore ABC
# backend/app/infrastructure/perception/base_event_store.py
"""Abstract base class for regulatory event stores."""

from __future__ import annotations

from abc import ABC, abstractmethod


class BaseEventStore(ABC):
    """Port interface for regulatory event persistence."""

    @abstractmethod
    def all(self) -> list[dict]:
        """Return all events, most-recent first."""

    @abstractmethod
    def get(self, event_id: str) -> dict | None:
        """Return a single event by ID, or None."""

    @abstractmethod
    def filter(
        self,
        *,
        source: str | None = None,
        impact_level: str | None = None,
        limit: int = 50,
    ) -> list[dict]:
        """Return filtered events sorted by published_at descending."""

    @abstractmethod
    def stats(self) -> dict:
        """Return {total, high_impact, medium_impact, recent_90d}."""

    @abstractmethod
    def upsert(self, event: dict) -> None:
        """Insert or update an event record."""

    @abstractmethod
    def get_by_standard_code(self, standard_code: str) -> dict | None:
        """Return the most-recent event with matching standard_code, or None."""
  • Step 4: Patch MockEventStore to inherit BaseEventStore and add new methods

Open backend/app/infrastructure/perception/mock_event_store.py.

Add at the top (after existing imports):

from app.infrastructure.perception.base_event_store import BaseEventStore

Change class definition from:

class MockEventStore:

to:

class MockEventStore(BaseEventStore):

Add these two methods at the end of MockEventStore, after stats():

    def upsert(self, event: dict) -> None:
        """Insert or update event in the in-memory list (used in tests)."""
        existing = _EVENT_INDEX.get(event["id"])
        if existing:
            existing.update(event)
        else:
            MOCK_EVENTS.append(event)
            _EVENT_INDEX[event["id"]] = event

    def get_by_standard_code(self, standard_code: str) -> dict | None:
        """Return most-recent event with matching standard_code."""
        matches = [e for e in MOCK_EVENTS if e.get("standard_code") == standard_code]
        if not matches:
            return None
        return max(matches, key=lambda e: e.get("published_at", ""))
  • Step 5: Run tests — expect PASS
cd backend && PYTHONPATH=. pytest tests/perception/test_base_event_store.py -v

Expected: 8 tests PASS


Task 2: PostgresEventStore

Files:

  • Create: backend/app/infrastructure/perception/postgres_event_store.py

  • Create: backend/tests/perception/test_postgres_event_store.py

  • Step 1: Write the failing test (mock psycopg2)

# backend/tests/perception/test_postgres_event_store.py
"""Unit tests for PostgresEventStore using a mocked psycopg2 pool."""
from __future__ import annotations
import json
from unittest.mock import MagicMock, patch, call
import pytest

# Patch psycopg2 before importing the module under test
import sys
mock_psycopg2 = MagicMock()
mock_psycopg2.extras = MagicMock()
sys.modules.setdefault("psycopg2", mock_psycopg2)
sys.modules.setdefault("psycopg2.extras", mock_psycopg2.extras)
sys.modules.setdefault("psycopg2.pool", MagicMock())

from app.infrastructure.perception.base_event_store import BaseEventStore


SAMPLE_ROW = {
    "id": "pg-001",
    "source": "国标委",
    "source_label": "国家标准化管理委员会",
    "standard_code": "GB 18384-2025",
    "title": "电动汽车安全要求",
    "summary": "新增要求",
    "full_text_url": "https://openstd.samr.gov.cn",
    "status": "enacted",
    "impact_level": "high",
    "published_at": "2025-11-15",
    "effective_at": "2026-07-01",
    "category": "电动汽车安全",
    "tags": ["电池安全"],
    "obligations": None,
    "deadlines": None,
    "scope": None,
    "penalties": None,
    "content_hash": "abc123",
    "previous_hash": None,
    "change_summary": None,
    "changed_sections": None,
    "affected_docs": None,
    "crawled_at": "2026-06-05T10:00:00+00:00",
    "processed_at": None,
    "raw_storage_key": None,
}


def _make_store_with_pool(mock_pool):
    with patch("psycopg2.pool.ThreadedConnectionPool", return_value=mock_pool):
        with patch(
            "app.infrastructure.perception.postgres_event_store.PostgresEventStore._ensure_schema"
        ):
            from app.infrastructure.perception.postgres_event_store import PostgresEventStore
            return PostgresEventStore()


def _cursor_returning(rows):
    cursor = MagicMock()
    cursor.__enter__ = lambda s: s
    cursor.__exit__ = MagicMock(return_value=False)
    cursor.fetchall.return_value = rows
    cursor.fetchone.return_value = rows[0] if rows else None
    return cursor


def test_is_base_event_store():
    mock_pool = MagicMock()
    store = _make_store_with_pool(mock_pool)
    assert isinstance(store, BaseEventStore)


def test_filter_returns_list():
    mock_pool = MagicMock()
    conn = MagicMock()
    conn.__enter__ = lambda s: s
    conn.__exit__ = MagicMock(return_value=False)
    cursor = _cursor_returning([SAMPLE_ROW])
    conn.cursor.return_value = cursor
    mock_pool.getconn.return_value = conn
    store = _make_store_with_pool(mock_pool)
    result = store.filter(limit=10)
    assert isinstance(result, list)


def test_stats_returns_correct_keys():
    mock_pool = MagicMock()
    conn = MagicMock()
    conn.__enter__ = lambda s: s
    conn.__exit__ = MagicMock(return_value=False)
    # stats runs 4 queries
    cursor = MagicMock()
    cursor.__enter__ = lambda s: s
    cursor.__exit__ = MagicMock(return_value=False)
    cursor.fetchone.return_value = {"count": 5}
    conn.cursor.return_value = cursor
    mock_pool.getconn.return_value = conn
    store = _make_store_with_pool(mock_pool)
    stats = store.stats()
    for key in ("total", "high_impact", "medium_impact", "recent_90d"):
        assert key in stats
  • Step 2: Run test to verify it fails
cd backend && PYTHONPATH=. pytest tests/perception/test_postgres_event_store.py -v

Expected: ImportError on postgres_event_store

  • Step 3: Implement PostgresEventStore
# backend/app/infrastructure/perception/postgres_event_store.py
"""PostgreSQL-backed regulatory event store."""

from __future__ import annotations

import json
from contextlib import contextmanager
from datetime import UTC, date, datetime, timedelta
from typing import Any

import psycopg2
import psycopg2.extras
from psycopg2.pool import ThreadedConnectionPool

from app.config.settings import settings
from app.infrastructure.perception.base_event_store import BaseEventStore

_CREATE_TABLE = """
CREATE TABLE IF NOT EXISTS regulation_events (
    id               TEXT PRIMARY KEY,
    source           TEXT NOT NULL,
    source_label     TEXT,
    standard_code    TEXT NOT NULL,
    title            TEXT NOT NULL,
    summary          TEXT,
    full_text_url    TEXT,
    status           TEXT,
    impact_level     TEXT,
    published_at     DATE,
    effective_at     DATE,
    category         TEXT,
    tags             TEXT[],
    obligations      JSONB,
    deadlines        JSONB,
    scope            TEXT,
    penalties        TEXT,
    content_hash     TEXT,
    previous_hash    TEXT,
    change_summary   TEXT,
    changed_sections JSONB,
    affected_docs    JSONB,
    crawled_at       TIMESTAMPTZ DEFAULT now(),
    processed_at     TIMESTAMPTZ,
    raw_storage_key  TEXT
);
CREATE INDEX IF NOT EXISTS reg_events_source_date
    ON regulation_events (source, published_at DESC);
CREATE INDEX IF NOT EXISTS reg_events_impact_date
    ON regulation_events (impact_level, published_at DESC);
"""

_ALL_COLUMNS = (
    "id", "source", "source_label", "standard_code", "title", "summary",
    "full_text_url", "status", "impact_level", "published_at", "effective_at",
    "category", "tags", "obligations", "deadlines", "scope", "penalties",
    "content_hash", "previous_hash", "change_summary", "changed_sections",
    "affected_docs", "crawled_at", "processed_at", "raw_storage_key",
)


def _row_to_dict(row: dict[str, Any]) -> dict:
    """Convert a psycopg2 RealDictRow to a plain dict with serialized JSON fields."""
    d = dict(row)
    for field in ("obligations", "deadlines", "changed_sections", "affected_docs"):
        val = d.get(field)
        if isinstance(val, str):
            d[field] = json.loads(val)
    for date_field in ("published_at", "effective_at"):
        val = d.get(date_field)
        if isinstance(val, date):
            d[date_field] = val.isoformat()
    for ts_field in ("crawled_at", "processed_at"):
        val = d.get(ts_field)
        if isinstance(val, datetime):
            d[ts_field] = val.isoformat()
    return d


class PostgresEventStore(BaseEventStore):
    """Regulatory event store backed by PostgreSQL."""

    def __init__(self) -> None:
        self._pool = ThreadedConnectionPool(
            minconn=1,
            maxconn=5,
            host=settings.postgres_host,
            port=settings.postgres_port,
            user=settings.postgres_user,
            password=settings.postgres_password,
            dbname=settings.postgres_db,
        )
        self._ensure_schema()

    def _ensure_schema(self) -> None:
        with self._conn() as conn:
            with conn.cursor() as cur:
                cur.execute(_CREATE_TABLE)
            conn.commit()

    @contextmanager
    def _conn(self):
        conn = self._pool.getconn()
        try:
            yield conn
        finally:
            self._pool.putconn(conn)

    def all(self) -> list[dict]:
        with self._conn() as conn:
            with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
                cur.execute(
                    "SELECT * FROM regulation_events ORDER BY published_at DESC NULLS LAST"
                )
                return [_row_to_dict(r) for r in cur.fetchall()]

    def get(self, event_id: str) -> dict | None:
        with self._conn() as conn:
            with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
                cur.execute(
                    "SELECT * FROM regulation_events WHERE id = %s", (event_id,)
                )
                row = cur.fetchone()
                return _row_to_dict(row) if row else None

    def filter(
        self,
        *,
        source: str | None = None,
        impact_level: str | None = None,
        limit: int = 50,
    ) -> list[dict]:
        conditions: list[str] = []
        params: list[Any] = []
        if source:
            conditions.append("source = %s")
            params.append(source)
        if impact_level:
            conditions.append("impact_level = %s")
            params.append(impact_level)
        where = ("WHERE " + " AND ".join(conditions)) if conditions else ""
        params.append(limit)
        sql = f"""
            SELECT * FROM regulation_events
            {where}
            ORDER BY published_at DESC NULLS LAST
            LIMIT %s
        """
        with self._conn() as conn:
            with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
                cur.execute(sql, params)
                return [_row_to_dict(r) for r in cur.fetchall()]

    def stats(self) -> dict:
        cutoff = (date.today() - timedelta(days=90)).isoformat()
        with self._conn() as conn:
            with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
                cur.execute("SELECT COUNT(*) AS count FROM regulation_events")
                total = (cur.fetchone() or {}).get("count", 0)
                cur.execute(
                    "SELECT COUNT(*) AS count FROM regulation_events WHERE impact_level = 'high'"
                )
                high = (cur.fetchone() or {}).get("count", 0)
                cur.execute(
                    "SELECT COUNT(*) AS count FROM regulation_events WHERE impact_level = 'medium'"
                )
                medium = (cur.fetchone() or {}).get("count", 0)
                cur.execute(
                    "SELECT COUNT(*) AS count FROM regulation_events WHERE published_at >= %s",
                    (cutoff,),
                )
                recent = (cur.fetchone() or {}).get("count", 0)
        return {
            "total": int(total),
            "high_impact": int(high),
            "medium_impact": int(medium),
            "recent_90d": int(recent),
        }

    def upsert(self, event: dict) -> None:
        """Insert or update a regulation event."""
        cols = [c for c in _ALL_COLUMNS if c in event]
        placeholders = ", ".join(f"%({c})s" for c in cols)
        updates = ", ".join(f"{c} = EXCLUDED.{c}" for c in cols if c != "id")
        sql = f"""
            INSERT INTO regulation_events ({', '.join(cols)})
            VALUES ({placeholders})
            ON CONFLICT (id) DO UPDATE SET {updates}
        """
        row: dict[str, Any] = {}
        for c in cols:
            val = event.get(c)
            if c in ("obligations", "deadlines", "changed_sections", "affected_docs") and val is not None:
                row[c] = json.dumps(val, ensure_ascii=False)
            elif c == "tags" and isinstance(val, list):
                row[c] = val  # psycopg2 handles list→array
            else:
                row[c] = val
        with self._conn() as conn:
            with conn.cursor() as cur:
                cur.execute(sql, row)
            conn.commit()

    def get_by_standard_code(self, standard_code: str) -> dict | None:
        with self._conn() as conn:
            with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
                cur.execute(
                    """SELECT * FROM regulation_events
                       WHERE standard_code = %s
                       ORDER BY published_at DESC NULLS LAST
                       LIMIT 1""",
                    (standard_code,),
                )
                row = cur.fetchone()
                return _row_to_dict(row) if row else None
  • Step 4: Run tests — expect PASS
cd backend && PYTHONPATH=. pytest tests/perception/test_postgres_event_store.py -v

Expected: 3 tests PASS


Task 3: Crawler base + CATARC crawler

Files:

  • Create: backend/app/infrastructure/perception/crawlers/__init__.py

  • Create: backend/app/infrastructure/perception/crawlers/base.py

  • Create: backend/app/infrastructure/perception/crawlers/catarc_crawler.py

  • Create: backend/tests/perception/test_crawlers.py

  • Step 1: Write failing test

# backend/tests/perception/test_crawlers.py
"""Unit tests for crawlers — mock httpx responses."""
from __future__ import annotations
from unittest.mock import MagicMock, patch
import pytest

from app.infrastructure.perception.crawlers.base import RawEvent, BaseCrawler


def test_raw_event_fields():
    ev = RawEvent(
        source="TEST",
        source_label="Test",
        standard_code="TST-001",
        title="Test",
        summary="Summary",
        full_text_url="https://example.com",
        status="enacted",
        published_at="2026-01-01",
        effective_at=None,
        category="test",
        tags=["a"],
        raw_text="full text here",
    )
    assert ev.source == "TEST"
    assert ev.tags == ["a"]


CATARC_HTML = """
<html><body>
<table>
<tr>
  <td><a href="/std/detail/123">GB 18384-2025</a></td>
  <td>电动汽车安全要求</td>
  <td>2025-11-15</td>
  <td>现行</td>
</tr>
<tr>
  <td><a href="/std/detail/456">GB/T 40429-2026</a></td>
  <td>汽车驾驶自动化分级</td>
  <td>2026-02-01</td>
  <td>即将实施</td>
</tr>
</table>
</body></html>
"""


def test_catarc_crawler_parses_html():
    from app.infrastructure.perception.crawlers.catarc_crawler import CatarcCrawler

    mock_resp = MagicMock()
    mock_resp.status_code = 200
    mock_resp.text = CATARC_HTML
    mock_resp.raise_for_status = MagicMock()

    with patch("httpx.get", return_value=mock_resp):
        crawler = CatarcCrawler()
        events = crawler.fetch(limit=10)

    assert isinstance(events, list)
    assert len(events) >= 1
    assert all(isinstance(e, RawEvent) for e in events)
    codes = [e.standard_code for e in events]
    assert "GB 18384-2025" in codes


GUOBIAO_JSON = {
    "rows": [
        {
            "std_code": "GB 18384-2025",
            "std_name": "电动汽车安全要求",
            "release_date": "2025-11-15",
            "implement_date": "2026-07-01",
            "std_status": "现行",
            "std_type": "强制性",
        },
    ]
}


def test_guobiao_crawler_parses_json():
    from app.infrastructure.perception.crawlers.guobiao_crawler import GuobiaoMandatoryCrawler

    mock_resp = MagicMock()
    mock_resp.status_code = 200
    mock_resp.json.return_value = GUOBIAO_JSON
    mock_resp.raise_for_status = MagicMock()

    with patch("httpx.get", return_value=mock_resp):
        crawler = GuobiaoMandatoryCrawler()
        events = crawler.fetch(limit=10)

    assert len(events) >= 1
    assert events[0].source == "国标委"
    assert events[0].standard_code == "GB 18384-2025"
  • Step 2: Run test to verify it fails
cd backend && PYTHONPATH=. pytest tests/perception/test_crawlers.py -v

Expected: ImportError

  • Step 3: Create crawler base
# backend/app/infrastructure/perception/crawlers/__init__.py
# backend/app/infrastructure/perception/crawlers/base.py
"""Shared contracts for regulatory source crawlers."""

from __future__ import annotations

from abc import ABC, abstractmethod
from dataclasses import dataclass, field


@dataclass
class RawEvent:
    """Raw regulatory event returned by a crawler before enrichment."""

    source: str
    source_label: str
    standard_code: str
    title: str
    summary: str
    full_text_url: str
    status: str           # 'enacted' | 'draft' | 'consultation'
    published_at: str     # YYYY-MM-DD string
    effective_at: str | None
    category: str
    tags: list[str] = field(default_factory=list)
    raw_text: str = ""    # full crawled text for hashing + LLM


class BaseCrawler(ABC):
    """Abstract regulatory source crawler."""

    @abstractmethod
    def fetch(self, limit: int = 50) -> list[RawEvent]:
        """Fetch up to `limit` recent events from the data source."""
  • Step 4: Create CATARC crawler
# backend/app/infrastructure/perception/crawlers/catarc_crawler.py
"""Crawler for CATARC automotive standard catalogue."""

from __future__ import annotations

import hashlib

import httpx
from bs4 import BeautifulSoup
from loguru import logger

from app.infrastructure.perception.crawlers.base import BaseCrawler, RawEvent

_BASE_URL = "https://www.catarc.org.cn/bzzxd/qcbz/index.html"
_HOST = "https://www.catarc.org.cn"

# Status strings appearing on the CATARC site mapped to our vocabulary.
_STATUS_MAP = {
    "现行": "enacted",
    "即将实施": "enacted",
    "废止": "enacted",
    "征求意见": "consultation",
    "报批": "draft",
}


class CatarcCrawler(BaseCrawler):
    """Scrape the CATARC automotive standard list page."""

    def fetch(self, limit: int = 50) -> list[RawEvent]:
        events: list[RawEvent] = []
        page = 1
        while len(events) < limit:
            url = f"{_BASE_URL}?page={page}"
            try:
                resp = httpx.get(url, timeout=30, follow_redirects=True)
                resp.raise_for_status()
            except Exception as exc:
                logger.warning("CATARC fetch failed page={} err={}", page, exc)
                break

            soup = BeautifulSoup(resp.text, "lxml")
            rows = soup.select("table tr")
            if not rows:
                break

            batch: list[RawEvent] = []
            for row in rows:
                cells = row.find_all("td")
                if len(cells) < 3:
                    continue
                link = cells[0].find("a")
                standard_code = link.get_text(strip=True) if link else cells[0].get_text(strip=True)
                title = cells[1].get_text(strip=True) if len(cells) > 1 else standard_code
                date_text = cells[2].get_text(strip=True) if len(cells) > 2 else ""
                published_at = _parse_date(date_text)
                status_text = cells[3].get_text(strip=True) if len(cells) > 3 else ""
                status = _STATUS_MAP.get(status_text, "enacted")
                detail_url = (_HOST + link["href"]) if link and link.get("href") else url
                raw_text = f"{standard_code} {title}"
                batch.append(RawEvent(
                    source="CATARC",
                    source_label="全国汽车标准化技术委员会",
                    standard_code=standard_code,
                    title=title,
                    summary=title,
                    full_text_url=detail_url,
                    status=status,
                    published_at=published_at,
                    effective_at=None,
                    category="汽车标准",
                    tags=_extract_tags(standard_code, title),
                    raw_text=raw_text,
                ))

            if not batch:
                break
            events.extend(batch)
            page += 1

        return events[:limit]


def _parse_date(text: str) -> str:
    """Return YYYY-MM-DD from common Chinese date formats, or today's date."""
    import re
    from datetime import date
    text = text.strip()
    m = re.search(r"(\d{4})[/-](\d{1,2})[/-](\d{1,2})", text)
    if m:
        y, mo, d = m.group(1), m.group(2).zfill(2), m.group(3).zfill(2)
        return f"{y}-{mo}-{d}"
    m2 = re.search(r"(\d{4})年(\d{1,2})月(\d{1,2})日?", text)
    if m2:
        y, mo, d = m2.group(1), m2.group(2).zfill(2), m2.group(3).zfill(2)
        return f"{y}-{mo}-{d}"
    return date.today().isoformat()


def _extract_tags(standard_code: str, title: str) -> list[str]:
    """Derive simple keyword tags from standard code and title."""
    tags: list[str] = []
    code_upper = standard_code.upper()
    if "GB" in code_upper:
        tags.append("国家标准")
    if "/T" in code_upper:
        tags.append("推荐性")
    else:
        tags.append("强制性")
    keywords = ["电动", "安全", "自动驾驶", "充电", "智能网联", "碰撞", "排放", "网络安全"]
    for kw in keywords:
        if kw in title:
            tags.append(kw)
    return tags[:5]
  • Step 5: Create 国标委 crawler
# backend/app/infrastructure/perception/crawlers/guobiao_crawler.py
"""Crawlers for the 国标委 (SAMR) standard information platform."""

from __future__ import annotations

import httpx
from loguru import logger

from app.infrastructure.perception.crawlers.base import BaseCrawler, RawEvent
from app.infrastructure.perception.crawlers.catarc_crawler import _parse_date, _extract_tags

# p.p1=1 → mandatory (强制性); p.p1=2 → recommended (推荐性)
_BASE_URL = "https://openstd.samr.gov.cn/bzgk/std/std_list_type"
_HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; RegulatoryBot/1.0)"}


def _fetch_page(std_type: int, page: int, page_size: int) -> list[dict]:
    params = {
        "p.p1": std_type,
        "p.p2": "车",
        "p.p90": "circulation_date",
        "p.p91": "desc",
        "p.p6": page,
        "p.p7": page_size,
    }
    try:
        resp = httpx.get(_BASE_URL, params=params, headers=_HEADERS, timeout=30)
        resp.raise_for_status()
        data = resp.json()
        return data.get("rows", []) or []
    except Exception as exc:
        logger.warning("国标委 fetch failed type={} page={} err={}", std_type, page, exc)
        return []


def _row_to_raw_event(row: dict, source_label: str) -> RawEvent:
    standard_code = row.get("std_code", "")
    title = row.get("std_name", standard_code)
    published_at = _parse_date(row.get("release_date", ""))
    effective_at_raw = row.get("implement_date", "")
    effective_at = _parse_date(effective_at_raw) if effective_at_raw else None
    status_text = row.get("std_status", "")
    if "征求意见" in status_text:
        status = "consultation"
    elif "报批" in status_text or "草案" in status_text:
        status = "draft"
    else:
        status = "enacted"
    return RawEvent(
        source="国标委",
        source_label=source_label,
        standard_code=standard_code,
        title=title,
        summary=title,
        full_text_url=f"https://openstd.samr.gov.cn/bzgk/std/detail?id={row.get('id', '')}",
        status=status,
        published_at=published_at,
        effective_at=effective_at,
        category=row.get("std_type", "国家标准"),
        tags=_extract_tags(standard_code, title),
        raw_text=f"{standard_code} {title}",
    )


class GuobiaoMandatoryCrawler(BaseCrawler):
    """Fetch mandatory national standards (强制性) related to vehicles."""

    def fetch(self, limit: int = 50) -> list[RawEvent]:
        events: list[RawEvent] = []
        page = 1
        while len(events) < limit:
            rows = _fetch_page(std_type=1, page=page, page_size=20)
            if not rows:
                break
            events.extend(_row_to_raw_event(r, "国标委·强制性") for r in rows)
            page += 1
        return events[:limit]


class GuobiaoRecommendedCrawler(BaseCrawler):
    """Fetch recommended national standards (推荐性) related to vehicles."""

    def fetch(self, limit: int = 50) -> list[RawEvent]:
        events: list[RawEvent] = []
        page = 1
        while len(events) < limit:
            rows = _fetch_page(std_type=2, page=page, page_size=20)
            if not rows:
                break
            events.extend(_row_to_raw_event(r, "国标委·推荐性") for r in rows)
            page += 1
        return events[:limit]
  • Step 6: Run tests
cd backend && PYTHONPATH=. pytest tests/perception/test_crawlers.py -v

Expected: 3 tests PASS


Task 4: EUR-Lex + UN-ECE crawler

Files:

  • Create: backend/app/infrastructure/perception/crawlers/eurlex_crawler.py

(Tests already created in test_crawlers.py — add to existing file)

  • Step 1: Add EUR-Lex test to existing test file

Append to backend/tests/perception/test_crawlers.py:

EURLEX_RSS = """<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>EUR-Lex</title>
    <item>
      <title>Regulation (EU) 2024/1689 — AI Act</title>
      <link>https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689</link>
      <description>The EU Artificial Intelligence Act enters into force.</description>
      <pubDate>Fri, 12 Jul 2024 00:00:00 GMT</pubDate>
    </item>
  </channel>
</rss>"""


def test_eurlex_crawler_parses_rss():
    from app.infrastructure.perception.crawlers.eurlex_crawler import EurlexCrawler

    mock_resp = MagicMock()
    mock_resp.status_code = 200
    mock_resp.text = EURLEX_RSS
    mock_resp.raise_for_status = MagicMock()

    with patch("httpx.get", return_value=mock_resp):
        crawler = EurlexCrawler()
        events = crawler.fetch(limit=5)

    assert isinstance(events, list)
    assert len(events) >= 1
    assert events[0].source == "EUR-Lex"
  • Step 2: Run to verify it fails
cd backend && PYTHONPATH=. pytest tests/perception/test_crawlers.py::test_eurlex_crawler_parses_rss -v

Expected: ImportError

  • Step 3: Implement EUR-Lex + UN-ECE crawler
# backend/app/infrastructure/perception/crawlers/eurlex_crawler.py
"""Crawler for EUR-Lex RSS feeds covering EU AI Act and automotive regulations."""

from __future__ import annotations

import re
from email.utils import parsedate_to_datetime

import httpx
from bs4 import BeautifulSoup
from loguru import logger

from app.infrastructure.perception.crawlers.base import BaseCrawler, RawEvent
from app.infrastructure.perception.crawlers.catarc_crawler import _parse_date

# EUR-Lex predefined RSS: legislation in force (OJ L series)
_EURLEX_RSS_URLS = [
    # EU AI Act + automotive-related OJ publications
    "https://eur-lex.europa.eu/rss-feed/OJ-L.rss",
]

# UN-ECE automotive regulations via EUR-Lex CELLAR
_UNECE_CELEX = [
    "32024R0001",  # UN R155 cybersecurity (representative CELEX; adjust as needed)
    "32024R0002",  # UN R156 software updates
]

_AUTOMOTIVE_KEYWORDS = [
    "vehicle", "automotive", "motor", "tyre", "emission", "ADAS", "autonomous",
    "AI Act", "artificial intelligence", "cybersecurity", "software update",
    "R155", "R156", "汽车", "车辆",
]


def _is_automotive_relevant(title: str, description: str) -> bool:
    combined = (title + " " + description).lower()
    return any(kw.lower() in combined for kw in _AUTOMOTIVE_KEYWORDS)


def _extract_celex(url: str) -> str:
    """Extract CELEX number from EUR-Lex URL, or return empty string."""
    m = re.search(r"CELEX[:/]([0-9A-Z]+)", url)
    return m.group(1) if m else ""


def _parse_rss_date(rfc2822: str) -> str:
    """Parse RFC-2822 date string → YYYY-MM-DD."""
    try:
        dt = parsedate_to_datetime(rfc2822)
        return dt.date().isoformat()
    except Exception:
        return _parse_date(rfc2822)


class EurlexCrawler(BaseCrawler):
    """Fetch automotive-relevant EU regulations from EUR-Lex RSS feeds."""

    def fetch(self, limit: int = 50) -> list[RawEvent]:
        events: list[RawEvent] = []
        for rss_url in _EURLEX_RSS_URLS:
            if len(events) >= limit:
                break
            try:
                resp = httpx.get(rss_url, timeout=30, follow_redirects=True)
                resp.raise_for_status()
            except Exception as exc:
                logger.warning("EUR-Lex RSS fetch failed url={} err={}", rss_url, exc)
                continue

            soup = BeautifulSoup(resp.text, "lxml-xml")
            for item in soup.find_all("item"):
                if len(events) >= limit:
                    break
                title = (item.find("title") or {}).get_text(strip=True)
                description = (item.find("description") or {}).get_text(strip=True)
                link = (item.find("link") or {}).get_text(strip=True)
                pub_date = (item.find("pubDate") or {}).get_text(strip=True)

                if not _is_automotive_relevant(title, description):
                    continue

                celex = _extract_celex(link)
                standard_code = celex if celex else title[:60]
                published_at = _parse_rss_date(pub_date) if pub_date else _parse_date("")

                events.append(RawEvent(
                    source="EUR-Lex",
                    source_label="欧盟官方公报",
                    standard_code=standard_code,
                    title=title,
                    summary=description[:500],
                    full_text_url=link,
                    status="enacted",
                    published_at=published_at,
                    effective_at=None,
                    category="EU法规",
                    tags=_extract_eurlex_tags(title, description),
                    raw_text=f"{title}\n{description}",
                ))

        return events[:limit]


def _extract_eurlex_tags(title: str, description: str) -> list[str]:
    combined = title + " " + description
    tag_map = {
        "AI Act": "EU AI Act",
        "artificial intelligence": "EU AI Act",
        "R155": "UN R155",
        "R156": "UN R156",
        "cybersecurity": "网络安全",
        "emission": "排放",
        "autonomous": "自动驾驶",
        "ADAS": "ADAS",
    }
    tags = []
    for kw, tag in tag_map.items():
        if kw.lower() in combined.lower():
            tags.append(tag)
    return tags[:5]
  • Step 4: Run tests
cd backend && PYTHONPATH=. pytest tests/perception/test_crawlers.py -v

Expected: 4 tests PASS


Task 5: LLM Pipeline (extract + assess + diff)

Files:

  • Create: backend/app/infrastructure/perception/llm_pipeline.py

  • Create: backend/tests/perception/test_llm_pipeline.py

  • Step 1: Write the failing test

# backend/tests/perception/test_llm_pipeline.py
"""Unit tests for LlmPipeline — mock LLM client and embedding provider."""
from __future__ import annotations
from unittest.mock import MagicMock, patch
import json
import pytest


def _make_pipeline():
    with patch("app.infrastructure.perception.llm_pipeline.get_llm_client") as mock_llm_fn, \
         patch("app.infrastructure.perception.llm_pipeline.OpenAICompatibleEmbeddingProvider") as mock_emb_cls:

        mock_client = MagicMock()
        mock_client.chat.return_value = MagicMock(content='{"obligations":[{"text":"test obligation","deontic":"must","subject":"OEM","object":"system","condition":""}],"deadlines":[{"date":"2026-07-01","description":"实施截止"}],"scope":"适用于M1类车辆","penalties":"罚款","impact_level":"high"}')
        mock_llm_fn.return_value = mock_client

        mock_emb = MagicMock()
        mock_emb.embed_texts.return_value = [[0.1] * 1024, [0.9] * 1024]
        mock_emb_cls.return_value = mock_emb

        from app.infrastructure.perception.llm_pipeline import LlmPipeline
        return LlmPipeline(), mock_client, mock_emb


def test_extract_structure_returns_dict():
    pipeline, mock_client, _ = _make_pipeline()
    event = {
        "id": "evt-001",
        "standard_code": "GB 18384-2025",
        "title": "电动汽车安全要求",
        "summary": "新增 IP67 级别防护",
        "source_label": "CATARC",
        "tags": ["电池安全"],
    }
    result = pipeline.extract_structure(event)
    assert isinstance(result, dict)
    assert "obligations" in result
    assert "impact_level" in result


def test_assess_impact_returns_list():
    pipeline, mock_client, _ = _make_pipeline()
    mock_client.chat.return_value = MagicMock(content='[{"doc_id":"d1","doc_name":"Safety Manual","score":0.85,"key_clauses":"§4.2","recommendation":"更新第4章"}]')
    mock_retrieval = MagicMock()
    chunk = MagicMock()
    chunk.doc_id = "d1"
    chunk.doc_title = "Safety Manual"
    chunk.score = 0.85
    chunk.text = "relevant text"
    chunk.section_title = "§4.2"
    mock_retrieval.retrieve.return_value = [chunk]
    event = {
        "standard_code": "GB 18384-2025",
        "title": "电动汽车安全要求",
        "obligations": [{"text": "OEM shall comply"}],
    }
    result = pipeline.assess_impact(event, mock_retrieval)
    assert isinstance(result, list)


def test_compute_diff_no_change():
    pipeline, _, mock_emb = _make_pipeline()
    # identical texts → cosine similarity = 1.0 → no changes
    mock_emb.embed_texts.return_value = [[0.5] * 1024, [0.5] * 1024]
    result = pipeline.compute_diff("paragraph one", "paragraph one")
    assert isinstance(result, dict)
    assert "changed_sections" in result
    assert "change_summary" in result


def test_compute_diff_detects_change():
    pipeline, mock_client, mock_emb = _make_pipeline()
    # low cosine similarity → change detected
    import numpy as np
    mock_emb.embed_texts.return_value = [
        [1.0] + [0.0] * 1023,
        [0.0] + [1.0] + [0.0] * 1022,
    ]
    mock_client.chat.return_value = MagicMock(content='{"change_type":"tightened","summary":"Requirement tightened"}')
    result = pipeline.compute_diff("old paragraph text", "new tighter requirement text")
    assert isinstance(result["changed_sections"], list)
  • Step 2: Run to verify it fails
cd backend && PYTHONPATH=. pytest tests/perception/test_llm_pipeline.py -v

Expected: ImportError

  • Step 3: Implement LlmPipeline
# backend/app/infrastructure/perception/llm_pipeline.py
"""LLM-driven pipeline for regulatory event enrichment."""

from __future__ import annotations

import json
import math
from typing import Any

from loguru import logger

from app.config.settings import settings
from app.infrastructure.embedding.openai_compatible_embedding_provider import (
    OpenAICompatibleEmbeddingProvider,
)
from app.services.llm.llm_factory import get_llm_client

_EXTRACT_SYSTEM = (
    "You are a regulatory compliance expert specialising in automotive standards "
    "(GB, UN-ECE, ISO, EU). Extract structured information from regulation text. "
    "Return valid JSON only — no markdown fences, no extra keys."
)

_ASSESS_SYSTEM = (
    "You are an automotive compliance analyst. Given a regulation and related document excerpts, "
    "identify which documents are affected and what actions are required. "
    "Return a JSON array only."
)

_DIFF_SYSTEM = (
    "You are a regulatory change analyst. Given an old and new version of a regulation paragraph, "
    "classify the type of change and summarise it. "
    "Return JSON only: {\"change_type\": \"tightened|relaxed|added|removed\", \"summary\": \"...\"}"
)

_SIMILARITY_THRESHOLD = 0.85


def _cosine(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    if norm_a == 0 or norm_b == 0:
        return 1.0
    return dot / (norm_a * norm_b)


def _llm_json(client: Any, messages: list[dict]) -> Any:
    """Call LLM and parse JSON response; return None on failure."""
    try:
        resp = client.chat(messages)
        text = (resp.content or "").strip()
        # strip markdown fences if model added them despite instructions
        if text.startswith("```"):
            text = text.split("```")[1]
            if text.startswith("json"):
                text = text[4:]
        return json.loads(text)
    except Exception as exc:
        logger.warning("LLM JSON parse failed: {}", exc)
        return None


class LlmPipeline:
    """Three-step enrichment pipeline for crawled regulatory events."""

    def __init__(self) -> None:
        self._client = get_llm_client(
            provider=settings.llm_provider,
            model=settings.llm_model,
        )
        self._embedder = OpenAICompatibleEmbeddingProvider()

    # ------------------------------------------------------------------
    # Step 1: Structure extraction
    # ------------------------------------------------------------------

    def extract_structure(self, event: dict) -> dict:
        """Extract obligations, deadlines, scope, penalties, impact_level from event text."""
        prompt = f"""Extract structured compliance information from this regulation:

Standard: {event.get('standard_code', '')}
Title: {event.get('title', '')}
Source: {event.get('source_label', '')}
Summary: {event.get('summary', '')}
Tags: {', '.join(event.get('tags', []))}

Return JSON with exactly these keys:
{{
  "obligations": [{{"text": "...", "deontic": "must|shall|may|prohibited", "subject": "...", "object": "...", "condition": ""}}],
  "deadlines": [{{"date": "YYYY-MM-DD or null", "description": "..."}}],
  "scope": "one sentence describing who/what this applies to",
  "penalties": "one sentence on consequences of non-compliance, or null",
  "impact_level": "high|medium|low"
}}"""

        messages = [
            {"role": "system", "content": _EXTRACT_SYSTEM},
            {"role": "user", "content": prompt},
        ]
        result = _llm_json(self._client, messages)
        if not isinstance(result, dict):
            return {
                "obligations": [],
                "deadlines": [],
                "scope": "",
                "penalties": "",
                "impact_level": "medium",
            }
        return result

    # ------------------------------------------------------------------
    # Step 2: Impact assessment
    # ------------------------------------------------------------------

    def assess_impact(self, event: dict, retrieval_service: Any) -> list[dict]:
        """Use RAG to find affected documents and generate recommendations."""
        obligations = event.get("obligations") or []
        obligation_texts = " ".join(o.get("text", "") for o in obligations[:3])
        query = f"{event.get('standard_code', '')} {event.get('title', '')} {obligation_texts}"

        try:
            chunks = retrieval_service.retrieve(query=query, top_k=5)
        except Exception as exc:
            logger.warning("RAG retrieval failed: {}", exc)
            return []

        if not chunks:
            return []

        seen: set[str] = set()
        doc_excerpts: list[dict] = []
        for chunk in chunks:
            if chunk.doc_id not in seen:
                seen.add(chunk.doc_id)
                doc_excerpts.append({
                    "doc_id": chunk.doc_id,
                    "doc_name": chunk.doc_title,
                    "score": round(float(chunk.score), 4),
                    "snippet": (chunk.text or "")[:300],
                    "clause": getattr(chunk, "section_title", "") or "",
                })

        context = "\n".join(
            f"[{d['doc_name']} {d['clause']}] score={d['score']}: {d['snippet']}"
            for d in doc_excerpts
        )
        prompt = f"""Regulation: {event.get('standard_code')}{event.get('title')}
Obligations: {obligation_texts or event.get('summary', '')}

Affected documents found in knowledge base:
{context}

For each document, assess impact and recommend action. Return JSON array:
[{{"doc_id":"...","doc_name":"...","score":0.0,"key_clauses":"...","recommendation":"one sentence action"}}]"""

        messages = [
            {"role": "system", "content": _ASSESS_SYSTEM},
            {"role": "user", "content": prompt},
        ]
        result = _llm_json(self._client, messages)
        if isinstance(result, list):
            # merge score from retrieval (more reliable than LLM-invented scores)
            score_map = {d["doc_id"]: d["score"] for d in doc_excerpts}
            for item in result:
                if isinstance(item, dict) and item.get("doc_id") in score_map:
                    item["score"] = score_map[item["doc_id"]]
            return result
        return doc_excerpts  # fallback: return retrieval results without LLM recommendation

    # ------------------------------------------------------------------
    # Step 3: Semantic diff
    # ------------------------------------------------------------------

    def compute_diff(self, old_text: str, new_text: str) -> dict:
        """Compare old and new regulation text; return changed sections and summary."""
        old_paras = [p.strip() for p in old_text.split("\n") if p.strip()]
        new_paras = [p.strip() for p in new_text.split("\n") if p.strip()]

        if not old_paras or not new_paras:
            return {"changed_sections": [], "change_summary": "No comparable text."}

        all_paras = old_paras + new_paras
        try:
            all_embeddings = self._embedder.embed_texts(all_paras)
        except Exception as exc:
            logger.warning("Embedding for diff failed: {}", exc)
            return {"changed_sections": [], "change_summary": "Diff unavailable (embedding error)."}

        old_embeddings = all_embeddings[: len(old_paras)]
        new_embeddings = all_embeddings[len(old_paras):]

        # Pair paragraphs by position (zip — handles length differences)
        changed_sections: list[dict] = []
        for i, (old_emb, new_emb, old_p, new_p) in enumerate(
            zip(old_embeddings, new_embeddings, old_paras, new_paras)
        ):
            sim = _cosine(old_emb, new_emb)
            if sim < _SIMILARITY_THRESHOLD:
                messages = [
                    {"role": "system", "content": _DIFF_SYSTEM},
                    {"role": "user", "content": f"OLD: {old_p[:500]}\nNEW: {new_p[:500]}"},
                ]
                classification = _llm_json(self._client, messages) or {}
                changed_sections.append({
                    "old_text": old_p[:300],
                    "new_text": new_p[:300],
                    "similarity": round(sim, 3),
                    "change_type": classification.get("change_type", "modified"),
                    "summary": classification.get("summary", ""),
                })

        if not changed_sections:
            change_summary = "No substantive changes detected between versions."
        else:
            types = [s["change_type"] for s in changed_sections]
            change_summary = (
                f"{len(changed_sections)} paragraph(s) changed: "
                + ", ".join(f"{t}" for t in set(types))
                + ". "
                + (changed_sections[0].get("summary", "") if changed_sections else "")
            )

        return {"changed_sections": changed_sections, "change_summary": change_summary}
  • Step 4: Run tests
cd backend && PYTHONPATH=. pytest tests/perception/test_llm_pipeline.py -v

Expected: 4 tests PASS


Task 6: CrawlService

Files:

  • Create: backend/app/application/perception/crawl_service.py

  • Create: backend/tests/perception/test_crawl_service.py

  • Step 1: Write the failing test

# backend/tests/perception/test_crawl_service.py
"""Integration tests for CrawlService."""
from __future__ import annotations
from unittest.mock import MagicMock
import hashlib
import pytest

from app.infrastructure.perception.crawlers.base import RawEvent
from app.infrastructure.perception.mock_event_store import MockEventStore


def _make_raw_event(code="TST-001"):
    return RawEvent(
        source="TEST", source_label="Test", standard_code=code,
        title=f"Test {code}", summary="Summary", full_text_url="https://example.com",
        status="enacted", published_at="2026-01-01", effective_at=None,
        category="test", tags=["test"], raw_text="full text",
    )


def _make_service(raw_events):
    from app.application.perception.crawl_service import CrawlService

    mock_crawler = MagicMock()
    mock_crawler.fetch.return_value = raw_events

    mock_pipeline = MagicMock()
    mock_pipeline.extract_structure.return_value = {
        "obligations": [], "deadlines": [], "scope": "test",
        "penalties": None, "impact_level": "low",
    }
    mock_pipeline.assess_impact.return_value = []
    mock_pipeline.compute_diff.return_value = {
        "changed_sections": [], "change_summary": "No changes.",
    }

    mock_retrieval = MagicMock()
    store = MockEventStore()

    return CrawlService(
        crawlers={"TEST": mock_crawler},
        event_store=store,
        llm_pipeline=mock_pipeline,
        retrieval_service=mock_retrieval,
    )


def test_crawl_yields_progress_and_done():
    svc = _make_service([_make_raw_event("TST-001")])
    events = list(svc.run_crawl())
    event_types = [e.get("event") for e in events]
    assert "done" in event_types


def test_crawl_upserts_to_store():
    store = MockEventStore()
    from app.application.perception.crawl_service import CrawlService
    mock_crawler = MagicMock()
    mock_crawler.fetch.return_value = [_make_raw_event("NEW-001")]
    mock_pipeline = MagicMock()
    mock_pipeline.extract_structure.return_value = {
        "obligations": [], "deadlines": [], "scope": "",
        "penalties": None, "impact_level": "medium",
    }
    mock_pipeline.assess_impact.return_value = []
    mock_pipeline.compute_diff.return_value = {
        "changed_sections": [], "change_summary": "",
    }
    svc = CrawlService(
        crawlers={"TEST": mock_crawler},
        event_store=store,
        llm_pipeline=mock_pipeline,
        retrieval_service=MagicMock(),
    )
    list(svc.run_crawl())
    result = store.get_by_standard_code("NEW-001")
    assert result is not None
    assert result["title"] == "Test NEW-001"


def test_crawl_skips_unchanged_events():
    store = MockEventStore()
    raw = _make_raw_event("SKIP-001")
    content_hash = hashlib.sha256(raw.raw_text.encode()).hexdigest()
    # Pre-seed with same hash
    store.upsert({
        "id": hashlib.sha256(f"TEST-SKIP-001".encode()).hexdigest()[:12],
        "standard_code": "SKIP-001",
        "source": "TEST",
        "source_label": "Test",
        "title": "Test SKIP-001",
        "summary": "",
        "full_text_url": "",
        "status": "enacted",
        "impact_level": "low",
        "published_at": "2026-01-01",
        "effective_at": None,
        "category": "test",
        "tags": [],
        "content_hash": content_hash,
    })
    mock_pipeline = MagicMock()
    from app.application.perception.crawl_service import CrawlService
    mock_crawler = MagicMock()
    mock_crawler.fetch.return_value = [raw]
    svc = CrawlService(
        crawlers={"TEST": mock_crawler},
        event_store=store,
        llm_pipeline=mock_pipeline,
        retrieval_service=MagicMock(),
    )
    list(svc.run_crawl())
    # pipeline should NOT have been called for unchanged event
    mock_pipeline.extract_structure.assert_not_called()
  • Step 2: Run to verify it fails
cd backend && PYTHONPATH=. pytest tests/perception/test_crawl_service.py -v

Expected: ImportError

  • Step 3: Implement CrawlService
# backend/app/application/perception/crawl_service.py
"""Orchestrates regulatory source crawlers and LLM enrichment pipeline."""

from __future__ import annotations

import hashlib
from typing import Any, Generator

from loguru import logger

from app.infrastructure.perception.base_event_store import BaseEventStore
from app.infrastructure.perception.crawlers.base import BaseCrawler, RawEvent
from app.infrastructure.perception.llm_pipeline import LlmPipeline


def _event_id(source: str, standard_code: str) -> str:
    """Deterministic 12-char ID from source + standard_code."""
    return hashlib.sha256(f"{source}-{standard_code}".encode()).hexdigest()[:12]


def _content_hash(raw_text: str) -> str:
    return hashlib.sha256(raw_text.encode()).hexdigest()


def _raw_to_dict(raw: RawEvent, event_id: str, content_hash: str) -> dict:
    return {
        "id": event_id,
        "source": raw.source,
        "source_label": raw.source_label,
        "standard_code": raw.standard_code,
        "title": raw.title,
        "summary": raw.summary,
        "full_text_url": raw.full_text_url,
        "status": raw.status,
        "impact_level": "medium",  # updated by LLM pipeline
        "published_at": raw.published_at,
        "effective_at": raw.effective_at,
        "category": raw.category,
        "tags": raw.tags,
        "content_hash": content_hash,
        "previous_hash": None,
    }


class CrawlService:
    """Orchestrate crawlers, hash-based change detection, and LLM enrichment."""

    def __init__(
        self,
        crawlers: dict[str, BaseCrawler],
        event_store: BaseEventStore,
        llm_pipeline: LlmPipeline,
        retrieval_service: Any,
    ) -> None:
        self._crawlers = crawlers
        self._store = event_store
        self._pipeline = llm_pipeline
        self._retrieval = retrieval_service

    def run_crawl(
        self, sources: list[str] | None = None
    ) -> Generator[dict, None, None]:
        """Run crawl for selected sources. Yields SSE-ready progress dicts."""
        targets = sources or list(self._crawlers.keys())
        total_new = 0
        total_updated = 0

        for source_key in targets:
            crawler = self._crawlers.get(source_key)
            if not crawler:
                yield {"event": "error", "data": f"Unknown source: {source_key}"}
                continue

            yield {"event": "progress", "data": {"source": source_key, "stage": "fetching"}}
            try:
                raw_events = crawler.fetch(limit=100)
            except Exception as exc:
                logger.exception("Crawler failed source={}", source_key)
                yield {"event": "error", "data": {"source": source_key, "message": str(exc)}}
                continue

            yield {
                "event": "progress",
                "data": {"source": source_key, "stage": "processing", "fetched": len(raw_events)},
            }

            new_count = 0
            updated_count = 0

            for raw in raw_events:
                eid = _event_id(raw.source, raw.standard_code)
                new_hash = _content_hash(raw.raw_text or raw.title)
                existing = self._store.get(eid)

                if existing and existing.get("content_hash") == new_hash:
                    # Unchanged — skip LLM processing
                    continue

                is_update = existing is not None
                old_text = existing.get("summary", "") if is_update else ""
                previous_hash = existing.get("content_hash") if is_update else None

                event_dict = _raw_to_dict(raw, eid, new_hash)
                event_dict["previous_hash"] = previous_hash

                # Step 1: Structure extraction
                try:
                    structure = self._pipeline.extract_structure(event_dict)
                    event_dict.update(structure)
                except Exception as exc:
                    logger.warning("Structure extraction failed id={} err={}", eid, exc)

                # Step 2: Impact assessment
                try:
                    affected = self._pipeline.assess_impact(event_dict, self._retrieval)
                    event_dict["affected_docs"] = affected
                except Exception as exc:
                    logger.warning("Impact assessment failed id={} err={}", eid, exc)

                # Step 3: Semantic diff (only when updating existing event)
                if is_update and old_text and raw.raw_text:
                    try:
                        diff = self._pipeline.compute_diff(old_text, raw.raw_text)
                        event_dict["change_summary"] = diff.get("change_summary")
                        event_dict["changed_sections"] = diff.get("changed_sections")
                    except Exception as exc:
                        logger.warning("Diff failed id={} err={}", eid, exc)

                self._store.upsert(event_dict)

                if is_update:
                    updated_count += 1
                else:
                    new_count += 1

            total_new += new_count
            total_updated += updated_count

            yield {
                "event": "progress",
                "data": {
                    "source": source_key,
                    "stage": "done",
                    "new": new_count,
                    "updated": updated_count,
                },
            }

        yield {
            "event": "done",
            "data": {"total_new": total_new, "total_updated": total_updated},
        }
  • Step 4: Run tests
cd backend && PYTHONPATH=. pytest tests/perception/test_crawl_service.py -v

Expected: 3 tests PASS


Task 7: Wire bootstrap + add settings + update PerceptionService type hint

Files:

  • Modify: backend/app/config/settings.py

  • Modify: backend/app/shared/bootstrap.py

  • Modify: backend/app/application/perception/services.py

  • Modify: backend/requirements.txt

  • Modify: backend/.env

  • Modify: backend/.env.example

  • Step 1: Add settings

In backend/app/config/settings.py, after the use_celery_worker field (line ~88), add:

    # ── Perception crawl ──────────────────────────────────────────────────────
    perception_crawl_timeout_seconds: int = Field(
        default=120, description="HTTP timeout for regulatory source crawlers."
    )
    perception_max_events_per_source: int = Field(
        default=100, description="Maximum events fetched per source per crawl run."
    )
    perception_diff_similarity_threshold: float = Field(
        default=0.85,
        description="Cosine similarity below which a paragraph is flagged as changed.",
    )
  • Step 2: Add env vars to .env and .env.example

Add to backend/.env (after USE_CELERY_WORKER=false):

PERCEPTION_CRAWL_TIMEOUT_SECONDS=120
PERCEPTION_MAX_EVENTS_PER_SOURCE=100
PERCEPTION_DIFF_SIMILARITY_THRESHOLD=0.85

Add the same block to .env.example.

  • Step 3: Fix type hint in PerceptionService

In backend/app/application/perception/services.py, change:

from app.infrastructure.perception.mock_event_store import MockEventStore

to:

from app.infrastructure.perception.base_event_store import BaseEventStore

Change constructor type hint from:

    def __init__(
        self,
        event_store: MockEventStore,
        retrieval_service: KnowledgeRetrievalService,
    ) -> None:

to:

    def __init__(
        self,
        event_store: BaseEventStore,
        retrieval_service: KnowledgeRetrievalService,
    ) -> None:
  • Step 4: Wire bootstrap.py

At the top of backend/app/shared/bootstrap.py, after existing imports, add:

from app.application.perception.crawl_service import CrawlService
from app.infrastructure.perception.base_event_store import BaseEventStore
from app.infrastructure.perception.crawlers.catarc_crawler import CatarcCrawler
from app.infrastructure.perception.crawlers.guobiao_crawler import (
    GuobiaoMandatoryCrawler,
    GuobiaoRecommendedCrawler,
)
from app.infrastructure.perception.crawlers.eurlex_crawler import EurlexCrawler
from app.infrastructure.perception.llm_pipeline import LlmPipeline

Replace the existing get_perception_service() function:

@lru_cache
def _get_event_store() -> BaseEventStore:
    """Return event store selected by DOCUMENT_REPOSITORY_BACKEND setting."""
    if settings.document_repository_backend == "postgres":
        from app.infrastructure.perception.postgres_event_store import PostgresEventStore
        return PostgresEventStore()
    return MockEventStore()


@lru_cache
def get_perception_service() -> PerceptionService:
    """Return perception service for regulatory intelligence."""
    return PerceptionService(
        event_store=_get_event_store(),
        retrieval_service=get_retrieval_service(),
    )


@lru_cache
def get_crawl_service() -> CrawlService:
    """Return CrawlService wired with all registered crawlers and LLM pipeline."""
    crawlers = {
        "CATARC": CatarcCrawler(),
        "国标委·强制性": GuobiaoMandatoryCrawler(),
        "国标委·推荐性": GuobiaoRecommendedCrawler(),
        "EUR-Lex": EurlexCrawler(),
    }
    return CrawlService(
        crawlers=crawlers,
        event_store=_get_event_store(),
        llm_pipeline=LlmPipeline(),
        retrieval_service=get_retrieval_service(),
    )
  • Step 5: Add beautifulsoup4 + lxml to requirements.txt

After the httpx>=0.25.0 line in backend/requirements.txt, add:

beautifulsoup4>=4.12.0
lxml>=5.0.0
  • Step 6: Verify imports work
cd backend && PYTHONPATH=. python -c "from app.shared.bootstrap import get_crawl_service; print('ok')"

Expected: ok


Task 8: New API endpoints (crawl + process + diff)

Files:

  • Modify: backend/app/api/routes/perception.py

  • Step 1: Add three new endpoints

Open backend/app/api/routes/perception.py. After the existing analyze_event endpoint, add:

from fastapi import Depends
from app.api.dependencies.auth import get_current_user
from app.domain.auth.models import UserClaims
from app.shared.bootstrap import get_crawl_service


@router.post("/crawl")
async def run_crawl(
    body: dict = None,
    current_user: UserClaims = Depends(get_current_user),
):
    """Trigger manual crawl of regulatory sources. Streams SSE progress.

    Body (optional): {"sources": ["CATARC", "国标委·强制性", "EUR-Lex"]}
    Omit sources to crawl all registered sources.
    """
    sources: list[str] | None = (body or {}).get("sources")
    crawl_svc = get_crawl_service()

    async def crawl_stream():
        async for item in iter_in_thread(crawl_svc.run_crawl(sources=sources)):
            event_name = item.get("event", "message")
            data = item.get("data", "")
            if isinstance(data, (dict, list)):
                data = json.dumps(data, ensure_ascii=False)
            yield f"event: {event_name}\ndata: {data}\n\n"

    return StreamingResponse(
        crawl_stream(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
    )


@router.post("/events/{event_id}/process")
async def process_event(
    event_id: str,
    current_user: UserClaims = Depends(get_current_user),
):
    """Trigger LLM pipeline (extract + assess + diff) for a single event."""
    from datetime import UTC, datetime
    from app.infrastructure.perception.llm_pipeline import LlmPipeline
    from app.shared.bootstrap import get_retrieval_service

    event = get_perception_service().get_event(event_id)
    if not event:
        from fastapi import HTTPException
        raise HTTPException(status_code=404, detail=f"Event {event_id} not found")

    store = get_crawl_service()._store  # share the same store instance
    pipeline = LlmPipeline()

    structure = pipeline.extract_structure(event)
    event.update(structure)
    event["affected_docs"] = pipeline.assess_impact(event, get_retrieval_service())
    event["processed_at"] = datetime.now(UTC).isoformat()
    store.upsert(event)

    return {"status": "ok", "event_id": event_id, "processed_at": event["processed_at"]}


@router.get("/events/{event_id}/diff")
async def get_event_diff(event_id: str):
    """Return semantic diff detail for an event (only available if previously crawled twice)."""
    event = get_perception_service().get_event(event_id)
    if not event:
        from fastapi import HTTPException
        raise HTTPException(status_code=404, detail=f"Event {event_id} not found")
    if not event.get("change_summary"):
        from fastapi import HTTPException
        raise HTTPException(status_code=404, detail="No diff available for this event")
    return {
        "event_id": event_id,
        "change_summary": event.get("change_summary"),
        "changed_sections": event.get("changed_sections") or [],
        "previous_hash": event.get("previous_hash"),
        "content_hash": event.get("content_hash"),
    }
  • Step 2: Smoke test with curl (backend running)
# With backend running (./dev.sh start api):
curl -s -H "Authorization: Bearer $TOKEN" \
  http://localhost:8000/api/v1/perception/stats | python -m json.tool

Expected: JSON with total, high_impact, medium_impact, recent_90d.


Task 9: Frontend — Crawl Bar + Detail Tabs

Files:

  • Modify: frontend/src/pages/Perception/PerceptionPage.tsx

  • Step 1: Add CrawlBar state and handler at the top of PerceptionPage

In PerceptionPage.tsx, after the existing abortRef line (~line 107), add:

  const [crawling, setCrawling] = useState(false);
  const [crawlStatus, setCrawlStatus] = useState('');
  const [detailTab, setDetailTab] = useState<'overview'|'obligations'|'assessment'|'diff'>('overview');

  // Extended signal shape from DB (populated after crawl)
  const [selectedFull, setSelectedFull] = useState<Record<string, unknown> | null>(null);

  async function fetchFullEvent(id: string) {
    try {
      const res = await fetch(`/api/v1/perception/events/${id}`, { headers: authHeader() });
      if (res.ok) setSelectedFull(await res.json());
    } catch { /* ignore */ }
  }
  • Step 2: Add runCrawl function

After stopAnalysis(), add:

  async function runCrawl() {
    setCrawling(true);
    setCrawlStatus('正在连接数据源...');
    try {
      const res = await fetch('/api/v1/perception/crawl', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json', ...authHeader() },
        body: JSON.stringify({}),
      });
      if (!res.body) { setCrawlStatus('No stream'); setCrawling(false); return; }
      const reader = res.body.getReader();
      const dec = new TextDecoder();
      let buf = '';
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        buf += dec.decode(value);
        const parts = buf.split('\n\n');
        buf = parts.pop() ?? '';
        for (const block of parts) {
          const eventLine = block.split('\n').find(l => l.startsWith('event: '));
          const dataLine = block.split('\n').find(l => l.startsWith('data: '));
          const evtName = eventLine?.slice(7).trim();
          const raw = dataLine?.slice(6).trim();
          if (!raw) continue;
          try {
            const d = JSON.parse(raw);
            if (evtName === 'progress') {
              setCrawlStatus(`${d.source}: ${d.stage === 'fetching' ? '抓取中...' : d.stage === 'processing' ? `处理 ${d.fetched} 条...` : `完成 +${d.new} 条`}`);
            } else if (evtName === 'done') {
              setCrawlStatus(`更新完成 — 新增 ${d.total_new} 条,更新 ${d.total_updated} 条`);
              // refresh event list
              fetch('/api/v1/perception/events?limit=100', { headers: authHeader() })
                .then(r => r.json())
                .then(d2 => { if (Array.isArray(d2?.events)) setSignals(d2.events.map(mapEvent)); });
            } else if (evtName === 'error') {
              setCrawlStatus(`错误: ${typeof d === 'string' ? d : d.message}`);
            }
          } catch { /* ignore */ }
        }
      }
    } catch (e: unknown) {
      setCrawlStatus(`连接失败: ${e instanceof Error ? e.message : String(e)}`);
    }
    setCrawling(false);
  }
  • Step 3: Update selectSignal to also fetch full event

Replace:

  function selectSignal(sig: Signal) {
    setSelected(sig);
    setAiOutput('');
    setStreaming(false);
  }

with:

  function selectSignal(sig: Signal) {
    setSelected(sig);
    setSelectedFull(null);
    setAiOutput('');
    setStreaming(false);
    setDetailTab('overview');
    fetchFullEvent(sig.id);
  }
  • Step 4: Replace Topbar Refresh button with CrawlBar

Replace the existing:

            <button className="btn sm"><RefreshCw size={13} />Refresh</button>

with:

            <button className="btn sm primary" onClick={runCrawl} disabled={crawling}>
              <RefreshCw size={13} className={crawling ? 'spin' : ''} />
              {crawling ? '抓取中...' : '刷新数据源'}
            </button>
            {crawlStatus && <span style={{ fontSize: 12, color: 'var(--text-secondary)', marginLeft: 8 }}>{crawlStatus}</span>}
  • Step 5: Replace right panel with tabbed detail view

Replace the entire right panel section (the <div className="analysis-pane"> block, roughly lines 267319) with:

        <div className="analysis-pane">
          {!selected ? (
            <div className="analysis-empty">
              <div className="empty-ring" />
              <p>Select a signal to run impact analysis</p>
            </div>
          ) : (
            <>
              {/* ── Detail header card ── */}
              <div className="card detail-card">
                <div className="detail-header">
                  <span className="source-tag">{selected.source}</span>
                  <span className="ev-std">{selected.standard}</span>
                  <span className={`status ${selected.status}`}>
                    {selected.status === 'risk' ? 'Urgent' : selected.status === 'warn' ? 'Draft' : 'Published'}
                  </span>
                  {selectedFull?.change_summary && (
                    <span className="status warn" style={{ marginLeft: 'auto' }}>CHANGED</span>
                  )}
                </div>
                <div className="detail-title">{selected.title}</div>
                <p className="detail-summary">{selected.summary}</p>
                <div className="detail-actions">
                  {!streaming
                    ? <button className="btn sm primary" onClick={runAnalysis}><Play size={12} />Run impact analysis</button>
                    : <button className="btn sm" onClick={stopAnalysis}><Square size={12} />Stop</button>
                  }
                  {selected && (
                    <a
                      href={(selectedFull?.full_text_url as string) || '#'}
                      target="_blank"
                      rel="noopener noreferrer"
                      className="btn sm"
                    >
                      <ExternalLink size={12} />Source
                    </a>
                  )}
                </div>
              </div>

              {/* ── Tab bar ── */}
              <div className="detail-tabs">
                {(['overview', 'obligations', 'assessment', 'diff'] as const).map(tab => (
                  <button
                    key={tab}
                    className={`detail-tab${detailTab === tab ? ' active' : ''}${tab === 'diff' && !selectedFull?.change_summary ? ' disabled' : ''}`}
                    onClick={() => tab !== 'diff' || selectedFull?.change_summary ? setDetailTab(tab) : undefined}
                  >
                    {tab === 'overview' ? '概览' : tab === 'obligations' ? '义务条款' : tab === 'assessment' ? '影响评估' : '变更对比'}
                  </button>
                ))}
              </div>

              {/* ── Tab content ── */}
              {detailTab === 'overview' && (
                <div className="card">
                  <div className="card-header">Scope &amp; Summary</div>
                  <p className="detail-summary" style={{ marginTop: 8 }}>
                    {(selectedFull?.scope as string) || selected.summary}
                  </p>
                  {selectedFull?.penalties && (
                    <p style={{ fontSize: 13, color: 'var(--danger)', marginTop: 6 }}>
                       {selectedFull.penalties as string}
                    </p>
                  )}
                </div>
              )}

              {detailTab === 'obligations' && (
                <div className="card">
                  <div className="card-header">义务条款</div>
                  {(() => {
                    const obs = (selectedFull?.obligations as Array<Record<string,string>>) || [];
                    const deadlines = (selectedFull?.deadlines as Array<Record<string,string>>) || [];
                    return obs.length === 0 && deadlines.length === 0 ? (
                      <p className="detail-summary" style={{ marginTop: 8 }}>暂无结构化数据。点击右上角"Run impact analysis"触发提取。</p>
                    ) : (
                      <>
                        {obs.length > 0 && (
                          <table style={{ width: '100%', fontSize: 13, borderCollapse: 'collapse', marginTop: 8 }}>
                            <thead>
                              <tr style={{ borderBottom: '1px solid var(--border)' }}>
                                <th style={{ textAlign: 'left', padding: '4px 8px' }}>义务描述</th>
                                <th style={{ textAlign: 'left', padding: '4px 8px', width: 80 }}>主体</th>
                                <th style={{ textAlign: 'left', padding: '4px 8px', width: 60 }}>类型</th>
                              </tr>
                            </thead>
                            <tbody>
                              {obs.map((ob, i) => (
                                <tr key={i} style={{ borderBottom: '1px solid var(--border-faint)' }}>
                                  <td style={{ padding: '6px 8px' }}>{ob.text}</td>
                                  <td style={{ padding: '6px 8px', color: 'var(--text-secondary)' }}>{ob.subject}</td>
                                  <td style={{ padding: '6px 8px' }}>
                                    <span className={`status ${ob.deontic === 'must' || ob.deontic === 'shall' ? 'risk' : ob.deontic === 'prohibited' ? 'risk' : 'info'}`}>
                                      {ob.deontic}
                                    </span>
                                  </td>
                                </tr>
                              ))}
                            </tbody>
                          </table>
                        )}
                        {deadlines.length > 0 && (
                          <div style={{ marginTop: 12 }}>
                            <div className="card-header">截止日期</div>
                            {deadlines.map((d, i) => (
                              <div key={i} style={{ fontSize: 13, padding: '4px 0', display: 'flex', gap: 12 }}>
                                <span style={{ fontWeight: 600, color: 'var(--danger)' }}>{d.date || '待定'}</span>
                                <span style={{ color: 'var(--text-secondary)' }}>{d.description}</span>
                              </div>
                            ))}
                          </div>
                        )}
                      </>
                    );
                  })()}
                </div>
              )}

              {detailTab === 'assessment' && (
                <div className="card docs-card">
                  <div className="card-header">Affected documents</div>
                  {(() => {
                    const docs = (selectedFull?.affected_docs as Array<Record<string,unknown>>) || MOCK_DOCS.map(d => ({ doc_name: d.name, score: d.score / 100, key_clauses: d.clause, snippet: d.snippet, recommendation: '' }));
                    return docs.length === 0
                      ? <p className="detail-summary" style={{ marginTop: 8 }}>No affected documents found.</p>
                      : docs.map((d, i) => (
                          <div key={i} className="doc-row">
                            <span className="doc-score">{Math.round(Number(d.score ?? 0) * 100)}%</span>
                            <div>
                              <div className="doc-name">
                                {String(d.doc_name || '')}
                                <span className="doc-clause">{String(d.key_clauses || d.clause || '')}</span>
                              </div>
                              {d.snippet && <div className="doc-snippet">{String(d.snippet)}</div>}
                              {d.recommendation && (
                                <div style={{ fontSize: 12, color: 'var(--accent)', marginTop: 2 }}> {String(d.recommendation)}</div>
                              )}
                            </div>
                          </div>
                        ));
                  })()}
                </div>
              )}

              {detailTab === 'diff' && selectedFull?.change_summary && (
                <div className="card">
                  <div className="card-header">变更对比</div>
                  <p style={{ fontSize: 13, color: 'var(--text-secondary)', marginTop: 8 }}>
                    {selectedFull.change_summary as string}
                  </p>
                  {(() => {
                    const sections = (selectedFull.changed_sections as Array<Record<string,unknown>>) || [];
                    return sections.map((s, i) => (
                      <div key={i} style={{ marginTop: 12, borderTop: '1px solid var(--border)', paddingTop: 10 }}>
                        <div style={{ display: 'flex', gap: 8, marginBottom: 6 }}>
                          <span className={`status ${s.change_type === 'tightened' || s.change_type === 'added' ? 'risk' : s.change_type === 'removed' ? 'warn' : 'info'}`}>
                            {String(s.change_type)}
                          </span>
                          <span style={{ fontSize: 12, color: 'var(--text-secondary)' }}>cosine: {String(s.similarity)}</span>
                        </div>
                        <div style={{ display: 'grid', gridTemplateColumns: '1fr 1fr', gap: 8, fontSize: 12 }}>
                          <div style={{ background: 'var(--danger-bg)', padding: 8, borderRadius: 4 }}>
                            <div style={{ fontWeight: 600, marginBottom: 4 }}>旧版</div>
                            {String(s.old_text)}
                          </div>
                          <div style={{ background: 'var(--success-bg)', padding: 8, borderRadius: 4 }}>
                            <div style={{ fontWeight: 600, marginBottom: 4 }}>新版</div>
                            {String(s.new_text)}
                          </div>
                        </div>
                        {s.summary && <p style={{ fontSize: 12, marginTop: 6, color: 'var(--text-secondary)' }}>{String(s.summary)}</p>}
                      </div>
                    ));
                  })()}
                </div>
              )}

              {/* ── AI Analysis card (unchanged) ── */}
              {(aiOutput || streaming) && (
                <div className="card ai-card">
                  <div className="card-header">AI Impact Analysis</div>
                  <div className="ai-output">
                    {aiOutput}
                    {streaming && <span className="blink-cursor"></span>}
                  </div>
                </div>
              )}
            </>
          )}
        </div>
  • Step 6: Add CSS for tabs and spin animation

In frontend/src/styles/globals.css, append at the end:

/* ── Perception detail tabs ── */
.detail-tabs {
  display: flex;
  gap: 2px;
  margin: 8px 0 0;
  border-bottom: 1px solid var(--border);
  padding-bottom: 0;
}
.detail-tab {
  background: none;
  border: none;
  border-bottom: 2px solid transparent;
  padding: 6px 14px;
  font-size: 13px;
  color: var(--text-secondary);
  cursor: pointer;
  transition: color 0.15s, border-color 0.15s;
}
.detail-tab:hover { color: var(--text); }
.detail-tab.active {
  color: var(--accent);
  border-bottom-color: var(--accent);
  font-weight: 600;
}
.detail-tab.disabled {
  opacity: 0.35;
  cursor: not-allowed;
}

/* ── Spin animation for crawl refresh icon ── */
@keyframes spin { from { transform: rotate(0deg); } to { transform: rotate(360deg); } }
.spin { animation: spin 1s linear infinite; }
  • Step 7: Verify TypeScript compiles
cd frontend && npx tsc --noEmit

Expected: no errors (or only pre-existing errors unrelated to PerceptionPage)


Task 10: Install new Python dependencies

Files:

  • Modify: backend/requirements.txt (already done in Task 7)

  • Step 1: Install on server

# On the server (in project root):
.venv/bin/pip install beautifulsoup4>=4.12.0 lxml>=5.0.0
  • Step 2: Verify import
PYTHONPATH=backend .venv/bin/python -c "from bs4 import BeautifulSoup; print('ok')"

Expected: ok

  • Step 3: Run all perception tests
cd backend && PYTHONPATH=. pytest tests/perception/ -v

Expected: all tests PASS


Task 11: End-to-end verification

  • Step 1: Start backend
./dev.sh start api
  • Step 2: Verify stats endpoint still works
TOKEN=$(curl -s -X POST http://localhost:8000/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","password":"Admin@2026!"}' | python -m json.tool | grep access_token | cut -d'"' -f4)

curl -s -H "Authorization: Bearer $TOKEN" \
  http://localhost:8000/api/v1/perception/stats | python -m json.tool

Expected: {"total": ..., "high_impact": ..., ...}

  • Step 3: Trigger manual crawl (with DOCUMENT_REPOSITORY_BACKEND=json, uses MockEventStore)
curl -s -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  http://localhost:8000/api/v1/perception/crawl \
  -d '{"sources":["CATARC"]}' --no-buffer

Expected: SSE stream with event: progress lines followed by event: done

  • Step 4: Switch to postgres backend and re-verify (if PostgreSQL available)

In .env, set DOCUMENT_REPOSITORY_BACKEND=postgres, restart API, then repeat Step 2 and 3. Verify events appear in regulation_events table:

psql -h 6.86.80.8 -U postgresql -d compliance_db -c "SELECT COUNT(*) FROM regulation_events;"
  • Step 5: Build frontend on server
cd frontend && npm install && npm run build

Expected: build succeeds

  • Step 6: Open browser, navigate to Regulatory Signals page

Verify:

  • Stats bar shows real counts
  • "刷新数据源" button is visible in topbar
  • Clicking a signal shows 概览 / 义务条款 / 影响评估 / 变更对比 tabs
  • 变更对比 tab is greyed out until a second crawl detects a change

Self-Review

Spec coverage check:

Spec requirement Task
Replace MockEventStore → PostgresEventStore Tasks 1, 2, 7
BaseEventStore ABC as port Task 1
CATARC crawler Task 3
国标委 strong + recommended crawlers Task 3
EUR-Lex RSS crawler Task 4
LLM structure extraction Task 5
LLM impact assessment (RAG) Task 5
Semantic diff via embedding Task 5
CrawlService with hash-based skip Task 6
bootstrap.py wiring + settings Task 7
POST /crawl SSE endpoint Task 8
POST /events/{id}/process endpoint Task 8
GET /events/{id}/diff endpoint Task 8
Frontend crawl bar + progress Task 9
Frontend detail tabs (4 tabs) Task 9
Changed badge on signal cards Task 9 (CHANGED badge in header)
Real affected_docs replacing MOCK_DOCS Task 9
New Python dependencies Task 10
E2E verification Task 11

All spec requirements covered. No placeholders found.