RFC-0012: Documentation Graph Model¶

Status¶

Accepted

Summary¶

Define a machine-readable intermediate representation (IR) for the remote-store API surface: a typed graph of nodes and edges derived from source code via static analysis, serialized as JSON, and projected to multiple documentation outputs (FEATURES.md capability matrix, per-class API reference stubs, extras index). The graph is the single source of truth for any derived doc artefact; FEATURES.md becomes a projection of it rather than a hand-maintained file. Versioned snapshots replace in-node since: / deprecated_in: tracking.

Motivation¶

Two documentation surfaces share the same drift problem:

FEATURES.md (ID-159) is a hand-maintained capabilities snapshot. The capability matrix, method list, and extras table are all mechanically derivable from source.

The API reference (docs/api/store/, docs/api/aio/, per-backend pages) has the same issue: method tables, capability gates, and backend-conditional parameter admonitions are written by hand against the source and drift silently as the API evolves.

Both surfaces expose the same cross-cutting relationships — which capability gates which method, which extra enables which backend — but currently each is authored independently with no shared extraction logic.

Without a shared IR:

FEATURES.md, the API reference, and the extras index each duplicate the same traversal logic.
Cross-cutting relationships — cap:WRITE gates Store.write, xtr:s3 enables S3Backend — are not queryable; they live in prose.
Diff between releases requires diffing prose documents, not structured data.

This RFC defines the graph IR. Implementation (the loader, projection scripts, and release-skill integration) is tracked under ID-159.

Goals¶

One typed graph; all derived doc artefacts are projections of it.
The graph is queryable in plain Python; no SPARQL, no property-graph DB.
Cross-cutting edges (capability → method, extra → backend, sync ↔ async mirror) are first-class, not buried in prose.
Versioned snapshots replace since: tracking; git-diff two snapshots to produce a changelog.
JSON serialization is byte-stable across regenerations of the same source tree (suitable for git diff).
The graph accommodates conditional capabilities (ID-140: per-instance, dialect-conditional capabilities) without a schema change.

Non-goals¶

Generating narrative prose. Curated sections of FEATURES.md remain hand-maintained (region-tagged).
Replacing the mkdocstrings HTML renderer. The API reference is still rendered by mkdocstrings; the graph produces structured stubs and admonition metadata, not raw HTML.
Storing type-inference results. Types are recorded as opaque TypeRef node labels, not inferred by a type checker.
Querying across multiple versions. The consumer loads two snapshots and diffs them; the IR itself is single-version.

Proposal¶

The graph: two collections¶

The IR is two sorted arrays: nodes and edges. Everything is addressable by a stable URI string. No separate adjacency structure — queries iterate the arrays.

{
  "schema_version": "1.4",
  "source_version": "0.24.0",
  "snapshot": "0.24.0",
  "nodes": [ ... ],
  "edges": [ ... ]
}

schema_version and source_version are independent fields. A schema bump (new node kinds, renamed edge kinds, or new properties on an existing node kind) increments schema_version; a new package release increments source_version. snapshot mirrors source_version (both track pyproject.toml[project][version]).

Promote-to-node rule¶

If a value is referenced from more than one node, or a natural query is "which X have property Y", it is a node. Closed enumerations that only describe their parent (booleans, sync/async, abc/backend/facade) stay as properties.

Node taxonomy¶

Kind	URI prefix	Key properties	Role property
`package`	`pkg:`	`runtime` (`sync`/`async`), `version`	—
`module`	`mod:`	`file`	—
`class`	`cls:`	`role`, `runtime`, `file`, `line`, `summary`	`abc` · `backend` · `facade` · `extension` · `data` · `enum` · `error` · `helper`
`method`	`mtd:`	`summary`, `is_abstract`, `is_async`, `file`, `line`	—
`capability`	`cap:`	`summary`, `semantics`	—
`data_model`	`dm:`	`frozen`, `summary`	—
`field`	`fld:`	`default`, `summary`	—
`error`	`err:`	`when_raised`, `summary`	—
`extra`	`xtr:`	`kind` (`backend`/`extension`/`aggregate`)	—
`package_dep`	`dep:`	`min_version`	—
`parameter`	`prm:`	`has_default`, `summary`	—
`type_ref`	`typ:`	`label` (opaque string)	—
`predicate`	`prd:`	`label` (opaque string, e.g. `"dialect == 'sqlite'"`)	—
`requirement`	`req:`	`mode` (`all`/`any`, default `all`)	—
`role`	`rol:`	`label` (`source`/`dest`/`self`)	—

predicate carries the condition for a conditional declares or raises edge as an opaque label. The IR records it; the semantic interpretation belongs to the code. This covers ID-140 (dialect-conditional capabilities) without a schema change.

requirement is an explicit AND/OR group between a method and its capability gate(s). All current methods use mode: all with a single capability; the node exists so the schema does not change when a method needs a conjunction or disjunction.

Two patterns use multiple req: nodes vs. multiple of edges:

AND of capabilities (one req:, N of edges): the method requires all N capabilities simultaneously on the same code path.
Alternative gates (N req: nodes, each with its own of edge): the method chooses one capability gate at runtime based on a condition. Each req: node represents one branch. URI convention: <method>.gate for the primary gate; <method>.gate_<discriminator> for each alternative (e.g. .gate_depth for a depth-limited code path).

Edge taxonomy¶

Kind	Domain → Range	Attributes	Notes
`contains`	package/module/class → child	—	Containment tree
`inherits`	class → class	—	DAG
`declares`	backend → capability	`condition: prd URI` (omitted when unconditional)	Key present only when conditional; absent == unconditional
`gates`	requirement → method	—	Via `req:` group
`of`	requirement → capability	`index: int`	Members of the group
`enables`	extra → class/extension	—	pip extra → backend
`requires_dep`	extra → package_dep	—	pip dependency
`mirrors`	class → class	`capability_delta: {async_only: [str], sync_only: [str]}`	Canonical direction: async → sync peer (one edge per pair; deduped by generator). Capability lists are sorted; names are anchored to the canonical direction so `async_only` lists capabilities present on `src` (async) but absent on `dst` (sync).
`composes`	extension → class	—	The Store/Backend it wraps
`requires_cap`	extension/role → capability	—	Capability needed by ext/role
`played_by`	extension → role	—	Extension has this role on the edge
`returns`	method → data_model	—	—
`accepts`	method → data_model	`param: str`	—
`has_param`	method → parameter	`position: int`	—
`typed`	parameter/field → type_ref	—	—
`has_field`	data_model → field	—	—
`raises`	method → error	`condition: prd URI` (omitted when unconditional)	Key present only when conditional; absent == unconditional

gates and of together replace a simple "method requires capability" attribute. A method with one unconditional gate has one req: node linked by a single of edge to the capability, and one gates edge from that req: to the method. When a second capability is added, a second of edge joins the same req: node — no schema change.

Diagram¶

graph LR
  xtr -- enables --> cls_backend
  cls_backend -. mirrors .- cls_async
  cls_backend -- declares --> cap
  cap <-- of -- req -- gates --> mtd
  mtd -- returns --> dm
  dm -- has_field --> fld
  fld -- typed --> typ
  mtd -- has_param --> prm
  prm -- typed --> typ
  mtd -- raises --> err
  ext -- composes --> cls_backend
  ext -- played_by --> rol
  rol -- requires_cap --> cap
  xtr -- requires_dep --> dep

Worked example: S3Backend neighborhood¶

One-hop walk from cls:remote_store.backends._s3.S3Backend:

cls:...S3Backend
  ← enables          xtr:s3
  → inherits         cls:remote_store._backend.Backend
  → declares (×13)   cap:READ, cap:WRITE, cap:DELETE, cap:LIST, cap:GLOB,
                     cap:MOVE, cap:COPY, cap:ATOMIC_WRITE, cap:METADATA,
                     cap:USER_METADATA, cap:SEEKABLE_READ, cap:LAZY_READ,
                     cap:WRITE_RESULT_NATIVE
                     (condition omitted — all unconditional)
  ↔ mirrors          (no async S3 backend exists yet; edge applies once one is added)

mtd:remote_store.Store.write
  ← gates    req:Store.write.gate
  req:Store.write.gate
  → of       cap:WRITE
  mtd:Store.write → returns → dm:remote_store.WriteResult
  dm:WriteResult → has_field → fld:WriteResult.etag → typed → typ:str|None

The FEATURES.md capability matrix is a two-hop walk:

def matrix(g):
    return {
        b: {c: [m for m in methods_gated_by(g, c)]
            for c in capabilities_declared_by(g, b)}
        for b in g.nodes(kind="class", role="backend")
    }

Snapshots¶

One rolling file tracks the current development state. source_version and snapshot are both set to the pyproject.toml [project][version] at generation time; gen_graph.py reads this dynamically so no manual update is needed.

File	`source_version`	`snapshot`
`docs-src/_data/graph/graph.json`	current pyproject version	current pyproject version

Files live in docs-src/_data/graph/ (git-tracked; mkdocs copies the directory verbatim).

Release-time flow (Phase 2): after bump-my-version stamps the new version into pyproject.toml, run hatch run gen-graph to re-stamp graph.json with the release version before committing. The git tag then freezes the file; source_version is self-describing for consumers without git context.

Consumer note: between releases graph.json reflects the pyproject version at the last commit that touched it, not necessarily the in-progress dev state. Use git history or git tag to identify the exact released snapshot.

Determinism: the serializer must produce byte-identical output for the same source tree. Rules:

nodes sorted ascending by id URI.
edges sorted ascending by (kind, src, dst).
Object keys sorted (JSON sort_keys=True).
LF line endings, no trailing whitespace, 2-space indent.
One golden test: generate twice, assert the files are identical.

Projection design¶

Three layers between source and rendered Markdown:

src/  ─ Loader ─► IR (graph.json)  ─ Projection ─► View  ─ Jinja ─► Markdown
                                     (pure Python)

One projection function per output; all query logic lives in Python, not Jinja:

Output	Projection input	Primary walk
`FEATURES.md` capability matrix	All backend nodes	`backend → declares → cap → gates → method`
Per-backend reference page	One backend node	`backend → declares`, `mirrors`, `enables`
Capability × method table	All capability nodes	`cap ← of ← req ← gates ← method`
Extras index	All extra nodes	`extra → enables → backend`, `extra → requires_dep`

The projection returns plain Python dataclasses (view objects); templates render only. This makes the test surface trivial: snapshot the projection output, not the rendered Markdown.

Schema evolution¶

Older snapshots are frozen at their schema_version. The chosen strategy is read-only legacy: loaders support the lowest common subset across known schema versions; projections degrade gracefully when a node kind or edge kind is absent. Forklift-upgrade (re-running the generator against every historical tag) is deferred until a second schema version is actually needed.

Tooling appendix¶

This section is informational; implementation decisions belong to ID-159.

Loader. Griffe is already loaded by mkdocstrings (configured in mkdocs.yml). The graph generator reuses Griffe's parse via a Griffe Extension that fires on on_class_instance to populate declares edges from each backend's capabilities property, and on on_module to collect extras from pyproject.toml. Griffe is not the IR; it is the parse layer.

Extras → backend mapping. The mapping is a two-source join:

src/remote_store/backends/__init__.py — each try/except ImportError block names the backend class and, implicitly, the package whose absence causes the failure (e.g. from remote_store.backends._s3 import S3Backend fails when s3fs is absent).
pyproject.toml [project.optional-dependencies] — maps each pip extra name to the packages it installs.

The generator joins the two by package name: "which extra installs the package that backends/__init__.py needs for this class?" This requires no hand-maintained table and stays correct as new backends are added, as long as both sources are consulted.

Static capability extraction. Today capabilities are declared as module-level CapabilitySet constants and exposed via a capabilities property. The lightest annotation that makes them statically extractable without running the code is a ClassVar annotation directly on the subclass:

CAPABILITIES: ClassVar[CapabilitySet] = CapabilitySet({...})

This is the recommended precondition for ID-159. It is a small, non-breaking refactor of each backend.

Gating table. Today, capability gating in _store.py is done via inline .require() calls scattered across each method (see References). The recommended precondition for ID-159 is to consolidate these into a central _GATING dict mapping method names to Capability values — making it both the runtime check source and the static extraction target for gates edges. This is the same category of precondition as CAPABILITIES: ClassVar on backends.

Alternatives Considered¶

Griffe tree as IR¶

Griffe's object tree is containment-only. Cross-cutting edges (capability gates method, extra enables backend, sync mirrors async) have no natural home. Griffe's extra dict can carry annotations per node, but it cannot represent edges between nodes. The proposed graph is loaded from Griffe, not as Griffe.

RDF / property-graph database¶

No SPARQL queries are needed. All queries are simple array iterations in Python. RDF does not survive git diff as cleanly as JSON. Rejected.

`docspec` (Pydoc-Markdown's IR)¶

A published Python API IR format. Griffe is already in the dependency set (via mkdocstrings); adding docspec would be a second parse layer for no gain. Rejected.

Versioning as a graph axis (`since:` / `deprecated_in:` on edges)¶

Adding version fields to every edge grows the schema and the serialization cost. Per-file snapshots give the same information for free via git diff: a node present in graph-0.24.0.json and absent in graph-0.23.0.json was added in 0.24.0. Rejected.

Single rolling file (no per-version snapshots)¶

Adopted (ID-163). The original proposal kept per-version archive files (graph-X.Y.Z.json) alongside the rolling file. During implementation the archive step was dropped in favour of simplicity: graph.json is always stamped with the current pyproject.toml version; the git tag is the immutable record of the released state. Diffing two releases means checking out the two tags and comparing graph.json. This trades the convenience of in-tree snapshot artefacts for a smaller file surface and a simpler release sequence.

Impact¶

Public API: none. This RFC defines a build-time artefact, not a runtime surface.
Backwards compatibility: not applicable.
Performance: graph generation is a build-time step, not a hot path. The JSON file for the current backend surface is expected to be under 1 MB.
Testing: one golden test (byte-stable round-trip); snapshot tests for each projection function. No runtime tests.
Ripple-check: this RFC is design-only. The implementation PR for ID-159 will touch backends (adding CAPABILITIES: ClassVar), _store.py (adding _GATING table), docs-src/_data/graph/, scripts/, the release skill, and FEATURES.md.

Open Questions¶

Sync↔async peer discovery for mirrors edges. Resolved by ID-159: each async backend carries a __mirror__: ClassVar[type[T]] annotation pointing to its sync peer. The generator emits one directed edge per pair in the canonical async → sync direction (deduped). Consumers that need to query from the sync side must reverse the edge themselves.

Capability asymmetry between mirror peers. Resolved by ID-162 (schema 1.2): async and sync peers may declare different capability sets (e.g. AsyncMemoryBackend includes LAZY_READ; MemoryBackend does not). Each mirrors edge carries a capability_delta object with sorted async_only and sync_only capability-name lists so consumers can render the asymmetry instead of treating the peers as equivalent. The lists are always present (empty when the peers are symmetric).

References¶

ID-159 (FEATURES.md hybrid generation): sdd/BACKLOG.md
ID-140 (dialect-conditional capabilities): sdd/BACKLOG.md
Backend capabilities: src/remote_store/backends/_local.py:26, _s3.py:37, _sftp.py:47, _azure.py:47, _sqlalchemy.py:47, _http.py:41
Capability gating in Store: src/remote_store/_store.py:68,87,118,189
Capability enum: src/remote_store/_capabilities.py
mkdocs + mkdocstrings config: mkdocs.yml
Existing gen-files script: scripts/gen_pages.py
Extension architecture: sdd/adrs/0008-extension-architecture.md
Backend adapter contract: sdd/specs/003-backend-adapter-contract.md