SQL Query Backend Specification¶
Overview¶
SQLQueryBackend implements the Backend ABC as a read-only query
materializer. It maps path keys to SQL queries and serializes result sets to
bytes in the format implied by the key's file extension (Parquet, CSV, Arrow
IPC).
Shares _SQLAlchemyBaseBackend with SQLBlobBackend (v1). This spec covers
only SQLQueryBackend (v2).
Primary use cases: exposing SQL query results as portable data products, on-demand materialization of views/reports, bridging SQL databases into the Store abstraction for downstream consumers who don't need to know the source is a database.
Dependencies: sqlalchemy>=2.0, pyarrow>=12.0.0 (optional extra:
sql-query).
Architecture: Two-class design with shared base — see ADR-0018.
Construction¶
SQL-QUERY-001: Constructor¶
Signature:
SQLQueryBackend(
url: str | None = None,
*,
engine: Engine | None = None,
queries: dict[str, str] | None = None,
strict: bool = True,
serializer: ResultSerializer | None = None,
)
Preconditions:
- Exactly one of url or engine must be provided. Both or neither raises
ValueError.
- queries, if provided, must have non-empty string keys and non-empty string
values. Empty keys or values raise ValueError.
- If strict=False, raises NotImplementedError (view/convention discovery
deferred to future version).
Postconditions:
- Engine is created (if url) or stored (if engine).
- If SQLite: PRAGMA journal_mode=WAL and PRAGMA synchronous=NORMAL are set.
- Key registry is built from the queries dict.
- If serializer is None, defaults to ArrowSerializer().
SQL-QUERY-002: Backend Name¶
Invariant: name property returns "sql-query".
SQL-QUERY-003: Capability Declaration¶
Invariant: SQLQueryBackend declares 5 capabilities: READ, LIST,
METADATA, GLOB, SEEKABLE_READ.
SQL-QUERY-004: Repr¶
Invariant: repr(backend) returns:
Credential-bearing URL components are never included.
SQL-QUERY-005: Registration¶
Invariant: The "sql-query" backend type is registered in
_register_builtin_backends() behind a try/except ImportError guard.
The registration guard checks for sqlalchemy (required for import).
pyarrow is checked eagerly at construction time (__init__) with a
clear error message directing users to install remote-store[sql-query].
Key Resolution¶
SQL-QUERY-010: Explicit Query Mapping¶
Invariant: Keys are resolved by exact lookup in the queries dict.
If the key is not found, NotFound is raised. No partial matching, no
normalization beyond standard path validation.
Resolution precedence (v2 implements level 1 only):
1. Explicit mapping (queries dict) — highest priority
2. View-based (deferred)
3. Convention-based (deferred)
SQL-QUERY-011: Strict Mode¶
Invariant: strict=True (default) restricts resolution to the explicit
queries dict only. Setting strict=False raises NotImplementedError in
v2 — view-based and convention-based discovery are deferred to a future
version.
SQL-QUERY-012: Format Detection¶
Invariant: The serialization format is determined by the key's file extension:
| Extension | Format |
|---|---|
.parquet |
Apache Parquet |
.csv |
CSV |
.arrow, .ipc |
Arrow IPC |
Keys without a recognized extension raise InvalidPath with a message listing
supported extensions.
Serialization¶
SQL-QUERY-013: ResultSerializer Protocol¶
Invariant: ResultSerializer is a typing.Protocol:
@runtime_checkable
class ResultSerializer(Protocol):
def serialize(self, rows: Sequence[Any], columns: Sequence[str], format: str) -> bytes: ...
The format parameter is the lowercase extension without the dot (e.g.,
"parquet", "csv", "arrow"). The rows parameter is a sequence of
row tuples. The columns parameter is the column name list.
SQL-QUERY-014: ArrowSerializer¶
Invariant: The built-in ArrowSerializer converts rows + columns to a
pyarrow.Table, then serializes to the requested format:
- "parquet" — pyarrow.parquet.write_table()
- "csv" — pyarrow.csv.write_csv()
- "arrow" / "ipc" — pyarrow.ipc.RecordBatchFileWriter
An empty result set (zero rows) produces a valid file with schema but no data.
Unsupported format strings raise ValueError.
Operations¶
SQL-QUERY-020: read()¶
Invariant: read(path) resolves the key to SQL, executes the query,
serializes the result, and returns a seekable BinaryIO.
Pipeline:
1. Validate path (same rules as SQL-BLOB-060).
2. Resolve key to SQL string via the queries dict.
3. Detect format from file extension.
4. Execute SQL: connection.execute(sa.text(sql)) within _map_errors().
5. Serialize: self._serializer.serialize(rows, columns, format).
6. Return BufferedReader(BytesIO(data)).
An empty result set is valid — not an error.
SQL-QUERY-021: read_bytes()¶
Invariant: Same pipeline as read(), returns raw bytes.
SQL-QUERY-030: list_files()¶
Invariant: Enumerates keys from the key registry (not database tables).
Filters by prefix matching. Returns FileInfo for each matching key with
size=0 (sentinel — size unknown until materialized) and
modified_at=datetime.min(tzinfo=utc).
- Non-recursive: only direct children of the prefix.
- Recursive: all keys under the prefix.
max_depth: limits depth of suffix beyond prefix.
SQL-QUERY-031: list_folders()¶
Invariant: Derives virtual folders from key registry paths. Same
prefix-based deduplication logic as SQLBlobBackend.list_folders().
SQL-QUERY-032: glob()¶
Invariant: Applies pattern_to_regex(pattern) from _glob.py against
the key registry. Returns FileInfo for all matching keys. No SQL-side
narrowing needed since the key set is in-memory.
Metadata¶
SQL-QUERY-040: get_file_info()¶
Invariant: Returns FileInfo with:
- path: the requested key as RemotePath
- name: final path component
- size: 0 (sentinel — unknown until materialized)
- modified_at: datetime.min.replace(tzinfo=timezone.utc)
- extra: {"materialized": False}
Raises NotFound if the key is not in the registry.
SQL-QUERY-041: get_folder_info()¶
Invariant: Computes folder info from the key registry:
- file_count: number of keys under prefix
- total_size: 0 (all files have sentinel size)
- modified_at: None
Raises NotFound if no keys match the prefix and path is non-empty.
Existence Checks¶
SQL-QUERY-042: exists()¶
Invariant: Returns True if path matches a registered key (file) or
any key starts with path/ (virtual folder). Empty path returns True
(root).
SQL-QUERY-043: is_file()¶
Invariant: Returns True if path is an exact key in the registry.
Empty path returns False.
SQL-QUERY-044: is_folder()¶
Invariant: Returns True if any registered key starts with path/.
Empty path returns True (root).
Unsupported Operations¶
SQL-QUERY-050: Write Operations¶
Invariant: The following methods raise CapabilityNotSupported with
backend=self.name:
write()— capability"write"write_atomic()— capability"atomic_write"open_atomic()— capability"atomic_write"delete()— capability"delete"delete_folder()— capability"delete"move()— capability"move"copy()— capability"copy"
Engine & Connection¶
SQL-QUERY-060: check_health()¶
Invariant: Inherited from _SQLAlchemyBaseBackend. Executes SELECT 1.
Raises BackendUnavailable on failure.
SQL-QUERY-061: close()¶
Invariant: Inherited. Disposes owned engines; no-op for borrowed.
SQL-QUERY-062: unwrap()¶
Invariant: Inherited. unwrap(Engine) returns the SQLAlchemy Engine.
SQL-QUERY-063: SQLite PRAGMAs¶
Invariant: Inherited. WAL mode and synchronous=NORMAL for SQLite.
Error Mapping¶
SQL-QUERY-070: Exception Translation¶
| Condition | remote-store error |
|---|---|
| Key not in registry | NotFound |
| Unsupported file extension | InvalidPath |
strict=False |
NotImplementedError |
SQL execution error (ProgrammingError, OperationalError) |
RemoteStoreError / BackendUnavailable |
Other SQLAlchemyError |
RemoteStoreError |
| Unsupported operation | CapabilityNotSupported |
Backend-native exceptions must never leak.
Path Handling¶
SQL-QUERY-080: Path Validation¶
Invariant: Same rules as SQL-BLOB-060:
- No null bytes -> InvalidPath
- No absolute paths (leading /) -> InvalidPath
- No .. segments -> InvalidPath
- Empty path allowed for folder operations (root), rejected for file operations.
Performance¶
SQL-QUERY-090: Query Execution¶
Invariant: Queries execute via SQLAlchemy Core connection.execute().
Full materialization of result sets in memory. Streaming/chunked execution
is deferred to a future version.
SQL-QUERY-091: Serialization Overhead¶
Invariant: The ArrowSerializer converts SQL rows to a PyArrow Table,
then serializes. This is a full copy. Zero-copy paths (ADBC) are deferred
to v3.