Glob - Pattern Matching Specification¶
Overview¶
Three-tier pattern matching for remote-store (ADR-0009):
list_files(pattern=…):fnmatchname filtering onStore.list_files(). Works with every backend (needsLISTonly).Store.glob(pattern): native backend glob, capability-gated onCapability.GLOB. Likeunwrap()— opt-in native access.ext.glob.glob_files(): portable full glob. Delegates toStore.glob()when available, otherwiselist_files()+ client-side regex.
Patterns follow Unix glob conventions: * matches any non-separator characters,
** matches zero or more path segments (recursive), ? matches a single
non-separator character, [abc] matches a character class, [!abc] matches
a negated character class.
Module (extension): src/remote_store/ext/glob.py
Dependencies: None (pure Python, always available)
Related: 003-backend-adapter-contract.md
(CAP-001, BE-024), 001-store-api.md (STORE-014, STORE-018),
BK-002, ID-007, ADR-0009.
Tier 1: list_files(pattern=…)¶
GLOB-001: list_files pattern Parameter¶
Invariant: Store.list_files(path, *, recursive=False, pattern=None).
When pattern is not None, only files whose name matches the pattern
(via fnmatch.fnmatch) are yielded.
Postconditions: Filtering is applied at the Store level after
rebasing paths. Backend list_files signature is unchanged.
Rationale: Covers the common case ("give me the CSVs") without new
capabilities, new methods, or extensions.
Tier 2: Native Glob Capability and Store API¶
GLOB-002: Capability.GLOB Enum Member¶
Invariant: Capability.GLOB is a member of the Capability enum with
value "glob".
Rationale: Backends that implement native pattern matching declare this
capability. Backends without native glob omit it — Tier 1 and Tier 3
provide universal alternatives.
GLOB-003: Backend.glob() Default Method¶
Invariant: Backend.glob(pattern) is a non-abstract method with a default
implementation that raises CapabilityNotSupported.
Signature:
pattern is a glob pattern relative to the backend root.
Supports *, **, ?, [abc], and [!abc].
Raises: CapabilityNotSupported if the backend does not declare
Capability.GLOB.
Rationale: Non-abstract so existing backends compile without changes.
Backends that add native glob override this method and add GLOB to their
capability set.
GLOB-004: Backend.glob() Postconditions¶
Invariant: Returns only files (not folders). Paths in returned FileInfo
objects are backend-relative (same convention as list_files). Results are
yielded lazily via iterator.
GLOB-005: LocalBackend Native Glob¶
Invariant: LocalBackend overrides glob() using pathlib.Path.glob().
LocalBackend declares Capability.GLOB in its capability set.
Postconditions: Leverages the OS filesystem's native pattern matching.
FileInfo paths are converted via to_key() (same as list_files).
GLOB-018: S3Backend Native Glob¶
Invariant: S3Backend overrides glob() using prefix-optimized listing
via s3fs. S3Backend declares Capability.GLOB in its capability set.
Algorithm: Extracts the longest non-wildcard prefix from the pattern,
lists files under that prefix (recursive or non-recursive as determined by
the pattern), and filters client-side with a compiled regex.
Postconditions: Same contract as GLOB-004 (files only, backend-relative
paths, lazy iterator). Error handling is delegated to list_files().
GLOB-019: S3PyArrowBackend Native Glob¶
Invariant: S3PyArrowBackend overrides glob() using the same
prefix-optimized algorithm as GLOB-018, delegating to its own list_files()
(which uses s3fs for listing). S3PyArrowBackend declares Capability.GLOB.
GLOB-020: AzureBackend Native Glob¶
Invariant: AzureBackend overrides glob() using prefix-optimized
listing via the Blob SDK. AzureBackend declares Capability.GLOB.
Postconditions: Works with both HNS and non-HNS accounts because it
delegates to self.list_files() which handles both modes. Same contract
as GLOB-004.
GLOB-006: Store.glob() Signature¶
Invariant: Store.glob(pattern) -> Iterator[FileInfo].
Parameters: pattern is a glob pattern relative to the store root.
Raises: CapabilityNotSupported if the backend lacks Capability.GLOB.
GLOB-007: Store.glob() Path Scoping¶
Invariant: Store prepends root_path to the pattern before delegating to
Backend.glob(). Returned FileInfo.path values are rebased to store-relative
(same as list_files).
GLOB-008: Store.glob() Capability Gating¶
Invariant: Store.glob() calls capabilities.require(Capability.GLOB)
before delegating. If the backend lacks GLOB, the caller should use
list_files(pattern=...) or ext.glob.glob_files() instead.
Tier 3: Extension — ext.glob¶
GLOB-009: glob_files Signature¶
Invariant: glob_files(store, pattern) -> Iterator[FileInfo].
Parameters: store is a Store instance. pattern is a glob pattern
relative to the store root.
GLOB-010: Native Delegation¶
Invariant: When store.supports(Capability.GLOB) is True,
glob_files delegates entirely to store.glob(pattern).
GLOB-011: Client-Side Fallback¶
Invariant: When store.supports(Capability.GLOB) is False,
glob_files extracts the longest non-wildcard directory prefix from the
pattern, calls store.list_files(prefix, recursive=...), and filters
results client-side against the compiled pattern.
GLOB-012: Prefix Extraction¶
Invariant: The prefix is the longest sequence of leading path segments
that contain no wildcard characters (*, ?, [). For data/2024/*.csv
the prefix is "data/2024". For **/*.csv the prefix is "".
Rationale: Minimizes the listing scope — the backend only returns files
under the prefix directory, reducing network traffic and memory usage.
GLOB-013: Recursive Detection¶
Invariant: The fallback uses recursive=True if the pattern contains
** or if any non-final path segment contains wildcards. Otherwise uses
recursive=False.
Rationale: ** explicitly requests recursive descent. Wildcards in
non-final segments (e.g., */sub/*.csv) require traversing multiple
directory levels.
GLOB-014: Pattern Matching¶
Invariant: Client-side filtering converts the glob pattern to a regex:
- * → [^/]* (any characters except separator)
- **/ → (?:.+/)? (zero or more path segments)
- ** (at end) → .* (match everything)
- ? → [^/] (single non-separator character)
- [abc] → [abc] (character class, passed through)
- [!abc] → [^abc] (negated character class)
- All other characters are regex-escaped.
** must be a complete path segment (**/, /**, or the entire pattern).
Patterns like **error where ** is embedded within a segment raise
ValueError.
The regex is anchored (^...$) and matched against the full store-relative
path of each FileInfo.
GLOB-015: No Backend Coupling¶
Invariant: glob_files operates exclusively through the public Store
API (supports, glob, list_files). It never accesses store._backend
or any backend internals.
GLOB-016: Capability Gating Propagation¶
Invariant: CapabilityNotSupported raised by Store.glob() or
Store.list_files() propagates immediately. glob_files does not catch
or wrap these errors.
GLOB-017: Empty Results¶
Invariant: When no files match the pattern, both list_files(pattern=...)
and glob_files yield nothing (empty iterator). This is not an error.