Seekable Read Specification¶
Overview¶
Seekable read provides Store.read_seekable() — a method that always
returns a seekable BinaryIO stream, backend-optimized for random access.
Complements Store.read() (which returns the backend's natural stream,
possibly non-seekable). Follows ADR-0017,
superseding the three-tier extension design in ADR-0016.
SEEK-001: Capability Declaration¶
Invariant: Capability.SEEKABLE_READ is an enum member. Backends whose
read() always returns a seekable stream declare it in their CapabilitySet.
Postconditions: Local, Memory, S3, S3-PyArrow, SFTP, SQLBlob, and Dafny
declare SEEKABLE_READ. Azure and HTTP do not.
SEEK-002: Store.read_seekable() Contract¶
Invariant: Store.read_seekable(path) always returns a seekable
BinaryIO stream positioned at byte 0.
Postconditions:
- The returned stream satisfies stream.seekable() == True.
- The stream content matches the file at path.
- The caller owns the stream and must close it.
SEEK-003: Backend.read_seekable() Default¶
Invariant: Backend.read_seekable(path) has a concrete default
implementation. Calls self.read(path). If the returned stream is
seekable, returns it directly (zero-copy passthrough). Otherwise, spools
into a SpooledTemporaryFile(max_size=8_388_608) and returns it
positioned at byte 0.
Postconditions: All backends support read_seekable() without
overriding. Backends MAY override for optimization.
SEEK-004: Passthrough for Seekable Backends¶
Invariant: When self.read(path) returns a seekable stream,
read_seekable() returns the same stream instance with no copying.
Postconditions: Local, Memory, S3, S3-PyArrow, SFTP, SQLBlob, and Dafny
return the read() stream directly. Zero overhead.
SEEK-005: Spool Fallback for Non-Seekable Backends¶
Invariant: When self.read(path) returns a non-seekable stream,
the default read_seekable() spools it into a SpooledTemporaryFile.
Content up to 8 MB stays in RAM; beyond that spills to a temporary file
on disk.
Postconditions: The returned stream is seekable at byte 0. The
original stream is closed after spooling.
SEEK-006: Azure Range Reader Override¶
Invariant: AzureBackend overrides read_seekable() to return an
_AzureRangeReader — a seekable io.RawIOBase where each readinto()
issues a single HTTP Range request via download_blob(offset=, length=).
Postconditions:
- No data is downloaded until read() is called.
- seek() and tell() update position without I/O.
- Sequential read() calls issue one HTTP request per readinto() call.
- The stream is wrapped in _ErrorMappingStream (no BufferedReader — its
seek-invalidation would defeat range reads by turning each
PythonFile.read_at() into a separate HTTP request).
SEEK-007: Azure read() Unchanged¶
Invariant: AzureBackend.read() continues to return the chunked
forward-only _AzureBinaryIO stream. It is NOT replaced by the range
reader.
Postconditions: Sequential callers get efficient chunked streaming.
read() and read_seekable() serve different I/O patterns.
SEEK-008: Arrow Integration¶
Invariant: StoreFileSystemHandler.open_input_file() calls
store.read_seekable() instead of store.read() for the Tier 3
streaming path (files larger than materialization_threshold).
Postconditions: PyArrow gets a seekable handle optimized for sparse
random access. Column pruning works on Azure without full-file
materialization.
SEEK-009: ProxyStore Forwarding¶
Invariant: ProxyStore.read_seekable() delegates to
self._inner.read_seekable(). ObservedStore fires hooks.
CachedStore inherits the default.
Postconditions: All proxy layers forward correctly.
SEEK-010: Error Propagation¶
Invariant: Backend errors (e.g. NotFound) propagate through
read_seekable() as Store errors.
Postconditions: No error remapping beyond the standard
_ErrorMappingStream behavior.
SEEK-011: Stream Closure After Spooling¶
Invariant: When the default read_seekable() spools, the original
stream from read() is closed after the content is fully copied.
Postconditions: The caller owns only the returned spool.
SEEK-012: fileno() Limitation¶
Invariant: When read_seekable() returns a SpooledTemporaryFile
(non-seekable backend, no override), fileno() may raise when content
is still in memory (Python < 3.12).
Postconditions: Documented in the method docstring.