ADR-0017: Seekable Read on Store API¶
Status¶
Accepted (supersedes ADR-0016)
Context¶
ADR-0016 placed seekable read handling in ext.seekable.seekable_read(),
following the three-tier pattern (capability + passthrough + extension
fallback). This worked for the initial use case: user code that needs a
seekable stream and doesn't care how it gets one.
However, the Azure PyArrow optimization work (ID-102) revealed a gap.
Azure's read() returns a forward-only chunk iterator — efficient for
sequential reads but unusable for PyArrow's PythonFile.read_at() which
needs seek() + read() for Parquet column pruning. The range-reader
approach (download_blob(offset=, length=)) is ideal for random access
but catastrophic for sequential reads (~1,280 HTTP requests for a 10 MB
file vs ~1-2 with chunked streaming).
This creates a tension that ext.seekable cannot resolve:
- Sequential callers need the current chunked
read()— efficient, forward-only, minimal HTTP requests. - Analytical callers (PyArrow, Dagster IO) need a seekable handle optimized for sparse random access — each seek + read maps to one HTTP Range request.
ext.seekable.seekable_read() handles non-seekable backends by spooling
the entire file into a SpooledTemporaryFile. This gives seekability
but defeats the main benefit of range reads: downloading only the bytes
you need.
The extension approach also cannot serve consuming abstractions like
pyarrow.fs.FileSystem or Dagster IOManager, which control the read
path and won't call an extension function.
Decision¶
Add read_seekable() to Backend and Store as a concrete (non-abstract)
method alongside the existing read().
Backend.read_seekable(path) -> BinaryIO¶
Default implementation: delegates to read(). If the returned stream is
already seekable, returns it directly. Otherwise, spools into a
SpooledTemporaryFile (same logic as the removed ext.seekable).
Backends MAY override to provide an optimized implementation:
- AzureBackend: returns
_AzureRangeReader— a seekableio.RawIOBasewhere eachreadinto()issues a single HTTP Range request viadownload_blob(offset=, length=). - HttpBackend: could implement HTTP Range in the future (not in this change).
Store.read_seekable(path) -> BinaryIO¶
Delegates to backend.read_seekable() with path resolution, capability
checks, and logging — same pattern as Store.read().
Arrow integration¶
StoreFileSystemHandler.open_input_file() calls store.read_seekable()
instead of store.read() for the Tier 3 path. This gives PyArrow a
seekable, random-access-optimized handle on all backends without
materializing the full file.
Removal of ext.seekable¶
ext.seekable.seekable_read() is removed (never released — introduced
after v0.19.0 in ID-100). Its functionality is subsumed by
Store.read_seekable(). The SEEKABLE_READ capability shifts meaning:
it now indicates that read_seekable() is zero-overhead (no spooling
needed, the backend natively returns seekable streams).
ProxyStore cascade¶
ProxyStore.read_seekable() delegates to self._inner.read_seekable().
ObservedStore hooks around it. CachedStore inherits the default.
Consequences¶
- Store API grows by one method. This is the main trade-off. The
method is justified because "sequential streaming" and "random-access
seekable" are fundamentally different I/O patterns that backends serve
differently. Compare
open_input_filevsopen_input_streamin PyArrow's ownFileSystemAPI. - Enables backend-specific optimization. Azure can return a range reader; HTTP could follow. No backend is forced to implement anything new — the default spooling fallback handles it.
- Consuming abstractions benefit automatically. PyArrow and Dagster get optimal seekable streams without needing to call extension functions or know about backend internals.
SEEKABLE_READcapability preserved. Meaning shifts from "read() returns seekable" to "read_seekable() is zero-overhead" — still useful for callers who want to branch at setup time.- One extension removed (
ext.seekable). Net surface area change is roughly neutral: one Store method added, one extension module removed.