PyArrow FileSystem Adapter¶
The ext.arrow module wraps any Store into a pyarrow.fs.PyFileSystem,
unlocking seamless interop with the PyArrow ecosystem: datasets, Pandas,
Polars, DuckDB, PyIceberg, and Delta Lake all accept pyarrow.fs.FileSystem
objects for I/O.
Installation¶
This installs pyarrow >= 12.0.0 as an optional dependency. The adapter works
with any backend — no extra configuration needed.
Quick Start¶
import pyarrow as pa
import pyarrow.parquet as pq
from remote_store import Store
from remote_store.backends import MemoryBackend
from remote_store.ext.arrow import pyarrow_fs
store = Store(backend=MemoryBackend())
fs = pyarrow_fs(store)
# Now use `fs` anywhere PyArrow accepts a filesystem:
table = pa.table({"col": [1, 2, 3]})
pq.write_table(table, "data.parquet", filesystem=fs)
result = pq.read_table("data.parquet", filesystem=fs)
Parquet Round-Trip¶
import pyarrow as pa
import pyarrow.parquet as pq
from remote_store.ext.arrow import pyarrow_fs
fs = pyarrow_fs(store)
# Write
table = pa.table({"id": [1, 2], "value": ["a", "b"]})
pq.write_table(table, "output.parquet", filesystem=fs)
# Read
result = pq.read_table("output.parquet", filesystem=fs)
Pandas Integration¶
import pandas as pd
df = pd.DataFrame({"x": [10, 20], "y": ["foo", "bar"]})
df.to_parquet("pandas.parquet", engine="pyarrow", filesystem=fs)
result = pd.read_parquet("pandas.parquet", engine="pyarrow", filesystem=fs)
Dataset Discovery¶
PyArrow's dataset API discovers partitioned data automatically:
import pyarrow.dataset as ds
dataset = ds.dataset("data/", filesystem=fs, format="parquet")
table = dataset.to_table()
Configuration¶
The adapter accepts two tuning parameters:
| Parameter | Default | Description |
|---|---|---|
materialization_threshold |
64 MB | Max file size for full-file read in open_input_file. Files above this threshold stream via PythonFile (if seekable) or fall back to materialization with a warning. |
write_spill_threshold |
64 MB | Max in-memory buffer size for writes. Exceeding this spills to a temporary file on disk. |
fs = pyarrow_fs(
store,
materialization_threshold=128 * 1024 * 1024, # 128 MB
write_spill_threshold=32 * 1024 * 1024, # 32 MB
)
Tiered Read Strategy¶
open_input_file (used by Parquet readers) selects the best approach:
| Condition | Strategy | Memory | Notes |
|---|---|---|---|
| Backend has native PyArrow FS | Tier 1: native open_input_file |
~range size | Zero overhead, C++ range requests |
| File <= threshold | Tier 2: read_bytes() -> BufferReader |
Full file | Zero GIL overhead |
| File > threshold, seekable stream | Tier 3: read() -> PythonFile |
Streaming | GIL per read call |
| File > threshold, non-seekable | Tier 2 fallback (with warning) | Full file | S3/Azure HTTP streams |
Tier 1 is automatically enabled for backends that expose a native PyArrow
filesystem via unwrap() (currently S3PyArrowBackend). The handler detects
this at construction time and bypasses Python I/O entirely for reads — the full
C++ ReadAt -> HTTP Range request -> I/O coalescing pipeline runs with zero GIL
overhead. This matters for analytical workloads (Parquet column pruning, dataset
scans) where PyArrow issues many small range reads. For sequential byte
streaming, Tier 1 does not provide a speed advantage — the regular S3 backend
is faster for that use case (see Performance).
Thread Safety¶
The handler holds no shared mutable state. PyArrow's C++ layer may call handler methods from background threads (with the GIL acquired). Thread safety depends on the backend — all built-in backends (Memory, Local, S3, SFTP, Azure) are safe under concurrent calls. If using a custom backend, ensure its methods are thread-safe.
Limitations¶
- No append support.
open_append_streamraisesNotImplementedError. Most backends (S3, Azure) lack native append semantics. - Root deletion blocked.
delete_dir(""),delete_dir_contents(""), anddelete_root_dir_contents()raiseNotImplementedErroras a safety guard. - Write buffering. Writes are buffered until
close()— data is not visible in the Store during the write. This is inherent to the Store's single-shotwrite()API. - Tier 1 limited to S3-PyArrow.
S3PyArrowBackendexposes a native PyArrow filesystem for zero-copy Tier 1 reads. Other backends providenative_path()but use Tier 2/3 for data transfer. - Process exit on Linux. PyArrow's C++ atexit handlers can deadlock
during interpreter shutdown when a
PyFileSystemis still alive. If your script hangs after completing, explicitlydelthe PyArrow filesystem and dataset objects before exit, or callos._exit(0)after cleanup.
Error Mapping¶
Store errors are translated to standard Python exceptions:
| Store error | Python exception |
|---|---|
NotFound |
FileNotFoundError |
AlreadyExists |
FileExistsError |
PermissionDenied |
PermissionError |
InvalidPath |
ValueError |
CapabilityNotSupported |
NotImplementedError |
Other RemoteStoreError |
OSError |
No RemoteStoreError leaks to PyArrow callers — all exceptions are mapped
with from chaining for debuggability.