RFC-0003: S3-PyArrow Read Path Optimization¶
Status¶
Implemented
Summary¶
Remove the io.BufferedReader wrapper from S3PyArrowBackend.read() and add a
direct read() method to _PyArrowBinaryIO, eliminating two unnecessary memory
copies per chunk on the streaming read path. This brings streaming read()
performance in line with the legacy S3Store that uses PyArrow C++ directly.
Note: read_bytes() is unaffected — it already uses open_input_stream +
bytes(stream.read()) directly, bypassing BufferedReader entirely.
Motivation¶
Benchmark data (legacy/benchmark_s3.py, 15 iterations, 3 warmup, MinIO on
localhost) shows overhead on the streaming read() path:
| Benchmark | Legacy (ms) | Remote (ms) | Ratio | Peak Mem |
|---|---|---|---|---|
streaming_read |
23.9 | 22.8 | 1.05x | 129 KB / 201 KB |
read_large |
8.9 | 10.9 | 0.82x | 978 KB / 1954 KB |
streaming_read exercises Store.read() with 64 KB chunked reads — the code
path this RFC targets. The 1.05x ratio looks close to parity, but the 56% higher
peak memory (201 KB vs 129 KB) reveals the double-copy overhead. The
read_large row uses read_bytes() which bypasses BufferedReader entirely, so
its 0.82x ratio reflects a different bottleneck (not addressed here).
Current chain (2 extra copies per chunk)¶
BufferedReader.read(n)
-> readinto(8 KB internal buffer)
-> _ErrorMappingStream.readinto(b)
-> _PyArrowBinaryIO.readinto(b)
-> data = self._pa.read(len(b)) # copy 1: C++ -> Python bytes
-> b[:n] = data # copy 2: bytes -> BufferedReader buffer
BufferedReader calls readinto() in 8 KB chunks regardless of the requested
read size. Each chunk goes through two Python-level copies before reaching the
caller.
Target chain (1 copy)¶
_ErrorMappingStream.read(n)
-> _PyArrowBinaryIO.read(n)
-> self._pa.read(n) # returns bytes directly from C++
By adding read() to _PyArrowBinaryIO and removing the BufferedReader
wrapper, reads go straight from PyArrow C++ to the caller. PyArrow's
NativeFile.read() already returns bytes, so no additional copy is needed.
Proposal¶
1. Add read() to _PyArrowBinaryIO¶
# src/remote_store/backends/_s3_pyarrow.py, class _PyArrowBinaryIO
def read(self, size: int = -1) -> bytes:
# NativeFile.read() returns bytes; no wrapper needed.
if size is None or size < 0:
return self._pa.read()
return self._pa.read(size)
Keep readinto() for compatibility — anyone wrapping our stream in their own
BufferedReader still works.
2. Remove BufferedReader from S3PyArrowBackend.read()¶
Before:
def read(self, path: str) -> BinaryIO:
with self._pyarrow_errors(path):
pa_file = self._pa_fs.open_input_file(self._pa_path(path))
raw = _PyArrowBinaryIO(pa_file)
return io.BufferedReader(
cast("io.RawIOBase", _ErrorMappingStream(raw, self._classify_error, path))
)
After:
def read(self, path: str) -> BinaryIO:
with self._pyarrow_errors(path):
pa_file = self._pa_fs.open_input_file(self._pa_path(path))
raw = _PyArrowBinaryIO(pa_file)
return cast("BinaryIO", _ErrorMappingStream(raw, self._classify_error, path))
What's preserved¶
- Error mapping via
_ErrorMappingStream(OSError -> RemoteStoreError) - Seekability (PyArrow RandomAccessFile)
- Context manager /
close()lifecycle readinto()still works for callers that need it_ErrorMappingStream.read()already exists and delegates to inner.read()
3. Add chunked readline() to _PyArrowBinaryIO¶
Without BufferedReader, readline() falls back to RawIOBase default, which
calls readinto(1) in a tight loop — one call per byte until \n. This is
pathologically slow for lines of any length.
Add a readline() that reads in blocks (e.g. 8 KB) and scans for \n:
# src/remote_store/backends/_s3_pyarrow.py, class _PyArrowBinaryIO
_READLINE_CHUNK = 8192
def readline(self, size: int = -1) -> bytes:
buf = bytearray()
while size is None or size < 0 or len(buf) < size:
remaining = size - len(buf) if size is not None and size >= 0 else self._READLINE_CHUNK
chunk = self._pa.read(min(remaining, self._READLINE_CHUNK))
if not chunk:
break
idx = chunk.find(b"\n")
if idx >= 0:
buf.extend(chunk[: idx + 1])
# seek back past the unused portion
self._pa.seek(-(len(chunk) - idx - 1), 1)
break
buf.extend(chunk)
return bytes(buf)
Current callers don't use line-oriented reads on S3 binary streams, but removing
BufferedReader without this would leave a pathological performance trap for
anyone wrapping the stream in io.TextIOWrapper.
What changes¶
- No
BufferedReadermeans callers get exactly the bytes they asked for in a singleread()call, without 8 KB chunking overhead. readline()uses a new chunked implementation instead ofBufferedReader's internal buffer.
Alternatives Considered¶
-
Increase
BufferedReaderbuffer size. Setting a larger buffer (e.g. 64 KB) reduces per-call overhead but still has two copies per chunk. Rejected: doesn't eliminate the fundamental issue. -
Override
readinto()to avoid the intermediate bytes object. PyArrow'sread_buffer()returns apyarrow.Bufferwhich supports the buffer protocol, butNativeFile.read()already returnsbytes. We'd needread_buffer()+ memoryview copy, which is roughly equivalent. Rejected: more complexity for the same result. -
Use
open_input_streaminstead ofopen_input_file. Loses seekability. Rejected:read()contract requires seekable streams (position tracking,seek()/tell()).
Impact¶
- Public API: No changes to
__all__or Store interface. - Backwards compatibility: Non-breaking. The return type is still
BinaryIO. Callers using.read(),.seek(),.tell(),.close()are unaffected. Callers relying onisinstance(stream, io.BufferedReader)would break, but that's not part of the contract (spec SIO-001 only requiresBinaryIO). - Performance: Improvement applies to streaming
read()calls only (thestreaming_readbenchmark path).read_bytes()is unaffected — it already bypassesBufferedReader. Expect reduced per-chunk overhead and lower peak memory for chunked streaming reads. - Testing: Existing tests cover correctness:
tests/test_stream.py—_ErrorMappingStreamread/readinto pathstests/backends/test_conformance.py— chunked reads, position trackingtests/backends/test_s3_pyarrow.py— S3-PyArrow specific teststests/test_transfer.py— streaming transfers
Additionally, add a structural assertion that S3PyArrowBackend.read()
returns a stream that is NOT wrapped in BufferedReader, to prevent
regression.
- Scope: Only S3PyArrowBackend. Other backends (S3, SFTP, Azure, Local,
Memory) are untouched.
Open Questions¶
- Should other backends also drop
BufferedReader? The SFTP backend also wraps inBufferedReader, but the situation is different: SFTP's underlying stream is Paramiko'sSFTPFile, which has its ownprefetch()mechanism and internal buffering.BufferedReadermay serve a useful purpose there by smoothing bursty network reads. Removing it could interact with Paramiko's chunking behavior differently than removing it from PyArrow's in-memoryNativeFile.read(). Out of scope — requires separate profiling and analysis.
References¶
- Benchmark:
legacy/benchmark_s3.py(CPU/memory instrumentation added) - Source:
src/remote_store/backends/_s3_pyarrow.py(lines 36-64, 289-293) - Error mapping stream:
src/remote_store/_stream.py - Related spec:
sdd/specs/011-s3-pyarrow-backend.md - Related backlog: ID-020 (benchmark improvements)