Skip to content

S3-PyArrow Hybrid Backend Specification

Overview

S3PyArrowBackend implements the Backend ABC for S3-compatible object storage using a hybrid approach: PyArrow's C++ S3 filesystem for data-path operations (read, write, copy) and s3fs for control-path operations (listing, metadata, deletion). This combines PyArrow's high-throughput C++ I/O with s3fs's mature listing and metadata APIs.

This is a drop-in alternative to S3Backend with the same constructor signature. Users who need maximum read/write throughput for large files should prefer this backend.

Dependencies: s3fs, pyarrow (optional extra: pip install "remote-store[s3-pyarrow]")

This specification is a delta over spec 008 (S3 Backend). Any invariant not restated here follows the paired S3-NNN ID verbatim, substituting the backend name "s3-pyarrow" for "s3". Only PyArrow-specific deltas (dual-library architecture, credential translation, the PyArrow read path, dual unwrap(), and the dual error-mapping context managers) carry a full body below. Tests reference both IDs per backend via per-parameter pytest.mark.spec(...) marks (see tests/backends/s3/test_shared.py).

Paired IDs (delta map)

This spec (S3PA-NNN) Inherited from 008 (S3-NNN)
S3PA-001 Constructor Parameters S3-001 (same signature)
S3PA-004 Lazy Connection S3-004
S3PA-005 Construction Validation S3-005
S3PA-008 Virtual Folder Semantics S3-006
S3PA-009 Folder Detection S3-007
S3PA-010 Write Does Not Create Markers S3-008
S3PA-011 Folder Lifecycle S3-009
S3PA-013 Write Via PyArrow S3-010 (atomic write)
S3PA-014 Copy Via PyArrow S3-014 (server-side copy)
S3PA-015 Move Via Hybrid S3-013 (copy + delete)
S3PA-016 Delete Via s3fs S3-011, S3-012
S3PA-017 Listing Via s3fs (implicit; s3fs control path)
S3PA-018 Dual Error Context Managers S3-015, S3-016, S3-017
S3PA-019 No Native Exception Leakage S3-018
S3PA-020 close() S3-019
S3PA-022 Client Options Passthrough S3-021
S3PA-023 Endpoint URL Normalization S3-025
S3PA-024 Default Credential Chain S3-022
S3PA-026 config_kwargs + RetryPolicy S3-026
S3PA-027 Listings cache defaults off S3-027

Full-body deltas (unique to S3-PyArrow): S3PA-002, S3PA-003, S3PA-006, S3PA-007, S3PA-012, S3PA-021.


Construction

S3PA-001: Constructor Parameters

See S3-001. Same signature; the class name is S3PyArrowBackend. Constructor arguments are translated to each library's conventions internally (S3PA-007). Default credential chain follows S3-022.

S3PA-002: Backend Name

Invariant: name property returns "s3-pyarrow".

S3PA-003: Capability Declaration

Invariant: S3PyArrowBackend declares capabilities: READ, WRITE, DELETE, LIST, MOVE, COPY, ATOMIC_WRITE, METADATA, GLOB, WRITE_RESULT_NATIVE. Native glob via prefix-optimized listing (see 018-glob.md GLOB-019).

Delta vs S3-003: USER_METADATA is NOT declared — PyArrow's open_output_stream() does not support per-object user metadata. WRITE_RESULT_NATIVE IS declared: after upload, write() performs a head_object(ChecksumMode="ENABLED") round-trip via s3fs to populate etag, digest, and last_modified.

Rationale: Same as S3-003 for the declared capabilities. The ATOMIC_MOVE capability is not declared (move = copy+delete, so partial failure is observable — see S3PA-015).

S3PA-004: Lazy Connection

See S3-004. Applies to both the PyArrow and s3fs filesystem instances: each is created lazily on first use.

S3PA-005: Construction Validation

See S3-005.


Library Mapping

S3PA-006: Dual-Library Architecture

Invariant: Operations are split between two libraries based on their strengths:

PyArrow (C++ data path) s3fs (control path)
read, read_bytes exists, is_file, is_folder
write, write_atomic list_files, list_folders
copy get_file_info, get_folder_info
delete, delete_folder
move (s3fs checks + pyarrow copy + s3fs delete)

Rationale: PyArrow's C++ S3 implementation offers superior throughput for bulk data transfer. s3fs (built on aiobotocore) has more mature and flexible listing, metadata, and deletion APIs.

S3PA-007: Credential Translation

Invariant: Constructor credentials are translated per library: - PyArrow: access_key, secret_key, region, endpoint_override, scheme - s3fs: key, secret, client_kwargs.region_name, endpoint_url

Postconditions: Both libraries authenticate with the same credentials to the same endpoint.


S3 Object Model

S3PA-008: Virtual Folder Semantics

See S3-006.

S3PA-009: Folder Detection

See S3-007.

S3PA-010: Write Does Not Create Folder Markers

See S3-008.

S3PA-011: Folder Lifecycle Tied to Contents

See S3-009.


Operations

S3PA-012: Read Via PyArrow

Invariant: read() uses open_input_file() (seekable RandomAccessFile) and returns the stream wrapped in _ErrorMappingStream without BufferedReader. read_bytes() uses open_input_stream() and reads all bytes directly. readline() uses a chunked scan (_READLINE_CHUNK-sized reads) with seek-back for over-read bytes, requiring a seekable stream from open_input_file.

Rationale: PyArrow's C++ I/O path provides higher throughput than s3fs for large files. Removing the BufferedReader eliminates a double-copy per chunk on the streaming read path (RFC-0003). The chunked readline() avoids the pathological byte-at-a-time fallback from RawIOBase.

Note: Unlike other backends which return io.BufferedReader, S3-PyArrow returns a raw _ErrorMappingStream(RawIOBase). This means io.TextIOWrapper(stream) requires wrapping in io.BufferedReader first. The spec (SIO-001 in 008-streaming-io.md) only requires BinaryIO, so this is valid, but callers should not assume BufferedIOBase.

S3PA-013: Write Via PyArrow

See S3-010. write() and write_atomic() use pyarrow.fs.S3FileSystem.open_output_stream() for data transfer; existence checks go through s3fs.

Atomicity on mid-stream content failure: PyArrow's output stream cannot be aborted once opened (it exposes no discard()/abort, unlike s3fs), so the S3-010 abort strategy does not transfer. Instead write_atomic buffers the content in full before opening the output stream — a content-source failure then occurs before any upload begins, leaving no object and satisfying AW-001. Plain write streams directly and is non-atomic (AW-007): a mid-stream content failure may leave a truncated object, the same best-effort behaviour as the local backend's write. This is the one place write and write_atomic diverge on this backend (elsewhere write_atomic delegates to write).

S3PA-014: Copy Via PyArrow

See S3-014. Uses pyarrow.fs.S3FileSystem.copy_file() for the server-side copy; existence checks go through s3fs.

S3PA-015: Move Via Hybrid

See S3-013. The copy step uses PyArrow; existence checks and the delete step go through s3fs. Not atomic — if copy succeeds but delete fails, both objects exist.

S3PA-016: Delete Via s3fs

See S3-011 and S3-012. delete() and delete_folder() use s3fs, identical to S3Backend.

S3PA-017: Listing Via s3fs

Invariant: list_files(), list_folders(), get_file_info(), get_folder_info() use s3fs, identical to S3Backend. get_file_info() returns a FileInfo carrying etag and (when the object has a stored checksum) digest.


Error Mapping

S3PA-018: Dual Error Context Managers

See S3-015, S3-016, and S3-017 for the NotFound / PermissionDenied / BackendUnavailable mappings.

Delta vs S3-015/016/017: Two context managers handle the two libraries:

  • _pyarrow_errors(path): catches OSError / ArrowInvalid from PyArrow operations and maps to remote_store errors.
  • _s3fs_errors(path): catches s3fs/botocore exceptions, same mapping as S3Backend.

Postconditions: backend attribute is set to "s3-pyarrow" on all mapped errors.

S3PA-019: No Native Exception Leakage

See S3-018. Extended to PyArrow: no PyArrow, s3fs, botocore, or aiobotocore exceptions propagate to callers.


Resource Management

S3PA-020: close()

See S3-019. close() releases both the PyArrow and s3fs filesystem instances. Safe to call multiple times.

S3PA-021: Dual unwrap()

Invariant: unwrap() supports two type hints: - unwrap(pyarrow.fs.S3FileSystem) returns the PyArrow filesystem. - unwrap(s3fs.S3FileSystem) returns the s3fs filesystem.

Raises: CapabilityNotSupported for any other type hint.

Rationale: Escape hatch for users who need library-specific features. Delta vs S3-020 (which only unwraps to s3fs.S3FileSystem).


Configuration

S3PA-022: Client Options Passthrough

See S3-021. Applies to s3fs only — PyArrow configuration is derived from the explicit constructor parameters, not from client_options.

S3PA-023: Endpoint URL Normalization

See S3-025.

S3PA-024: Default Credential Chain

See S3-022.

S3PA-026: config_kwargs is the only Config channel; client_kwargs['config'] is rejected

See S3-026. Applies to the s3fs control path only; the PyArrow data path (_pa_fs) is unaffected.

S3PA-027: Directory-listing cache defaults off

See S3-027. Listing runs through the s3fs control path (S3PA-017), so the off-by-default cache applies identically; the PyArrow data path (_pa_fs) is unaffected.