Getting Started¶

Installation¶

Install from PyPI:

pip install remote-store

Backends that need extra dependencies use extras:

pip install "remote-store[s3]"           # Amazon S3 / MinIO
pip install "remote-store[s3-pyarrow]"   # S3 via PyArrow (analytical workloads)
pip install "remote-store[azure]"        # Azure Blob / ADLS Gen2
pip install "remote-store[graph]"        # Microsoft Graph (OneDrive / SharePoint / Teams), async-only
pip install "remote-store[sftp]"         # SFTP / SSH
pip install "remote-store[sql]"          # SQL Blob (SQLite, PostgreSQL, ...)
pip install "remote-store[sql-query]"    # SQL Query (read-only, SQLAlchemy + PyArrow)

Optional extras for integrations:

pip install "remote-store[requests]"       # HTTP backend with requests (connection pooling)
pip install "remote-store[httpx]"          # HTTP backend with httpx (HTTP/2)
pip install "remote-store[arrow]"          # PyArrow filesystem adapter
pip install "remote-store[otel]"           # OpenTelemetry instrumentation
pip install "remote-store[yaml]"           # YAML config support
pip install "remote-store[pydantic]"       # Pydantic BaseSettings config
pip install "remote-store[toml]"           # TOML config on Python < 3.11

Quick Start¶

The simplest way to use remote-store (examples/getting_started/quickstart.py):

from remote_store import Store
from remote_store.backends import LocalBackend

store = Store(LocalBackend(root="./data"))
store.write_text("hello.txt", "Hello, world!")
print(store.read_text("hello.txt"))  # 'Hello, world!'

For applications that manage multiple backends or switch between environments, use a Registry with declarative config:

from remote_store import Registry, RegistryConfig

config = RegistryConfig.from_dict({
    "backends": {"main": {"type": "local", "options": {"root": "./data"}}},
    "stores": {"data": {"backend": "main", "root_path": ""}},
})

with Registry(config) as registry:
    store = registry.get_store("data")
    store.write_text("hello.txt", "Hello, world!")
    print(store.read_text("hello.txt"))  # 'Hello, world!'

Same code, different environment¶

Switch from local to S3 by changing the config file. The application code stays the same:

Dev (local filesystem):

[backends.main]
type = "local"
options = { root = "./data" }

[stores.reports]
backend = "main"
root_path = "reports"

Production (S3):

[backends.main]
type = "s3"
options = { bucket = "analytics-data" }

[stores.reports]
backend = "main"
root_path = "reports"

# Identical in both environments:
config = RegistryConfig.from_toml("remote-store.toml")
with Registry(config) as registry:
    store = registry.get_store("reports")
    store.write_text("monthly/2026-03.csv", report_csv)

Configuration supports TOML, YAML, Pydantic BaseSettings, and plain dicts. Credentials are automatically masked in repr()/str() to prevent leakage in logs.

Who this is for¶

Platform and internal tooling teams: provide one stable storage interface across environments
Data engineering teams: pipelines that run against local storage, S3, Azure, or SFTP depending on the environment
Teams that include citizen developers: analysts and domain experts who write Python shouldn't need to learn cloud SDKs just to read and write files
Anyone tired of writing storage wrappers in every project

What you get¶

One interface, many backends: local filesystem, in-memory, S3, Azure, OneDrive / SharePoint (Microsoft Graph, async), SFTP / SSH, and more
Folder-scoped stores: each Store is rooted at a folder; compose layouts with multiple stores or narrow scope with child()
Swap backends via config: move between environments without changing code
Streaming by default: large files just work without blowing up memory
Atomic writes where supported: safer updates for file-producing workflows
Async support: remote_store.aio provides AsyncStore with coroutine methods; wrap any sync backend with SyncBackendAdapter
Established libraries underneath: s3fs, paramiko, etc. do the real work

Zero runtime dependencies, strict mypy, spec-driven test suite. Optional integrations for PyArrow, OpenTelemetry, and more. See features for the full list.

What it is not¶

Not a query engine (no SQL, no predicate pushdown)
Not a table format (no Delta Lake log, no Iceberg manifests)
Not a filesystem reimplementation (delegates to s3fs, paramiko, pyarrow, etc., the libraries you'd pick anyway)
Not a file-transfer server (no SFTP/FTP/WebDAV service such as SFTPGo)

Supported Backends¶

Backend	Extra	Library	Atomic write	Native glob	`move()` atomic
Local filesystem	(built-in)	stdlib	Yes	Yes	Yes*
Memory (in-process)	(built-in)	—	Yes	—	Yes
Amazon S3 / MinIO	`remote-store[s3]`	`s3fs`	Yes	Yes	— (copy+delete)
S3 (PyArrow)	`remote-store[s3-pyarrow]`	`pyarrow` + `s3fs`	Yes	Yes	— (copy+delete)
Azure Blob / ADLS	`remote-store[azure]`	`azure-storage-file-datalake`	Yes	Yes	HNS: Yes / non-HNS: —
Microsoft Graph (OneDrive / SharePoint / Teams)**	`remote-store[graph]`	`httpx` + `msal`	Yes	—	—***
SFTP / SSH	`remote-store[sftp]`	`paramiko`	Yes	—	—****
HTTP/HTTPS (read-only)	(built-in)	stdlib	—	—	—
SQL Blob (SQLite, PostgreSQL, ...)	`remote-store[sql]`	`sqlalchemy`	Yes	Yes	Yes
SQL Query (read-only)	`remote-store[sql-query]`	`sqlalchemy` + `pyarrow`	—	—	—

* Same-filesystem only; cross-filesystem falls back to copy+delete. * Async-only: construct via AsyncStore(backend=GraphBackend(...)); there is no sync Store wrapper or config type= string. * Native server-side move (PATCH driveItem, identity-preserving); may complete asynchronously, so ATOMIC_MOVE is not declared. *** Attempts posix_rename (atomic on POSIX-compliant servers) but falls back to copy+delete; atomicity cannot be guaranteed, so ATOMIC_MOVE is not declared.

All backends except HTTP and SQL Query support read, write, delete, list, copy, move, and metadata. HTTP is read-only. SQL Query is read-only: it materializes SQL queries to Parquet/CSV/Arrow IPC on read. Glob is natively supported by most backends; for those that lack it, the portable fallback ext.glob.glob_files() works with any LIST-capable backend. Seekable reads are available via Store.read_seekable() on backends that declare SEEKABLE_READ. See features, the capabilities matrix, and the concurrency guide for full details.

Store API¶

The Store provides methods across read/write, browsing, management, and utility. Key highlights:

store.read_text("path/to/file.txt")             # → str
store.write_text("path/to/file.txt", content)   # write string
store.read_bytes("path/to/file.csv")            # → bytes
store.write("path/to/data.bin", binary_stream)  # streaming write

store.list_files("reports/", pattern="*.csv")   # iterate FileInfo
store.glob("**/*.parquet")                      # native glob (capability-gated)
store.exists("path/to/file.txt")                # → bool
store.head("path/to/file.txt")                  # WriteResult snapshot (size, etag, …)

store.move("old.txt", "new.txt")                # move / rename
store.copy("src.txt", "dst.txt")                # copy
store.delete("path/to/file.txt")                # delete

store.child("subfolder")                        # scoped child store
store.supports(Capability.ATOMIC_WRITE)         # runtime capability check (gates a method)
store.supports(Capability.ATOMIC_MOVE)          # quality flag — move() atomicity guarantee
store.resolve("path/to/file.txt")               # resolution plan (introspection)
store.ping()                                    # health check

For the full method list, see the API reference. All write, move, and copy methods accept overwrite=True to replace existing files.

Performance¶

remote-store adds at most a small number of extra protocol round trips per operation over the raw SDK. Where it does — an S3 write or delete carries about one extra round trip — the overhead is a few milliseconds on a fast link but its absolute cost grows with network round-trip time (on the order of one RTT per extra round trip); backends that add none, such as SFTP and Azure, stay near zero at any latency. S3 listing is faster than raw boto3 via s3fs connection caching. The performance guide has the measured numbers in milliseconds, the methodology, and the hatch run bench-* levers to test the overhead against your own workload — whether it is acceptable is your call, not a number this table can give you.

Extensions¶

The core library handles storage operations. Extensions add optional capabilities on top: PyArrow integration, observability, caching, or bulk operations. All live in remote_store.ext; import only what you need.

Extension	Extra	What it does
PyArrow adapter	`remote-store[arrow]`	Use any Store as a `pyarrow.fs.FileSystem`; works with Parquet, Pandas, Polars, DuckDB
Parquet datasets	`remote-store[arrow]`	Managed Parquet datasets with manifests, `_SUCCESS` markers, and multi-part layouts
Batch operations	(none)	Bulk delete, copy, and exists with error aggregation
Transfer operations	(none)	Upload, download, and cross-store transfer with progress
Observability hooks	(none)	Callback-based instrumentation for logging, metrics, and tracing
OpenTelemetry bridge	`remote-store[otel]`	Pre-built OTel spans and metrics for Store operations
Caching middleware	(none)	TTL-based read cache with automatic invalidation on mutations
Stream wrappers	(none)	Composable BinaryIO wrappers for progress tracking and checksums
Integrity helpers	(none)	Checksum computation and verification over Store's public API
Write helpers	(none)	Client-side content hashing for write operations, compatible with any backend
Dagster integration	`remote-store[dagster]`	IOManager adapter, config-driven Store resource, and compute log manager for Dagster pipelines

Plus glob helpers, partition helpers, YAML and Pydantic config adapters. See the extensions guide for details.

Quality & Testing¶

Storage behavior must be predictable and correct. We verify this across multiple dimensions:

Spec-driven development: behavior specifications are the source of truth; tests link directly to them. Prevents feature drift.
Design by Contract: pre/post conditions and invariants catch incorrect usage early. Fails fast on misuse.
Examples and snippets: runnable code in examples/ and notebooks; docs are tested against actual behavior. Keeps examples real.
Extensive unit tests: high coverage across all backends, focused on behavior. Catches integration issues early.
Dependency drift guard: scheduled CI re-resolves extras against latest versions to catch silent transitive upgrades. Surfaces upstream breakage early.
Property-based testing: randomized input generation via Hypothesis surfaces edge cases no hand-written test would find. Finds blind spots.
Formal verification: critical paths are proven correct in Dafny before implementation. Eliminates logic errors.
Mutation testing: gremlins modify the code; if they survive the tests, the tests have gaps. Exposes weak test coverage.
Benchmarks: performance tracked per operation and backend. Provides baseline for optimization.

Learn more¶

To explore remote-store beyond the Quick Start:

Examples: self-contained scripts in examples/ covering core operations (file I/O, streaming, atomic writes, error handling, etc.) and backend-specific setups for S3, Azure, OneDrive, SFTP, HTTP, and SQL.
Notebooks: interactive Jupyter notebooks that walk through common workflows step by step.
Guides: topic-focused walkthroughs in the documentation covering backends, extensions, configuration, and patterns like data lake layouts or health checks.
For AI coding agents: point your agent at llms.txt (index) or llms-full.txt (the full docs). For the public API surface alone — every signature and docstring, backends included — use llms-api.txt.

How it compares¶

There are several excellent Python libraries for file I/O across backends. Here is where remote-store sits:

	fsspec	smart_open	cloudpathlib	obstore	remote-store
API surface	many methods	`open()` only	pathlib-style	~10 methods	full Store API
Backends	many filesystems	S3, GCS, Az, SFTP	S3, GCS, Azure	S3, GCS, Azure	Local, Memory, S3, Azure, OneDrive, SFTP, HTTP, SQL
SFTP	via sshfs	Yes	—	—	Built-in
Streaming I/O	Yes	Yes	— (downloads)	Bytes-oriented	Yes (BinaryIO)
Atomic writes	—	—	—	—	Yes (capability-gated)
Async	Yes	—	—	Yes (first-class)	Yes (`remote_store.aio`)
Observability	—	—	—	—	`ext.observe` + OTel
Config model	Per-filesystem	URI-based	Per-client	Per-store kwargs	Immutable Registry
Runtime deps	Yes	Minimal	SDK-based	Rust binary	Zero (core)

Feature sets may change as these libraries evolve. Check each project's documentation for the current state.

In short: remote-store is for teams that need more than open() (smart_open) but less than a full filesystem abstraction (fsspec), with streaming, SFTP, atomic writes, observability, and immutable config. Under the hood, it delegates to the same libraries you'd pick anyway (s3fs/boto3, paramiko, Azure SDK, PyArrow).