Choosing a Backend¶
This guide helps you pick the right remote-store backend for your use case.
Decision tree¶
-
Local filesystem? Use Local. Fast, full capabilities, zero config. Best for development and single-machine workflows.
-
In-process testing or caching? Use Memory. No disk I/O, instant setup/teardown, ideal for unit tests and ephemeral caches. Lacks native glob (use
ext.globfallback). -
S3-compatible object store (AWS S3, MinIO, Ceph, etc.)?
- Need analytical workloads (Parquet column pruning, PyArrow datasets)?
Use S3-PyArrow. Native C++
FileSystemfor PyArrow, GIL-free reads. - Otherwise use S3 (fsspec-based). Faster for sequential reads/writes, lighter dependency footprint, same API surface.
- Need analytical workloads (Parquet column pruning, PyArrow datasets)?
Use S3-PyArrow. Native C++
-
Azure Blob Storage or ADLS Gen2? Use Azure. Supports both flat and HNS (hierarchical namespace) accounts. Connection string, SAS token, or DefaultAzureCredential auth.
-
SSH/SFTP server? Use SFTP. Legacy systems, on-prem file servers. Supports password and key-based auth. Lacks native glob (use
ext.globfallback). -
Store blobs in a relational database (SQLite, PostgreSQL, etc.)? Use SQLBlob. Broad capability set — read, write, list, move, copy, glob, and atomic writes. Useful for embedded storage, metadata-heavy workloads, or environments where a database is already available.
-
Materialize SQL queries as files (read-only)? Use SQLQuery. Executes a SQL query and exposes the result as Parquet, CSV, or Arrow IPC. Read and metadata only. Useful for ETL pipelines and data exports.
-
Read-only HTTP/HTTPS endpoint? Use HTTP. Public data, static file servers, REST APIs. Read and metadata only — no write, list, or delete. Zero required dependencies (stdlib
urllib); optionalrequestsorhttpxtransports for connection pooling.
Trade-offs at a glance¶
| Backend | Dependencies | Glob | Throughput | Best for |
|---|---|---|---|---|
| Local | None | Native | Disk-bound | Dev, single machine |
| Memory | None | Fallback | In-process | Tests, caches |
| S3 | s3fs |
Native | Network | General S3 workloads |
| S3-PyArrow | pyarrow |
Native | Network | Parquet, PyArrow datasets |
| SFTP | paramiko |
Fallback | Network | Legacy, on-prem |
| Azure | azure-storage-blob |
Native | Network | Azure workloads |
| SQLBlob | sqlalchemy |
Native | DB-bound | Embedded, metadata-heavy |
| SQLQuery | sqlalchemy + pyarrow |
Native | DB-bound | Read-only ETL exports |
| HTTP | None | — | Network | Read-only public data |
Switching backends at runtime¶
The whole point of remote-store is that your application code stays the same
regardless of backend. Switch via configuration:
# dev.toml
[backends.storage]
type = "local"
base_path = "./data"
[stores.default]
backend = "storage"
# prod.toml
[backends.storage]
type = "s3"
bucket = "my-bucket"
[stores.default]
backend = "storage"
from remote_store import RegistryConfig, Registry
config = RegistryConfig.from_toml("dev.toml") # or "prod.toml"
registry = Registry(config)
store = registry.get_store("default")
# Same API regardless of backend
See also¶
- Capabilities Matrix — full backend x capability table
- Backends guide — per-backend configuration details
- Performance guide — benchmark data across backends