Data Lake Patterns¶
How to use remote-store as the storage plumbing layer for a multi-layer Parquet data lake — and where to draw the line.
The idea¶
remote-store provides a unified, backend-agnostic I/O layer that makes data lake patterns portable across storage backends. A team can develop locally against Memory or Local backends, test in CI without cloud credentials, and deploy to S3 or Azure without changing application code.
The value proposition is not competing with Databricks, Spark, or any query/compute engine. It is making the storage layer underneath them swappable and testable. remote-store owns the I/O path; table formats (Delta Lake, Iceberg) and query engines (Spark, DuckDB, Polars) sit on top via the PyArrow adapter.
| Layer | Components |
|---|---|
| Query / Compute | Polars, DuckDB, Pandas, Spark, Databricks |
| Table Format (optional) | Delta Lake, PyIceberg, plain Parquet |
| PyArrow FileSystem | pyarrow_fs(store) |
| remote-store | Store, child(), ext.batch, ext.transfer |
| Backend | Local, Memory, S3, S3-PyArrow, SFTP, Azure, HTTP, SQLBlob, SQLQuery |
What this gets you¶
- Backend portability. Develop against
MemoryBackend, deploy to S3/Azure. Same code, zero cloud dependencies in dev. - Layer isolation.
store.child("bronze"),store.child("silver"),store.child("gold")— each layer gets its own namespace without separate credentials, connections, or config. - Streaming by default. Reads and writes never fully materialize large files in memory. Critical for multi-GB Parquet partitions.
- Atomic writes.
write_atomic()uses rename-based semantics, reducing partial-write risk for individual files. - Batch operations.
batch_delete,batch_copy,batch_existsfor partition-scale workflows with error aggregation. - Observability.
ext.observehooks and OpenTelemetry bridge give visibility into every I/O operation without instrumenting application code.
What this does not get you¶
Being honest about the boundary:
- No table format protocol. remote-store does not implement the Delta Lake
transaction log, Iceberg manifest management, or any table format's commit
protocol. Libraries like
deltalake(delta-rs) orpyiceberghandle that — remote-store is the I/O layer they write through. - No query execution. No SQL, no predicate pushdown, no shuffle. Polars, DuckDB, or Spark handle query planning and execution.
- No catalog integration. Unity Catalog, Hive Metastore, and Glue Catalog are metadata services remote-store does not interact with.
- No Databricks-native credential passthrough. Databricks clusters use
instance profiles, workspace tokens, and Unity Catalog vended credentials.
remote-store's
BackendConfigtakes explicit credentials. On Databricks, you would either pass credentials explicitly or use the native DBFS/ABFSS paths directly and skip the abstraction. - GIL-free fast path for native backends. When the backend exposes a native
PyArrow filesystem (e.g.,
S3PyArrowBackend), the adapter uses Tier 1 fast-path reads that bypass Python I/O entirely — full C++ range requests with I/O coalescing and zero GIL overhead. Backends without native PyArrow support fall back to Tier 2 (full-file materialization) or Tier 3 (PythonFile streaming).
The bridge metaphor works best with a clear boundary: remote-store owns portable, testable, observable storage I/O. Everything above the storage layer — formats, queries, catalogs — belongs to purpose-built tools.
Bronze / Silver / Gold with Store.child()¶
The medallion architecture maps naturally to child stores:
from remote_store import BackendConfig, Registry, RegistryConfig, StoreProfile
config = RegistryConfig(
backends={
"lake": BackendConfig(
type="s3",
options={"bucket": "my-data-lake"},
),
},
stores={"lake": StoreProfile(backend="lake", root_path="v1")},
)
with Registry(config) as registry:
lake = registry.get_store("lake")
bronze = lake.child("bronze") # raw ingestion
silver = lake.child("silver") # cleaned and typed
gold = lake.child("gold") # aggregated / business-ready
Each layer is a full Store — same API, same backend connection, independent
namespace. In dev/test, swap the backend to MemoryBackend or LocalBackend
with zero code changes.
Writing Parquet via PyArrow¶
import pyarrow as pa
import pyarrow.parquet as pq
from remote_store.ext.arrow import pyarrow_fs
# Bronze: ingest raw data as Parquet
bronze_fs = pyarrow_fs(bronze)
raw = pa.table({
"timestamp": ["2026-03-01T10:00:00", "2026-03-01T10:01:00"],
"sensor_id": ["s-001", "s-002"],
"reading": [23.4, 19.8],
})
pq.write_table(raw, "readings/2026-03-01.parquet", filesystem=bronze_fs)
Reading and transforming with Polars¶
import polars as pl
import pyarrow.parquet as pq
from remote_store.ext.arrow import pyarrow_fs
# Read bronze partition via PyArrow, convert to Polars
bronze_fs = pyarrow_fs(bronze)
raw = pq.read_table("readings/2026-03-01.parquet", filesystem=bronze_fs)
df = pl.from_arrow(raw)
# Clean and type (bronze -> silver)
silver_df = (
df.with_columns(
pl.col("timestamp").str.to_datetime(),
pl.col("reading").cast(pl.Float64),
)
.filter(pl.col("reading").is_not_null())
)
# Write to silver layer via PyArrow
silver_fs = pyarrow_fs(silver)
pq.write_table(silver_df.to_arrow(), "readings/2026-03-01.parquet", filesystem=silver_fs)
Partitioned datasets with PyArrow¶
PyArrow's dataset API handles Hive-style partitioned layouts automatically:
import pyarrow.dataset as ds
# Write partitioned by date
part = pa.table({
"date": ["2026-03-01", "2026-03-01", "2026-03-02"],
"sensor": ["s-001", "s-002", "s-001"],
"value": [23.4, 19.8, 21.1],
})
ds.write_dataset(
part,
"sensors",
filesystem=pyarrow_fs(silver),
format="parquet",
partitioning=ds.partitioning(pa.schema([("date", pa.string())])),
)
# Read with partition pruning
dataset = ds.dataset(
"sensors",
filesystem=pyarrow_fs(silver),
format="parquet",
partitioning="hive",
)
march_1 = dataset.to_table(filter=ds.field("date") == "2026-03-01")
DuckDB integration¶
import duckdb
import pyarrow.dataset as ds
from remote_store.ext.arrow import pyarrow_fs
silver_fs = pyarrow_fs(silver)
dataset = ds.dataset("readings", filesystem=silver_fs, format="parquet")
# DuckDB queries PyArrow datasets directly
result = duckdb.sql("""
SELECT sensor_id, AVG(reading) as avg_reading
FROM dataset
GROUP BY sensor_id
""")
print(result.fetchdf())
Delta Lake (via deltalake)¶
The deltalake library accepts PyArrow filesystems for storage:
from deltalake import DeltaTable, write_deltalake
gold_fs = pyarrow_fs(gold)
# Write a Delta table
write_deltalake(
"aggregates",
silver_df.to_arrow(),
filesystem=gold_fs,
)
# Read it back
dt = DeltaTable("aggregates", filesystem=gold_fs)
df = pl.from_arrow(dt.to_pyarrow_table())
Note: deltalake (delta-rs) owns the transaction log and commit protocol.
remote-store provides the I/O — it does not replace Delta's ACID guarantees.
Batch operations across partitions¶
from remote_store import batch_delete, batch_copy
# Archive old bronze partitions within the same store
old_files = [
f.path for f in bronze.list_files("readings", recursive=True)
if f.path < "readings/2026-01-01"
]
batch_copy(bronze, [(f, f"archive/{f}") for f in old_files])
batch_delete(bronze, old_files)
For partition discovery without scanning the full tree, use max_depth to
limit traversal:
# List only top-level partition folders (e.g. date directories)
for folder in bronze.list_folders("readings", max_depth=0):
print(folder.name) # 2026-01-01, 2026-02-01, ...
# Files in readings/ and its immediate partition folders only
for f in bronze.list_files("readings", max_depth=1):
print(f.path)
Cross-backend transfer¶
Move data between backends without intermediate files:
from remote_store import transfer
# Stream a partition from S3 bronze to Azure silver
transfer(
s3_bronze, "readings/2026-03-01.parquet",
azure_silver, "readings/2026-03-01.parquet",
on_progress=lambda n: print(f"{n} bytes transferred"),
)
Testing without cloud credentials¶
The entire pipeline can run against MemoryBackend in unit tests:
import pytest
from remote_store import Store
from remote_store.backends import MemoryBackend
from remote_store.ext.arrow import pyarrow_fs
@pytest.fixture
def lake():
"""In-memory data lake for testing."""
store = Store(backend=MemoryBackend())
return {
"bronze": store.child("bronze"),
"silver": store.child("silver"),
"gold": store.child("gold"),
}
def test_bronze_to_silver(lake):
import pyarrow as pa
import pyarrow.parquet as pq
# Write raw data to bronze
bronze_fs = pyarrow_fs(lake["bronze"])
table = pa.table({"id": [1, 2], "value": [10.0, 20.0]})
pq.write_table(table, "batch.parquet", filesystem=bronze_fs)
# Transform and write to silver
silver_fs = pyarrow_fs(lake["silver"])
raw = pq.read_table("batch.parquet", filesystem=bronze_fs)
# ... apply transformations ...
pq.write_table(raw, "batch.parquet", filesystem=silver_fs)
# Verify silver has the data
result = pq.read_table("batch.parquet", filesystem=silver_fs)
assert result.num_rows == 2
Features that complement this pattern¶
These extensions work well with data lake workflows:
| Feature | What it does |
|---|---|
| Parallel batch operations | Concurrent I/O for partition-scale deletes and copies (concurrent=True) |
| Hive partition helpers | Build and parse year=2026/month=03/ paths with partition_path() and parse_partition() |
| PyArrow Tier 1 fast-path | Zero-GIL C++ range requests for large Parquet workloads (S3-PyArrow) |
| Streaming atomic writes | open_atomic() context manager for multi-GB Parquet exports |
| Caching middleware | Reduces round-trips for metadata-heavy listing workflows |
| Parquet Dataset Store | Managed Parquet datasets with manifests, _SUCCESS markers, and multi-part layouts |
See also¶
- Medallion + Dagster Showcase — end-to-end Bronze/Silver/Gold pipeline demonstrating extensions over live MeteoSwiss data
- Data Lake Medallion notebook — runnable end-to-end Bronze/Silver/Gold pipeline
- Dagster Integration — use any Store as a Dagster IO manager for orchestrated pipelines
- PyArrow FileSystem Adapter — adapter configuration and tiered read strategy
- Transfer Operations — cross-store and local-path transfers
- Batch Operations — bulk delete, copy, exists
- Concurrency — TOCTOU and atomicity considerations