Azure Backend¶
The Azure backend stores files in Azure Blob Storage and Azure Data Lake Storage (ADLS) Gen2 using azure-storage-file-datalake directly. It adapts at runtime to Hierarchical Namespace (HNS) accounts, providing atomic rename and real directories on ADLS Gen2 while remaining fully functional on plain Blob Storage.
Installation¶
This pulls in azure-storage-file-datalake and azure-identity (for DefaultAzureCredential).
Usage¶
from remote_store import BackendConfig, RegistryConfig, Registry, StoreProfile
config = RegistryConfig(
backends={
"my-azure": BackendConfig(
type="azure",
options={
"container": "my-container",
"account_name": "mystorageaccount",
},
),
},
stores={"data": StoreProfile(backend="my-azure", root_path="datasets")},
)
with Registry(config) as registry:
store = registry.get_store("data")
store.write("report.csv", b"col1,col2\n1,2\n")
data = store.read_bytes("report.csv")
Direct construction¶
from remote_store.backends import AzureBackend
# Account key
backend = AzureBackend(
container="my-container",
account_name="mystorageaccount",
account_key="...",
)
# SAS token
backend = AzureBackend(
container="my-container",
account_name="mystorageaccount",
sas_token="sv=2023-11-03&...",
)
# Connection string
backend = AzureBackend(
container="my-container",
connection_string="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;",
)
# DefaultAzureCredential (auto-resolves env vars, managed identity, CLI login, etc.)
backend = AzureBackend(
container="my-container",
account_name="mystorageaccount",
)
Options¶
| Option | Type | Default | Description |
|---|---|---|---|
container |
str |
(required) | Azure Storage container name |
account_name |
str |
None |
Storage account name (builds URL automatically) |
account_url |
str |
None |
Full account URL (e.g. https://myaccount.dfs.core.windows.net) |
account_key |
str |
None |
Storage account key |
sas_token |
str |
None |
Shared Access Signature token |
connection_string |
str |
None |
Azure Storage connection string |
credential |
Any |
None |
Any credential object (e.g. DefaultAzureCredential()) |
client_options |
dict |
None |
Extra kwargs passed to service clients (see Upload tuning) |
max_concurrency |
int |
1 |
Parallel connections for uploads/downloads (>1 benefits large files) |
At least one of account_name, account_url, or connection_string must be provided.
Authentication¶
The backend resolves credentials in this order:
account_key— if provided, used directlysas_token— if provided, used directlycredential— any credential object (e.g.DefaultAzureCredential())DefaultAzureCredential— auto-detected from environment (requiresazure-identity)
DefaultAzureCredential automatically tries environment variables, managed identity, Azure CLI, and other sources. See the Azure Identity docs for details.
HNS vs Non-HNS¶
The backend detects Hierarchical Namespace (HNS) status on first use and adapts its behavior:
| Feature | HNS Enabled (ADLS Gen2) | No HNS (Blob Storage) |
|---|---|---|
| Directories | Real entities | Virtual (prefix-based) |
write_atomic |
Temp file + atomic rename | Direct upload (PUT is atomic) |
move |
Atomic rename_file |
Copy + delete |
delete_folder(recursive=True) |
Single recursive delete | Iterate + delete each blob |
If the HNS detection call fails (e.g. insufficient permissions), the backend falls back to non-HNS behavior.
Note that non-HNS move() (copy + delete) is not atomic and overwrite=False has a TOCTOU race on all account types. See the Concurrency and Atomicity Guarantees guide for details.
File Metadata¶
get_file_info() and list_files() return FileInfo objects with the following fields populated by the Azure backend:
| Field | Source | Notes |
|---|---|---|
etag |
BlobProperties.etag |
Double-quotes stripped; lowercased. |
digest |
BlobProperties.content_settings.content_md5 |
Populated as ContentDigest("md5", <hex>) when the blob has a stored Content-MD5; None otherwise. |
Write Results¶
The Azure backend declares WRITE_RESULT_NATIVE and USER_METADATA. Write operations return
a WriteResult with etag and last_modified populated from the upload
response. digest is populated as ContentDigest("md5", <hex>) when Azure echoes back
Content-MD5 in the upload response, and None otherwise. When blob versioning is enabled
on a non-HNS container, version_id is also populated from the upload response.
Pass metadata= to store custom string key-value pairs as Azure blob metadata.
Capabilities¶
Supports all capabilities except SEEKABLE_READ and ATOMIC_MOVE.
See the capabilities matrix for full details.
Streaming¶
read() returns a forward-only streaming handle (not seekable). Data is fetched on demand, not loaded into memory upfront. If you need seekability, use read_bytes() and wrap in BytesIO:
Upload tuning¶
The library sets conservative upload defaults on the Azure service clients to keep memory usage bounded during streaming transfers:
| Setting | Library default | SDK default |
|---|---|---|
max_single_put_size |
1 MiB | 64 MiB |
max_block_size |
1 MiB | 4 MiB |
min_large_block_upload_threshold |
1 | 4 MiB + 1 |
These defaults cause uploads to use staged-block requests with small blocks.
For large files where upload throughput matters more than memory, override
via client_options:
AzureBackend(
container="my-container",
connection_string="...",
client_options={
"max_single_put_size": 8 * 1024 * 1024, # 8 MiB
"max_block_size": 4 * 1024 * 1024, # 4 MiB
},
)
Escape Hatch¶
Access the underlying FileSystemClient when you need Azure-specific features:
from azure.storage.filedatalake import FileSystemClient
fs = backend.unwrap(FileSystemClient)
fs.get_paths(path="my-prefix")
Local Development with Azurite¶
Azurite is the official Azure Storage emulator. Start it with Docker:
Then connect using the well-known Azurite connection string:
backend = AzureBackend(
container="test",
connection_string=(
"DefaultEndpointsProtocol=http;"
"AccountName=devstoreaccount1;"
"AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq"
"/K1SZFPTOtr/KBHBeksoGMGw==;"
"BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;"
),
)
Note: Azurite does not support Hierarchical Namespace. HNS-specific features (atomic rename, real directories) are tested with mocked SDK objects. To validate against a live ADLS Gen2 account, see Azure HNS account setup.
See also¶
API Reference¶
AzureBackend
¶
AzureBackend(
container: str,
*,
account_name: str | None = None,
account_url: str | None = None,
account_key: str | Secret | None = None,
sas_token: str | Secret | None = None,
connection_string: str | Secret | None = None,
credential: Any | None = None,
client_options: dict[str, Any] | None = None,
retry: RetryPolicy | None = None,
max_concurrency: int = 1,
reject_write_under_file_ancestor: bool = False,
)
Bases: Backend
Azure Storage backend.
Uses the Blob SDK for non-HNS accounts (plain Blob Storage, Azurite) and the DataLake SDK for HNS accounts (ADLS Gen2) to get atomic rename and real directory support.
move() on non-HNS accounts is implemented as a server-side copy
followed by a blob delete. This is non-atomic: a failure between the
two steps may leave both source and destination present. HNS accounts
use rename_file which is atomic, but since the backend cannot
guarantee HNS at construction time, ATOMIC_MOVE is not declared.
Parameters:
-
container(str) –Azure Storage container name (required, non-empty).
-
account_name(str | None, default:None) –Storage account name.
-
account_url(str | None, default:None) –Full account URL (e.g.
https://myaccount.dfs.core.windows.net). -
account_key(str | Secret | None, default:None) –Storage account key.
-
sas_token(str | Secret | None, default:None) –Shared Access Signature token.
-
connection_string(str | Secret | None, default:None) –Azure Storage connection string.
-
credential(Any | None, default:None) –Any credential object (e.g.
DefaultAzureCredential()). -
client_options(dict[str, Any] | None, default:None) –Additional options passed to service clients. The library sets
max_single_put_size,max_block_size, andmin_large_block_upload_thresholddefaults for streaming memory discipline; user-supplied values take precedence. -
max_concurrency(int, default:1) –Maximum number of parallel connections for uploads and downloads (default
1-- sequential). -
reject_write_under_file_ancestor(bool, default:False) –If
True,write/write_atomic/open_atomic/move/copyHEAD each slash-aligned ancestor of the target path on non-HNS accounts and raiseInvalidPathon the first regular-file hit, matching the cross-backend contract that hierarchical filesystems enforce natively. On HNS accounts the kwarg short-circuits:hdi_isfolderrejects the operation natively, and the backend detects the file ancestor on that rejection and re-raises it asInvalidPath, so HNS delivers the cross-backend contract with or without the kwarg set. DefaultFalse: enabling the check adds one HEAD per ancestor per nested-path write; paths without slashes short-circuit.
resolve
¶
resolve(path: str) -> ResolutionPlan
Return a ResolutionPlan with Azure-specific details.
Parameters:
-
path(str) –Backend-relative key.
Returns:
-
ResolutionPlan–Plan with
kind="azure"anddetailscontaining -
ResolutionPlan–containerandaccount_url.
delete_folder
¶
Delete a folder.
Parameters:
-
path(str) –Backend-relative key.
-
recursive(bool, default:False) –If
True, delete all contents first. -
missing_ok(bool, default:False) –If
True, do not raise when absent.
Raises:
-
NotFound–If the folder is missing and
missing_okisFalse. -
InvalidPath–If
pathnames a file (usedeleteinstead). -
DirectoryNotEmpty–If non-empty and
recursiveisFalse.
get_file_info
¶
get_file_info(path: str) -> FileInfo
Return file metadata for path.
Parameters:
-
path(str) –Backend-relative key.
Raises:
-
NotFound–If the file does not exist.
-
InvalidPath–If
pathnames a directory (HNS:hdi_isfolder=true).