Research: Backend Health Check (store.ping() / backend.check_health())¶
Item ID: ID-054 Date: 2026-03-09 Context: Lightweight, non-destructive health verification for backends; startup gates and liveness probes.
1. Overview and Motivation¶
A health check method verifies that a backend is reachable and credentials are valid without performing any data operations. This enables:
- Startup gates: Fail fast if credentials are invalid before accepting traffic
- Liveness probes: Kubernetes, container orchestrators, monitoring systems
- Connection validation: Application initialization / bootstrap logic
- Operational hygiene: Verify before critical operations
The operation must be non-destructive (no side effects), lightweight (minimal I/O), and portable across all backends.
2. Design Constraints and Principles¶
2.1 Non-Destructive¶
- Must not create, modify, or delete any data
- Must not have observable side effects on the backend
- Read-only or pure metadata operations only
2.2 Lightweight¶
- Should complete in under 1–2 seconds on a healthy backend
- Minimal network round-trips (preferably a single call)
- No listing, streaming, or enumeration
2.3 Portable Semantics¶
- Uniform API across all backends (Store + all 6 backends)
- Failure modes standardized: success (healthy), exception (unhealthy)
- Credential validation implicit (failures = bad credentials or connectivity)
2.4 Clear Failure Semantics¶
- Success = backend is reachable and credentials work
- Raises exception on any connectivity or credential issue
- Does not return a
boolor status enum — follows remote-store convention of raising exceptions for error conditions - Note:
LocalBackend.__init__callsself._root.mkdir(parents=True, exist_ok=True), so the root always exists after construction. ANotFoundfor Local would only occur if the directory is deleted between construction andping()— an unusual but valid edge case to handle
3. Backend-Specific Strategies¶
3.1 Local Backend¶
Method: os.access(root_path, os.R_OK)
- Verifies the root path exists and is readable
- Non-destructive, instant
- Raises
PermissionDeniedif path not readable,NotFoundif path missing
Implementation Notes: - Works for any root path (file or folder) - Natural fit for how Local backend validates paths
3.2 S3 Backend¶
Method: head_bucket() via s3fs' underlying botocore client
- Lightweight metadata call, no data transfer via s3fs
- Validates bucket exists and credentials have permission
- s3fs wraps boto3/botocore; access underlying client via
self._fs.s3.head_bucket() - Or use
self._fs.info(self._bucket)for direct stat-like metadata
Implementation Notes:
- S3Backend uses self._fs (s3fs.S3FileSystem), not a raw boto3 client
- Access underlying client: self._fs.s3.head_bucket(Bucket=self._bucket)
- Or lightweight info call: self._fs.info(self._bucket) returns metadata
- Preferred over checking for root path existence (which would require listing)
Error Mapping:
- 403 / AccessDenied → PermissionDenied
- 404 / NoSuchBucket → NotFound
- Timeout / connection → BackendUnavailable
3.3 S3-PyArrow Backend¶
Method: PyArrow S3FileSystem.get_file_info() on bucket root
- Uses PyArrow's native
S3FileSystem, not boto3 self._pa_fs.get_file_info(self._bucket)returns lightweight metadata- Validates bucket exists and credentials work
Implementation Notes:
- S3-PyArrow wraps PyArrow's S3FileSystem, not boto3
- Lightweight metadata call via get_file_info() on bucket name or root path
- Error handling consistent with S3 Backend (map to same exception types)
3.4 SFTP Backend¶
Method: stat(root_path) or equivalent
- Checks if root path exists via
os.stat()-like call - Paramiko provides
sftp.stat(path)— returns stat info - Also validates SSH connectivity and credentials
Implementation Notes:
- sftp.stat(root_path) already used in existence checks
- Non-destructive: just metadata lookup
- Error mapping: EIO / connection → BackendUnavailable, EACCES → PermissionDenied, ENOENT → NotFound
3.5 Azure Blob Storage / Data Lake Storage¶
Method: HNS-aware container/filesystem properties lookup
Non-HNS (standard Blob):
- get_container_properties() or equivalent (ContainerClient.exists())
- Non-destructive metadata call
- Validates container exists and credentials work
HNS (Data Lake):
- DataLakeFileSystemClient.get_file_system_properties() or equivalent (exists())
- Similar lightweight metadata call
- Validates filesystem exists and credentials work
Implementation Notes:
- AzureBackend detects HNS mode via self._hns flag and branches accordingly
- Non-HNS: ContainerClient.get_container_properties() is lightweight
- HNS: DataLakeFileSystemClient.get_file_system_properties() (reuses existing folder stat logic)
- Error mapping (both modes): 403 → PermissionDenied, 404 → NotFound, connection → BackendUnavailable
3.6 Memory Backend¶
Method: Always succeeds
- In-memory backend is always "healthy" by definition
- Return immediately without any checks
- Could optionally verify that root path exists in the tree, but not required
Implementation Notes: - Safe to always return success — no external resources to validate
4. API Design¶
Option A: store.ping() (Recommended)¶
# Store level
def ping(self) -> None:
"""
Verify the backend is reachable and credentials are valid.
Non-destructive: performs no data operations, creates no side effects.
Lightweight: single metadata call per backend.
Raises
------
PermissionDenied
If credentials are invalid or insufficient.
NotFound
If the backend root path / container / filesystem does not exist.
BackendUnavailable
If the backend is unreachable (network, timeout, etc.).
Examples
--------
>>> store = Store(s3_backend, root_path="data")
>>> store.ping() # Raises on error, succeeds silently
>>> try:
... store.ping()
... except (PermissionDenied, NotFound, BackendUnavailable) as e:
... print(f"Backend unhealthy: {e}")
"""
Rationale: - Familiar terminology from web/API health checks - Short, memorable method name - Consistent with Unix/network convention
Option B: backend.check_health() (Alternative)¶
# Backend ABC level
def check_health(self) -> None:
"""
Verify the backend is healthy (reachable, credentials valid).
"""
Tradeoff: - More explicit about intent - Longer name - Less standard in Python ecosystem
Design Decision: Both¶
- Expose at Store level as
ping()(user-facing) - Implement at Backend ABC level as
check_health()(backend contract) - Store delegates to backend
5. Error Mapping¶
Health checks should raise existing error types from the error model (005-error-model.md):
| Condition | Exception Type | Details |
|---|---|---|
| Credentials invalid | PermissionDenied |
Backend explicitly rejects credentials |
| Root path missing | NotFound |
Bucket, container, FS, or directory does not exist |
| Network unreachable | BackendUnavailable |
Timeout, connection refused, DNS failure |
| Other backend error | BackendUnavailable |
Generic fallback for unmapped errors |
No new error types needed — existing error model covers all scenarios.
6. Integration Points¶
6.1 Store Lifecycle¶
- Optional in
__init__()or shortly after for fail-fast bootstrap - Separate call — not automatic (user controls when to verify)
- No dependency on Registry — Store can call
ping()independently
6.2 Registry Health¶
- Could add
Registry.ping()to verify all registered backends - Iterates all backends, calls
store.ping()on each - Returns first error or succeeds silently
6.3 Observability Integration¶
ext.observehooks can wrap health checks for monitoring- OpenTelemetry span for
ping()call - Metrics: health check latency, failure rate
- Logging: debug-level entry/exit, info-level on failure
6.4 No Tight Coupling¶
- Health check is independent of existing Store/Backend methods
- Does not affect caching (
ext.cache) - Does not interact with batch operations
- Does not require capabilities (not capability-gated)
7. Testing Strategy¶
7.1 Unit Tests per Backend¶
For each backend, test:
- Success case: Healthy backend → no exception
- PermissionDenied: Invalid credentials → raises PermissionDenied
- NotFound: Missing bucket/path → raises NotFound
- BackendUnavailable: Mock network failure → raises BackendUnavailable
7.2 Conformance Tests¶
In test_conformance.py, add:
- test_check_health_success() — all backends pass when healthy
- test_check_health_missing_root() — NotFound when root missing (skip Memory)
- Error injection tests per backend via mock
7.3 Integration Tests¶
- DockerBackend fixtures: real MinIO, Azurite, SFTP
- Verify actual latency / performance (should be < 1 second)
7.4 Examples¶
- Simple script:
examples/health_check.pyor section in existing example - Shows Store + health check pattern for startup validation
8. Specification Outline (for ADR / Spec)¶
Spec Structure¶
Spec: sdd/specs/025-health-check.md (tentative number)
Sections:
- Overview — use cases and design constraints
- Store API —
store.ping()signature and semantics - Backend ABC —
Backend.check_health()contract - Per-backend implementation — strategies and error mapping
- Error Handling — detailed error types and conditions
- Integration — observability, Registry, lifecycle
- Non-requirements — what health checks do NOT do
- Testing — unit, conformance, integration, examples
Traceability Markers:
- PING-001 through PING-NNN for spec requirements
- Store method: STORE-016 (verify at spec authoring; historical note: STORE-015 was once duplicated in spec 001-store-api.md across native_path() and glob() — resolved under BK-250, glob() is now STORE-018)
- Backend method: BE-026 (next available; BE-025 is native_path())
9. Implementation Roadmap¶
Phase 1: Core¶
- Add
Backend.check_health()ABC method (all 6 backends implement) - Implement per-backend logic (Local, S3, S3-PyArrow, SFTP, Azure)
- Add
Store.ping()delegation - Unit tests per backend + conformance suite
- Spec + ADR
Phase 2: Observability¶
ext.observewiring (hooks for health checks)- OpenTelemetry span support
- Example:
examples/health_check.py
Phase 3: Registry Integration¶
Registry.ping_all()or similar (optional convenience method)- Docs integration
10. Known Considerations and Open Questions¶
10.1 Should health checks be capability-gated?¶
Decision: No.
Rationale: - Health checks are orthogonal to data operations - All backends can provide some form of health verification - Failure should never be due to missing capability, only backend unavailability
10.2 Should health checks cache results?¶
Decision: No, always make a live call.
Rationale:
- Health checks are meant to detect transient failures
- Caching would defeat the purpose
- Caller can implement their own caching if needed
- ext.cache applies to data operations, not health checks
10.3 What about timeout configuration?¶
Decision: Deferred; use backend's default timeout.
Rationale: - Adding timeout parameters complicates the API - Backends already have timeout configuration - Can revisit if customers need tunable timeouts
10.4 Should health checks verify read or write capability?¶
Decision: Read only (minimal verification).
Rationale: - Write verification would require creating/deleting test files (side effects) - Read (existence, permissions) is sufficient for startup gates and liveness probes - Users can test write capability separately if needed
10.5 Return type: void (None) vs. bool vs. object?¶
Decision: None (void).
Rationale:
- Consistent with remote-store error convention: success = return silently, failure = raise
- Boolean return would require catch-all exception for unhealthy state
- None is idiomatic Python for "operation succeeded"
11. Comparison with Other Libraries¶
AWS SDK (boto3)¶
s3_client.head_bucket()— validates bucket access- Returns response metadata; no exception = success
- remote-store aligns with this pattern
Google Cloud (google-cloud-storage)¶
bucket.exists()— checks bucket existence- Similar lightweight metadata call
Azure SDK¶
ContainerClient.exists()/get_container_properties()- Both lightweight, non-destructive
fsspec (via s3fs, adlfs, paramiko)¶
- No standard health check interface
- Each provider has its own validation mechanism
- remote-store standardizing across backends is an improvement
12. Summary and Recommendations¶
- Method names:
Store.ping()(user-facing),Backend.check_health()(implementation) - Error types: Use existing
PermissionDenied,NotFound,BackendUnavailable - Implementation: Per-backend lightweight checks (HeadBucket, stat, exists calls)
- No new capabilities: Health checks are always available
- Non-destructive: Verify without side effects
- Testing: Unit + conformance + integration against real backends
- Spec: Separate spec document (025-health-check.md)
- Integration: Optional observability hooks via
ext.observe - No caching, no timeouts, no return value — keep API minimal and idiomatic
Appendix: Pseudocode¶
Store.ping()¶
def ping(self) -> None:
"""Verify backend is reachable and credentials work."""
return self._backend.check_health()
Backend.check_health() (ABC)¶
@abstractmethod
def check_health(self) -> None:
"""
Verify the backend is healthy (reachable, credentials valid).
Raises
------
PermissionDenied
Credentials invalid or insufficient.
NotFound
Root path / container / filesystem does not exist.
BackendUnavailable
Backend unreachable (network, timeout, etc.).
"""
LocalBackend.check_health()¶
def check_health(self) -> None:
"""Verify root path exists and is readable."""
try:
if not self._root.exists():
raise NotFound(f"Root path does not exist: {self._root}")
if not os.access(self._root, os.R_OK):
raise PermissionDenied(f"No read access to {self._root}")
except OSError as e:
# Refine based on errno or re-raise as BackendUnavailable
S3Backend.check_health()¶
def check_health(self) -> None:
"""Verify bucket exists and credentials work via HeadBucket."""
with self._errors():
# Option A: underlying botocore client
self._fs.s3.head_bucket(Bucket=self._bucket)
# Option B: s3fs info call
# self._fs.info(self._bucket)
S3PyArrowBackend.check_health()¶
def check_health(self) -> None:
"""Verify bucket exists and credentials work via PyArrow metadata."""
with self._errors():
self._pa_fs.get_file_info(self._bucket)
Author notes: This research provides a complete blueprint for implementing health checks. The design is minimal, portable, and consistent with the existing architecture. No blocking issues identified.