Research: Read-Only HTTP Backend¶
Item ID: ID-082 Date: 2026-03-15 Status: Research complete — ready for spec consideration
1. Problem Statement¶
Users already use remote-store for storage (local, S3, SFTP, Azure). Sometimes another kind of "remote stored thing" enters the picture: files hosted at an HTTP URL — government open data portals, dataset registries, static file servers, CDN-hosted assets, package archives, etc.
A ReadOnlyHttpBackend would treat an HTTP endpoint as just another backend.
Files behind a URL become accessible through the same Store interface, with
the same composability (ext.cache, ext.transfer, ext.observe,
ext.batch) that users already rely on for other backends.
Why a backend, not an extension?¶
An extension cannot provide Store.read() — it would need to reimplement the
entire Store interface. A backend slots into the existing architecture
naturally: capability gating, error mapping, registry lifecycle, and all
extensions work out of the box.
Design constraints¶
- Core package has zero runtime dependencies (
dependencies = []). - HTTP library must be optional (
urllibfrom stdlib as baseline,requestsorhttpxas optional extras). - The backend is read-only — write, delete, move, copy operations raise
CapabilityNotSupported. - Must handle real-world HTTP concerns: redirects, content-type, timeouts, auth headers.
2. Capability Profile¶
| Capability | Supported | Notes |
|---|---|---|
| READ | Yes | Core value: GET request, return body as stream |
| WRITE | — | Raises CapabilityNotSupported |
| DELETE | — | Raises CapabilityNotSupported |
| LIST | — | No reliable server-side mechanism (see SS5) |
| MOVE | — | Raises CapabilityNotSupported |
| COPY | — | Raises CapabilityNotSupported |
| ATOMIC_WRITE | — | Raises CapabilityNotSupported |
| METADATA | Yes | HEAD request -> size, content-type, last-modified, ETag |
| GLOB | — | No server-side pattern matching |
Capability set: {READ, METADATA}
This would be the first backend with only 2 capabilities. The capability
system already handles this — Store gates every operation and raises
CapabilityNotSupported with clear context.
3. Path Semantics¶
3.1 Base URL + relative path¶
The backend takes a base_url at construction. Paths are appended:
backend = ReadOnlyHttpBackend(base_url="https://data.example.com/datasets/")
# store.read("population/2024.csv")
# -> GET https://data.example.com/datasets/population/2024.csv
3.2 The urljoin trailing-slash footgun¶
urllib.parse.urljoin has surprising behavior with trailing slashes:
urljoin("https://example.com/data", "file.csv")
# -> "https://example.com/file.csv" (WRONG -- replaces last segment)
urljoin("https://example.com/data/", "file.csv")
# -> "https://example.com/data/file.csv" (correct)
Mitigation: The constructor normalizes base_url to always end with /.
Path construction uses simple string concatenation (base_url + quote(path))
rather than urljoin, avoiding the footgun entirely. urljoin is only needed
if we ever support relative ../ paths, which we don't (path validation
rejects ..).
3.3 Path validation¶
- Standard remote-store path rules apply (no
.., no null bytes, no absolute paths). - The backend URL-encodes paths internally via
urllib.parse.quote(path, safe="/")when constructing request URLs. User-visible paths remain unencoded.
3.4 native_path() and to_key()¶
native_path(path)-> full URL string (e.g.,"https://data.example.com/datasets/population/2024.csv")to_key(native_path)-> stripsbase_urlprefix, returns relative key
4. HTTP Library Strategy¶
4.1 Tiered approach¶
| Tier | Library | Dependency | Pros | Cons |
|---|---|---|---|---|
| Baseline | urllib.request |
stdlib | Zero deps, always available | No connection pooling, clunky API, no async |
| Standard | requests |
optional extra | Industry standard, sessions, auth adapters | Sync only, heavy dep tree |
| Advanced | httpx |
optional extra | Sync + async, HTTP/2, modern API | Newer, smaller ecosystem |
4.2 Recommendation¶
Single backend, pluggable transport. Rather than separate backends per HTTP
library (the S3 vs S3-PyArrow model), use a single ReadOnlyHttpBackend that
auto-detects the best available library at init:
- If
httpxis installed -> use it (best feature set) - Else if
requestsis installed -> use it (most common) - Else -> fall back to
urllib.request(always available)
User can override: ReadOnlyHttpBackend(base_url=..., http_client="urllib").
Rationale: Unlike S3 vs S3-PyArrow (which have fundamentally different I/O models and performance profiles), the HTTP libraries are functionally interchangeable for our needs. One backend with swappable transport is simpler than three backends.
4.3 Transport abstraction¶
Internal protocol (not user-facing):
class HttpTransport(Protocol):
def get(self, url: str, headers: dict[str, str], timeout: float) -> HttpResponse: ...
def head(self, url: str, headers: dict[str, str], timeout: float) -> HttpResponse: ...
def close(self) -> None: ...
@dataclass
class HttpResponse:
status: int
headers: dict[str, str]
body: BinaryIO # streaming body (only meaningful for GET)
Three implementations: UrllibTransport, RequestsTransport, HttpxTransport.
4.4 urllib limitations (verified)¶
- No connection pooling. Each request opens a new TCP connection. Fine for
occasional reads; poor for batch operations.
requests/httpxsessions solve this. - No async. Acceptable — all existing backends are sync.
- SSL works.
urllib.request.urlopenvalidates TLS certificates by default viassl.create_default_context().verify_ssl=Falsewould usessl._create_unverified_context(). - Redirects handled.
urllibfollows redirects automatically (up to a built-in limit). Custommax_redirectsrequires a subclassed handler. - Streaming works. Response object supports chunked
read(size). See §11 for details.
5. LIST Capability — Why Not¶
HTTP has no native directory listing. The options considered:
| Approach | Verdict | Reason |
|---|---|---|
| No LIST | Chosen | Clean, honest, no hacks |
| Manifest-based LIST | Deferred | Requires user to maintain sidecar file; could add later via manifest_path param |
| HTML index parsing | Rejected | Fragile (HTML varies by server), security risk (arbitrary HTML) |
| API-specific listing | Out of scope | Belongs in focused extensions (e.g., ext.ckan) |
Users who need listing use an external catalog (API, manifest, database) to
discover paths, then store.read() each one. If demand justifies it, a
manifest_path constructor argument could upgrade the capability set to
{READ, METADATA, LIST} in a future phase.
6. Complete Method Mapping¶
Every Backend ABC method and its HTTP implementation:
| Method | Implementation | Notes |
|---|---|---|
name |
"http" |
See §18 Q1 for naming rationale |
capabilities |
{READ, METADATA} |
Fixed set |
exists(path) |
HEAD -> 200=True, 404=False |
|
is_file(path) |
HEAD -> 200=True, 404=False |
HTTP resources are always "files" |
is_folder(path) |
Always False |
No folder concept without LIST |
read(path) |
GET -> _ErrorMappingStream(response) |
Non-seekable stream, see §11 |
read_bytes(path) |
GET -> response.read() |
Fully buffered |
write(...) |
Raise CapabilityNotSupported |
|
write_atomic(...) |
Raise CapabilityNotSupported |
|
open_atomic(...) |
Raise CapabilityNotSupported |
|
delete(...) |
Raise CapabilityNotSupported |
|
delete_folder(...) |
Raise CapabilityNotSupported |
|
list_files(...) |
Raise CapabilityNotSupported |
|
list_folders(...) |
Raise CapabilityNotSupported |
|
iter_children(...) |
Raise CapabilityNotSupported |
Default impl calls list_files+list_folders; override to raise directly |
get_file_info(path) |
HEAD -> FileInfo(...) |
See §12 for field mapping |
get_folder_info(path) |
Raise NotFound |
No folder concept; consistent with is_folder() returning False |
move(...) |
Raise CapabilityNotSupported |
|
copy(...) |
Raise CapabilityNotSupported |
|
glob(...) |
Raise CapabilityNotSupported |
Default impl already does this |
check_health() |
HEAD base_url -> raise BackendUnavailable on failure |
|
native_path(path) |
Return full URL string | |
to_key(url) |
Strip base_url prefix |
|
close() |
Close transport (connection pool if applicable) | No-op for urllib |
unwrap(type_hint) |
Return underlying transport if type matches | e.g., unwrap(httpx.Client) |
7. Error Mapping¶
| HTTP Status | remote-store Error | Notes |
|---|---|---|
| 200, 204 | Success | |
| 301, 302, 307, 308 | Follow redirect (up to limit) | Map final status |
| 401, 403 | PermissionDenied |
|
| 404 | NotFound |
|
| 408, 429, 500, 502, 503, 504 | BackendUnavailable |
Transient |
| Other 4xx | RemoteStoreError |
Generic |
8. Configuration & Auth¶
8.1 Constructor signature (sketch)¶
ReadOnlyHttpBackend(
base_url: str,
*,
headers: dict[str, str] | None = None, # custom headers (API keys, auth tokens)
timeout: float = 30.0, # request timeout in seconds
retry: RetryPolicy | None = None, # retry config (same as S3/SFTP/Azure)
http_client: str | None = None, # force "urllib", "requests", or "httpx"
verify_ssl: bool = True, # TLS verification
max_redirects: int = 5, # redirect follow limit
)
8.2 Auth patterns¶
- API key in header:
headers={"Authorization": "Bearer <token>"}orheaders={"X-API-Key": "<key>"} - No auth: Most open data portals need nothing
- Advanced auth (OAuth, mutual TLS): Out of scope for v1. Users can
pre-configure an
httpx.Clientand pass it via a futureclientparameter.
8.3 Registry integration¶
# store config
stores:
opendata:
backend: http
base_url: "https://data.example.com/datasets/"
options:
timeout: 60
headers:
X-API-Key: "${OPENDATA_API_KEY}"
9. Composability with Existing Extensions¶
This is the primary value of making it a backend vs. standalone code:
| Extension | Benefit |
|---|---|
ext.cache |
TTL-based caching of read() results — critical for HTTP, avoids repeated downloads |
ext.transfer |
download(store, "dataset.csv", local_path) — works out of the box |
ext.observe |
Instrument HTTP reads with callbacks (timing, logging) |
ext.batch |
batch_exists(store, paths) — check multiple resources |
ext.arrow |
read_table(store, "data.parquet") — read remote Parquet/CSV via PyArrow |
The ext.cache composability alone justifies the backend approach over ad-hoc
HTTP code.
Note: ext.glob requires LIST capability, so it won't work with this backend.
10. Conformance Suite Impact¶
10.1 Current state of capability-gating¶
The conformance suite (tests/backends/test_conformance.py) has 19 test
classes with 69 test methods. Only two capabilities are currently gated:
| Capability | Gated? | Tests |
|---|---|---|
| ATOMIC_WRITE | Yes | 7 tests skip cleanly |
| GLOB | Yes | 2 tests skip cleanly |
| WRITE, DELETE, LIST, MOVE, COPY, METADATA | No | ~60 tests have no capability checks |
10.2 What breaks for a {READ, METADATA} backend¶
Most test classes set up test data by calling backend.write() before
asserting read behavior. This means even read/metadata tests will fail — not
because the backend can't read, but because the test can't set up fixtures.
Tests that would need changes:
| Test Class | Issue | Fix |
|---|---|---|
TestBackendExists |
Calls write() in setup |
Gate on WRITE or use pre-seeded fixture |
TestBackendFileFolder |
Calls write() in setup |
Gate on WRITE |
TestBackendRead |
Calls write() in setup |
Gate on WRITE or pre-seed |
TestBackendWrite |
Tests write operations | Gate on WRITE |
TestBackendDelete |
Tests delete operations | Gate on DELETE |
TestBackendListing |
Tests list operations | Gate on LIST |
TestBackendIterChildren |
Tests list operations | Gate on LIST |
TestBackendMetadata |
Calls write() in setup |
Gate on WRITE or pre-seed |
TestBackendMove |
Tests move operations | Gate on MOVE |
TestBackendCopy |
Tests copy operations | Gate on COPY |
TestStreamingConformance |
4 read tests (SIO-001) call write() for setup; 2 write tests (SIO-003) test write behavior |
Read tests: gate on WRITE or pre-seed; write tests: gate on WRITE |
TestBackendToKey |
3 of 4 methods pass; test_to_key_round_trip_with_listing calls write() + list_files() |
Gate 1 method on WRITE+LIST |
Tests that pass as-is:
| Test Class | Why |
|---|---|
TestBackendIdentity |
Only checks name, capabilities, repr |
TestBackendWriteAtomic |
Already gated on ATOMIC_WRITE |
TestBackendOpenAtomic |
Already gated on ATOMIC_WRITE |
TestBackendLifecycle |
Only checks that close() is callable |
TestBackendGlob |
Already gated on GLOB |
TestBackendUnwrap |
Only checks unwrap raises or returns |
TestBackendNativePath |
Only checks path round-trip |
10.3 Proposed conformance changes¶
Two-pronged approach:
-
Add capability gates to test classes that test write/delete/move/copy operations. Pattern:
if not backend.capabilities.supports(Capability.X): pytest.skip(...). This is the same pattern already used for ATOMIC_WRITE and GLOB. Straightforward, ~15 lines of changes. -
Pre-seeded fixture for read-only backends. Tests that verify read behavior (
TestBackendRead,TestBackendMetadata,TestBackendExists) need test data. For writable backends, they create it inline. For read-only backends, provide aconftest.pyfixture that pre-seeds the HTTP mock server with test files. The conformance test checksbackend.capabilities.supports(Capability.WRITE)— if true, write inline; if false, assume the fixture pre-seeded the data.
Estimated effort: Small. The capability-gating pattern is established.
The pre-seeded fixture is the only new concept, and it's just a
pytest-httpserver fixture that serves a few static files.
11. Stream Lifecycle (read() return value)¶
11.1 What the spec requires (SIO-001)¶
From spec 006-streaming-io.md: "The returned stream is not guaranteed to be
seekable. Seekability is a backend-level property (e.g. local files are
seekable, HTTP-based streams typically are not), not a Store API contract."
The streaming conformance tests (TestStreamingConformance) verify:
- Stream is not a BytesIO wrapper (must be a real stream)
- Chunked read(size) works
- Stream supports context manager protocol
- Seekability is NOT tested as a requirement
11.2 urllib.request response as BinaryIO¶
urllib.request.urlopen() returns http.client.HTTPResponse, which:
- Inherits from io.BufferedIOBase (not RawIOBase)
- Supports: read(size), readline(size), readinto(b), close()
- Reports seekable() -> False
- Has __enter__/__exit__ (context manager)
11.3 Wrapping with _ErrorMappingStream¶
_ErrorMappingStream delegates all I/O to the inner stream and maps
exceptions to remote-store errors. It handles imperfect streams gracefully:
- seek() returning None (paramiko quirk) -> falls back to tell()
- seekable() missing -> returns False
- tell() returning None -> returns 0
- close() exceptions -> suppressed
Verdict: _ErrorMappingStream(http_response, ...) works directly.
Do NOT wrap in io.BufferedReader — unlike S3/SFTP backends, the HTTP
response is already buffered (BufferedIOBase). Double-buffering would be
wasteful and could cause issues.
Return pattern:
def read(self, path: str) -> BinaryIO:
response = self._transport.get(self._url(path), ...)
return cast("BinaryIO", _ErrorMappingStream(response.body, self._classify_error, path))
11.4 Connection lifecycle¶
The HTTP connection stays open while the stream is open. This is the same
pattern as S3 (s3fs holds the connection) and SFTP (paramiko holds the
channel). The stream's close() releases the connection.
For urllib, this means one TCP connection per open stream. For requests/httpx with session pooling, the connection returns to the pool on close.
12. FileInfo Field Mapping from HTTP Headers¶
| FileInfo field | HTTP header | Handling when missing |
|---|---|---|
path |
From request path | Always available |
name |
From path | Always available |
size |
Content-Length |
0 if missing (chunked transfer, dynamic content) |
modified_at |
Last-Modified |
datetime.min.replace(tzinfo=timezone.utc) if missing |
checksum |
ETag |
None (optional field) |
content_type |
Content-Type |
None (optional field) |
extra |
All response headers | {"headers": dict(response.headers)} |
Notes:
- Content-Length is absent for chunked responses and some CDNs. Using 0 as
a fallback is imperfect — code checking file_info.size == 0 (e.g.,
skip-empty-file logic, progress bars, ext.transfer pre-allocation) would
misinterpret "unknown" as "zero bytes". Since FileInfo.size is int (not
Optional[int]), there is no clean sentinel today. The spec should note this
as a known limitation that may warrant making size Optional in a future
FileInfo revision.
- Last-Modified is absent on many static file hosts and CDNs. The sentinel
must be timezone-aware (datetime.min.replace(tzinfo=timezone.utc)) because
S3 and Azure backends return timezone-aware modified_at values — mixing
naive and aware datetimes raises TypeError in user code (sorting, filtering,
ext.transfer comparisons). Sorting by modified_at would still place HTTP
files at the beginning of any list. datetime(1970, 1, 1, tzinfo=UTC) (Unix
epoch) is a more conventional sentinel but carries the same ambiguity. Like
size, both fields may warrant Optional treatment in a future FileInfo
revision.
- ETag maps naturally to checksum — both are opaque identifiers for
content versioning. Useful for ext.cache integration.
13. Prior Art — Build vs. Reuse¶
The key question is: can we use an existing library instead of writing our own HTTP backend? The answer is no — but the implementation is small enough that this is fine.
13.1 Why not wrap fsspec HTTPFileSystem?¶
fsspec's HTTP support is the closest match. It provides read-only HTTP access, streaming, and even directory listing via HTML parsing.
Why it doesn't fit:
- fsspec is a heavyweight dependency (pulls in aiohttp for HTTP).
remote-store's core has zero runtime deps.
- fsspec's HTTPFileSystem exposes an AbstractFileSystem interface, not
our Backend interface. Wrapping it would mean adapting every method —
the wrapper would be roughly the same size as a direct implementation.
- Its HTML-based directory listing is fragile and not something we'd want.
- Its Range-based seeking adds complexity we don't need.
fsspec validates that the concept works, but there's nothing to reuse.
13.2 Why not wrap smart_open?¶
smart_open.open("https://...") gives a streaming reader. But:
- It's a single open() function, not a filesystem abstraction. No
exists(), get_file_info(), check_health(), or any metadata support.
- Wrapping it would provide only read() — we'd still implement everything
else ourselves.
- It requires requests as a dependency.
13.3 Why not use requests/httpx directly as the backend?¶
We do — that's exactly the transport layer (SS4). The "build" here is the thin Backend adapter (~150 lines) that maps HTTP semantics to remote-store's interface. The actual HTTP work is delegated to urllib/requests/httpx.
13.4 What we learn from prior art¶
| Project | Lesson for us |
|---|---|
| fsspec HTTPFileSystem | Concept is proven. Skip HTML listing and Range seeking. |
| smart_open | Simple HTTP read adapter has demand. We add metadata + composability on top. |
| Hugging Face Hub | Domain-specific HTTP access belongs in extensions, not the base backend. |
| DVC | ETag -> checksum mapping works well. Validates ext.cache integration pattern. |
13.5 Implementation size estimate¶
The HTTP backend is a thin adapter over standard HTTP libraries. Estimated: - Backend class: ~150 lines (method mapping, error mapping, path handling) - Transport protocol + urllib impl: ~80 lines - requests/httpx transports: ~50 lines each (optional)
This is much smaller than MemoryBackend (~505 lines) and S3Backend
(~440 lines). Not a wheel worth importing — simpler to build.
14. Real-World HTTP Endpoint Behavior¶
Tested against representative public endpoints to validate assumptions:
| Endpoint | Content-Length | Last-Modified | ETag | HEAD | Redirects |
|---|---|---|---|---|---|
| GitHub raw (raw.githubusercontent.com) | Yes | Yes | Yes (weak) | Yes | Yes (1 redirect from github.com) |
| PyPI simple index (pypi.org) | Yes | — | Yes | Yes | Yes (http->https) |
| PyPI package files (files.pythonhosted.org) | Yes | Yes | Yes | Yes | — |
| opendata.swiss (lindas API) | Varies | — | — | Yes | — |
| CDN-hosted static files (typical) | Yes | Yes | Yes | Yes | — |
Findings:
- HEAD is universally supported — exists() and get_file_info() are safe.
- Content-Length is present for static files, sometimes missing for API responses.
- Last-Modified is often missing on API endpoints and CDNs.
- ETag is common on static file servers, rare on dynamic APIs.
- Redirects are common (http->https, domain aliases). Following redirects is mandatory.
15. Risk Assessment¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| urllib can't produce conformant streams | Low | High | Verified: _ErrorMappingStream wraps HTTPResponse correctly (SS11) |
| Conformance suite changes break other backends | Low | Medium | Changes are additive (capability gates); existing backends unaffected |
| Scope creep toward WebDAV/write support | Medium | Medium | Hard boundary: backend name is http, not webdav; no write methods |
Content-Length missing breaks FileInfo |
Low | Medium | Use size=0 fallback; document as known limitation (see §12) |
| Connection leak from unclosed streams | Medium | Medium | Same risk as S3/SFTP; _ErrorMappingStream.close() handles cleanup |
| urllib SSL issues on older Python | Low | Low | ssl.create_default_context() works on Python 3.10+ |
No showstoppers identified. The urllib streaming concern (P1.4 in the original gap analysis) is resolved — it works.
16. Testing Strategy¶
16.1 Backend-specific tests (tests/backends/test_http.py)¶
| ID | Test | Spec |
|---|---|---|
| HTTP-001 | read() returns streaming BinaryIO, chunked read works |
SIO-001 |
| HTTP-002 | read_bytes() returns full content |
BE-007 |
| HTTP-003 | exists() returns True for 200, False for 404 |
BE-004 |
| HTTP-004 | get_file_info() maps headers to FileInfo fields |
BE-016 |
| HTTP-005 | get_file_info() handles missing Content-Length/Last-Modified |
BE-016 |
| HTTP-006 | Error mapping: 401->PermissionDenied, 404->NotFound, 500->BackendUnavailable | ERR-* |
| HTTP-007 | native_path() returns full URL |
NPR-003 |
| HTTP-008 | to_key() strips base_url prefix |
NPR-003 |
| HTTP-009 | Path with special characters is URL-encoded | — |
| HTTP-010 | Custom headers are sent with every request | — |
| HTTP-011 | Redirects are followed (up to limit) | — |
| HTTP-012 | Timeout raises BackendUnavailable | — |
| HTTP-013 | check_health() sends HEAD to base_url |
BE-020 |
| HTTP-014 | Write/delete/move/copy raise CapabilityNotSupported | — |
| HTTP-015 | close() is callable, releases transport |
BE-020 |
| HTTP-016 | Transport auto-detection (urllib/requests/httpx) | — |
| HTTP-017 | is_folder() always returns False |
BE-005 |
16.2 Conformance suite participation¶
After adding capability gates (SS10.3), the HTTP backend runs through the shared conformance suite. Expected results:
- ~12 tests pass (identity 5, lifecycle 1, to_key 3, unwrap 1, native_path 2)
- ~48 tests need capability gates (60 ungated minus 12 pass)
- 9 tests already gated and skip (ATOMIC_WRITE 7 + GLOB 2)
- Total: 12 + 48 + 9 = 69 ✓
- 0 tests fail
16.3 Test infrastructure¶
Use pytest-httpserver (lightweight, no external deps) to create a local HTTP
server in fixtures. Pre-seed with test files for read/metadata tests.
No real network calls in CI.
17. Implementation Checklist (SDD Pipeline)¶
| Step | Item | Notes |
|---|---|---|
| 1 | Write spec sdd/specs/032-http-backend.md |
Capability profile, method mapping, error mapping |
| 2 | Add capability gates to conformance suite | ~15 lines, prerequisite for step 5 |
| 3 | Implement ReadOnlyHttpBackend |
src/remote_store/backends/_http.py |
| 4 | Implement UrllibTransport |
Same file or _http_transport.py |
| 5 | Register in backend registry | backends/__init__.py, from_dict() support |
| 6 | Write backend-specific tests | tests/backends/test_http.py |
| 7 | Run conformance suite with HTTP backend | Verify skip/pass/fail counts |
| 8 | Add RequestsTransport, HttpxTransport |
Optional extras |
| 9 | Add optional extras to pyproject.toml |
[http] or [httpx] group |
| 10 | Add docs: guide, API ref, examples | docs-src/guides/, docs-src/api/ |
| 11 | Update CHANGELOG, BACKLOG | Per repo conventions |
18. Resolved Questions¶
Questions from the original draft, now resolved with reasoning:
Q1. Backend name: "http" or "http-readonly" or "web"?
Use "http". Reasons:
- Consistent with other backend names ("local", "s3", "sftp", "azure")
— none encode capabilities in the name.
- A future WebDAV backend would use "webdav", not "http" — different
protocol, different backend.
- The capability system communicates what the backend can do; the name
identifies the protocol.
Q2. Should is_folder() always return False?
Yes. HTTP has no folder concept. Without LIST, there are no known prefixes to
check against. If manifest-based LIST is added later, is_folder() can check
whether a path is a known prefix in the manifest.
Q3. Streaming vs. buffered reads?
Streaming. Consistent with all other backends. urllib's HTTPResponse supports
chunked read(size) and is already buffered (BufferedIOBase). Wrap in
_ErrorMappingStream directly, no BufferedReader needed. Non-seekable per
SIO-001 spec allowance.
Q4. Retry policy?
Accept the existing RetryPolicy in the constructor (like S3, SFTP, Azure do).
Map its fields to urllib/requests/httpx retry mechanisms:
- urllib: Implement a simple retry loop around urlopen(), respecting
max_attempts, backoff_base, backoff_max, jitter, and timeout.
Retry on transient HTTP statuses (429, 500, 502, 503, 504) and connection
errors. Honour Retry-After header when present.
- requests/httpx: Delegate to urllib3.Retry / httpx transport retry
config, mapping RetryPolicy fields to native parameters.
Q5. Extra dependency group name?
pip install remote-store[httpx] for httpx, pip install remote-store[requests]
for requests. No [http] group — the baseline (urllib) needs no extra deps.
This mirrors how [arrow] means "install PyArrow" and [otel] means "install
OpenTelemetry".
Q6. Conformance suite changes?
Needed but small. Add capability gates to ~10 test classes (same pattern as
existing ATOMIC_WRITE/GLOB gates). Pre-seed read-only test data via
pytest-httpserver fixture. See SS10.3 for details.
19. Recommendation¶
Proceed with a spec. No showstoppers found:
- urllib streaming works with
_ErrorMappingStream(verified). - Conformance suite changes are small and additive.
- The capability system handles read-only gracefully.
- Real-world HTTP endpoints behave as expected (HEAD, Content-Length, redirects).
- Prior art (fsspec, smart_open, DVC) validates the concept.
- Composability with
ext.cacheandext.transferdelivers clear value.
Next steps:
1. Write spec 032-http-backend.md
2. Add conformance suite capability gates (prerequisite, benefits all future
partial-capability backends)
3. Implement Phase 1 (urllib-only, {READ, METADATA})
20. Implementation Plan¶
Formerly sdd/plans/plan-readonly-http-backend.md (consolidated).
Branch: claude/plan-readonly-http-backend-YhBCx
Phases¶
| Phase | Scope | Notes |
|---|---|---|
| 0 | Conformance suite capability gates | Prerequisite — make suite safe for partial-capability backends |
| 1 | Spec 032-http-backend.md |
Formalize design as testable specification |
| 2 | ReadOnlyHttpBackend + UrllibTransport + registration |
Zero new runtime deps |
| 3 | Backend-specific tests + conformance fixture | 17 HTTP-specific scenarios via pytest-httpserver |
| 4 | Optional transports (requests, httpx) + extras | Higher-quality HTTP for users with these installed |
| 5 | Docs: guide, example, README, nav, CHANGELOG, BACKLOG | User-facing documentation |
Commit strategy¶
Each commit passes hatch run all independently. One commit per phase.
Dependencies & test infrastructure¶
| Dependency | Type | Purpose |
|---|---|---|
pytest-httpserver |
test-only | Mock HTTP server for backend tests + conformance fixture |
requests>=2.25.0 |
optional extra | RequestsTransport |
httpx>=0.24.0 |
optional extra | HttpxTransport |
Risks & mitigations¶
| Risk | Mitigation |
|---|---|
| Conformance gate changes affect other backends | Additive only (skip where unsupported); run full suite |
| urllib streaming edge cases | Research verified _ErrorMappingStream wraps HTTPResponse correctly |
size=0 when Content-Length missing |
Document as known limitation in spec and guide |