Research: Testing Best Practices for Long-Term Package Quality¶
Date: 2026-03-28 Context: Recurring test quality issues keep resurfacing in remote-store. PR #301 is the latest fix round; audits and backlog items show a pattern of the same anti-patterns returning after being fixed. This document compiles lessons learned, industry best practices, and proposes enforceable guardrails.
1. Our History: What We Fixed and What Keeps Coming Back¶
1.1 Audit 001 (Adversarial Review, v0.5.0) Findings¶
| ID | Problem | Anti-Pattern | Status |
|---|---|---|---|
| M-12 | Azure HNS tested via unconstrained MagicMock() |
Mock accepts any call/signature | Open |
| M-13 | test_coverage_gaps.py inflated coverage with assert X is not None |
Pure-import assertions | Fixed (BK-014) |
| M-14 | Zero concurrency tests despite thread-safety spec claims | Missing behavioral tests | Open |
| M-16 | No tests for PermissionDenied/BackendUnavailable in S3/SFTP |
Untested error paths | Fixed (AF-013) |
| M-17 | SFTP retry logic (tenacity) never tested | Untested retry paths | Open |
| L-18 | test_list_files_round_trip checks len(data) == 3, not content |
Weak assertions | Open |
| L-19 | S3/S3-PyArrow ~500 lines near-identical copy-paste | Test duplication | Fixed (BK-011) |
| L-20 | No tests for Unicode paths, special chars, empty/large files | Missing edge cases | Open |
1.2 BK-014: Test Suite Deduplication (v0.19.0)¶
Reduced ~17,800 to ~16,300 lines (-8.6%) across 30 of 40 test files while preserving 1,866 passed tests. Key techniques:
- Parametrized similar tests (error mapping, validation, operation variants)
- Extracted shared fixtures
- Merged single-method test classes into parent classes
- Consolidated repeated assertion patterns
- Reviewed
test_coverage_gaps.pypure-import assertions (M-13)
1.3 PR #301: Fix Coverage and ResourceWarning (v0.20.0-dev)¶
- SQLBlob test fixtures now yield+close; engines disposed on teardown
- ProxyStore delegation coverage 68% -> 100% (new
test_proxy.py, 33 tests) - SQLAlchemy backend coverage 90% -> 99%
/prskill now gates onhatch run test-cov(95% threshold)
1.4 Pattern: What Keeps Recurring¶
Despite fixes, the same categories of bad tests reappear:
- Tests with no assertions — just call a method, assume "no crash = pass"
isinstanceas the only assertion — verifies type, not behavior- Private attribute access (
._backends,._ttl,._owns_backend) - Unconstrained mocks (
MagicMock()withoutspec=) - Factory helpers duplicating constructor logic
These recur because they are not mechanically prevented. A code reviewer (human or AI) must catch them every time, and that is not sustainable.
2. Current Anti-Patterns in Our Codebase¶
Detailed audit of recurring anti-patterns with file-path evidence. Each category maps to a proposed rule in section 4.1.
- No assertions (~35 instances): Tests that call a method but never
assert on the outcome. Examples:
test_ping.py,test_proxy.py:205,test_config.py:82. Mutation testing would flag all of these. → Rule 1. isinstanceas primary assertion (~100+ instances): Verifies type, not behavior. AFileInfowith wrong path/size/timestamp passesisinstance. Examples:test_store.py:136,test_sftp.py:130. → Rule 2.- Private attribute access (~100+ instances):
assert obj._field == xcouples tests to internals. Examples:test_registry.py:61(_backends),test_cache.py:151(_ttl),test_arrow.py:75(_store). → Rule 3. - Unconstrained mocks (~67 instances):
MagicMock()withoutspec=accepts any call (68 total, 1 withspec=). Examples:test_azure.py(46),test_ping.py(11),test_depth_listing.py(4). → Rule 4.
Summary Statistics¶
| Anti-Pattern | Count | Severity |
|---|---|---|
| No assertions (void call only) | ~35 | High |
isinstance as primary assertion |
~100+ | Medium |
| Private attribute access in assertions | ~100+ | Medium |
Unconstrained MagicMock() |
~67 | Medium |
| Factory helpers duplicating constructors | ~6 | Low |
3. Industry Best Practices¶
3.1 The Core Principle: Test Behavior, Not Implementation¶
Vladimir Khorikov (Unit Testing: Principles, Practices, and Patterns, Manning) defines the key insight:
Tests should verify units of behavior — something meaningful for the problem domain — not units of code. The number of classes it takes to implement that behavior is irrelevant.
His four pillars of a good test: 1. Protection against regressions — catches real bugs 2. Resistance to refactoring — doesn't break on harmless changes 3. Fast feedback — runs quickly 4. Maintainability — easy to read and change
Most of our anti-patterns violate pillar 2 (private attribute access, implementation-coupled mocks) or pillar 1 (no assertions, isinstance-only).
Important nuance: Khorikov's "the number of classes is irrelevant" applies to what you test (a unit of behavior may span classes). It does not mean the volume of test code is irrelevant. Pillar 4 (maintainability) directly addresses this — and our own history proves it.
3.2 Test Code Economy: Less Code, Same Coverage, Better Signal¶
Why bloated test suites are dangerous:
- Distraction. Meaningful tests buried among trivial ones; reviewers miss real issues.
- False confidence. High test count masks low distinct behavioral coverage.
- Maintenance drag. At ~1.3:1 test-to-production line ratio, refactoring costs double.
- Anti-pattern camouflage. No-assertion tests hide in large classes but stand out in tight parametrized suites.
Jay Fields (Working Effectively with Unit Tests):
It's acceptable to delete tests that don't provide value. Tests are an investment; if the return is negative, cut your losses.
BK-014 validated this: -8.6% test code, zero coverage loss (see section 1.2).
3.3 How Major Python Projects Test¶
| Project | Key Strategy | Mocking Approach |
|---|---|---|
| requests | Real local HTTP server (pytest-httpbin) | Adapter pattern as natural mock point |
| SQLAlchemy | In-memory SQLite for fast real DB tests | Almost never mocks own internals |
| Django | Transaction-wrapped tests with auto-rollback | --shuffle flag to detect isolation bugs |
| FastAPI | TestClient + dependency overrides |
DI as the testability mechanism |
| Pydantic | Extensive parametrization, property-based tests | Minimal mocking |
Common theme: Prefer real dependencies over mocks. Use architecture (dependency injection, adapter pattern, functional core) to make code testable without mocks.
3.4 Hynek Schlawack: "Don't Mock What You Don't Own"¶
Key principles from hynek.me:
- Never mock third-party code directly — wrap it in your own function, mock that wrapper
- Keep test code paths close to production — an in-process test server is closer to reality than magically replacing a client with a fake
- Use verified fakes over unconstrained mocks
- Heavy mocking signals architectural problems, not testing problems
His talk "Design Pressure" (PyCon US 2025) extends this: if you need many mocks, your code has a design problem. Push I/O to the edges ("Functional Core, Imperative Shell") and the core becomes trivially testable.
3.5 Ned Batchelder: Coverage Is Necessary But Not Sufficient¶
From his blog and talks:
100% statement coverage doesn't mean much. There are dozens of ways your code or tests could still be broken. Statement coverage has taken you to the end of its road, and the bad news is, you aren't at your destination.
Implication for us: Our 95% coverage threshold is good as a floor, but it doesn't prevent tests with no assertions from inflating the number. We need quality metrics alongside quantity metrics.
3.6 Property-Based Testing (Hypothesis)¶
Hypothesis generates random inputs and checks invariant properties. Hillel Wayne's insight: Contracts + Property-Based Testing = Integration Tests — define invariants as contracts, let Hypothesis find violations.
Candidate invariants for remote-store: write/read round-trip, path normalization idempotency, config serialization round-trip, batch operation completeness. Implementation details belong in the PR that adds these tests.
3.7 Mutation Testing (pytest-gremlins)¶
Mutation testing answers: "If a bug were introduced, would my tests catch it?" A mutation testing tool systematically mutates production code and re-runs tests. Surviving mutants = potential undetected bugs. Our "no assertion" tests would be flagged immediately — every mutant survives because nothing checks the output. Too slow for per-PR CI; run on a weekly schedule.
Tool choice: pytest-gremlins (v1.5+) selected over mutmut after evaluation. Key advantages:
| Factor | pytest-gremlins | mutmut |
|---|---|---|
| Integration | Native pytest plugin (pytest --gremlins) |
Separate CLI |
| Test selection | Coverage-guided (only runs tests covering mutated line) | Runs all tests per mutant |
| Caching | Incremental (content hash, skips unchanged code) | File-based |
| Speed | 3.7-13.8x faster in benchmarks | Baseline |
| Regression guard | Pardon system (--gremlin-max-pardons-pct) |
Manual |
| Parallelism | Built-in multi-core | Requires --runners flag |
mutmut would work but adds friction: separate CLI, no coverage-guided selection, slower I/O-bound mutations. pytest-gremlins fits our all-pytest workflow and keeps weekly CI budgets reasonable for ~12K source lines.
3.8 Testing Retry and Concurrency Behavior¶
Two known gaps (M-14 concurrency, M-17 retry) require specific patterns:
- Retry testing: Controlled failure injection — a deterministic fake that fails N times then succeeds. Assert attempt count + final outcome, never wall-clock time.
- Concurrency testing: Invariant assertions under contention --
ThreadPoolExecutorwith deterministic checks for no lost writes, no partial reads, no duplicate entries, no deadlocks (timeout on the test).
Full code examples belong in the PR that implements these tests.
4. Proposed Testing Rules for remote-store¶
4.1 The Eight Rules¶
These rules should be added to sdd/DESIGN.md section 11 (Test Style) and
enforced in code review:
Rule 1: Every test must have at least one meaningful assertion. [CI-enforced]
"No crash" is not a test. If the method returns void and the point is that
it doesn't raise, add a post-condition assertion proving the system is in
the expected state. Every public API method should also have at least one
failure-path test (pytest.raises with match=).
Rule 2: Assert behavior, not types. [review-enforced]
Replace assert isinstance(x, FileInfo) with assertions on x.path,
x.size, or x.name. The type is already guaranteed by the type checker.
Rule 3: Never assert on private attributes. [review-enforced]
If you need to verify internal state, either (a) expose it through a public
API, or (b) verify it through observable behavior. assert obj._field == x
is banned in new tests. Exception: if no public observable exists, annotate
with # internal: no public observable and justify in the PR.
Rule 4: Always use spec= (or spec_set=) with MagicMock. [CI-enforced]
MagicMock() without spec is banned. Use MagicMock(spec=RealClass) or
create_autospec(RealClass) to constrain mocks to real interfaces.
Rule 5: Don't mock what you don't own. [review-enforced]
Never patch("boto3.client") or patch("paramiko.SFTPClient"). Instead,
mock at our own boundary (the Backend ABC or a wrapper function).
Rule 6: Prefer real dependencies over mocks. [review-enforced]
Use MemoryBackend or in-memory SQLite over mocked backends. Use
pytest-httpserver over mocked HTTP clients. Reserve mocks for external
services that can't be run locally.
Rule 7: Maximize behavioral coverage per line of test code. [review-enforced] Parametrize similar tests instead of writing separate methods. Merge single-method test classes. Delete tests subsumed by others. Three parametrized cases in 10 lines beat three methods in 30 lines — same coverage, one-third the maintenance surface. (See BK-014: -8.6% test code, zero coverage loss.) When removing a test, verify via coverage that the deleted path is still exercised by remaining tests.
Rule 8: Tests must survive refactoring. [review-enforced] If renaming a private method or changing internal data structures would break a test without changing behavior, the test is coupled to implementation. Fix the test.
4.2 Ruff PT Rules (Mechanical Enforcement)¶
Add to pyproject.toml:
[tool.ruff.lint]
extend-select = ["PT"] # flake8-pytest-style
[tool.ruff.lint.flake8-pytest-style]
raises-require-match-for = [
"ValueError", "TypeError", "KeyError", "RuntimeError",
"NotFound", "AlreadyExists", "CapabilityNotSupported",
"BackendUnavailable", "PermissionDenied",
]
Key PT rules that catch our recurring issues:
- PT011: pytest.raises() too broad (no match=) — prevents lazy
exception checks
- PT018: Composite assertions should use multiple assert statements
- PT006/PT007: Consistent parametrize style
4.3 CI-Enforced Checks¶
| Check | Rule | Phase | Effort |
|---|---|---|---|
AST check: test has no assert and no pytest.raises |
Rule 1 | 1 | Low (script) |
Grep: MagicMock( without spec= or autospec= |
Rule 4 | 1 | Low (grep) |
--cov-branch |
Coverage quality | 1 | Low (flag) |
diff-cover --fail-under=90 |
New code without tests | 2 | Low (pip) |
pytest-gremlins baseline run (diagnostic, not blocking) |
Assertion quality | 2 | Medium |
pytest-smell --ci |
17 types of test smells | 3 | Low (pip) |
pytest-gremlins weekly with threshold |
Assertion quality gate | 3 | Medium |
4.4 PR Review Checklist¶
Add to the PR template as a reviewer aid:
### Test review
- [ ] Every new test has at least one `assert` or `pytest.raises`
- [ ] Assertions are behavioral (not `isinstance`-only, not type-only)
- [ ] No new `._private` attribute access in assertions (or justified)
- [ ] Mocks use `spec=` or `create_autospec`
- [ ] Failure path covered for new/changed public API methods
4.5 CLAUDE.md Additions¶
Add a ## Testing rules section to CLAUDE.md summarizing the 8 rules above
with CI/review enforcement tags. See section 4.1 for the full text. The
actual CLAUDE.md edit will be reviewed in its own implementation PR.
5. Phased Adoption Plan¶
Phase 1: Prevent New Anti-Patterns (Immediate)¶
- [ ] Add the 8 testing rules to
sdd/DESIGN.mdsection 11 - [ ] Add testing rules summary (with CI/review tags) to
CLAUDE.md - [ ] Enable Ruff
PTrules inpyproject.toml - [ ] Switch coverage to branch mode (
--cov-branch) - [ ] Add CI check: fail if test function has no
assertand nopytest.raises - [ ] Add CI grep check: fail if
MagicMock(appears withoutspec= - [ ] Add PR review checklist to PR template (section 4.4)
Phase 2: Fix Existing Anti-Patterns + Diagnostics (Next Sprint)¶
- [ ] Fix all ~35 "no assertion" tests (add meaningful post-conditions)
- [ ] Replace top ~20
isinstance-only assertions with behavioral checks - [ ] Add
spec=to all existingMagicMock()calls - [ ] Replace private attribute assertions in
test_registry.py,test_cache.py,test_proxy.pywith public API checks - [ ] Add
diff-coverto PR CI (fail-under=90 for new code) - [ ] Run
pytest-gremlinsbaseline (diagnostic only, establish starting score) - [ ] Add concurrency test suite for thread-safety claims (M-14)
Phase 3: Advanced Quality Measures (Future)¶
- [ ] Add
pytest-gremlinsto weekly CI schedule with threshold gate - [ ] Add
pytest-smell --cito lint step - [ ] Add retry testing with controlled failure injection fakes (M-17)
- [ ] Explore Hypothesis for round-trip and path normalization properties (provide 2-3 canonical examples in-repo)
- [ ] Create a
TestingBackend(controlled fake) to replace mock injection in tests liketest_registry.py:119
6. References¶
Books¶
- Vladimir Khorikov, Unit Testing: Principles, Practices, and Patterns (Manning) — cited in §3.1
- Jay Fields, Working Effectively with Unit Tests (Leanpub) — cited in §3.2
Articles & Talks¶
- Hynek Schlawack, "Don't Mock What You Don't Own" in 5 Minutes — cited in §3.4
- Hynek Schlawack, "Design Pressure" (PyCon US 2025) — cited in §3.4
- Ned Batchelder, "Flaws in Coverage Measurement" — cited in §3.5
- Hillel Wayne, "Property Tests + Contracts = Integration Tests" — cited in §3.6
- Google Testing Blog, "Test Behavior, Not Implementation" — cited in §3.1
Tools¶
| Tool | Purpose | Link |
|---|---|---|
| Hypothesis | Property-based testing | github.com/HypothesisWorks/hypothesis |
| pytest-gremlins | Mutation testing (selected) | pypi.org/project/pytest-gremlins |
| mutmut | Mutation testing (considered) | github.com/boxed/mutmut |
| pytest-smell | Test smell detection | pypi.org/project/pytest-smell |
| diff-cover | Delta coverage on PRs | pypi.org/project/diff-cover |
| Ruff PT rules | Test style linting | docs.astral.sh/ruff/rules/#flake8-pytest-style-pt |
| pytest-httpserver | Real HTTP in tests | pypi.org/project/pytest-httpserver |