Turning “Can’t Reproduce” into “Fixed”: Why Logging Wins Bug Triage
When bug triage stalls on “Can’t reproduce,” “Works as designed,” or “Not a bug-feature,” the problem isn’t teamwork-it’s observability. The fastest way to move from debate to diagnosis is excellent logging: precise, structured, and searchable.
The problem
Without context, faults are hard to recreate. Different environments, feature flags, inputs, timings, and dependency states make issues appear “random”. Closing tickets on opinion wastes time and erodes trust.
Why logs matter
- Reproduction on demand: With correlation IDs and request context, you can retrace a user’s exact path and replay it in staging.
- Accuracy over guesswork: Exceptions with full stack traces and inner causes point straight to the failing layer and line.
- Evidence for design changes: If logs show recurring user-visible errors or timeouts, the behaviour may be “as designed” yet still harmful-now you have data to change it.
Minimum logging standard (you can ship this now)
- Structured logs (JSON or key–value), not free-text. Include:
timestamp, level, service, environment, version/commit, traceId, spanId, userId/sessionId,
route/path, inputs (redacted), featureFlags, region/tenant. - Exception capture: type, message, full stack trace, and first failing frame.
- Correlation across services: propagate traceId/spanId end-to-end.
- PII safety: redact or hash sensitive data at the edge.
- Level policy: DEBUG off in prod by default; INFO for state changes; WARN for recoverable anomalies;
ERROR for failed user actions; FATAL for system-down events. - Retention & searchability: centralised log aggregation with indexes on IDs and exception fields.
Make “Can’t Reproduce” rare
- Show a correlation ID in user-facing error messages and pipe it into support tickets.
- Auto-attach recent log excerpts and request/response samples to bug reports.
- Log the build/commit SHA and a config snapshot so you can run the exact binary.
- Record runtime/OS/container image and feature-flag states-heisenbugs love flags.
- Track latency, retries, queue depth, memory/CPU to surface timing and race conditions.
What “Works as Designed” should mean
If the data shows frequent retries, timeouts, or compensating actions, the design isn’t working for users-even if the spec says otherwise. Logs move the discussion from opinions to measurable impact.
Implementation checklist
- Adopt a structured logging library in every service
- Standardise common fields and naming
- Propagate correlation IDs across HTTP/gRPC/queues
- Enforce level rules in CI/CD and runtime configs
- Centralise logs; define saved searches and dashboards
- Redaction tests for sensitive fields
- Incident runbook: how to go from ticket → correlation ID → replay
Conclusion
Great logging is the shortest path from “Can’t reproduce” to “Fixed”. Invest before the next incident-not after.