Concurrency Bugs

Something I often here when developers are trying to reason about behavior of highly concurrent code is, “I don’t think it could happen” - and this kills me.

I think the best strategy to avoid the expensive and time consuming process of identifying and correcting multithreading bugs starts with a proper architecture. If you can’t say with certainty that the behavior is correct, then you’re doing it wrong and pain is headed your way! Firstly, use something like an Actor model combined with adequent sequence diagrams covering all of the messaging behavior.

The actor based programming model will help prevent your developers from shooting themselves in the foot, and the sequence diagrams will help them grok the moving pieces in your system. With just these tools in place, it should be possible to think about the messages and events involved in a piece of behavior and know, for sure, that nothing will go awry. Using an actor-based model will insulate the developer from the primitives associated with concurrent programming (locks, mutexes, visibility issues, check-then-act problems) while still facilitating the parallel execution paradigms needed for scale.

If your codebase is too complex, too distributed, or too poorly documented to really understand what happens when a message is received, you’re just asking to spend hours tracking down once-in-a-million concurrency bugs.

Leads, project managers, make sure you enforce best practices! Time spent on documenting and architecture on the front-end will pay off in spades with debugging time saved and more rapid development time.

Related links: