- application is single threaded
- throughput and latency is unacceptable
- make application multi-threaded with coarse-grained synchronization
- throughput and latency still unacceptable
- introduce finer-grained synchronization
- fix deadlock
Even locking alone, whose solution to avoid deadlocks is fairly simple -- just order the locks -- it turns out that implementing lock ordering is pretty hard to get right and there are many pitfalls to it.
So I've compiled a "checklist" that would hopefully help reduce some frustration when writing multi-threaded applications. These are more like "what worked for me" and not necessarily rules to follow. Also I'm not an expert, so keep that in mind!
Here goes, the "Multi-threaded-development-checklist-that-works-for-Um" checklist.
Are you locking on an arbitrarily long event?
Are you writing to disk/sending network packets/invoking user-supplied callbacks during locks? If you do, whoever is waiting for that lock, might be waiting for an arbitrarily long time as well.
Is your application layered?
If you're using fine-grained locks, layer your application so the locks have levels as well. It's easier to reason about lock ordering if your classes are well-layered. For example, if you have class Parent which holds a collection of class Child. Remembering the lock order Parent -> Child is intuitive.
Are layers in the same level interacting?
From the example above, Child objects should not interact with each other. If they need to share some information, do it in the Parent class. Otherwise, finding the right lock order would be hard.
Do lower layers release their locks before calling upper layers?
It's basically a violation of the lock ordering if Child locks before calling a method in Parent that locks the parent.
Are you aware of any locking inside your third-party libraries?
If you're using a third party library that says it's thread-safe, then that library is probably using synchronization primitives as well. In our example, the layering could be like this: Parent -> Child -> third-party object. What if the third-party can invoke some method Child::Callback()? Then that means there might be a lock order violation, e.g. the third party library holds a lock before invoking Child::Callback() which locks the child.
Is anything blocking forever?
If you have condition variables, do all code paths notify it at some point? If you're blocking to wait for an event (for example, via epoll), do you have a way to preempt it for have a timeout on it?
Is your application modifying a snapshot when it's suppose to modify shared data (or vice-versa)?
This bit us pretty hard recently at work, and we spent a couple of hours trying to find the cause. Suppose Parent contains a collection of Child. If for example, you wanted to remove a Child from the collection, but you only removed it from a copy of the collection. Then the real shared data won't be modified at all, and all weirdness follows. So make sure the data you're acting on is actually the data you want to act on.
That's all I have for now.