How to write fault tolerant message consumers

This page shows you the most important aspects of writing fault tolerant message consumers. Not all described situations may (seem to) occur in your system, but it is a good idea anyway to meet these guidelines.

Often it is very difficult to rule out some of the listed problems, because a distributed system is not easy to understand. Furthermore your system may change and so its behavior. So its best to design your consumers upfront in a way that it does not fail when you need to scale up, get into a high traffic situations or receive invalid messages.

1. Ensure old messages dont overwrite new data

The order of the messages is not guaranteed within the message broker. Therefore you have to deal with old messages that get ahead of new messages.

To get around this issue make sure you have a message property that allows you to sort out old messages. For instance a timestamp or a sequence number. Persist this value somewhere and ignore (but ack) messages that are older than the current state.

2. Implement "idempotent" consumers

Messages can and will arrive multiple times when using a message broker. A typical scenario is a consumer that crashes right before a message is acked.
The broker will re-sent the unacked message later on despite it was processed already.

Make sure your consumers are implemented in an "idempotent" way. This means that it should be safe to process the same message twice. For instance - a message stating "delete entry x" should not fail even if entry x was already deleted before.

3. Restart consumers on failure

Let your consumers shutdown gracefully if something unexpected (or something critical and known) happens. Especially for long running processes this behaviour is much safer and reliable then trying to recover from an unexpected exception or other error-states instead.

The message broker module provides a special exception for this case: The CriticalErrorException that denotes that the consumer can not continue. Throw this exception for example when your mysql connection dies or other dependent systems are not available anymore.

The message broker module will log the exception to watchdog (which may fail of course) and exit with code 1. We recommend you add a small (bash script) wrapper that is responsible for restarting the process if anything bad happens.

4. Dont let invalid messages flood your queues

Invalid messages are a problem for your whole system, because they fail processing and thus get re-delivered all the time to your consumers. Your queues will slowly fill up and your consumers spend more time in processing messages that cant be acked.
This degrades the overall system performance and can become an even bigger problem if you dont keep an eye on it.

The solution is to acknowledge the invalid messages although you can't process them. The message broker module supports this scenario by providing the special purpose InvalidMessageException. Throw this if you discover something invalid within the message body.

The module will catch the exception, acknowledge the message and log a small info about it to watchdog. You could also register a invalidMessageHandler for your consumer, which allows you to do more advanced things in this situation.