MassTransit consumer fault handling when the destination system is down for long time

I have read the MT documentation on Error handling and faults and put some code to publish the fault and written a fault consumer to listen to the fault message after some number of retries with Polly.

I have a queue consumer gets the messages from RabbitMQ using MassTranasit and send to a cloud system through Http api. I have handled all possible exceptions and also wrapped http calls in Polly retry for transient network errors. But the problem with this approach is the message is literally abandoned from processing after the retries exhausted.

If the destination system is down for 10 hrs assume( this outage we don't know before otherwise i will plan for consumer service stop), what is the best strategy we can put with MassTransit to stop pulling the messages from Queue into Consumer? Is there a way we can stop receiving the messages based on number of failures etc..?

Thanks

1 answer

  • answered 2020-09-14 07:39 Alexey Zimarev

    You need a circuit breaker, it's a well-known pattern in distributed systems. The circuit breaker activates when the remote system is struggling under load and putting more requests to it will potentially strangle it. It would also allow you to stop sending messages to the remote system when it is down.

    The circuit breaker is available in MassTransit out of the box.

    I would also not recommend implementing retries using Polly in the consumer. MassTransit has a comprehensive set of retry policies and it also allows MassTransit to understand how many failures occur in the consumer, which is not available when you use Polly. For example, the circuit breaker middleware won't know about failures in a Polly-wrapped call and therefore won't be reacting properly.

    If the remote system is down for a long time (like hours, as you described), any retry policy with a limited number of attempts will eventually fail. The circuit breaker will open but it would reset from time to time and try sending calling the consumer again. Otherwise, it won't ever know when the remote system is recovered. So, you would either need to recover messages from the error queue or add the redelivery middleware.

    You can therefore configure your receive pipeline this way:

    redelivery -> circuit breaker -> retry -> consumer