monitoring(1)
Stop Monitoring for Failure
monitoring(1)
______ __ __ / ____/___ ____/ /__ _____/ /_ ___ ____________ __ / / / __ \/ __ / _ \/ ___/ __ \/ _ \/ ___/ ___/ / / / / /___/ /_/ / /_/ / __/ / / /_/ / __/ / / / / /_/ / \____/\____/\__,_/\___/_/ /_.___/\___/_/ /_/ \__, / /____/
______ __ __ / ____/___ ____/ /__ _____/ /_ _ _ / / / __ \/ __ / _ \/ ___/ __ \/ / / / / /___/ /_/ / /_/ / __/ / / /_/ / /_/ / \____/\ODER/\__,_/\___/_/ /_.___/\_, / ERRY/
______ __ / ____/___ / /_ _ _ / / / __ \/ __ \/ / / / / /___/ /_/ / /_/ / /_/ / \____/\ODER/\_.__/\_, / ERRY/

Synopsis

Many teams address operational readiness by identifying how their system can fail and defining monitors for those failures.

This approach is fundamentally flawed, but there is a better way.

A Bad Day

A few years back a legacy system I was working on failed in the worst way. It failed completely. It failed silently. It failed until the customer reached out to ask what the hell was going on.

In this case the component that failed had the following contract:

  1. Wait for an upstream notification
  2. On that upstream notification load a file
  3. Translate the file into alternate artifacts
  4. Deliver the new artifacts to another downstream service.

The complete failure meant that absolutely no notifications were being processed.

We had failure detection. All those steps checked for failure, generated metrics, and those metrics were monitored by alarm rules.

So if all that was set up, why didn't we get an alarm?

The Failure of Monitoring for Failure

It turned out a code change had accidentally caused the component to ignore all notifications. It would receive the notification, incorrectly classify it as irrelevant, and return without actually doing anything.

The component Transactions Per Second were the same, the latency was AMAZING, there were no processing errors, and thus no alarms.

Sure we could (and did) add another metric to catch this, but how can we possibly anticipate all possible errors?

Monitoring for Success

The relatively simple solution is: monitor for success

Ask "what does my system need to do as a whole to be healthy?" and then confirm those conditions are true.

In this case, our system had nothing to do with notifications and files per se: our system as a whole was responsible for registering new devices attached to the network.

From this perspective, the failure had been extremely visible to our customer. As they added new devices, none of them showed up in our system.

Our focus on monitoring component errors had left us completely blind.

If we'd simply build monitoring around the customer expectation we'd have seen the issue immediately.

How can you do it Better?

If you take one thing away from this article, define what success is early and build monitoring it into your system. If you don't know how to define an measurable success metrics, don't believe they're possible to collect, or don't know how to compute them, be curious.

See Also

Author

Written by Michael Smit

Copyright

©2024 Michael Smit