alert when argocd app unhealthy for x minutes using prometheus and grafana

I am trying to set up a grafana panel that shows how long an ArgoCD app has been unhealthy and alert if it stays unhealthy for 15 minutes. My PromQL query so far is:

sum(count_over_time(argocd_app_info{health_status!="Healthy"}[20m])) by (name)

enter image description here

This gets me pretty close. The line graph increments every minute that the app is unhealthy, up to a maximum of 20 minutes. I can set a limit at 15 minutes to alert on.

The problem is that it decrements every minute the app is healthy. This means the app can be in a progressing state for 15 out of the past 20 minutes and alert, even if it finished progressing and went back to healthy several times in that period.

Instead of decrementing every minute the app is healthy, I want the line to drop to zero as soon as the app becomes healthy. How do I change the PromQL query to do that?

1 answer

  • answered 2022-04-26 12:32 williamcodes

    I figured it out. Seems like you need to multiply by a vector that has a value of 0 whenever the app is in sync. Here's the query:

    sum(count_over_time(argocd_app_info{health_status!="Healthy"}[20m]) * argocd_app_info{health_status!="Healthy"}) by (name)
    

    enter image description here

    The query is a bit long and confusing, but it works.

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum