Dataproc local disk usage metrics

I'm trying to monitor local disk usage (percentage) on Dataproc 2.0 using cloud metrics. This would be useful for monitoring situations where Spark temporary files fill up disk.

By default Dataproc seems to send only local disk performance metrics, CPU etc.. metrics and cluster level HDFS metrics but not local disk usage.

There seems to be a stackdriver agent installed on the Dataproc image but it is not running so apparently Dataproc uses a different way of collecting metrics. I checked that df plugin is enabled in /etc/stackdriver/collectd.conf. However, starting the agent fails:

Jul 16 03:01:57 metrics-test-m systemd[1]: Starting LSB: start and stop Stackdriver Agent...
Jul 16 03:01:57 metrics-test-m stackdriver-agent[3829]: Starting Stackdriver metrics collection agent: stackdriver-agentThe instance has neither the application default credentials file nor the correct monitoring scopes; Exiting. ... failed!
Jul 16 03:01:57 metrics-test-m stackdriver-agent[3829]: not starting, configuration/credentials error. ... failed!
Jul 16 03:01:57 metrics-test-m stackdriver-agent[3829]:  (warning).
Jul 16 03:01:57 metrics-test-m systemd[1]: Started LSB: start and stop Stackdriver Agent.

Is it possible to somehow monitor local disk usage in Dataproc and push the metrics to Google Cloud Metrics?

1 answer

  • answered 2021-07-16 03:59 Dagang

    Google Cloud Monitoring Agent is installed in Dataproc cluster VMs, but disabled by default.

    Adding --properties dataproc:dataproc.monitoring.stackdriver.enable=true when creating the cluster will enable it. The agent collects guest OS metrics including memory and disk usage, so you can view them in Cloud Metrics. See the property in this doc.

    BTW, the reason why CPU usage is collected by default and doesn't depend on the agent is that, it is collected by GCE from the VM host. But for memory and local disk usage, VM host doesn't have knowledge about them, they have to be collected from inside the guest OS, hence it depends on the agent. When you enable the agent, there will be two CPU usage metrics with different types, one (compute) is from the VM host perspective, the other (agent) is from the guest OS perspective.

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum