The standard GPU utilization metric reported by nvidia-smi, nvtop, Weights & Biases, Amazon CloudWatch, Google Cloud Monitoring, and Azure Monitor is highly misleading. It reports the fraction of time that any kernel is running on the GPU, which means a GPU can report 100% utilization even if only a small portion of its compute capacity is actually being used. In practice, we've seen workloads with ~1β10% real compute throughput while dashboards show 100%.
This becomes a problem when teams rely on that metric for capacity planning or optimization decisions, it can make underutilized systems look saturated.
We're releasing an open-source (Apache 2.0) tool, Utilyze, to measure GPU utilization differently. It samples hardware performance counters and reports compute and memory throughput relative to the hardware's theoretical limits. It also estimates an attainable utilization ceiling for a given workload.
GitHub link: https://github.com/systalyze/utilyze
We'd love to hear your thoughts!

Discussion (8 Comments)Read Original on HackerNews
But if you really care about this, you should actually profile your application. nsight systems makes this pretty simple to do. Dunno how many actually care about having a TUI.
On nsys, agreed it's great, but we wanted something that could run continuously instead of an offline analysis tool. We think there's room for both to be useful.
At the moment (v0.1.3) it is more helpful for compute visualization but keeping track of memory usage/processes/temperature/fan speed/etc. prevent this from becoming a full-on drop-in replacement for `nvidia-smi` for me.