nvitop, an interactive NVIDIA-GPU process viewer, the one-stop solution for GPU process management

Xuehai Pan

Last update: Jan 2, 2023

Related tags

Data Visualization console monitoring gpu cuda nvidia curses nvml top command-line-tool nvidia-smi monitoring-tool process-monitoring gpu-monitoring resource-monitor

Overview

nvitop

nvitop, an interactive NVIDIA-GPU process viewer, the one-stop solution for GPU process management. (screenshots)

Features
Requirements
Installation
Usage
Screenshots
License

This project is inspired by nvidia-htop and nvtop for monitoring, and gpustat for application integration.

nvidia-htop is a tool for enriching the output of nvidia-smi. It uses regular expressions to read the output of nvidia-smi from a subprocess, which is inefficient. In the meanwhile, there is a powerful interactive GPU monitoring tool called nvtop. But nvtop is written in C, which makes it lack of portability. And what is really inconvenient is that you should compile it yourself during the installation. Therefore, I made this repo. I got a lot help when reading the source code of ranger, the console file manager. Some files in this repo are copied and modified from ranger under the GPLv3 License.

So far, nvitop is in the beta phase, and most features have been tested on Linux. If you are using Windows with NVIDIA-GPUs, please submit feedback on the issue page, thank you very much!

If this repo is useful to you, please star ⭐️ it to let more people know 🤗 .

Compare to nvidia-smi:

Features

Informative and fancy output: show more information than nvidia-smi with colorized fancy box drawing.
Monitor mode: can run as a resource monitor, rather than print the results only once. (vs. nvidia-htop, limited support with command watch -c)
Interactive: responsive for user input in monitor mode. (vs. gpustat & py3nvml)
Efficient:
- query device status using NVML Python bindings directly, instead of parsing the output of nvidia-smi. (vs. nvidia-htop)
- cache results with ttl_cache from cachetools. (vs. gpustat)
- display information using the curses library rather than print with ANSI escape codes. (vs. py3nvml)
- asynchronously gather information using multithreading and correspond to user input much faster. (vs. nvtop)
Portable: work on both Linux and Windows.
- get host process information using the cross-platform library psutil instead of calling ps -p in a subprocess. (vs. nvidia-htop & py3nvml)
- written in pure Python, easy to install with pip. (vs. nvtop)
Integrable: easy to integrate into other applications, more than monitoring. (vs. nvidia-htop & nvtop)

Requirements

Python 3.5+
NVIDIA Management Library (NVML)
nvidia-ml-py
psutil
cachetools
curses
termcolor

NOTE: The NVIDIA Management Library (NVML) is a C-based programmatic interface for monitoring and managing various states. The runtime version of NVML library ships with the NVIDIA display driver (available at Download Drivers | NVIDIA), or can be downloaded as part of the NVIDIA CUDA Toolkit (available at CUDA Toolkit | NVIDIA Developer). The lists of OS platforms and NVIDIA-GPUs supported by the NVML library can be found in the NVML API Reference.

Installation

Install from PyPI ( / ):

pip3 install --upgrade nvitop

Install the latest version from GitHub ():

pip3 install git+https://github.com/XuehaiPan/nvitop.git#egg=nvitop

Or, clone this repo and install manually:

git clone --depth=1 https://github.com/XuehaiPan/nvitop.git
cd nvitop
pip3 install .

IMPORTANT: pip will install nvidia-ml-py==11.450.51 as a dependency for nvitop. Please verify whether the nvidia-ml-py package is compatible with your NVIDIA driver version. Since nvidia-ml-py==11.450.129, the definition of nvmlProcessInfo_t has introduced two new fields gpuInstanceId and computeInstanceId (GI ID and CI ID in newer nvidia-smi) which are incompatible with some old NVIDIA drivers. nvitop may not display the processes correctly due to this incompatibility. You can check the release history of nvidia-ml-py at nvidia-ml-py's Release History, and install the compatible version manually.

Usage

Device and Process Status

Query the device and process status. The output is similar to nvidia-smi, but has been enriched and colorized.

# Query status of all devices
$ nvitop

# Specify query devices
$ nvitop -o 0 1  # only show  and 

# Only show devices in `CUDA_VISIBLE_DEVICES`
$ nvitop -ov

NOTE: nvitop uses only one character to indicate the type of processes. C stands for compute processes, G for graphics processes, and X for processes with both contexts (i.e. mi(x)ed, in nvidia-smi it is C+G).

Resource Monitor

Run as a resource monitor:

# Automatically configure the display mode according to the terminal size
$ nvitop -m

# Arbitrarily display as `full` mode
$ nvitop -m full

# Arbitrarily display as `compact` mode
$ nvitop -m compact

# Specify query devices
$ nvitop -m -o 0 1  # only show  and 

# Only show devices in `CUDA_VISIBLE_DEVICES`
$ nvitop -m -ov

Press q to return to the terminal.

For Docker Users

Build and run the Docker image using nvidia-docker:

docker build --tag nvitop:latest .
docker run --interactive --tty --rm --runtime=nvidia --gpus all --pid=host nvitop:latest -m

NOTE: Don't forget to add --pid=host option when running the container.

For SSH Users

Run nvitop directly on the SSH session instead of a login shell:

ssh user@host -t nvitop -m                 # installed by `sudo pip3 install ...`
ssh user@host -t '~/.local/bin/nvitop' -m  # installed by `pip3 install --user ...`

NOTE: Users need to add the -t option to allocate a pseudo-terminal over the SSH session for monitor mode.

Type nvitop --help for more information:

usage: nvitop [--help] [--version] [--monitor [{auto,full,compact}]]
              [--only idx [idx ...]] [--only-visible]
              [--gpu-util-thresh th1 th2] [--mem-util-thresh th1 th2]
              [--ascii]

An interactive NVIDIA-GPU process viewer.

optional arguments:
  --help, -h            show this help message and exit
  --version             show program's version number and exit
  --monitor [{auto,full,compact}], -m [{auto,full,compact}]
                        Run as a resource monitor. Continuously report query data,
                        rather than the default of just once.
                        If no argument is given, the default mode `auto` is used.
  --only idx [idx ...], -o idx [idx ...]
                        Only show the specified devices, suppress option `--only-visible`.
  --only-visible, -ov   Only show devices in environment variable `CUDA_VISIBLE_DEVICES`.
  --gpu-util-thresh th1 th2
                        Thresholds of GPU utilization to distinguish load intensity.
                        Coloring rules: light < th1 % <= moderate < th2 % <= heavy.
                        ( 1 <= th1 < th2 <= 99, defaults: 10 75 )
  --mem-util-thresh th1 th2
                        Thresholds of GPU memory utilization to distinguish load intensity.
                        Coloring rules: light < th1 % <= moderate < th2 % <= heavy.
                        ( 1 <= th1 < th2 <= 99, defaults: 10 80 )
  --ascii               Use ASCII characters only, which is useful for terminals without Unicode support.

Keybindings for Monitor Mode

Key	Binding
`q`	Quit and return to the terminal.
`h`	Go to the help screen.
`a` / `f` / `c`	Change the display mode to auto / full / compact.

/ / `[` / `]`	Scroll the host information of processes.
`^`	Scroll left to the beginning of the process entry (i.e. beginning of line).
`$`	Scroll right to the end of the process entry (i.e. end of line).
/ / /	Select and highlight a process.
	Select the first process.
	Select the last process.
	Clear process selection.

`I`	Send `signal.SIGINT` to the selected process (interrupt).
`T`	Send `signal.SIGTERM` to the selected process (terminate).
`K`	Send `signal.SIGKILL` to the selected process (kill).

`,` / `.`	Select the sort column.
`/`	Reverse the sort order.
`on` (`oN`)	Sort processes in the natural order, i.e., in ascending (descending) order of `GPU`.
`ou` (`oU`)	Sort processes by `USER` in ascending (descending) order.
`op` (`oP`)	Sort processes by `PID` in descending (ascending) order.
`og` (`oG`)	Sort processes by `GPU-MEM` in descending (ascending) order.
`os` (`oS`)	Sort processes by `%SM` in descending (ascending) order.
`oc` (`oC`)	Sort processes by `%CPU` in descending (ascending) order.
`om` (`oM`)	Sort processes by `%MEM` in descending (ascending) order.
`ot` (`oT`)	Sort processes by `TIME` in descending (ascending) order.

NOTE: Press the CTRL key to multiply the mouse wheel events by 5.

More than Monitoring

nvitop can be easily integrated into other applications.

Device

In [1]: from nvitop import host, Device, HostProcess, GpuProcess, NA

In [2]: Device.driver_version()
Out[2]: '430.64'

In [3]: Device.cuda_version()
Out[3]: '10.1'

In [4]: Device.count()
Out[4]: 10

In [5]: all_devices = Device.all()
   ...: all_devices
Out[5]: [
    Device(index=0, name="GeForce RTX 2080 Ti", total_memory=11019MiB),
    Device(index=1, name="GeForce RTX 2080 Ti", total_memory=11019MiB),
    Device(index=2, name="GeForce RTX 2080 Ti", total_memory=11019MiB),
    Device(index=3, name="GeForce RTX 2080 Ti", total_memory=11019MiB),
    Device(index=4, name="GeForce RTX 2080 Ti", total_memory=11019MiB),
    Device(index=5, name="GeForce RTX 2080 Ti", total_memory=11019MiB),
    Device(index=6, name="GeForce RTX 2080 Ti", total_memory=11019MiB),
    Device(index=7, name="GeForce RTX 2080 Ti", total_memory=11019MiB),
    Device(index=8, name="GeForce RTX 2080 Ti", total_memory=11019MiB),
    Device(index=9, name="GeForce RTX 2080 Ti", total_memory=11019MiB)
]

In [6]: nvidia0 = Device(0)  # from device index
   ...: nvidia0
Out[6]: Device(index=0, name="GeForce RTX 2080 Ti", total_memory=11019MiB)

In [7]: nvidia0.memory_used()  # in bytes
Out[7]: 9293398016

In [8]: nvidia0.memory_used_human()
Out[8]: '8862MiB'

In [9]: nvidia0.gpu_utilization()  # in percentage
Out[9]: 5

In [10]: nvidia0.processes()
Out[10]: {
    52059: GpuProcess(pid=52059, gpu_memory=7885MiB, type=C, device=Device(index=0, name="GeForce RTX 2080 Ti", total_memory=11019MiB), host=HostProcess(pid=52059, name='ipython3', status='sleeping', pid=tatus, started='14:31:22')),
    53002: GpuProcess(pid=53002, gpu_memory=967MiB, type=C, device=Device(index=0, name="GeForce RTX 2080 Ti", total_memory=11019MiB), host=HostProcess(pid=53002, name='python', status='running', started='14:31:59'))
}

In [11]: nvidia1 = Device(bus_id='00000000:05:00.0')  # from PCI bus ID
    ...: nvidia1
Out[11]: Device(index=1, name="GeForce RTX 2080 Ti", total_memory=11019MiB)

In [12]: nvidia1_snapshot = nvidia1.as_snapshot()
    ...: nvidia1_snapshot
Out[12]: DeviceSnapshot(
    real=Device(index=1, name="GeForce RTX 2080 Ti", total_memory=11019MiB),
    bus_id='00000000:05:00.0',
    compute_mode='Default',
    display_active='Off',
    ecc_errors='N/A',
    fan_speed=22,                       # in percentage
    fan_speed_string='22%',             # in percentage
    gpu_utilization=17,                 # in percentage
    gpu_utilization_string='17%',       # in percentage
    index=1,
    memory_free=10462232576,            # in bytes
    memory_free_human='9977MiB',
    memory_total=11554717696,           # in bytes
    memory_total_human='11019MiB',
    memory_usage='1041MiB / 11019MiB',
    memory_used=1092485120,             # in bytes
    memory_used_human='1041MiB',
    memory_utilization=9.5,             # in percentage
    memory_utilization_string='9.5%',   # in percentage
    name='GeForce RTX 2080 Ti',
    performance_state='P2',
    persistence_mode='Off',
    power_limit=250000,                 # in milliwatts (mW)
    power_status='66W / 250W',          # in watts (W)
    power_usage=66051,                  # in milliwatts (mW)
    temperature=39,                     # in Celsius
    temperature_string='39C'            # in Celsius
)

In [13]: nvidia1_snapshot.memory_utilization_string  # snapshot uses properties instead of function calls
Out[13]: '9%'

In [14]: nvidia1_snapshot.encoder_utilization  # snapshot will automatically retrieve not presented attributes from `real`
Out[14]: [0, 1000000]

In [15]: nvidia1_snapshot
Out[15]: DeviceSnapshot(
    real=Device(index=1, name="GeForce RTX 2080 Ti", total_memory=11019MiB),
    bus_id='00000000:05:00.0',
    compute_mode='Default',
    display_active='Off',
    ecc_errors='N/A',
    encoder_utilization=[0, 1000000],   ##### <-- new entry #####
    fan_speed=22,                       # in percentage
    fan_speed_string='22%',             # in percentage
    gpu_utilization=17,                 # in percentage
    gpu_utilization_string='17%',       # in percentage
    index=1,
    memory_free=10462232576,            # in bytes
    memory_free_human='9977MiB',
    memory_total=11554717696,           # in bytes
    memory_total_human='11019MiB',
    memory_usage='1041MiB / 11019MiB',
    memory_used=1092485120,             # in bytes
    memory_used_human='1041MiB',
    memory_utilization=9.5,             # in percentage
    memory_utilization_string='9.5%',   # in percentage
    name='GeForce RTX 2080 Ti',
    performance_state='P2',
    persistence_mode='Off',
    power_limit=250000,                 # in milliwatts (mW)
    power_status='66W / 250W',          # in watts (W)
    power_usage=66051,                  # in milliwatts (mW)
    temperature=39,                     # in Celsius
    temperature_string='39C'            # in Celsius
)

NOTE: The entry values may be 'N/A' (type: NaType) when the corresponding resources are not applicable. You can use some if entry != 'N/A' checks to avoid exceptions. It's safe to use float(entry) for numbers while 'N/A' will be converted to 'math.nan'. For example:

memory_used: Union[int, NaType] = device.memory_used()            # memory usage in bytes or `N/A`
memory_used_in_mib: float       = float(memory_used) / (1 << 20)  # memory usage in Mebibytes (MiB) or `math.nan`

Process

In [16]: processes = nvidia1.processes()  # type: Dict[int, GpuProcess]
    ...: processes
Out[16]: {
    23266: GpuProcess(pid=23266, gpu_memory=1031MiB, type=C, device=Device(index=1, name="GeForce RTX 2080 Ti", total_memory=11019MiB), host=HostProcess(pid=23266, name='python3', status='running', started='2021-05-10 21:02:40'))
}

In [17]: process = processes[23266]
    ...: process
Out[17]: GpuProcess(pid=23266, gpu_memory=1031MiB, type=C, device=Device(index=1, name="GeForce RTX 2080 Ti", total_memory=11019MiB), host=HostProcess(pid=23266, name='python3', status='running', started='2021-05-10 21:02:40'))

In [18]: process.status()
Out[18]: 'running'

In [19]: process.cmdline()  # type: List[str]
Out[19]: ['python3', 'rllib_train.py']

In [20]: process.command()  # type: str
Out[20]: 'python3 rllib_train.py'

In [21]: process.cwd()
Out[21]: '/home/xxxxxx/Projects/xxxxxx'

In [22]: process.gpu_memory_human()
Out[22]: '1031MiB'

In [23]: process.as_snapshot()
Out[23]: GpuProcessSnapshot(
    real=GpuProcess(pid=23266, gpu_memory=1031MiB, type=C, device=Device(index=1, name="GeForce RTX 2080 Ti", total_memory=11019MiB), host=HostProcess(pid=23266, name='python3', status='running', started='2021-05-10 21:02:40')),
    cmdline=['python3', 'rllib_train.py'],
    command='python3 rllib_train.py',
    cpu_percent=98.5,                      # in percentage
    cpu_percent_string='98.5%',            # in percentage
    device=Device(index=1, name="GeForce RTX 2080 Ti", total_memory=11019MiB),
    gpu_encoder_utilization=0,             # in percentage
    gpu_encoder_utilization_string='0%',   # in percentage
    gpu_decoder_utilization=0,             # in percentage
    gpu_decoder_utilization_string='0%',   # in percentage
    gpu_memory=1081081856,                 # in bytes
    gpu_memory_human='1031MiB',
    gpu_memory_utilization=9.4,            # in percentage
    gpu_memory_utilization_string='9.4%',  # in percentage
    gpu_sm_utilization=0,                  # in percentage
    gpu_sm_utilization_string='0%',        # in percentage
    identity=(23266, 1620651760.15, 1),
    is_running=True,
    memory_percent=1.6849018430285683,     # in percentage
    memory_percent_string='1.7%',          # in percentage
    name='python3',
    pid=23266,
    running_time=datetime.timedelta(days=1, seconds=80013, microseconds=470024),
    running_time_human='46:13:33',
    type='C',                             # 'C' for Compute / 'G' for Graphics / 'C+G' for Both
    username='panxuehai'
)

In [24]: process.kill()

In [25]: list(map(Device.processes, all_devices))  # all processes
Out[25]: [
    {
        52059: GpuProcess(pid=52059, gpu_memory=7885MiB, type=C, device=Device(index=0, name="GeForce RTX 2080 Ti", total_memory=11019MiB), host=HostProcess(pid=52059, name='ipython3', status='sleeping', started='14:31:22')),
        53002: GpuProcess(pid=53002, gpu_memory=967MiB, type=C, device=Device(index=0, name="GeForce RTX 2080 Ti", total_memory=11019MiB), host=HostProcess(pid=53002, name='python', status='running', started='14:31:59'))
    },
    {},
    {},
    {},
    {},
    {},
    {},
    {},
    {
        84748: GpuProcess(pid=84748, gpu_memory=8975MiB, type=C, device=Device(index=8, name="GeForce RTX 2080 Ti", total_memory=11019MiB), host=HostProcess(pid=84748, name='python', status='running', started='11:13:38'))
    },
    {
        84748: GpuProcess(pid=84748, gpu_memory=8341MiB, type=C, device=Device(index=9, name="GeForce RTX 2080 Ti", total_memory=11019MiB), host=HostProcess(pid=84748, name='python', status='running', started='11:13:38'))
    }
]

In [26]: import os
    ...: this = HostProcess(os.getpid())
    ...: this
Out[26]: HostProcess(pid=35783, name='python', status='running', started='19:19:00')

In [27]: this.cmdline()  # type: List[str]
Out[27]: ['python', '-c', 'import IPython; IPython.terminal.ipapp.launch_new_instance()']

In [27]: this.command()  # not simply `' '.join(cmdline)` but quotes are added
Out[27]: 'python -c "import IPython; IPython.terminal.ipapp.launch_new_instance()"'

In [28]: this.memory_info()
Out[28]: pmem(rss=83988480, vms=343543808, shared=12079104, text=8192, lib=0, data=297435136, dirty=0)

In [29]: import cupy as cp
    ...: x = cp.zeros((10000, 1000))
    ...: this = GpuProcess(os.getpid(), nvidia0)  # construct from `GpuProcess(pid, device)` explicitly rather than calling `device.processes()`
    ...: this
Out[29]: GpuProcess(pid=35783, gpu_memory=N/A, type=N/A, device=Device(index=0, name="GeForce RTX 2080 Ti", total_memory=11019MiB), host=HostProcess(pid=35783, name='python', status='running', started='19:19:00'))

In [30]: this.update_gpu_status()  # update used GPU memory from new driver queries
Out[30]: 267386880

In [31]: this
Out[31]: GpuProcess(pid=35783, gpu_memory=255MiB, type=C, device=Device(index=0, name="GeForce RTX 2080 Ti", total_memory=11019MiB), host=HostProcess(pid=35783, name='python', status='running', started='19:19:00'))

In [32]: id(this) == id(GpuProcess(os.getpid(), nvidia0))  # IMPORTANT: the instance will be reused while the process is running
Out[32]: True

Host (inherited from psutil)

In [33]: host.cpu_count()
Out[33]: 88

In [34]: host.cpu_percent()
Out[34]: 18.5

In [35]: host.cpu_times()
Out[35]: scputimes(user=2346377.62, nice=53321.44, system=579177.52, idle=10323719.85, iowait=28750.22, irq=0.0, softirq=11566.87, steal=0.0, guest=0.0, guest_nice=0.0)

In [36]: host.load_average()
Out[36]: (14.88, 17.8, 19.91)

In [37]: host.virtual_memory()
Out[37]: svmem(total=270352478208, available=192275968000, percent=28.9, used=53350518784, free=88924037120, active=125081112576, inactive=44803993600, buffers=37006450688, cached=91071471616, shared=23820632064, slab=8200687616)

In [38]: host.swap_memory()
Out[38]: sswap(total=65534947328, used=475136, free=65534472192, percent=0.0, sin=2404139008, sout=4259434496)

Screenshots

Example output of nvitop:

Example output of nvitop -m:

Full	Compact

License

nvitop is released under the GNU General Public License, version 3 (GPLv3).

NOTE: Please feel free to use nvitop as a package or dependency for your own projects. However, if you want to add or modify some features of nvitop, or copy some source code of nvitop into your own code, the source code should also be released under the GPLv3 License (as nvitop contains some modified source code from ranger under the GPLv3 License).

Comments

MIG device support
Issue Type

Improvement/feature implementation

Runtime Environment

Operating system and version: Ubuntu 20.04 LTS

Terminal emulator and version: GNOME Terminal 3.36.2

Python version: 3.5+

NVML version (driver version): 430.64 / 460.84 / 470.82.00

nvitop version or commit: WIP

nvidia-ml-py version: 11.450.51 / 11.450.129 / 11.495.46

Locale: C / C.UTF-8 / en_US.UTF-8

Description

Add MIG device support to nvitop.

core/device: Add class MigDevice and update CUDA_VISIBLE_DEVICES handling for MIG devices.

gui: Update nvitop's UI for MIG enabled setup.

Motivation and Context

Add MIG device support to nvitop. Resolves #5.

Testing

Help wanted, see https://github.com/XuehaiPan/nvitop/pull/8#issuecomment-1155241507.
enhancement core cli / gui
opened by XuehaiPan 16
[Bug] gpu memory-usage not show right in driver 510 version
Runtime Environment

Operating system and version: Ubuntu 20.04 LTS

Terminal emulator and version: GNOME Terminal 3.36.2

Python version: 3.8.10

NVML version (driver version): 510.47.03

nvitop version or commit: 0.5.3

nvidia-ml-py version: 11.450.51

Locale: zh_CN.UTF-8

Current Behavior

After upgrade the nvidia driver to the latest version 510.47.03, the gpu memory-usage not show right in my workstation both for 1080Ti and A100. It shows more memory usage than the actual one, which is not matched with the nvidia-smi command.

nvitop

nvidia-smi

It seems the nvtop command also makes mistakes.

nvtop

Expected Behavior

The gpu memory-usage should match the nvidia-smi.
bug upstream pynvml
opened by jue-jue-zi 14
[Feature Request] MIG device support (e.g. A100 GPUs)
Hello!

Firstly, thanks for creating and maintaining such an excellent library.

Runtime Environment

Operating system and version: Ubuntu 20.04 LTS

Terminal emulator and version: GNOME Terminal 3.36.2

Python version: `3.7

NVML version (driver version): 450.0

nvitop version or commit: main@b669fa3

python-ml-py version: 11.450.51

Locale: en_US.UTF-8

Current Behavior

When running nvitop on MiG enabled A100 GPU. nvitop fails to detect the GPU running process and GPU memory consumption. Which can otherwise be viewed by running the command, nvidia-smi

Expected Behavior

The A100 MiG GPU should be visible in the GUI.

Context

So far we can only view CPU usage metrics, which are really handy but it would also be nice to have GPU usage as designed.

Possible Solutions

I think that the MiG naming convention is different from regular naming conventions, and looks something like this: MIG 7g.80gb Device 0: rather than just Device 0: as is currently set-up in the nvitop repo.

Steps to reproduce

Run A100 in Mig mode

start nvitop watch -n 0.5 nvitop

enhancement core cli / gui
opened by ki-arie 12
[Feature Request] Collect metrics in a fixed interval for the lifespan of a training job

Hi @XuehaiPan,

In your examples to collect metrics using ResourceMetricCollector inside a training loop, the collector.collect(), collects a snapshot at each epoch/batch loop which misses the the entire period between the previous and current loop. If a loop takes 5 minutes, we have the metrics at 5 minutes interval.

I wonder if there is a way to run a process in background to collect the metrics at a certain interval let's say 5 seconds, during the lifespan of a training job?

Therefore if the entire job took 1hr, with the 5 sec interval, we collect 720 snapshots.

Thanks
enhancement core

opened by classicboyir 8
[Enhancement] Skip error gpus and show normal infos automatically
Runtime Environment

Operating system and version: Ubuntu 20.04 LTS

Terminal emulator and version: SSH

Python version: 3.8.10

NVML version (driver version): 515.65.01

nvitop version or commit: 0.10.0

nvidia-ml-py version: 11.515.75

Current Behavior

There are four GPUs on our server. And one of those was overheated for some reasons, which make that GPU cannot be recognized. If run nvidia-smi command without any args to query all the GPUs, error Unable to determine the device handle for GPU 0000:0C:00.0: Unknown Error will show without showing the remaining normal GPUs' infos. But if the command assigns the normal GPUs (nvidia-smi -i 0,1,3), all infos of the normal GPUs can be shown directly.

And if I use nvitop command to show the GPUs' infos, nvidia-ml-py will throw exceptions like this below,

Expected Behavior

I hope that with nvitop command, all the GPUs with errors can be skipped automatically, and show the normal GPUs' infos. If possible, maybe the error GPUs' info can be shown as tips below the normal infos using red fonts for emphasizing.
bug enhancement
opened by jue-jue-zi 6
[Bug] display issue when running inside tmux
Runtime Environment

Operating system and version: Ubuntu 16.04 LTS

Terminal emulator and version: iTerm2 3.4.15

Python version: 3.9.7

NVML version (driver version): 470.57

nvitop version: main@latest

Locale: en_US.UTF-8

Current Behavior

When running nvitop inside tmux, the rendered display will become messed up, as shown in the screenshots. This behavior is not present when not using tmux.

Steps to Reproduce

open a tmux session

run nvitop

Images / Videos
opened by Cveinnt 6
[Feature Request] torch_geometric support

First of all, thank you for the excellent nvitop.

I want to know if you have plans to add an integration with PyTorch Geometric (pyg)? It is a really great library for GNNs. I don't know if its helpful at all but it also has some profiling functions in the torch_geometric.profile module. Since pytorch lightning doesn't give you granular control over your models (sometimes reqd in research) I haven't seen anyone use it. On the flipside, pytorch geometric is probably the most popular library for GNNs.

Hope you consider this!
enhancement core

opened by plutonium-239 5
NVML ERROR: RM has detected an NVML/RM version mismatch.

I installed the nvitop via pip3 as described and it worked fine.

Then I installed nvcc via:

sudo apt install nvidia-cuda-toolkit

Then nvitop stopped working with the error:

NVML ERROR: RM has detected an NVML/RM version mismatch.

How to make both work?

opened by bounlu 4

feat(core/libnvml): add compatibility layers for NVML Python bindings

Issue Type

Improvement/feature implementation

Runtime Environment

Operating system and version: Ubuntu 20.04 LTS
Terminal emulator and version: GNOME Terminal 3.36.2
Python version: 3.9.13
NVML version (driver version): 470.129.06
nvitop version or commit: v0.7.1
python-ml-py version: 11.450.51
Locale: en_US.UTF-8

Description

Automatically patch the pynvml module when the first call fails when calling the versioned APIs. Now we support a more broad range of the PyPI package nvidia-ml-py dependency versions.

Motivation and Context

See #29 for more details.

Resolves #29 Closes #13

Testing

Using nvidia-ml-py == 11.515.48 with the NVIDIA R430 driver (CUDA 10.x):

$ pip3 install --ignore-installed .
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Processing /home/panxuehai/Projects/nvitop
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Collecting psutil>=5.6.6
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/62/1f/f14225bda76417ab9bd808ff21d5cd59d5435a9796ca09b34d4cb0edcd88/psutil-5.9.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (281 kB)
Collecting cachetools>=1.0.1
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/68/aa/5fc646cae6e997c3adf3b0a7e257cda75cff21fcba15354dffd67789b7bb/cachetools-5.2.0-py3-none-any.whl (9.3 kB)
Collecting nvidia-ml-py<11.516.0a0,>=11.450.51
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/7c/b6/738d9c68f8abcdedf8901c4abf00df74e8f281626de67b5185dcc443e693/nvidia_ml_py-11.515.48-py3-none-any.whl (28 kB)
Collecting termcolor>=1.0.0
  Using cached termcolor-1.1.0-py3-none-any.whl
Building wheels for collected packages: nvitop
  Building wheel for nvitop (pyproject.toml) ... done
  Created wheel for nvitop: filename=nvitop-0.7.1+6.g0feed99-py3-none-any.whl size=154871 sha256=da07a27d8579e1cc38a3bd3d537f0d885d592df0c3293ba585b831fa236f100e
  Stored in directory: /tmp/pip-ephem-wheel-cache-3qzopv_e/wheels/9a/17/84/86d7a108dc1c0d7a25e96628d476e19df73a27353725b35779
Successfully built nvitop
Installing collected packages: termcolor, nvidia-ml-py, psutil, cachetools, nvitop
Successfully installed cachetools-5.2.0 nvidia-ml-py-11.515.48 nvitop-0.7.1+6.g84f43f5 psutil-5.9.1 termcolor-1.1.0

Result:

The v3 API nvmlDeviceGetComputeRunningProcesses_v3 fails-back to v2 API nvmlDeviceGetComputeRunningProcesses_v2 (which could not found either), then fails-back to v1 API nvmlDeviceGetComputeRunningProcesses.

$ LOGLEVEL=DEBUG ./nvitop.py -1
Patching NVML function pointer `nvmlDeviceGetComputeRunningProcesses_v3`
    Map NVML function `nvmlDeviceGetComputeRunningProcesses_v3` to `nvmlDeviceGetComputeRunningProcesses_v2`
    Map NVML function `nvmlDeviceGetGraphicsRunningProcesses_v3` to `nvmlDeviceGetGraphicsRunningProcesses_v2`
    Map NVML function `nvmlDeviceGetMPSComputeRunningProcesses_v3` to `nvmlDeviceGetMPSComputeRunningProcesses_v2`
    Patch NVML struct `c_nvmlProcessInfo_t` to `c_nvmlProcessInfo_v2_t`
Patching NVML function pointer `nvmlDeviceGetComputeRunningProcesses_v2`
    Map NVML function `nvmlDeviceGetComputeRunningProcesses_v2` to `nvmlDeviceGetComputeRunningProcesses`
    Map NVML function `nvmlDeviceGetGraphicsRunningProcesses_v2` to `nvmlDeviceGetGraphicsRunningProcesses`
    Map NVML function `nvmlDeviceGetMPSComputeRunningProcesses_v2` to `nvmlDeviceGetMPSComputeRunningProcesses`
    Map NVML function `nvmlDeviceGetComputeRunningProcesses_v3` to `nvmlDeviceGetComputeRunningProcesses`
    Map NVML function `nvmlDeviceGetGraphicsRunningProcesses_v3` to `nvmlDeviceGetGraphicsRunningProcesses`
    Map NVML function `nvmlDeviceGetMPSComputeRunningProcesses_v3` to `nvmlDeviceGetMPSComputeRunningProcesses`
    Patch NVML struct `c_nvmlProcessInfo_t` to `c_nvmlProcessInfo_v1_t`
Sun Jul 24 19:32:24 2022
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╪══════════════════════════════════════════════════════════════════════════════════╕
│   0  TITAN Xp            Off  │ 00000000:05:00.0 Off │                  N/A │ MEM: ▏ 0.2%                                                                      │
│ 24%   43C    P8    19W / 250W │     19MiB / 12194MiB │      0%      Default │ UTL: ▏ 0%                                                                        │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│   1  TITAN Xp            Off  │ 00000000:06:00.0 Off │                  N/A │ MEM: ▏ 0.0%                                                                      │
│ 23%   36C    P8    10W / 250W │      2MiB / 12196MiB │      0%      Default │ UTL: ▏ 0%                                                                        │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│   2  ..orce GTX TITAN X  Off  │ 00000000:09:00.0 Off │                  N/A │ MEM: ▏ 0.0%                                                                      │
│ 22%   34C    P8    17W / 250W │      2MiB / 12213MiB │      0%      Default │ UTL: ▏ 0%                                                                        │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╧══════════════════════════════════════════════════════════════════════════════════╛
[ CPU: █████▉ 5.3%                                                                                                          ]  ( Load Average:  0.89  0.61  0.39 )
[ MEM: ███▋ 3.2%                                                                                                            ]  [ SWP: ▏ 0.0%                     ]

╒════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Processes:                                                                                                                                    panxuehai@ubuntu │
│ GPU     PID      USER  GPU-MEM %SM  %CPU  %MEM      TIME  COMMAND                                                                                              │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│   0    2122 G    root    17MiB   0   0.0   0.0  4.2 days  /usr/lib/xorg/Xorg -core :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch │
╘════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛

enhancement pynvml core

opened by XuehaiPan 4

nvidia-ml-py version conflicts with other packages (e.g., gpustat)

Context: https://github.com/wookayin/gpustat/pull/107 trying to use nvidia-ml-py, Related issues: #4

Hello @XuehaiPan,

I just realized that nvitop requires nvidia-ml-py to be pinned at 11.450.51 due to the incompatible API, as discussed in wookayin/gpustat#107. My solution (in gpustat) to this bothersome library is to use pynvml greater than 11.450.129, but this would create some nuisance problems for normal users who may have both nvitop and gpustat>=1.0 installed.

From nvitop's README:

IMPORTANT: pip will install nvidia-ml-py==11.450.51 as a dependency for nvitop. Please verify whether the nvidia-ml-py package is compatible with your NVIDIA driver version. You can check the release history of nvidia-ml-py at nvidia-ml-py's Release History, and install the compatible version manually by:

Since nvidia-ml-py>=11.450.129, the definition of nvmlProcessInfo_t has introduced two new fields gpuInstanceId and computeInstanceId (GI ID and CI ID in newer nvidia-smi) which are incompatible with some old NVIDIA drivers. nvitop may not display the processes correctly due to this incompatibility.

Is having pynvml version NOT pinned at the specific version an option for you? More specifically, nvmlDeviceGetComputeRunningProcesses_v2 exists since 11.450.129+. In my opinion, pinning nvidia-ml-py at too old and too specific version isn't a great idea, although I also admit that the solution I accepted isn't ideal at all.

We could discuss and coordinate together to avoid any package conflict issues, because in the current situation gpustat and nvitop would be not compatible with each other due to the nvidia-ml-py version.
enhancement pynvml

opened by wookayin 4
[Question] Can nvitop keep a log/record of GPU-Utilization and store in a CSV?

I'm trying to record GPU-utilization, and the users., and what programs are running. Is there a way to log and save this information? Like into a CSV or database?

Sorry if I missed something from the readme
enhancement question core

opened by FelixMildon 4

[BUG] Cannot gather infomation of the `/XWayland` process in WSLg

Required prerequisites

[X] I have read the documentation https://nvitop.readthedocs.io.
[X] I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
[X] I have tried the latest version of nvitop in an new isolated virtual environment.

What version of nvitop are you using?

0.11.0

Operating system and version

Windows 10 build 10.0.19045.0

NVIDIA driver version

526.98

NVIDIA-SMI

Sat Dec 10 20:36:09 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.02    Driver Version: 526.98       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:09:00.0  On |                  N/A |
|  0%   56C    P3    34W / 240W |   2880MiB /  8192MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A        23      G   /Xwayland                       N/A      |
+-----------------------------------------------------------------------------+

Python environment

$ python3 -m pip freeze | python3 -c 'import sys; print(sys.version, sys.platform); print("".join(filter(lambda s: any(word in s.lower() for word in ("nvi", "cuda", "nvml", "gpu")), sys.stdin)))'
3.10.8 (main, Oct 11 2022, 11:35:05) [GCC 11.2.0] linux
nvidia-ml-py==11.515.75
nvitop==0.11.0

Problem description

The XWayland process in WSLg uses the NVIDIA GPU in the WSL instance. However, WSL does not expose the process in the /proc directory. So the psutil fails to gather process information by reading the files under /proc/23.

Steps to Reproduce

Command lines:

$ wsl.exe --shutdown
$ wsl.exe --update
$ wsl.exe
user@WSL $ nvitop

Traceback

No response

Logs

$ nvitop -1
Sat Dec 10 12:35:33 2022
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVITOP 0.11.0        Driver Version: 526.98       CUDA Driver Version: 12.0 │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╪════════════════════╕
│   0  GeForce RTX 3070    On   │ 00000000:09:00.0  On │                  N/A │ MEM: ███▏ 34.7%    │
│  0%   55C    P3    30W / 240W │    2844MiB / 8192MiB │     49%      Default │ UTL: ████▍ 49%     │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╧════════════════════╛
[ CPU: █▌ 3.1%                                                ]  ( Load Average:  0.08  0.02  0.01 )
[ MEM: ██▎ 4.5%                                               ]  [ SWP: ▏ 0.0%                     ]

╒══════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Processes:                                                       PanXuehai@BIGAI-PanXuehai (WSL) │
│ GPU     PID      USER  GPU-MEM %SM  %CPU  %MEM  TIME  COMMAND                                    │
╞══════════════════════════════════════════════════════════════════════════════════════════════════╡
│   0      23 G     N/A WDDM:N/A N/A   N/A   N/A   N/A  No Such Process                            │
╘══════════════════════════════════════════════════════════════════════════════════════════════════╛

Expected behavior

Show the process information rather than N/A and No Such Process.

Additional context

I have raised an issue in microsoft/wslg#919.

microsoft/wslg#919

bug upstream core cli / gui

opened by XuehaiPan 0

Releases(v0.11.0)

v0.11.0(Dec 4, 2022)
Repackage project with dual-license (Apache-2.0 and GPL-3.0), see License for more details. Bumping the version.

Drop Python 3.5 support.

Source code(tar.gz)
Source code(zip)
v0.10.2(Nov 18, 2022)
Add function and method to collect metrics in a background thread in #48 (Resolves #47).

Source code(tar.gz)
Source code(zip)
v0.10.1(Oct 22, 2022)
Add warning messages for corrupted dependencies (Fixes #44).

Handle "NVML Unknown Error" when failing to get the device handles (Fixes #45).

Source code(tar.gz)
Source code(zip)
v0.10.0(Oct 17, 2022)

The last beta version of nvitop. We are waiting for several months of compatibility check the NVIDIA driver and nvidia-ml-py package. The v1.0 stable release will be coming soon if everything goes fine. Feedback is welcome.
Source code(tar.gz)
Source code(zip)

nvitop, an interactive NVIDIA-GPU process viewer, the one-stop solution for GPU process management

Related tags

Overview

nvitop

Table of Contents

Features

Requirements

Installation

Usage

Device and Process Status

Resource Monitor

For Docker Users

For SSH Users

Keybindings for Monitor Mode

More than Monitoring

Device

Process

Host (inherited from psutil)

Screenshots

License

Comments

Issue Type

Runtime Environment

Description

Motivation and Context

Testing

Runtime Environment

Current Behavior

Expected Behavior

Runtime Environment

Current Behavior

Expected Behavior

Context

Possible Solutions

Steps to reproduce

Runtime Environment

Current Behavior

Expected Behavior

Runtime Environment

Current Behavior

Steps to Reproduce

Images / Videos

Issue Type

Runtime Environment

Description

Motivation and Context

Testing

Required prerequisites

What version of nvitop are you using?

Operating system and version

NVIDIA driver version

NVIDIA-SMI

Python environment

Problem description

Steps to Reproduce

Traceback

Logs

Expected behavior

Additional context

Releases(v0.11.0)

v0.11.0(Dec 4, 2022)

v0.10.2(Nov 18, 2022)

v0.10.1(Oct 22, 2022)

v0.10.0(Oct 17, 2022)

Owner

Xuehai Pan

Minimal Ethereum fee data viewer for the terminal, contained in a single python script.

Realtime Viewer Mandelbrot set with Python and Taichi (cpu, opengl, cuda, vulkan, metal)

a robust room presence solution for home automation with nearly no false negatives

Interactive Data Visualization in the browser, from Python

An interactive GUI for WhiteboxTools in a Jupyter-based environment

The interactive graphing library for Python (includes Plotly Express) :sparkles:

Interactive Data Visualization in the browser, from Python

Draw interactive NetworkX graphs with Altair

Interactive plotting for Pandas using Vega-Lite

The interactive graphing library for Python (includes Plotly Express) :sparkles:

Interactive Data Visualization in the browser, from Python

Draw interactive NetworkX graphs with Altair

Interactive plotting for Pandas using Vega-Lite

Easily convert matplotlib plots from Python into interactive Leaflet web maps.