Nvidia systems

This is a short guide how to use Nvidia GPUs.

Build a container

FROM docker.io/tensorflow/tensorflow:2.13.0-gpu

USER root

RUN apt-get update && apt-get install -y python3 python3-pip wget vim git fish libgl1-mesa-glx libglib2.0-0 libc6
RUN python3 -m pip install --upgrade pip
RUN pip3 install --target=/usr/local/lib/python3.8/dist-packages pycolonies opencv-python tqdm Pillow scikit-learn keras matplotlib numpy

Note the target flag to pip3. Without it, pip3 will installs to modules to ~/.local/lib/python3.9/dist-packages. However, when using Singularity, the home directory is normally mounted from the host system, which means that installed Python libraries will not be available in the container as intended.

docker build -t johan/tensorflow .
docker push johan/tensorflow

Simple Python example

pollinator new -n icekube

Update project.yaml file:

projectname: gputest
conditions:
  executorNames:
  - icekube
  nodes: 1
  processesPerNode: 1
  cpu: 5000m
  mem: 10000Mi
  walltime: 600
  gpu:
    count: 1
    name: "nvidia-gtx-2080ti"
environment:
  docker: johan/tensorflow
  rebuildImage: false
  cmd: python3
  source: main.py
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Now, run the code:

pollinator run --follow
Uploading main.py 100% [===============] (590 kB/s)
Uploading hello.txt 100% [===============] (66 kB/s)
INFO[0000] Process submitted                             ProcessID=591c063721f408419e571a65d451187488ceb2624cffa2262e7ad44d80744b37
INFO[0000] Follow process at https://dashboard.colonyos.io/process?processid=591c063721f408419e571a65d451187488ceb2624cffa2262e7ad44d80744b37
Num GPUs Available:  1
INFO[0012] Process finished successfully                 ProcessID=591c063721f408419e571a65d451187488ceb2624cffa2262e7ad44d80744b37

Running nvidia-smi

{
    "conditions": {
        "executortype": "container-executor",
        "executornames": [
            "dev-docker"
        ],
        "nodes": 2,
        "processespernode": 2,
        "mem": "2000Mi",
        "cpu": "500m",
        "gpu": {
            "name": "nvidia-gtx-2080ti",
            "count": 2
        },
        "walltime": 600
    },
    "funcname": "execute",
    "kwargs": {
        "cmd": "nvidia-smi",
        "args": [],
        "docker-image": "tensorflow/tensorflow:2.14.0rc1-gpu",
        "rebuild-image": false
    },
    "maxexectime": 600,
    "maxretries": 3
}
colonies function submit --spec nvidia-smi.json
INFO[0000] Process submitted                             ProcessId=1b93b14c1eb83c4b91bbe33c7f0b1bf35845ac20e1fe371aae0c9bedf3b638df
INFO[0000] Printing logs from process                    ProcessId=1b93b14c1eb83c4b91bbe33c7f0b1bf35845ac20e1fe371aae0c9bedf3b638df
Sun Dec 17 14:34:23 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.02              Driver Version: 545.29.02    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:24:00.0 Off |                  N/A |
| 32%   32C    P0              35W / 250W |      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
INFO[0010] Process finished successfully                 ProcessId=1b93b14c1eb83c4b91bbe33c7f0b1bf35845ac20e1fe371aae0c9bedf3b638df

AMD/ROCm systems

This is a short guide how to use AMD/ROCm GPUs.

Build a container

FROM docker.io/rocm/tensorflow:rocm5.2.0-tf2.9-dev

USER root

RUN apt-get update && DEBIAN_FRONTEND="noninteractive" TZ="Europe/Stockholm" apt-get install -y python3 python3-pip wget git fish libgl1-mesa-glx libglib2.0-0
RUN python3 -m pip install --upgrade pip
RUN pip3 install --target=/usr/local/lib/python3.9/dist-packages opencv-python tqdm Pillow scikit-learn keras matplotlib numpy google wrapt typing_extensions packaging opt_einsum gast astunparse termcolor flatbuffers
RUN pip3 install --target=/usr/local/lib/python3.9/dist-packages protobuf==3.20.0
docker build -t johan/rocmtensorflow .
docker push johan/rocmhackaton52g

Note that the Docker image becomes almost 9Gb!

Simple Python example

pollinator new -e lumi-standard-gpu-hpcexecutor

Update project.yaml file:

projectname: gputest
conditions:
  executorType: lumi-standard-gpu-hpcexecutor
  nodes: 1
  processesPerNode: 1
  cpu: 5000m
  mem: 10000Mi
  walltime: 600
  gpu:
    count: 1
    name: ""
environment:
  docker: johan/rocmtensorflow
  rebuildImage: false
  cmd: python3
  source: main.py

Replace the main.py with this code:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Now, run the code:

pollinator run --follow
INFO[0000] Process submitted                  ProcessID=192f0c29a89a3f7bc3c620a2306e2cab92709d5af5693736b5cb774536a07070
INFO[0000] Follow process at https://dashboard.colonyos.io/process?processid=192f0c29a89a3f7bc3c620a2306e2cab92709d5af5693736b5cb774536a07070
Num GPUs Available:  1
INFO[0036] Process finished successfully      ProcessID=192f0c29a89a3f7bc3c620a2306e2cab92709d5af5693736b5cb774536a07070

Running rocm-smi

{
    "conditions": {
        "executortype": "container-executor",
        "executornames": [
            "dev-docker"
        ],
        "nodes": 1,
        "processespernode": 1,
        "mem": "10Gi",
        "cpu": "1000m",
        "gpu": {
            "count": 8
        },
        "walltime": 60
    },
    "funcname": "execute",
    "kwargs": {
        "cmd": "rocm-smi",
        "args": [
            ""
        ],
        "docker-image": "johan/rocmtensorflow"
    },
    "maxexectime": 55,
    "maxretries": 3
}
colonies function submit --spec rocm-smi.json
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf    PwrCap  VRAM%  GPU%
0    44.0c  92.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%
1    49.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%
2    43.0c  88.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%
3    45.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%
4    48.0c  87.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%
5    48.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%
6    40.0c  92.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%
7    44.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%
================================================================================
============================= End of ROCm SMI Log ==============================