Introduction
In this tutorial, we will train a machine learning model to identify water in Sentinel-2 satellite images. We will be using code from this GitHub repo using this dataset.
Upload dataset
We need a dataset to train the ML model. We will use the Colonies CFS to distribute the dataset to different executors.
Method 1 - Colonies CLI
Download the dataset and unzip the archive.zip
file to directory named water_body_dataset
.
Then, upload to the dataset files to Colonies CFS.
colonies fs sync -l /water -d ./water_body_dataset
| water_body_1114.jpg | 6 KiB | /water |
| water_body_1209.jpg | 10 KiB | /water |
| water_body_1273.jpg | 9 KiB | /water |
| water_body_1552.jpg | 61 KiB | /water |
| water_body_7797.jpg | 4 KiB | /water |
| water_body_8847.jpg | 9 KiB | /water |
| water_body_1313.jpg | 8 KiB | /water |
| water_body_1615.jpg | 18 KiB | /water |
| water_body_1724.jpg | 1 KiB | /water |
| water_body_1801.jpg | 8 KiB | /water |
| water_body_1833.jpg | 11 KiB | /water |
+---------------------+----------+--------+
No files will be downloaded
/water:
=======
No files will be uploaded
No files will be downloaded
Are you sure you want to continue? (yes,no):
After the upload has finished, we can now list the dataset.
colonies fs label ls
+---------------------------------------------------------+-----------------+
| LABEL | NUMBER OF FILES |
+---------------------------------------------------------+-----------------+
| /water/Images | 2841 |
| /water/Masks | 2841 |
+---------------------------------------------------------+-----------------+
To access the dataset from an executor, the executor first needs to synchronize the data. This can be achieved in several ways; one method is to specify the /water
label in the function specification fs section. The executor will then synchronize the dataset files to its local file system.
Alternatively, you can submit a function to an executor, requesting it to synchronize a specific label to its local file system without launching a container. The code below will download the dataset on the Leonardo HPC system.
{
"conditions": {
"executortype": "container-executor",
"executornames": [
"leonardo-booster"
],
"executortype": "leonardo-booster-hpcexecutor",
"nodes": 1,
"processespernode": 1,
"cpu": "1000m",
"mem": "30Gi",
"gpu": {
"count": 0
},
"walltime": 60000
},
"funcname": "sync",
"fs": {
"mount": "/cfs",
"dirs": [
{
"label": "/water",
"dir": "/water",
"keepfiles": true,
"onconflicts": {
"onstart": {
"keeplocal": false
},
"onclose": {
"keeplocal": true
}
}
}
]
},
"maxwaittime": -1,
"maxexectime": 60000,
"maxretries": 3
}
The data set will then be available in /cfs/water/Images/
and /cfs/water/Masks/
in the container running on Leonardo.
colonies function submit --spec sync.json
Method 2 - Pollinator
First, find a target executor.
╭──────────────────┬────────────────────┬────────────────┬─────────────────────╮
│ NAME │ TYPE │ LOCATION │ LAST HEARD FROM │
├──────────────────┼────────────────────┼────────────────┼─────────────────────┤
│ icekube │ container-executor │ RISE, Sweden │ 2024-02-23 12:35:39 │
│ dev │ container-executor │ Rutvik, Sweden │ 2024-02-23 12:35:39 │
│ lumi-standard │ container-executor │ CSC, Finland │ 2024-02-23 12:35:41 │
│ leonardo-booster │ container-executor │ Cineca, Italy │ 2024-02-23 12:34:44 │
╰──────────────────┴────────────────────┴────────────────┴─────────────────────╯
Generate an empty working, targeting the LUMI HPC system. Note that the target executor type can be changed later.
mkdir waterml
cd waterml
pollinator new -e lumi-small-hpcexecutor
INFO[0000] Creating directory Dir=./cfs/src
INFO[0000] Creating directory Dir=./cfs/data
INFO[0000] Creating directory Dir=./cfs/result
INFO[0000] Generating Filename=./project.yaml
INFO[0000] Generating Filename=./cfs/data/hello.txt
INFO[0000] Generating Filename=./cfs/src/main.py
Copy the water_body_dataset
to the ./cfs/data
directory
cp ~/water_body_dataset ./cfs/data
The dataset will upload next time the project run.
pollinator run --follow
Uploading main.py 100% [===============] (4.3 MB/s)
Downloading water_body_8239.jpg 100% [===============] (248 kB/s)
Downloading water_body_701.jpg 100% [===============] (484 kB/s)
Downloading water_body_8159.jpg 100% [===============] (148 kB/s)
Downloading water_body_683.jpg 100% [===============] (145 kB/s)
Downloading water_body_967.jpg 100% [===============] (350 kB/s)
Downloading water_body_784.jpg 100% [===============] (906 kB/s)
Downloading water_body_922.jpg 100% [===============] (161 kB/s)
Downloading water_body_233.jpg 100% [===============] (251 kB/s)
Downloading water_body_1206.jpg 100% [===============] (720 kB/s)
Downloading water_body_1708.jpg 100% [===============] (1.3 MB/s)
Downloading water_body_2461.jpg 100% [===============] (560 kB/s)
...
The data set will then be available here in the running container:
projdir = os.environ.get("PROJECT_DIR")
image_path = projdir + '/data/water/Images/'
mask_path = projdir + '/data/water/Masks/'
Docker container
We are going the Container Executor, which comes in three variants.
Kube Executor runs containers as Kubernetes batch jobs.
Docker Executor runs containers as Docker containers on a baremetal servers or VMs.
HPC Executor runs containers as Singularity containers on HPC systems, managing them as Slurm jobs.
As the function specification is identical, meaning that we can easily switch between these 3 types of executors. To run containers, we first need to create a Dockerfile with the following content:
FROM docker.io/tensorflow/tensorflow:2.13.0-gpu
RUN apt-get update && apt-get install -y python3 python3-pip wget vim git fish libgl1-mesa-glx libglib2.0-0
RUN python3 -m pip install --upgrade pip
RUN pip3 install pycolonies opencv-python tqdm Pillow scikit-learn keras matplotlib numpy
Build and publish the Dockerfile and publish the Docker image at public Docker registry.
docker build -t johan/hackaton .
docker push johan/hackaton
The johan/hackaton
Docker image has already been published at DockerHub.
Training the model
Now that we have uploaded the dataset and created a Docker container, it’s time to proceed with training the model.
Setup a Pollinator project
Create a new Pollinator project (or use the one you already created when uploading the dataset).
In the example, we assumed the water_dataset
in available in Colonies CFS under the label /water
.
mkdir waterml
cd waterml
pollinator new -n leonardo-booster
Edit the project.yaml
file. Change the Docker image to johan/hackaton
, increase required memory to
30000Mi
, use 4 CPU cores (4000m
).
Walltime defined the maximum time the process may run. In this case, it has to finish in 2000
seconds.
projectname: 559ac0c3a834594b337d10ebedf3134ea0ca3142cceab26b1aa5c17ba141999d
conditions:
executorType: leonardo-booster-hpcexecutor
nodes: 1
processesPerNode: 1
cpu: 4000m
mem: 30000Mi
walltime: 2000
gpu:
count: 1
name: ""
environment:
docker: johan/hackaton
rebuildImage: false
cmd: python3
source: main.py
Replace main.py
Download source code from this GitHub repo.
cd cfs/src
wget https://raw.githubusercontent.com/johankristianss/colonyoshackaton/main/src/main.py .
At line 132, change epochs to e.g 30.
epochs = 30
Note that the Python code saves the training result and a random prediction example in the result directory, which is automatically synchronized back to the client after process completion.
plt.savefig(projdir + '/result/res_' + processid + '.png')
plt.savefig(projdir + '/result/samples_' + processid + '.png')
ls cfs/result
.rw-r--r-- 55k johan 12 Dec 21:40 res_076e273a1d082dd2886892dfd7d1723e12c747cf2899f2c2ede27ceb55e06ae2.png
.rw-r--r-- 266k johan 12 Dec 21:40 samples_076e273a1d082dd2886892dfd7d1723e12c747cf2899f2c2ede27ceb55e06ae2.png
Train the model
Pollinator will automatically synchronize the cfs/src
, cfs/data
, and cfs/result
directories to Colonies CFS, generate
a function specification and then submit the function specification, follow the process execution, and upon completion, synchronize the
project files back to your local computer.
pollinator run --follow
67/67 [==============================] - 1s 18ms/step - loss: 0.3434 - accuracy: 0.7024 - val_loss: 0.3263 - val_accuracy: 0.7038
Epoch 25/30
67/67 [==============================] - 1s 17ms/step - loss: 0.3307 - accuracy: 0.7092 - val_loss: 0.3146 - val_accuracy: 0.7121
Epoch 26/30
67/67 [==============================] - 1s 18ms/step - loss: 0.3139 - accuracy: 0.7140 - val_loss: 0.2947 - val_accuracy: 0.7249
Epoch 27/30
67/67 [==============================] - 1s 17ms/step - loss: 0.3226 - accuracy: 0.7110 - val_loss: 0.3027 - val_accuracy: 0.7244
Epoch 28/30
67/67 [==============================] - 1s 17ms/step - loss: 0.2994 - accuracy: 0.7208 - val_loss: 0.2910 - val_accuracy: 0.7259
Epoch 29/30
67/67 [==============================] - 1s 17ms/step - loss: 0.2910 - accuracy: 0.7239 - val_loss: 0.2781 - val_accuracy: 0.7261
Epoch 30/30
67/67 [==============================] - 1s 17ms/step - loss: 0.2856 - accuracy: 0.7258 - val_loss: 0.2733 - val_accuracy: 0.7313
23/23 [==============================] - 0s 4ms/step
INFO[0141] Process finished successfully ProcessID=61e597845ed3df4456c5be7d358e35141b8dc4c1f76a89d7caad0f31f792106c
Downloading samples_076e273a1d082dd2886892dfd7d1723e12c747cf2899f2c2ede27ceb55e06ae2.png 100% [===============] (5.0 MB/s)
Downloading res_076e273a1d082dd2886892dfd7d1723e12c747cf2899f2c2ede27ceb55e06ae2.png 100% [===============] (1.7 MB/s)
We can now open the sample and training plot pictures.