Image classificationο
Description of the taskο
The goal of this workflow is to assign a category (or class) to every input image.
In the figure below a few examples of this workflowβs input are depicted:
|
|
|
|
|
|
|
Each of these examples are of a different class and were obtained from MedMNIST v2 ([12]), concretely from DermaMNIST dataset which is a large collection of multi-source dermatoscopic images of common pigmented skin lesions.
Inputs and outputsο
The image classification workflows in BiaPy expect a series of folders as input:
Training Raw Images: A folder that contains the unprocessed (single-channel or multi-channel) images that will be used to train the model. As explained later, all images of the same category are expected to be in the same sub-folder.
- [Optional] Test Raw Images: A folder that contains the images to evaluate the model's performance. Optionaly, if the category of each test image is known, all images of the same category are expected to be in the same sub-folder.
Upon successful execution, a directory will be generated with the results of the classification. Therefore, you will need to define:
Output Folder: A designated path to save the classification outcomes.
BiaPy input and output folders for image classification. Notice the test folder |
Data structureο
To ensure the proper operation of the workflow, the directory tree should be something like this:
dataset/
βββ train
β βββ class_0
β β βββ train0_0.png
β β βββ train1013_0.png
β β βββ . . .
β β βββ train932_0.png
β βββ class_1
β β βββ train104_1.png
β β βββ train1049_1.png
β β βββ . . .
β β βββ train964_1.png
| . . .
β βββ class_6
β βββ train1105_6.png
β βββ train1148_6.png
β βββ . . .
β βββ train98_6.png
βββ test
βββ class_0
β βββ test1008_0.png
β βββ test1084_0.png
β βββ . . .
β βββ test914_0.png
βββ class_1
β βββ test10_1.png
β βββ test1034_1.png
β βββ . . .
β βββ test984_1.png
. . .
βββ class_6
βββ test1021_6.png
βββ test1069_6.png
βββ . . .
βββ test806_6.png
Each image category is obtained from the sub-folder name in which that image resides. That is why is so important to follow the directory tree as described above. If you have a .csv file with each image category, as is provided by MedMNIST v2, you can use our script from_class_csv_to_folders.py to create such directory tree.
The sub-folder names can be any number or string. They will be considered as the class names. Regarding the test, if you have no classes it doesnβt matter if the images are separated in several folders or are all in one folder.
Example datasetsο
Below is a list of publicly available datasets that are ready to be used in BiaPy for image classification:
Example dataset |
Image dimensions |
Link to data |
|---|---|---|
2D |
||
3D |
||
2D |
Minimal configurationο
Apart from the input and output folders, there are a few basic parameters that always need to be specified in order to run an image classification workflow in BiaPy. Depending on the parameter, they can be defined through the GUI Wizard, in the code-free notebooks, or by editing the YAML configuration file.
Experiment nameο
Also known as βmodel nameβ or βjob nameβ, this will be the name of the current experiment you want to run, so it can be differenciated from other past and future experiments.
Note
Use only my_model -style, not my-model (Use β_β not β-β). Do not use spaces in the name. Avoid using the name of an existing experiment/model/job (saved in the same result folder) as it will be overwritten.
Data managementο
Validation Setο
With the goal to monitor the training process, it is common to use a third dataset called the βValidation Setβ. This is a subset of the whole dataset that is used to evaluate the modelβs performance and optimize training parameters. This subset will not be directly used for training the model, and thus, when applying the model to these images, we can see if the model is learning the training setβs patterns too specifically or if it is generalizing properly.
Graphical description of data partitions in BiaPy.ο |
To define such set, there are two options:
Validation proportion/percentage: Select a proportion (or percentage) of your training dataset to be used to validate the network during the training. Usual values are 0.1 (10%) or 0.2 (20%), and the samples of that set will be selected at random.
Validation path: Similar to the training set, you can select a folder that contains the unprocessed (single-channel or multi-channel) raw images that will be used to validate the current model during training. As it happened with the training images, all images of the same category are expected to be in the same sub-folder.
Test ground-truthο
Do you have labels (classes) for the test set? This is a key question so BiaPy knows if your test set will be used for evaluation in new data (unseen during training) or simply produce predictions on that new data. All supervised workflows contain a parameter to specify this aspect.
Basic training parametersο
At the core of each BiaPy workflow there is a deep learning model. Although we try to simplify the number of parameters to tune, these are the basic parameters that need to be defined for training an image classification workflow:
Number of classes: The number of classes present in the problem. It must be equal to the number of subfolders in the training folder.
Number of input channels: The number of channels of your raw images (grayscale = 1, RGB = 3). Notice the dimensionality of your images (2D/3D) is set by default depending on the workflow template you select.
Number of epochs: This number indicates how many rounds the network will be trained. On each round, the network usually sees the full training set. The value of this parameter depends on the size and complexity of each dataset. You can start with something like 100 epochs and tune it depending on how fast the loss (error) is reduced.
Patience: This is a number that indicates how many epochs you want to wait without the model improving its results in the validation set to stop training. Again, this value depends on the data youβre working on, but you can start with something like 20.
For improving performance, other advanced parameters can be optimized, for example, the modelβs architecture. The architecture assigned as default is the ViT, as it is effective in image classification tasks. This architecture allows a strong baseline, but further exploration could potentially lead to better results.
Note
Once the parameters are correctly assigned, the training phase can be executed. Note that to train large models effectively the use of a GPU (Graphics Processing Unit) is essential. This hardware accelerator performs parallel computations and has larger RAM memory compared to the CPUs, which enables faster training times.
How to runο
BiaPy offers different options to run workflows depending on your degree of computer expertise. Select whichever is more approppriate for you:
In the BiaPy GUI, click on the Wizard, then follow the next instructions to select the image classification workflow:
Note
BiaPyβs GUI requires that all data and configuration files reside on the same machine where the GUI is being executed.
Tip
If you need additional help, watch BiaPyβs GUI walkthrough video.
BiaPy offers two code-free notebooks in Google Colab to perform image classification:
Tip
If you need additional help, watch BiaPyβs Notebook walkthrough video.
If you installed BiaPy via Docker, open a terminal as described in Installation. Then, you can use the 2d_classification.yaml template file (or your own file), and run the workflow as follows:
# Configuration file
job_cfg_file=/home/user/2d_classification.yaml
# Path to the data directory
data_dir=/home/user/data
# Where the experiment output directory should be created
result_dir=/home/user/exp_results
# Just a name for the job
job_name=classification
# Number that should be increased when one need to run the same job multiple times (reproducibility)
job_counter=1
# Number of the GPU to run the job in (according to 'nvidia-smi' command)
gpu_number=0
docker run --rm \
--gpus "device=$gpu_number" \
--mount type=bind,source=$job_cfg_file,target=$job_cfg_file \
--mount type=bind,source=$result_dir,target=$result_dir \
--mount type=bind,source=$data_dir,target=$data_dir \
biapyx/biapy:latest-11.8 \
biapy \
--config $job_cfg_file \
--result_dir $result_dir \
--name $job_name \
--run_id $job_counter \
--gpu "$gpu_number"
Note
Note that data_dir must contain the path DATA.*.PATH so the container can find it. For instance, if you want to only train in this example DATA.TRAIN.PATH could be /home/user/data/train/.
For container versions prior to 3.6.8, the biapy prefix is not required. You can execute the command directly as follows:
docker run --rm \
--gpus "device=$gpu_number" \
--mount type=bind,source=$job_cfg_file,target=$job_cfg_file \
--mount type=bind,source=$result_dir,target=$result_dir \
--mount type=bind,source=$data_dir,target=$data_dir \
biapyx/biapy:3.6.7-11.8 \
--config $job_cfg_file \
--result_dir $result_dir \
--name $job_name \
--run_id $job_counter \
--gpu "$gpu_number"
From a terminal, you can use the 2d_classification.yaml template file (or your own file), and run the workflow as follows:
# Configuration file
job_cfg_file=/home/user/2d_classification.yaml
# Where the experiment output directory should be created
result_dir=/home/user/exp_results
# Just a name for the job
job_name=my_2d_classification
# Number that should be increased when one need to run the same job multiple times (reproducibility)
job_counter=1
# Number of the GPU to run the job in (according to 'nvidia-smi' command)
gpu_number=0
# Load the environment
conda activate BiaPy_env
biapy \
--config $job_cfg_file \
--result_dir $result_dir \
--name $job_name \
--run_id $job_counter \
--gpu "$gpu_number"
For multi-GPU training you can call BiaPy as follows:
# First check where is your biapy command (you need it in the below command)
# $ which biapy
# > /home/user/anaconda3/envs/BiaPy_env/bin/biapy
gpu_number="0, 1, 2"
python -u -m torch.distributed.run \
--nproc_per_node=3 \
/home/user/anaconda3/envs/BiaPy_env/bin/biapy \
--config $job_cfg_file \
--result_dir $result_dir \
--name $job_name \
--run_id $job_counter \
--gpu "$gpu_number"
nproc_per_node needs to be equal to the number of GPUs you are using (e.g. gpu_number length).
REM Configuration file
set job_cfg_file=C:\home\user\2d_classification.yaml
REM Where the experiment output directory should be created
set result_dir=C:\home\user\exp_results
REM Just a name for the job
setjob_name=my_2d_classification
REM Number that should be increased when one need to run the same job multiple times (reproducibility)
set job_counter=1
REM Number of the GPU to run the job in (according to 'nvidia-smi' command)
set gpu_number=0
REM Load the environment
call conda activate BiaPy_env
biapy ^
--config %job_cfg_file% ^
--result_dir %result_dir% ^
--name %job_name% ^
--run_id %job_counter% ^
--gpu "%gpu_number%"
For multi-GPU training you can call BiaPy as follows:
REM First check where is your biapy command (you need it in the below command)
REM $ where biapy
REM > C:\home\user\anaconda3\envs\BiaPy_env\bin\biapy
set gpu_number="0, 1, 2"
python -u -m torch.distributed.run ^
--nproc_per_node=3 ^
C:\home\user\anaconda3\envs\BiaPy_env\bin\biapy ^
--config %job_cfg_file% ^
--result_dir %result_dir% ^
--name %job_name% ^
--run_id %job_counter% ^
--gpu "%gpu_number%"
nproc_per_node needs to be equal to the number of GPUs you are using (e.g. gpu_number length).
Templatesο
In the templates/classification folder of BiaPy, you can find a few YAML configuration templates for this workflow.
[Advanced] Special workflow configurationο
Note
This section is recommended for experienced users only to improve the performance of their workflows. When in doubt, do not hesitate to check our FAQ & Troubleshooting or open a question in the image.sc discussion forum.
Advanced Parametersο
Many of the parameters of our workflows are set by default to values that work commonly well. However, it may be needed to tune them to improve the results of the workflow. For instance, you may modify the following parameters
Model architecture: Select the architecture of the deep neural network used as backbone of the pipeline. ViT, EfficientNetB0, EfficientNetB1, EfficientNetB2, EfficientNetB3, EfficientNetB4, EfficientNetB5, EfficientNetB6, EfficientNetB7 and simple CNN. Default value: ViT.
Batch size: This parameter defines the number of patches seen in each training step. Reducing or increasing the batch size may slow or speed up your training, respectively, and can influence network performance. Common values are 4, 8, 16, etc.
Patch size: Input the size of the patches use to train your model (length in pixels in X and Y). The value should be smaller or equal to the dimensions of the image. The default value is 256 in 2D, i.e. 256x256 pixels.
Optimizer: Select the optimizer used to train your model. Options: ADAM, ADAMW, Stochastic Gradient Descent (SGD). ADAM usually converges faster, while ADAMW provides a balance between fast convergence and better handling of weight decay regularization. SGD is known for better generalization. Default value: ADAMW.
Initial learning rate: Input the initial value to be used as learning rate. If you select ADAM as optimizer, this value should be around 10e-4.
Learning rate scheduler: Select to adjust the learning rate between epochs. The current options are βReduce on plateauβ, βOne cycleβ, βWarm-up cosine decayβ or no scheduler.
Test time augmentation (TTA): Select to apply augmentation (flips and rotations) at test time. It usually provides more robust results but uses more time to produce each result. By default, no TTA is peformed.
Metricsο
During the inference phase the performance of the test data is measured using different metrics if test masks were provided (i.e. ground truth) and, consequently, DATA.TEST.LOAD_GT is True. In the case of classification the accuracy, precision, recall, and F1 are calculated. Apart from that, the confusion matrix is also printed.
Resultsο
The main output of this workflow will be a file named predictions.csv that will contain the predicted image class:
Classification workflow outputο
All files are placed in results folder under --result_dir directory with the --name given. Following the example, you should see that the directory /home/user/exp_results/classification has been created. If the same experiment is run 5 times, varying --run_id argument only, you should find the following directory tree:
config_files: directory where the .yaml filed used in the experiment is stored.2d_classification.yaml: YAML configuration file used (it will be overwrited every time the code is run).
checkpoints, optional: directory where modelβs weights are stored. Only created whenTRAIN.ENABLEisTrueand the model is trained for at least one epoch.model_weights_my_2d_classification_1.h5, optional: checkpoint file (best in validation) where the modelβs weights are stored among other information. Only created when the model is trained for at least one epoch.normalization_mean_value.npy, optional: normalization mean value. Is saved to not calculate it everytime and to use it in inference. Only created ifDATA.NORMALIZATION.TYPEiscustom.normalization_std_value.npy, optional: normalization std value. Is saved to not calculate it everytime and to use it in inference. Only created ifDATA.NORMALIZATION.TYPEiscustom.
results: directory where all the generated checks and results will be stored. There, one folder per each run are going to be placed.my_2d_classification_1: run 1 experiment folder. Can contain:predictions.csv: list of assigned class per test image.aug, optional: image augmentation samples. Only created ifAUGMENTOR.AUG_SAMPLESisTrue.charts, optional. Only created whenTRAIN.ENABLEisTrueand epochs trained are more or equalLOG.CHART_CREATION_FREQ:my_2d_classification_1_*.png: plot of each metric used during training.my_2d_classification_1_loss.png: loss over epochs plot.
train_logs: each row represents a summary of each epoch stats. Only avaialable if training was done.tensorboard: tensorboard logs.test_results_metrics.csv: a CSV file containing all the evaluation metrics obtained on each file of the test set if ground truth was provided.
Note
Here, for visualization purposes, only my_2d_classification_1 has been described but my_2d_classification_2, my_2d_classification_3, my_2d_classification_4 and my_2d_classification_5 directories will follow the same structure.