Self-supervision

Description of the task

The idea of this workflow is to pretrain a deep learning model by solving a so-called pretext task (denoising, inpainting, etc.) without labels. This way, the model learns a representation that can be later transferred to solve a downstream task (segmentation, detection, etc.) in a labeled (and usually smaller) dataset.

Note

In the current BiaPy GUI, self-supervised workflows are considered advanced and are not available through the Wizard. To run this workflow from the GUI, edit a YAML configuration file (including PROBLEM.TYPE: SELF_SUPERVISED) and load it from Run Workflow.

An example of a pretext task is depicted below, with an original image and its corresponding degradated version. The model will be then train using the clean original image as target and the degrated image as input:

Electron microscopy image to be used as target for pretraining

Original electron microscopy image.

Corresponding degrated version of the same image

Corresponding degradated image.

Inputs and outputs

The self-supervision workflows in BiaPy expect a folder as input:

  • Training Raw Images: A folder that contains the unprocessed (single-channel or multi-channel) images that will be used to (pre-)train the model.

    Expand to see how to configure

    In the current BiaPy GUI, configure this by editing DATA.TRAIN.PATH in your YAML file, then click Run Workflow and load that YAML file.

Upon successful execution, a directory will be generated with the results of the pre-training. Therefore, you will need to define:

  • Output Folder: A designated path to save the self-supervision outcomes.

    Expand to see how to configure

    Under Run Workflow, click on the Browse button of Output folder to save the results:

    ../_images/GUI-run-workflow5.png
Graphical description of minimal inputs and outputs in BiaPy for self-supervised learning.

BiaPy input and output folders for self-supervised learning. Since this workflow
is self-supervised, no labels are needed in neither train nor test.


Data structure

To ensure the proper operation of the workflow, the data directory tree should be something like this:

dataset/
β”œβ”€β”€ pre-train
β”‚   β”œβ”€β”€ training-0001.tif
β”‚   β”œβ”€β”€ training-0002.tif
β”‚   β”œβ”€β”€ . . .
β”‚   └── training-9999.tif
└── test
    β”œβ”€β”€ testing-0001.tif
    β”œβ”€β”€ testing-0002.tif
    β”œβ”€β”€ . . .
    └── testing-9999.tif

In this example, the (pre-)training images are under dataset/pre-train/, while the test images are under dataset/test/. This is just an example, you can name your folders as you wish as long as you set the paths correctly later.

Example datasets

Below is a list of publicly available datasets that are ready to be used in BiaPy for self-supervised learning:

Example dataset

Image dimensions

Link to data

Electron Microscopy Dataset (EPFL - CVLAB)

2D

fibsem_epfl.zip

Electron Microscopy Dataset (EPFL - CVLAB)

3D

lucchi3D.zip

Minimal configuration

Apart from the input and output folders, there are a few basic parameters that always need to be specified in order to run a self-supervised learning workflow in BiaPy. In the current GUI, this workflow is configured through a YAML file loaded from Run Workflow (not through the Wizard).

Workflow type

Set the workflow type explicitly in your YAML configuration:

PROBLEM:
  TYPE: SELF_SUPERVISED

Experiment name

Also known as β€œmodel name” or β€œjob name”, this will be the name of the current experiment you want to run, so it can be differenciated from other past and future experiments.

Expand to see how to configure

Under Run Workflow, type the name you want for the job in the Job name field:

../_images/GUI-run-workflow5.png

Note

Use only my_model -style, not my-model (Use β€œ_” not β€œ-β€œ). Do not use spaces in the name. Avoid using the name of an existing experiment/model/job (saved in the same result folder) as it will be overwritten.

Data management

Validation Set

With the goal to monitor the training process, it is common to use a third dataset called the β€œValidation Set”. This is a subset of the whole dataset that is used to evaluate the model’s performance and optimize training parameters. This subset will not be directly used for training the model, and thus, when applying the model to these images, we can see if the model is learning the training set’s patterns too specifically or if it is generalizing properly.

Graphical description of data partitions in BiaPy for SSL

Graphical description of data partitions in BiaPy when using self-generated labels.

To define such set, there are two options:

  • Validation proportion/percentage: Select a proportion (or percentage) of your training dataset to be used to validate the network during the training. Usual values are 0.1 (10%) or 0.2 (20%), and the samples of that set will be selected at random.

    Expand to see how to configure

    In the current BiaPy GUI, this option is configured by editing the DATA.VAL.SPLIT_TRAIN in your YAML file before clicking Run Workflow and loading that YAML file.

  • Validation path: Similar to the training and test sets, you can select a folder that contains the unprocessed (single-channel or multi-channel) raw images that will be used to validate the current model during training.

    Expand to see how to configure

    In the current BiaPy GUI, this option is configured by editing the DATA.VAL.PATH in your YAML file before clicking Run Workflow and loading that YAML file.

Basic training parameters

At the core of each BiaPy workflow there is a deep learning model. Although we try to simplify the number of parameters to tune, these are the basic parameters that need to be defined for training a self-supervised learning workflow:

  • Pretext task: The task to use to pretrain the model. Options: β€˜crappify’ to recover a worstened version of the input image (as in [3]), and β€˜masking’, where random patches of the input image are masked and the network needs to reconstruct the missing pixels (as in [5]). Default value: β€˜masking’.

    Expand to see how to configure

    In the current BiaPy GUI, this option is configured by editing the PROBLEM.SELF_SUPERVISED.PRETEXT_TASK in your YAML file before clicking Run Workflow and loading that YAML file.

  • Number of input channels: The number of channels of your raw images (grayscale = 1, RGB = 3). Notice the dimensionality of your images (2D/3D) is set by default depending on the workflow template you select.

    Expand to see how to configure

    In the current BiaPy GUI, this option is configured by editing the DATA.PATCH_SIZE in your YAML file before clicking Run Workflow and loading that YAML file.

  • Number of epochs: This number indicates how many rounds the network will be trained. On each round, the network usually sees the full training set. The value of this parameter depends on the size and complexity of each dataset. You can start with something like 100 epochs and tune it depending on how fast the loss (error) is reduced.

    Expand to see how to configure

    In the current BiaPy GUI, this option is configured by editing the TRAIN.EPOCHS in your YAML file before clicking Run Workflow and loading that YAML file.

  • Patience: This is a number that indicates how many epochs you want to wait without the model improving its results in the validation set to stop training. Again, this value depends on the data you’re working on, but you can start with something like 20.

    Expand to see how to configure

    In the current BiaPy GUI, this option is configured by editing the TRAIN.PATIENCE in your YAML file before clicking Run Workflow and loading that YAML file.

For improving performance, other advanced parameters can be optimized, for example, the model’s architecture. The architecture assigned as default is usually the MAE, as it is a standard in self-supervision tasks. This architecture allows a strong baseline, but further exploration could potentially lead to better results.

Note

Once the parameters are correctly assigned, the training phase can be executed. Note that to train large models effectively the use of a GPU (Graphics Processing Unit) is essential. This hardware accelerator performs parallel computations and has larger RAM memory compared to the CPUs, which enables faster training times.

How to run

BiaPy offers different options to run workflows depending on your degree of computer expertise. Select whichever is more approppriate for you:

In the BiaPy GUI, navigate to Workflow, then select Self-supervised learning and follow the on-screen instructions:

../_images/biapy_gui_ssl.png

Note

BiaPy’s GUI requires that all data and configuration files reside on the same machine where the GUI is being executed.

Tip

If you need additional help, watch BiaPy’s GUI walkthrough video.

Templates

In the templates/self-supervised folder of BiaPy, you can find a few YAML configuration templates for this workflow.

[Advanced] Special workflow configuration

Note

This section is recommended for experienced users only to improve the performance of their workflows. When in doubt, do not hesitate to check our FAQ & Troubleshooting or open a question in the image.sc discussion forum.

Advanced Parameters

Many of the parameters of our workflows are set by default to values that work commonly well. However, it may be needed to tune them to improve the results of the workflow. For instance, you may modify the following parameters:

  • Model architecture: Select the architecture of the DNN used as backbone of the pipeline. Options: MAE, EDSR, RCAN, WDSR, DFCAN, U-Net, Residual U-Net, Attention U-Net, SEUNet, MultiResUNet, ResUNet++, UNETR-Mini, UNETR-Small and UNETR-Base. Common option: MAE.

  • Batch size: This parameter defines the number of patches seen in each training step. Reducing or increasing the batch size may slow or speed up your training, respectively, and can influence network performance. Common values are 4, 8, 16, etc.

  • Patch size: Input the size of the patches use to train your model (length in pixels in X and Y). The value should be smaller or equal to the dimensions of the image. The default value is 64 in 2D, i.e. 64x64 pixels.

  • Optimizer: Select the optimizer used to train your model. Options: ADAM, ADAMW, Stochastic Gradient Descent (SGD). ADAM usually converges faster, while ADAMW provides a balance between fast convergence and better handling of weight decay regularization. SGD is known for better generalization. Default value: ADAMW.

  • Initial learning rate: Input the initial value to be used as learning rate. If you select ADAM as optimizer, this value should be around 10e-4.

Problem resolution

In BiaPy we adopt two pretext tasks that you will need to choose with pretext_task variable below (controlled with PROBLEM.SELF_SUPERVISED.PRETEXT_TASK):

  • crappify: Firstly, a pre-processing step is done where the input images are worstened by adding Gaussian noise and downsampling and upsampling them so the resolution gets worsen. This way, the images are stored in DATA.TRAIN.SSL_SOURCE_DIR, DATA.VAL.SSL_SOURCE_DIR and DATA.TEST.SSL_SOURCE_DIR for train, validation and test data respectively. This way, the model will be input with the worstened version of images and will be trained to map it to its good version (as in [3]).

  • masking: The model undergoes training by acquiring the skill to restore a concealed input image. This occurs in real-time during training, where random portions of the images are automatically obscured ([5]).

After this training, the model should have learned some features of the images, which will be a good starting point in another training process. This way, if you re-train the model loading those learned model’s weigths, which can be done enabling MODEL.LOAD_CHECKPOINT if you call BiaPy with the same --name option or setting PATHS.CHECKPOINT_FILE variable to point the file directly otherwise, the training process will be easier and faster than training from scratch.

Metrics

During the inference phase the performance of the test data is measured using different metrics if test masks were provided (i.e. ground truth) and, consequently, DATA.TEST.LOAD_GT is True. In the case of super-resolution the Peak signal-to-noise ratio (PSNR) metrics is calculated when the worstened image is reconstructed from individual patches.

Results

The results are placed in results folder under --result_dir directory with the --name given. An example of this workflow is depicted below:

../_images/pred_ssl.png

Predicted image.

../_images/lucchi_train_0.png

Original image.

Following the example, you should see that the directory /home/user/exp_results/my_2d_self-supervised has been created. If the same experiment is run 5 times, varying --run_id argument only, you should find the following directory tree:

Expand directory tree
my_2d_self-supervised/
β”œβ”€β”€ config_files
β”‚   └── my_2d_self-supervised.yaml
β”œβ”€β”€ checkpoints
β”‚   └── my_2d_self-supervised_1-checkpoint-best.pth
└── results
    β”œβ”€β”€ my_2d_self-supervised_1
    β”œβ”€β”€ . . .
    └── my_2d_self-supervised_5
        β”œβ”€β”€ aug
        β”‚   └── .tif files
        β”œβ”€β”€ charts
        β”‚   β”œβ”€β”€ my_2d_self-supervised_1_*.png
        β”‚   └── my_2d_self-supervised_1_loss.png
        β”œβ”€β”€ MAE_checks
        β”‚   └── .tif files
        β”œβ”€β”€ per_image
        β”‚   β”œβ”€β”€ .tif files
        β”‚   └── .zarr files (or.h5)
        β”œβ”€β”€ tensorboard
        └── train_logs

  • config_files: directory where the .yaml filed used in the experiment is stored.

    • my_2d_self-supervised.yaml: YAML configuration file used (it will be overwrited every time the code is run).

  • checkpoints, optional: directory where model’s weights are stored. Only created when TRAIN.ENABLE is True and the model is trained for at least one epoch. Can contain:

    • my_2d_self-supervised_1-checkpoint-best.pth, optional: checkpoint file (best in validation) where the model’s weights are stored among other information. Only created when the model is trained for at least one epoch.

    • normalization_mean_value.npy, optional: normalization mean value. Is saved to not calculate it everytime and to use it in inference. Only created if DATA.NORMALIZATION.TYPE is custom.

    • normalization_std_value.npy, optional: normalization std value. Is saved to not calculate it everytime and to use it in inference. Only created if DATA.NORMALIZATION.TYPE is custom.

  • results: directory where all the generated checks and results will be stored. There, one folder per each run are going to be placed.

    • my_2d_self-supervised_1: run 1 experiment folder. Can contain:

      • aug, optional: image augmentation samples. Only created if AUGMENTOR.AUG_SAMPLES is True.

      • charts, optional: only created when TRAIN.ENABLE is True and epochs trained are more or equal LOG.CHART_CREATION_FREQ. Can contain:

        • my_2d_self-supervised_1_*.png: Plot of each metric used during training.

        • my_2d_self-supervised_1_loss.png: Loss over epochs plot.

      • MAE_checks, optional: MAE predictions. Only created if PROBLEM.SELF_SUPERVISED.PRETEXT_TASK is masking.

        • *_original.tif: Original image.

        • *_masked.tif: Masked image inputed to the model.

        • *_reconstruction.tif: Reconstructed image.

        • *_reconstruction_and_visible.tif: Reconstructed image with the visible parts copied.

      • per_image:

        • .tif files: reconstructed images from patches.

        • .zarr files (or.h5), optional: reconstructed images from patches. Created when TEST.BY_CHUNKS.ENABLE is True.

      • tensorboard: Tensorboard logs.

      • train_logs: each row represents a summary of each epoch stats. Only avaialable if training was done.

Note

Here, for visualization purposes, only my_2d_self-supervised_1 has been described but my_2d_self-supervised_2, my_2d_self-supervised_3, my_2d_self-supervised_4 and my_2d_self-supervised_5 will follow the same structure.