{"id":14529191,"url":"https://github.com/ranandalon/mtl","last_synced_at":"2025-09-01T23:32:56.854Z","repository":{"id":37735680,"uuid":"165829628","full_name":"ranandalon/mtl","owner":"ranandalon","description":"Unofficial implementation of: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics","archived":false,"fork":false,"pushed_at":"2021-11-01T07:21:23.000Z","size":4884,"stargazers_count":533,"open_issues_count":16,"forks_count":78,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-09-05T00:01:52.441Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ranandalon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-01-15T10:14:56.000Z","updated_at":"2024-08-20T14:10:01.000Z","dependencies_parsed_at":"2022-07-30T21:08:11.430Z","dependency_job_id":null,"html_url":"https://github.com/ranandalon/mtl","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ranandalon%2Fmtl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ranandalon%2Fmtl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ranandalon%2Fmtl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ranandalon%2Fmtl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ranandalon","download_url":"https://codeload.github.com/ranandalon/mtl/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":231727114,"owners_count":18417390,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-05T00:00:57.269Z","updated_at":"2024-12-29T10:30:24.564Z","avatar_url":"https://github.com/ranandalon.png","language":"Python","readme":"# Multi-Task Learning project\nUnofficial implementation of:\u003cbr\u003e\nKendall, Alex, Yarin Gal, and Roberto Cipolla. \"Multi-task learning using uncertainty to weigh losses for scene geometry and semantics.\" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.\n[[arXiv](https://arxiv.org/abs/1705.07115)].\n\n\n## Abstract\nNumerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task’s loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.\n\n##  Multi Task Learning with Homoscedastic Uncertainty\nThe naive approach to combining multi objective losses would be to simply perform a weighted linear sum of the losses for each individual task:\u003cbr\u003e\n\u003cimg src='images/naive_loss.PNG'\u003e\u003cbr\u003e\n\nThe paper suggest that using Homoscedastic uncertainty can be used as a basis for weighting losses in a multi-task learning problem and produce supirior results then the naive approach.\n\n### Mathematical Formulation\nFirst the paper defines multi-task likelihoods:\u003cbr\u003e\n- For regression tasks, likelihood is defined as a Gaussian with mean given by the model output with an observation noise scalar σ:\u003cbr\u003e\n\u003cimg src='images/reg_likelihood.PNG'\u003e\u003cbr\u003e\n- For classification, likelihood is defined as:\u003cbr\u003e\n\u003cimg src='images/class_likelihood_1.PNG'\u003e\u003cbr\u003e\nwhere:\u003cbr\u003e\n\u003cimg src='images/class_likelihood_0.PNG'\u003e\u003cbr\u003e\n\nIn maximum likelihood inference, we maximise the log likelihood of the model. In regression for example:\u003cbr\u003e\n\u003cimg src='images/reg_loglikelihood.PNG'\u003e\u003cbr\u003e\nσ is the model’s observation noise parameter - capturing how much noise we have in the outputs. We then\nmaximise the log likelihood with respect to the model parameters W and observation noise parameter σ.\u003cbr\u003e\n\nAssuming two tasks that follow a Gaussian distributions:\u003cbr\u003e\n\u003cimg src='images/two_task.PNG'\u003e\u003cbr\u003e\nThe loss will be:\u003cbr\u003e\n\u003cimg src='images/total_loss_h.PNG'\u003e\u003cbr\u003e\n\u003cimg src='images/loss7.PNG'\u003e\u003cbr\u003e\nThis means that W and σ are the learned parameters of the network. W are the wights of the network while σ are used to calculate the wights of each task loss and also to regularize this task loss wight.\n\n\n## Architecture\n### Overview\nThe network consisets of an encoder which produce a shared representation and  followed by three task-specific decoders:\n1. Semantic segmantation Decoder.\n2. Instance segmantation Decoder.\n3. Depth estimation Decoder.\n\n\u003cimg src='images/arc_top.png'\u003e\n\n### Encoder\nThe encoder consisets of a fine tuned pre-trained ResNet 101 v1 with the following chnges:\n1. Droped the final fully conected layer.\n2. Last layer is resized to 128X256.\n3. used Dilated convolutional approch (atrous convolution).\n\n\u003cimg src='images/resnet.png'\u003e\n\n### Atrous convolution\nGiven an image, we assume that we first have a downsampling operation that reduces the resolution by a factor of 2, and then perform a convolution with a kernel (in the example beneath: the vertical Gaussian derivative). If one implants the resulting feature map in the original image coordinates, we realize that we have obtained responses at only 1/4 of the image positions. Instead, we can compute responses at all image positions if we convolve the full resolution image with a filter ‘with holes’, in which we upsample the original filter by a factor of 2, and introduce zeros in between filter values. Although the effective filter size increases, we only need to take into account the non-zero filter values, hence both the number of filter parameters and the number of operations per position stay constant. The resulting scheme allows us to easily and explicitly control the spatial resolution of neural network feature responses.\n\n\n\u003cimg src='images/atrous_convolution.png'\u003e\n\n### Decoders\nThe decoders consisets of three convolution layers:\n1. 3X3 Conv + ReLU (512 kernels).\n2. 1X1 Conv + ReLU (512 kernels).\n3. 1X1 Conv + ReLU (as many kernels as needed for the task).\n\n**Semantic segmantation Decoder:** last layer 34 channels.\u003cbr\u003e\n\u003cimg src='images/semantic_segmantation.png' height=\"100px\"\u003e\n\n**Instance segmantation Decoder:** last layer 2 channels.\u003cbr\u003e\n\u003cimg src='images/instance_segmantation.png' height=\"100px\"\u003e\n\n**Depth estimation Decoder:** last layer 1 channel.\u003cbr\u003e\n\u003cimg src='images/depth_estimation.png' height=\"100px\"\u003e\n\n## Losses\n### Specific losses\n1. Semantic segmantation loss (\u003cimg src='images/l_label.PNG' height=\"20px\"\u003e): Cross entropy on softMax per pixel (only on valid pixels).\n2. Instance segmantation loss (\u003cimg src='images/l_instance.PNG' height=\"20px\"\u003e): Centroid regression using masked L1. For each instance in the GT we calculate a mask of valid pixels and for each pixel in the mask the length (in pixels) from the mask center (for x and for y) - this will be used as the instance segmantation GT. Then for all valid pixels we calculate L1 between the network output and the instance segmantation GT.\n3. Depth estimation loss (\u003cimg src='images/l_disp.PNG' height=\"20px\"\u003e): L1 (only on valid pixels).\n\n### Multi loss\n\u003cimg src='images/multi_loss.PNG'\u003e\n\nNotice that: \u003cimg src='images/sigmas.PNG' height=\"20px\"\u003e are learnable.\n\n## Instance segmantation explained\nThe instance segmantation decoder produces two channels so that each pixel is a vector pointing to the instance center. Using the semantic segmantation result we calculate a mask for to calculate the instance segmantation valid pixels. Then we combine the mask and the vectors calculated by the instance segmantation decoder and using the OPTICS clustering algorithem we cluster the vectors to diffrent instances. OPTICS is an efficient density based clustering algorithm. It is able to identify an unknown number of multi-scale clusters with varying density from a given set of samples. OPICS is used for two reasons. It does not assume knowledge of the number of clusters like algorithms such as k-means. Secondly, it does not assume a canonical instance size or density like discretised binning approaches.\n\n\u003cimg src='images/instance_pipline_legand2.png'\u003e\n\n\n\n\n\n\n## Results\n### Examples\n|        Input        | Label \u003cbr\u003esegmentation  |Instance \u003cbr\u003esegmentation|       Depth         |\n|:-------------------:|:-------------------:|:-------------------:|:-------------------:|\n|\u003cimg width=\"200px\" src='inputs/Pedestrian_crossing_0.png'\u003e|\u003cimg src='results/resNet_label_instance_disp/label_Pedestrian_crossing_0.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/instance_Pedestrian_crossing_0.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/disp_Pedestrian_crossing_0.png' width=\"200px\"\u003e|\n|\u003cimg width=\"200px\" src='inputs/Pedestrian_crossing_1.png'\u003e|\u003cimg src='results/resNet_label_instance_disp/label_Pedestrian_crossing_1.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/instance_Pedestrian_crossing_1.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/disp_Pedestrian_crossing_1.png' width=\"200px\"\u003e|\n|\u003cimg width=\"200px\" src='inputs/bicycle_0.png'\u003e|\u003cimg src='results/resNet_label_instance_disp/label_bicycle_0.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/instance_bicycle_0.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/disp_bicycle_0.png' width=\"200px\"\u003e|\n|\u003cimg width=\"200px\" src='inputs/bicycle_1.png'\u003e|\u003cimg src='results/resNet_label_instance_disp/label_bicycle_1.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/instance_bicycle_1.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/disp_bicycle_1.png' width=\"200px\"\u003e|\n|\u003cimg width=\"200px\" src='inputs/bus_0.png'\u003e|\u003cimg src='results/resNet_label_instance_disp/label_bus_0.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/instance_bus_0.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/disp_bus_0.png' width=\"200px\"\u003e|\n|\u003cimg width=\"200px\" src='inputs/bus_1.png'\u003e|\u003cimg src='results/resNet_label_instance_disp/label_bus_1.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/instance_bus_1.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/disp_bus_1.png' width=\"200px\"\u003e|\n|\u003cimg width=\"200px\" src='inputs/parking_0.png'\u003e|\u003cimg src='results/resNet_label_instance_disp/label_parking_0.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/instance_parking_0.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/disp_parking_0.png' width=\"200px\"\u003e|\n|\u003cimg width=\"200px\" src='inputs/parking_1.png'\u003e|\u003cimg src='results/resNet_label_instance_disp/label_parking_1.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/instance_parking_1.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/disp_parking_1.png' width=\"200px\"\u003e|\n|\u003cimg width=\"200px\" src='inputs/truck_0.png'\u003e|\u003cimg src='results/resNet_label_instance_disp/label_truck_0.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/instance_truck_0.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/disp_truck_0.png' width=\"200px\"\u003e|\n|\u003cimg width=\"200px\" src='inputs/truck_1.png'\u003e|\u003cimg src='results/resNet_label_instance_disp/label_truck_1.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/instance_truck_1.png' width=\"200px\"\u003e|\u003cimg src='results/resNet_label_instance_disp/disp_truck_1.png' width=\"200px\"\u003e|\n\n### Single vs. Dual vs. All\n**Task quantitative result per epoch**\u003cbr\u003e\n\n|Label segmentation   |Instance segmentation|       Depth         |\n|:-------------------:|:-------------------:|:-------------------:|\n|\u003cimg src='images/graphs/label.png' width=\"280px\"\u003e|\u003cimg src='images/graphs/instance.png' width=\"280px\"\u003e|\u003cimg src='images/graphs/disp.png' width=\"280px\"\u003e|\n\n**Comparison to paper quantitative results** \u003cbr\u003e\n\n\n|                            |     |        |      |Label segmentation   |Instance segmentation|       Depth         |\n|----------------------------|:---:|:------:|:----:|:-------------------:|:-------------------:|:-------------------:|\n|loss                        |Label|Instance|Depth |IoU [%] (ours/*papers*)|RMS error (ours/*papers*)|RMS error (ours/*papers*)|\n|Label only                  |✓    |✗       |✗     |43.45/*43.1*        |✗                   |✗                    |\n|Instance only               |✗    |✓       |✗     |✗                  |3.4128/*4.61*        |✗                    |\n|Depth only                  |✗    |✗       |✓     |✗                  |✗                    |0.7005/*0.783*       |\n|Unweighted sum of losses    |0.333|0.333    |0.333 |*43.6*              |*3.92*               |*0.786*              |\n|Approx. optimal weights     |0.8  |0.05     |0.15  |*46.3*              |*3.92*               |*0.799*              |\n|2 task uncertainty weighting|✓    |✓       |✗     |42.98/*46.5*        |3.3185/*3.73*        |✗                    |\n|2 task uncertainty weighting|✓    |✗       |✓     |43.27/*46.2*        |✗                   |0.7118/*0.714*        |\n|2 task uncertainty weighting|✗    |✓       |✓     |✗                  |3.2853/*3.54*        |0.7166/*0.744*        |\n|3 task uncertainty weighting|✓    |✓       |✓     |42.87/*46.6*        |3.3183/*3.91*        |0.7102/*0.702*        |\n\n\nTable rows and column explanation: \n- Rows 3 - 11 show results for diffrent networks trained with diffrent combination of losses and tasks.\n- Column 1 shows the loss used for the network.\n- Columns 2 - 4 show the tasks that were used doring training (✓ - for using the task, ✗ - for not using the task or number for the task weighting whan a constant weighting parameter was used).\n- Columns 5 - 7 show results for every task (our results are on the left side of the `/` *paper results are on the right side*, in rows 7 and 8 we show only the papers result).\n\nThe table shows the results of 9 netwoks: \n- 1st - 3rd networks were trained with a single task - Label only (semantic segmantation), Instance only (instance segmantation), Depth only (Depth estimation).\n- 4th - 5th networks were trained with three tasks but with constant weights for every task loss (column 2-4 show the weighting).\n- 6th - 8th networks were trained with two tasks using uncertainty weighting.\n- 9th network was trained with three tasks using uncertainty weighting.\n\n\n\n## Setup\n\n**Inferene:**\u003cbr\u003e\n1. Download the pre-trained network folder ([resNet_label_instance_disp](https://drive.google.com/drive/folders/1gjhkFlxH0OEsOVD1YFaxrM_fWfpH1eEv?ogsrc=32)) and save it in `trained_nets`.\n2. Download the pre-trained resNet-101 folder ([res_net_101_ckpt](https://drive.google.com/drive/folders/1gjhkFlxH0OEsOVD1YFaxrM_fWfpH1eEv?ogsrc=32)) and save it in the main project folder.\n3. Put yor wanted input images in `inputs`.\n4. Run `inference.py`.\n4. Your results will be saved in `results`.\n\n\n\n\n","funding_links":[],"categories":["Uncategorized"],"sub_categories":["Uncategorized"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Franandalon%2Fmtl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Franandalon%2Fmtl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Franandalon%2Fmtl/lists"}