https://github.com/kleinyuan/train-crfasrnn
Detailed guide to help you understand how to train CRF as RNN
https://github.com/kleinyuan/train-crfasrnn
computer-vision crf crf-as-rnn crfasrnn deep-learning deeplab machine-learning rnn segmentation
Last synced: 5 months ago
JSON representation
Detailed guide to help you understand how to train CRF as RNN
- Host: GitHub
- URL: https://github.com/kleinyuan/train-crfasrnn
- Owner: KleinYuan
- Created: 2017-05-19T15:17:35.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2017-05-26T19:21:22.000Z (about 9 years ago)
- Last Synced: 2025-01-25T13:26:14.378Z (over 1 year ago)
- Topics: computer-vision, crf, crf-as-rnn, crfasrnn, deep-learning, deeplab, machine-learning, rnn, segmentation
- Language: Makefile
- Size: 7.81 KB
- Stars: 10
- Watchers: 4
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Slack
*Join slack and enter CRF-as-RNN channel to discuss*
https://deep-learning-geeks-slack.herokuapp.com/
# Intro and Clarifications
This repo is just to
- [X] show you how to train [CRF as RNN](https://github.com/torrvision/crfasrnn) with PASCAL VOC datasets (20 classes + background)
- [X] be a well maintained place to communicate with this methods
- [X] try to rewrite CRF as RNN with Caffe2 (join the slack team and let's discuss together)
# Step by Step
## Pick Up a Machine
1. Single GPU
AWS `p2.xlarge` instance (spot instance ~0.2$/hour, with Tesla K80 GPU, 12G Memory) will be enough for the training purpose.
Equivalent setup may work.
2. Multipile GPUs
You need to make some changes to achieve that and I haven't succeeded doing this for once (tried with 3GPUs with 18G memory intotal but failed) but I will update if I see any.
## Prepare Environment
Please refer to this [repo](https://github.com/KleinYuan/easy-yolo#b-environment-gpu) for all the details and commands. And it may take you around half -> one hour.
- [X] OpenCV (You may do not need OpenCV)
- [X] NVIDIA Driver
- [X] CUDA
- [X] CUDNN
## Build CRF as RNN Caffe
1. Get CRF as RNN and navigate to the correct branch
```
>$ git clone https://github.com/torrvision/caffe.git
>$ git checkout crfrnn
```
You will have a repo called `caffe` lying on whatever you run the command above.
2. Change some source code to optimize memory consumption (!IMPORTANT)
If you just begin to build, you probably will meet this [issue](https://github.com/torrvision/crfasrnn/issues/79), which sucks.
Therefore, you need to change some Caffe source code with this [PR](https://github.com/BVLC/caffe/pull/2016).
Details:
- [X] Change `caffe/src/caffe/layers/base_conv_layer.cpp`:
```
line 12: Blob BaseConvolutionLayer::col_buffer_;
line 13: template
```
- [X] Change `caffe/include/caffe/layers/base_conv_layer.hpp`:
```
line168: static Blob col_buffer_;
```
This can reduce the GPU memory consumption via sharing memory but with a known but ignorable [bug](https://github.com/BVLC/caffe/pull/2016#issuecomment-77509575).
3. Configure Make
In the root folder of `caffe`, there's a file called `Makefile.config.example`.
Copy and Paste and then rename it to `Makefile.config` or just run `cp Makefile.config.example Makefile.config`
- [X] If you install OpenCV separately, uncomment [`USE_PKG_CONFIG := 1`](https://github.com/torrvision/caffe/blob/crfrnn/Makefile.config.example#L107)
- [X] If you want to train with Multiple GPU, you need to uncomment [`USE_NCCL := 1`](https://github.com/torrvision/caffe/blob/crfrnn/Makefile.config.example#L103) and install [NCCL](https://github.com/KleinYuan/train-crfasrnn#trial-on-train-with-multiple-gpu)
- [X] If you want to use `OpenBlas` which works more efficiently with multiple CPUs instead of `ATLAS`, you probably wanna change [`BLAS := atlas`](https://github.com/torrvision/caffe/blob/crfrnn/Makefile.config.example#L50) to `BLAS := open`, which I don't think is necessary
- [X] Comment out all the `60, 61` [arch options](https://github.com/torrvision/caffe/blob/crfrnn/Makefile.config.example#L42) since the machine you are using is probably not going to support them unless you have a machine can
Then in root folder of `caffe` just run :
```
make all
```
and in case you fucked up, you can run
```
make clean
```
to clean everything and re-make
The process of make may take a while, like 10~min.
## Prepare DataSets
The whole idea is that, we need to
- [X] Download PASCAL VOC dataset (which is very large)
- [X] Label them
- [X] Create LMDB for Caffe to easy access
So, I found this [repo](https://github.com/remz1337/train-CRF-RNN) is doing good job on this step.
Therefore, you need to:
1. Clone this [repo](https://github.com/remz1337/train-CRF-RNN)
2. Prepare
Follow the step from [Prepare dataset for training](https://github.com/remz1337/train-CRF-RNN#prepare-dataset-for-training) to [Create LMDB database](https://github.com/remz1337/train-CRF-RNN#create-lmdb-database) and stop here.
You will have trouble on executing last [step](https://github.com/remz1337/train-CRF-RNN#create-lmdb-database) because this script needs a very small [functionality](https://github.com/remz1337/train-CRF-RNN/blob/master/data2lmdb.py#L14) to [dump](https://github.com/remz1337/train-CRF-RNN/blob/master/data2lmdb.py#L156) img into datum.
Therefore, you have two options:
a. In the root folder of this repo, also clone a Caffe (can be any version) and just build it like above except that you don't do any changes, basically:
- [X] Clone Caffe
- [X] run:
```
cp Makefile.config.example Makefile.config
make all
make pycaffe #!IMPORTANT and we didn't do that above because we didn't need this while you need this here
```
b. Go to the caffe we built above and continue build pycaffe and change the file path:
- [X] run command: make pycaffe
- [X] add caffe root like [this example](https://github.com/remz1337/train-CRF-RNN/blob/master/crfasrnn.py#L6) after this[line](https://github.com/remz1337/train-CRF-RNN/blob/master/data2lmdb.py#L6) with the actual caffe relative/absolute(recommended) path
After you've done those two above, you should be able to finish all the labeling step and thereby, in the folder of `train-CRF-RNN`, you will be able to see those folders:
- [X] train_images_20_lmdb
- [X] train_labels_20_lmdb
- [X] test_images_20_lmdb
- [X] test_labels_20_lmdb
## Prepare Training Proto buffer files
1. Clone this repo in a different place
```
git clone https://github.com/KleinYuan/train-crfasrnn.git
```
2. Edit`trainKit/CRFRNN_train.prototxt`
Replace `${PATH}` in line7/19/31/41 with actual *absolute* path of those folders I listed above.
3. Edit Make file of this repo
Replace `${CAFFE_PATH}` with the root path of the caffe we built above and replace `${TRAIN_CRF_RNN_PATH}` with root path of this repo.
## Download Pre-trained Models
If you check the Makefile, you will see I offer you four choices:
- [X] Train-single-gpu-from-0
- [X] Train-multiple-gpus-from-0
- [X] Train-single-gpu-fine-tuning
- [X] Train-multiple-gpus-fine-tuning
If you wanna train a model from sratch, you need to download the FCN-8s model and put it in the folder of Makefile by just run:
```
wget http://dl.caffe.berkeleyvision.org/fcn-8s-pascal.caffemodel
```
If you wanna train a model based on the pre-trained model, you need to download the `TVG_CRFRNN_COCO_VOC.caffemodel` (be aware of the LICENCE of this model, it's not free for commercial usage):
```
wget http://goo.gl/j7PrPZ -O TVG_CRFRNN_COCO_VOC.caffemodel
```
## Train!
Finally, you can train the model based on your purpose with the Makefile by running one of following command:
```
make Train-single-gpu-from-0
```
or
```
make Train-multiple-gpus-from-0
```
or
```
Train-single-gpu-fine-tuning
```
or
```
Train-multiple-gpus-fine-tuning
```
Also, if you wanan do multiple GPUs and have many GPUs, just keep adding `2, 3, 4...` on `0, 1` after the `-gpu` flag.
## Trial on Train with Multiple GPU
So, for Multiple GPUs training with Caffe, it's very picky for environment.
Those dependencies/changes are necessary to achieve this:
- [X] [NCCL](https://github.com/NVIDIA/nccl), which is subjected to CUDA version and be aware of the branch
- [X] Caffe Makefile uncomment [`USE_NCCL := 1`](https://github.com/torrvision/caffe/blob/crfrnn/Makefile.config.example#L103)
## Debug and Validation
Potential problems you will meet are:
1. `Check failed: error == cudaSuccess (2 vs. 0) out of memory` ----> it means what it says, your memory is not enough
2. `error==cudaSuccess (77 vs. 0) an illegal memory access was encountered` -----> it means that either the shape is not correct or your cuda version is not correct, check [here](https://github.com/BVLC/caffe/issues/4169)
3. `iteration 0` stuck for a long time ----> it's normal, just chill and drink Coffee