An open API service indexing awesome lists of open source software.

https://github.com/miladfa7/3d-bbox-predictor

3D bounding box prediction using Point Cloud and RGB image | Models: Transformer, Multimodal(PointNet++, ResNet), PointNet++
https://github.com/miladfa7/3d-bbox-predictor

3d-bounding-boxes multimodal point-cloud pointnet2 transformer

Last synced: 9 months ago
JSON representation

3D bounding box prediction using Point Cloud and RGB image | Models: Transformer, Multimodal(PointNet++, ResNet), PointNet++

Awesome Lists containing this project

README

          

# Sereact 3D bounding Box Predictor

Sereact Company

## Code Challenge Implementation

### Key Components:
- Data Analysis
- Data Preprocessing
- Model Architectures
- Performance Evaluation
- Suggestions
- Problems

### Required Packages:
```
torch
timm
opencv
open3d
nvidia-cu12.8
matplotlib
wandb
numpy
```

## 1. Data Analysis

### Point Cloud:

Based on my understanding, the point cloud data is organized in a structure similar to an image. Its shape is (3, height, width), where each pixel corresponds to a 3D point with X, Y, and Z coordinates.

It appears these point cloud and other data are generated by **simulation software** specifically for training model in test environment.

**Channel Descriptions:**.

- X-axis: Measure **how far** a point from camera's center in the **horizontal direction** (left: negative, right: positive, center: zero)

- Y-axis: Measure **how far** a point from camera's center in the **vertical direction**. (above: negative, below: positive, center: zero)

- Z-axis: Depth distance from the camera, changing smoothly from top to bottom. it provides **per-pixel depth**. Objects **closer** to the camera have **smaller Z values**. Objects **farther away** have **larger Z values**.

**Visualization Command:**
```python
python visualizer/point_viz.py --file_path dataset/points/00001.npy
```



point cloud viz


**3D boudning Boxes(Ground Truth) Visualization**

3D bounding box represnts 8 corners, each having three values x, y, z. Its shape is (num_objects, 8, 3).

```python
python visualizer/open3d_viz.py --sample 859074c5-9915-11ee-9103-bbb8eae05561 --draw_3d_box
```

3D bbox Ground Truth

**Mask and generate 2D Bounding Boxes**.
```python
python visualizer/mask_2dbox_viz.py --image_path ./dataset/images/00026.jpg --mask_path ./dataset/masks/00026.npy
```
mask and 2d bounding box

### 2. Data Preparation and Preprocessing

#### Dataset Restructuring

Restructuring raw data organizes it into a standardized format, making it easier to process and access.

**Command:**
```python
python prepare_dataset.py --input-path ./raw_data --output-path dataset
```
**DATASET Structure**

```
dataset/
│── points/
│ ├── 00001.npy
│ ├── 00002.npy
│ ├── ...
│── masks/
│ ├── 00001.npy
│ ├── 00002.npy
│ ├── ...
│── bboxes_3d/
│ ├── 00001.npy
│ ├── 00002.npy
│ ├── ...
│── images/
│ ├── 00001.jpg
│ ├── 00002.jpg
│ ├── ...

#### Preprocessing
* **Image (ImageMaskTransforms)**
- Resizing to (224, 224)
- Normalization (range [-1, 1])
- To Tensor
- Brightness augmentation
* **Point Cloud (PointCloudTransforms)**
- Reshape: Converts the points cloud from shape (3, H, W) to (N, 3 )for compatibility with the PointNet model or LiDAR-based model
- Normalization
- Voxelization
* **3D Boudning Box (BBox3DTransforms)**
- Reshape: Convert 3D corners representation to centroid, size, orientation(reasoin for model part)
- Boudning box parameterization: center_x, center_y, center_z, width, height, depth, yaw. its shape is (num_objects, 7)
- To Tensor
- Normalization

#### Data Loader
- **Data Loading:** There are two dataloader such as the **SereactDataLoader**, **PillarsDataLoader** that loads images, point clouds, 3D bounding boxes with support for transformations on each data type.
- **Data Splitting**: The **DataSpliter** class handles file loading, shuffling, and splitting the all data into training and testing sets.

## Deep Learning Models
### 1. Point Cloud Based Model (PointPillars)

This model is implemented with the following features:

### Model Pipeline:
- **Total Parameters: 4,830,140**
- **Input: Point Cloud**

- Transform the organized point cloud into an unorganized format with shape (N, 3) to feed into the PointNet-based model
- **Voxelization**

Converting 3D point cloud data into a grid of voxels to represent spatial information. The voxel values were specifically tuned for the Sereact dataset point ranges, which were achieved using this script.
```
python utils/get_point_ranges.py
```
- Voxel Size: **[0.01, 0.01, 0.01]**
- Point Cloud Range: **[-1.60, -1.35, 0.0, 1.60, 1.35, 3.0]**
- Max Number Points: **32**
- Max Voxels: **(30000, 50000)**


**Note**: the Voxels parameters adjusted by myself.
- **Pillar Feature Encoding:**

- Extract high-level features from each voxel/pillar using a PointNet based model

- Convert pillars features into densce pseude-images
- **2D CNN Backbone**
- Process the pseudo-image using 2D CNN layers
- **SSD Detection Head**
- Predict oriented 3D bounding boxes with the output structure (x,y,z,w,l,h,θ)
- Utilizes anchor-based regression
- Anchors size: [
[0.096, 0.096, 0.10], # Small anchor

[0.153, 0.154, 0.15], # Medium anchor

[0.21, 0.213, 0.21], # Large anchor
]
- Anchors ranges: [[-1.60, -1.35, 0.0, 1.60, 1.35, 3.0]]
```
python utils/get_bboxes_ranges.py
```
- **Output: 3D Bounding Box**
- Multiple 3D bounding boxes for the point cloud

#### Challenges: The dataset lacks essential configuration parameters needed for voxelization and anchor generation. I manually adjusted these parameters, but they may likely cause errors in prediction and training.

#### Training Configurations:
```
Path: configs/pillar_config.yaml
```
- Loss Function: **Smooth L1**
- Optimizer: **AdamW**
- Learning Rate: **0.001**
- Batch Size: **8**
- Train and Test Sets: **80%, 20% respectively**
- Epochs:

PointPillars

#### Run the Model Training

```
python train_pillars.py --config configs/pillars_config.yaml
```

### 2. Multi-Modal Model (CNN/ViT & PointNet++)

This model is implemented with the following features:

### Model Pipeline:
- **Total Parameters:**
- ResNet50 + PointNet: **28,476,833**
- ViT + PointNet: **103,352,417**
- **Input: Point Cloud and RGB Image**

- Points are retained the raw points without voxelization.normalized to the range [-1, +1], and 100,000 points are sampled from each point cloud.
- RGB images are normalized to the range [-1, +1], resized to [224, 224], and undergo brightness augmentation.
- **Image Feature Extraction**
- CNN-based -> ResNet50
- Transformer-Based -> Visual Transformer

- **Point Cloud Features Extraction (PointNet++)**
- Processes raw point clouds to extract spatial features.
- **Fusion Network**
- Combining the extracted features from both the RGB image (CNN output) and the point cloud (PointNet++ output)
- **MLP Layer and Regression Head**
- Processes the fused feature vector to predict 3D bounding boxes.
- **Output: 3D Bounding Box**
- Multiple 3D bounding boxes using the point cloud and rgb image

Multimodal

#### Training Configurations:
```
Path: configs/multimodal_config.yaml
```
- Loss Function: **Smooth L1**
- Optimizer: **AdamW**
- Learning Rate: **0.0001**
- Batch Size: **8**
- Train and Test Sets: **80%, 20% respectively**
- Image Backbone: **Resnet50 or ViT**
- Point Backbone: **PointNet++**
- Epochs: **80**

#### Run the Model Training

```python
python3 train_multimodal.py --config configs/multimodal_config.yaml
```

### 3. Transformer Based Model (3DETR)
This model is implemented with the following features:

3DETR

#### Data preparation for 3DETR
I am following the VoteNet codebase to preprocess data for training 3DETR, using instructions for datasets like SUN RGB-D, and I have customized it accordingly.

```python
python3 data/detr3d_data.py --data_root dataset/sereact_3detr --num_points 100000
```

### Model Pipeline:
- **Total Parameters: 7,306,976**
- **Input: Point Cloud**

- it Works directly with **raw point clouds** without requiring voxelization.
- **Backbone**: 3DETR utilizes a Transformer-based architecture:
- **Encoder**: Consists of 3 layers, each employing multi-headed attention with four heads and a two-layer MLP with hidden dimensions.
- **Decoder**: Comprises 8 layers, mirroring the encoder but with MLP hidden dimensions set to 128.
- **Output: 3D Bounding Box**:
- The model predicts 3D bounding boxes corners, means output format is (M, 8, 3)

- **Inference Time**: PyTorch model running time was 0.10 second or 100 milliseconds

#### Training Configurations:

- Loss Function: **Hungarian Loss**
- Optimizer: **AdamW**
- Learning Rate: **0.0004**
- Batch Size: **8**
- Train and Test Sets: **85%, 15% respectively**
- Point Backbone: **Transformer**
- Epochs: **1080**

#### Run the Model Training

```python
python3 train_3detr.py -dataset_name sereact --dataset_root_dir sereact_trainval --max_epoch 1080 --nqueries 256 --base_lr 1e-4 --matcher_giou_cost 3 --matcher_cls_cost 1 --matcher_center_cost 5 --matcher_objectness_cost 5 --loss_giou_weight 0 --loss_no_object_weight 0.1 --save_separate_checkpoint_every_epoch -1 --checkpoint_dir outputs/sereact_ep1080
```

## Performance Evaluation with Smooth L1 Loss:

### 1. PointNet-based model performances

#### Logged by WandB
Multimodal

❗❗❗

***Based on the evaluation, the model is not performing as expected, primarily because the voxelization parameters and the anchor sizes, are not correctly configured. I derived these values from the data distribution using the scripts mentioned above.***

❗❗❗

### 2. MultiModal model performances
#### Logged by WandB

Multimodal

#### Logged by terminal
```
Epoch 1
========== Train Loss: 1.7952
========== Eval Loss: 0.3506
Evaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 6.66it/s]
Epoch 2
========== Train Loss: 1.0478
========== Eval Loss: 0.1832
Evaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 6.70it/s]
Epoch 3
..
..
..
Evaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 6.44it/s]
Epoch 75

Epoch 78
========== Train Loss: 0.3594
========== Eval Loss: 0.1036
Evaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 6.49it/s]
Epoch 79
========== Train Loss: 0.3628
========== Eval Loss: 0.1040
Evaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 6.50it/s]
Epoch 80
========== Train Loss: 0.3761
========== Eval Loss: 0.1031
```

### 3. 3DETR(Transformer) model performances

#### Training results
testing_3dter

- **Train Loss:**, it decreased smoothly which means good training performance but it migth be overfit on training features
- **mAP(mean average precision) and AR(average recall) metrics**:
- mAP at 0.25 and AR at 0.25 are around 70, which indicates the model is doing well in detecting objects during the training.

#### Testing results
training_3dter

- **Test Loss:**, it didn't decrease smoothly, and increase after 10k steps. It might be indicate overfitting.
- **AP(average precision) and AR(average recall) metrics**:
- The mAP at 0.25 fluctuates between **38-45**. it doens't improve significantly after 10k steps.
- AR at 0.25 fluctuates between 40-47.

## Suggestions
- Use **Point Transformer v2/v3** as point clouds processing backbone to estimate 3D bounding box
- Fuse Point Transformer with visual backbone (Like CNN Models) to estimate 3D bounding box robustly.
- Use Transformer-Based 3D object detection like DETR3D, provided by facebookresearch [Github](https://github.com/facebookresearch/3detr)
- Use efficeint and accurate CNN backbone for image feature extraction
- Adopt Point Transform for point cloud processing instead of PointNet
- Integrate Camera intrinsic parameters for enhance training data prepration
- Use attention mechanisms for multi-modal feature fusion
- Using segmentation mask for better 3D object localication

## Problems
- The provided dataset lacked camera intrinsic parameters.
- Initially, I attempted to train 3D object detection model using MMDetection, but I encountered issues with CUDA installation and package inconsistencies. After resolving those, I faced another problem with the data loader carshing. Ultimately, I decided to develop my own pipeline.