{"id":30240239,"url":"https://github.com/miladfa7/3d-bbox-predictor","last_synced_at":"2025-10-06T00:48:10.886Z","repository":{"id":278056991,"uuid":"934371366","full_name":"miladfa7/3D-BBox-Predictor","owner":"miladfa7","description":"3D bounding box prediction using Point Cloud and RGB image |  Models: Transformer, Multimodal(PointNet++, ResNet), PointNet++","archived":false,"fork":false,"pushed_at":"2025-03-12T10:20:24.000Z","size":75895,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-08-15T04:23:18.334Z","etag":null,"topics":["3d-bounding-boxes","multimodal","point-cloud","pointnet2","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/miladfa7.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-17T18:05:51.000Z","updated_at":"2025-05-31T11:52:03.000Z","dependencies_parsed_at":"2025-08-15T04:28:33.362Z","dependency_job_id":null,"html_url":"https://github.com/miladfa7/3D-BBox-Predictor","commit_stats":null,"previous_names":["miladfa7/3d-bbox-predictor"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/miladfa7/3D-BBox-Predictor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miladfa7%2F3D-BBox-Predictor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miladfa7%2F3D-BBox-Predictor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miladfa7%2F3D-BBox-Predictor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miladfa7%2F3D-BBox-Predictor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/miladfa7","download_url":"https://codeload.github.com/miladfa7/3D-BBox-Predictor/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miladfa7%2F3D-BBox-Predictor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278542683,"owners_count":26004061,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-05T02:00:06.059Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["3d-bounding-boxes","multimodal","point-cloud","pointnet2","transformer"],"created_at":"2025-08-15T04:13:51.511Z","updated_at":"2025-10-06T00:48:10.879Z","avatar_url":"https://github.com/miladfa7.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"  \n\n# Sereact 3D bounding Box Predictor\n\n  \u003cimg src=\"./images/sereact.jpg\" alt=\"Sereact Company\" width=\"800\"\u003e \u003cbr\u003e\n\n\n## Code Challenge Implementation\n\n### Key Components:\n- Data Analysis\n- Data Preprocessing\n- Model Architectures\n- Performance Evaluation\n- Suggestions\n- Problems\n  \n### Required Packages: \n```\ntorch\ntimm\nopencv\nopen3d\nnvidia-cu12.8\nmatplotlib\nwandb\nnumpy\n```\n\n## 1. Data Analysis\n\n### Point Cloud: \u003cbr\u003e\nBased on my understanding, the point cloud data is organized in a structure similar to an image. Its shape is (3, height, width), where each pixel corresponds to a 3D point with X, Y, and Z coordinates.\n\nIt appears these point cloud and other data are generated by **simulation software** specifically for training model in test environment. \n\n**Channel Descriptions:**.\n\n- X-axis: Measure **how far** a point from camera's center in the **horizontal direction** (left: negative, right: positive, center: zero)\n\n- Y-axis: Measure **how far** a point from camera's center in the **vertical direction**. (above: negative, below: positive, center: zero)\n\n- Z-axis: Depth distance from the camera, changing smoothly from top to bottom. it provides **per-pixel depth**. Objects **closer** to the camera have **smaller Z values**.  Objects **farther away** have **larger Z values**. \u003cbr\u003e\n\n**Visualization Command:**\n```python\npython visualizer/point_viz.py --file_path dataset/points/00001.npy\n```\n\n\u003cbr\u003e\n\u003cimg src=\"./images/point_cloud_viz.png\" alt=\"point cloud viz\" width=\"700\"\u003e \u003cbr\u003e\n\n\u003cbr\u003e\n\n**3D boudning Boxes(Ground Truth) Visualization**\u003cbr\u003e\n3D bounding box represnts 8 corners, each having three values x, y, z. Its shape is (num_objects, 8, 3).\n\n```python\npython visualizer/open3d_viz.py --sample 859074c5-9915-11ee-9103-bbb8eae05561 --draw_3d_box\n```\n\n\u003cimg src=\"./images/3dviz.png\" alt=\"3D bbox Ground Truth\" width=\"800\"\u003e \u003cbr\u003e\n\n**Mask and generate 2D Bounding Boxes**.\n```python\npython visualizer/mask_2dbox_viz.py --image_path ./dataset/images/00026.jpg  --mask_path ./dataset/masks/00026.npy\n```\n\u003cimg src=\"./images/mask_2dbox.png\" alt=\"mask and 2d bounding box\" width=\"800\"\u003e \u003cbr\u003e\n\n\n### 2. Data Preparation and Preprocessing\n\n#### Dataset Restructuring\n\nRestructuring raw data organizes it into a standardized format, making it easier to process and access.\n\n**Command:**\n```python\npython prepare_dataset.py --input-path ./raw_data --output-path dataset \n```\n**DATASET Structure**\n\n\n    ```\n    dataset/\n    │── points/  \n    │   ├── 00001.npy  \n    │   ├── 00002.npy  \n    │   ├── ...  \n    │── masks/  \n    │   ├── 00001.npy  \n    │   ├── 00002.npy  \n    │   ├── ...  \n    │── bboxes_3d/  \n    │   ├── 00001.npy  \n    │   ├── 00002.npy  \n    │   ├── ...  \n    │── images/  \n    │   ├── 00001.jpg  \n    │   ├── 00002.jpg  \n    │   ├── ...  \n\n#### Preprocessing\n* **Image (\u003cem\u003eImageMaskTransforms\u003c/em\u003e)**\n    - Resizing to (224, 224)\n    - Normalization (range [-1, 1])\n    - To Tensor\n    - Brightness augmentation\n* **Point Cloud (\u003cem\u003ePointCloudTransforms\u003c/em\u003e)**\n    -  Reshape: Converts the points cloud from shape (3, H, W) to (N, 3 )for compatibility with the PointNet model or LiDAR-based model\n     - Normalization\n     - Voxelization\n* **3D Boudning Box (\u003cem\u003eBBox3DTransforms\u003c/em\u003e)**\n    - Reshape: Convert 3D corners representation to centroid, size, orientation(reasoin for model part)\n    - Boudning box parameterization: center_x, center_y, center_z, width, height, depth, yaw. its shape is (num_objects, 7)\n    - To Tensor\n    - Normalization\n\n#### Data Loader\n - **Data Loading:** There are two dataloader such as the \u003cem\u003e**SereactDataLoader**\u003c/em\u003e, \u003cem\u003e**PillarsDataLoader**\u003c/em\u003e that loads images, point clouds, 3D bounding boxes with support for transformations on each data type.\n  - **Data Splitting**: The \u003cem\u003e**DataSpliter**\u003c/em\u003e class handles file loading, shuffling, and splitting the all data into training and testing sets. \n\n\n\n## Deep Learning Models\n### 1. Point Cloud Based Model (PointPillars)\n\nThis model is implemented with the following features:\n\n### Model Pipeline:\n - **Total Parameters: 4,830,140** \n - **Input: Point Cloud** \u003cbr\u003e \n    - Transform the organized point cloud into an unorganized format with shape (N, 3) to feed into the PointNet-based model\n - **Voxelization**\u003cbr\u003e\n  Converting 3D point cloud data into a grid of voxels to represent spatial information. The voxel values were specifically tuned for the Sereact dataset point ranges, which were achieved using this script.\n    ```\n    python utils/get_point_ranges.py\n    ```\n   - Voxel Size: **[0.01, 0.01, 0.01]** \n   - Point Cloud Range: **[-1.60, -1.35, 0.0, 1.60, 1.35, 3.0]**\n   - Max Number Points: **32**\n   - Max Voxels: **(30000, 50000)** \u003cbr\u003e\n    \n    **Note**: the Voxels parameters adjusted by myself. \n - **Pillar Feature Encoding:** \u003cbr\u003e\n    - Extract high-level features from each voxel/pillar using a PointNet based model\u003cbr\u003e\n     - Convert pillars features into densce pseude-images\n  - **2D CNN Backbone**\n     - Process the pseudo-image using 2D CNN layers\n  - **SSD Detection Head**\n    - Predict oriented 3D bounding boxes with the output structure  (x,y,z,w,l,h,θ)\n    - Utilizes anchor-based regression\n    - Anchors size: [\n                [0.096, 0.096, 0.10], #  Small anchor \u003cbr\u003e\n                [0.153, 0.154, 0.15],  # Medium anchor\u003cbr\u003e\n                [0.21, 0.213, 0.21], # Large anchor\u003cbr\u003e]\n    - Anchors ranges: [[-1.60, -1.35, 0.0, 1.60, 1.35, 3.0]]\n    ```\n    python utils/get_bboxes_ranges.py\n    ```\n - **Output: 3D Bounding Box**\n    - Multiple 3D bounding boxes for the point cloud\n\n#### Challenges: The dataset lacks essential configuration parameters needed for voxelization and anchor generation. I manually adjusted these parameters, but they may likely cause errors in prediction and training.\n\n#### Training Configurations:\n```\n    Path: configs/pillar_config.yaml\n```\n - Loss Function: **Smooth L1**\n - Optimizer: **AdamW**\n - Learning Rate: **0.001**\n - Batch Size: **8**\n - Train and Test Sets: **80%, 20% respectively**\n - Epochs: \n\n\u003cimg src=\"./images/pointnet.jpg\" alt=\"PointPillars\" width=\"800\" height=\"300\"\u003e \u003cbr\u003e\n\n#### Run the Model Training \n\n\n```\n    python train_pillars.py --config configs/pillars_config.yaml\n```\n\n\n### 2. Multi-Modal Model (CNN/ViT \u0026 PointNet++) \n\nThis model is implemented with the following features:\n\n### Model Pipeline:\n - **Total Parameters:** \n    - ResNet50 + PointNet: **28,476,833**\n    - ViT + PointNet: **103,352,417** \n - **Input: Point Cloud and RGB Image** \u003cbr\u003e \n    - Points are retained the raw points without voxelization.normalized to the range [-1, +1], and 100,000 points are sampled from each point cloud.\n     - RGB images are normalized to the range [-1, +1], resized to [224, 224], and undergo brightness augmentation.\n - **Image Feature Extraction**\n    - CNN-based -\u003e ResNet50\n    - Transformer-Based -\u003e Visual Transformer\n\n - **Point Cloud Features Extraction (PointNet++)**\n     - Processes raw point clouds to extract spatial features.\n - **Fusion Network**\n    - Combining the extracted features from both the RGB image (CNN output) and the point cloud (PointNet++ output)\n - **MLP Layer and Regression Head**\n    - Processes the fused feature vector to predict 3D bounding boxes.\n - **Output: 3D Bounding Box**\n    - Multiple 3D bounding boxes using the point cloud and rgb image \n\n\u003cimg src=\"./images/Multimodal.jpg\" alt=\"Multimodal\" width=\"800\" height=\"370\"\u003e \u003cbr\u003e\n\n\n#### Training Configurations:\n```\n    Path: configs/multimodal_config.yaml\n```\n - Loss Function: **Smooth L1**\n - Optimizer: **AdamW**\n - Learning Rate: **0.0001**\n - Batch Size: **8**\n - Train and Test Sets: **80%, 20% respectively**\n - Image Backbone: **Resnet50 or ViT**\n - Point Backbone: **PointNet++**\n - Epochs: **80**\n\n\n#### Run the Model Training\n\n```python\npython3 train_multimodal.py --config configs/multimodal_config.yaml\n```\n\n\n### 3. Transformer Based Model (3DETR)\nThis model is implemented with the following features:\n\n\u003cimg src=\"./images/3DETR.png\" alt=\"3DETR\" width=\"800\" height=\"180\"\u003e \u003cbr\u003e\n\n#### Data preparation for 3DETR\nI am following the VoteNet codebase to preprocess data for training 3DETR, using instructions for datasets like SUN RGB-D, and I have customized it accordingly.\n\n```python\npython3 data/detr3d_data.py --data_root dataset/sereact_3detr --num_points 100000\n```\n\n### Model Pipeline:\n - **Total Parameters: 7,306,976** \n - **Input: Point Cloud** \u003cbr\u003e \n     - it Works directly with **raw point clouds** without requiring voxelization.\n - **Backbone**: 3DETR utilizes a Transformer-based architecture:\n    - **Encoder**: Consists of 3 layers, each employing multi-headed attention with four heads and a two-layer MLP with hidden dimensions.\n    - **Decoder**: Comprises 8 layers, mirroring the encoder but with MLP hidden dimensions set to 128.\n - **Output:  3D Bounding Box**:\n   -  The model predicts 3D bounding boxes corners, means output format is (M, 8, 3)\n\n - **Inference Time**: PyTorch model running time was 0.10 second or 100 milliseconds \n\n#### Training Configurations:\n\n - Loss Function: **Hungarian Loss**\n - Optimizer: **AdamW**\n - Learning Rate: **0.0004**\n - Batch Size: **8**\n - Train and Test Sets: **85%, 15% respectively**\n - Point Backbone: **Transformer**\n - Epochs: **1080**\n\n#### Run the Model Training\n\n```python\npython3 train_3detr.py -dataset_name sereact --dataset_root_dir sereact_trainval --max_epoch 1080 --nqueries 256 --base_lr 1e-4 --matcher_giou_cost 3 --matcher_cls_cost 1 --matcher_center_cost 5 --matcher_objectness_cost 5 --loss_giou_weight 0 --loss_no_object_weight 0.1 --save_separate_checkpoint_every_epoch -1 --checkpoint_dir outputs/sereact_ep1080\n```\n\n## Performance Evaluation with Smooth L1 Loss:\n\n### 1. PointNet-based model performances\n\n#### Logged by WandB\n\u003cimg src=\"./images/pointnet_loss.png\" alt=\"Multimodal\" width=\"800\" height=\"370\"\u003e \u003cbr\u003e\n\n\n❗❗❗\n\n***Based on the evaluation, the model is not performing as expected, primarily because the voxelization parameters and the anchor sizes, are not correctly configured. I derived these values from the data distribution using the scripts mentioned above.***\n\n❗❗❗\n\n### 2. MultiModal model performances\n#### Logged by WandB\n\n\u003cimg src=\"./images/loss_multimodal.png\" alt=\"Multimodal\" width=\"800\" height=\"370\"\u003e \u003cbr\u003e\n\n\n#### Logged by terminal\n```\nEpoch 1 \n========== Train Loss: 1.7952 \n========== Eval Loss: 0.3506\nEvaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00\u003c00:00,  6.66it/s]\nEpoch 2 \n========== Train Loss: 1.0478 \n========== Eval Loss: 0.1832\nEvaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00\u003c00:00,  6.70it/s]\nEpoch 3 \n..\n..\n..\nEvaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00\u003c00:00,  6.44it/s]\nEpoch 75 \n\nEpoch 78 \n========== Train Loss: 0.3594 \n========== Eval Loss: 0.1036\nEvaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00\u003c00:00,  6.49it/s]\nEpoch 79 \n========== Train Loss: 0.3628 \n========== Eval Loss: 0.1040\nEvaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00\u003c00:00,  6.50it/s]\nEpoch 80 \n========== Train Loss: 0.3761 \n========== Eval Loss: 0.1031\n```\n\n### 3. 3DETR(Transformer) model performances\n\n#### Training results \n \u003cimg src=\"./images/testing_3dter.png\" alt=\"testing_3dter\" width=\"800\" height=\"370\"\u003e \u003cbr\u003e\n  - **Train Loss:**, it decreased smoothly which means good training performance but it migth be overfit on training features \n  - **mAP(mean average precision) and AR(average recall) metrics**:\n    - mAP at 0.25 and AR at 0.25 are around 70, which indicates the model is doing well in detecting objects during the training. \n\n#### Testing results \n \u003cimg src=\"./images/training_3dter.png\" alt=\"training_3dter\" width=\"800\" height=\"370\"\u003e \u003cbr\u003e\n\n - **Test Loss:**, it didn't decrease smoothly, and increase after 10k steps. It might be indicate overfitting.\n  - **AP(average precision) and AR(average recall) metrics**:\n    - The mAP at 0.25 fluctuates between **38-45**. it doens't improve significantly after 10k steps.\n     - AR at 0.25 fluctuates between 40-47.\n\n\n\n\n## Suggestions\n - Use **Point Transformer v2/v3** as point clouds processing backbone to estimate 3D bounding box\n - Fuse Point Transformer with visual backbone (Like CNN Models) to estimate 3D bounding box robustly.\n - Use Transformer-Based 3D object detection like DETR3D, provided by  facebookresearch [Github](https://github.com/facebookresearch/3detr)\n - Use efficeint and accurate CNN backbone for image feature extraction\n - Adopt Point Transform for point cloud processing instead of PointNet\n - Integrate Camera intrinsic parameters for enhance training data prepration\n - Use attention mechanisms for multi-modal feature fusion\n - Using segmentation mask for better 3D object localication \n\n ## Problems\n  - The provided dataset lacked camera intrinsic parameters.\n  - Initially, I attempted to train 3D object detection model using MMDetection, but I encountered issues with CUDA installation and package inconsistencies. After resolving those, I faced another problem with the data loader carshing. Ultimately, I decided to develop my own pipeline.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiladfa7%2F3d-bbox-predictor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmiladfa7%2F3d-bbox-predictor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiladfa7%2F3d-bbox-predictor/lists"}