{"id":18158328,"url":"https://github.com/sam0x17/dseg","last_synced_at":"2025-05-07T04:11:22.768Z","repository":{"id":82773044,"uuid":"58104964","full_name":"sam0x17/DSeg","owner":"sam0x17","description":"Invariant Superpixel Features for Object Detection","archived":false,"fork":false,"pushed_at":"2018-08-28T22:09:43.000Z","size":541,"stargazers_count":19,"open_issues_count":0,"forks_count":7,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-31T06:11:21.354Z","etag":null,"topics":["deep-learning","deep-neural-networks","feature-extraction","feature-vector","machine-learning","segmentation","superpixels"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sam0x17.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-05-05T04:47:25.000Z","updated_at":"2023-10-25T17:30:34.000Z","dependencies_parsed_at":null,"dependency_job_id":"3d2d8eb0-9517-4db7-a71d-8b42030065d7","html_url":"https://github.com/sam0x17/DSeg","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sam0x17%2FDSeg","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sam0x17%2FDSeg/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sam0x17%2FDSeg/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sam0x17%2FDSeg/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sam0x17","download_url":"https://codeload.github.com/sam0x17/DSeg/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252810273,"owners_count":21807759,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","deep-neural-networks","feature-extraction","feature-vector","machine-learning","segmentation","superpixels"],"created_at":"2024-11-02T07:06:19.551Z","updated_at":"2025-05-07T04:11:22.730Z","avatar_url":"https://github.com/sam0x17.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Invariant Superpixel Features for Object Detection \u0026 Localization\n*by Sam Johnson*\n\n## Introducing DSeg\n\nDSeg is a novel superpixel-based feature designed to be useful in semantic\nsegmentation, object detection, and pose-estimation/localization tasks. DSeg\nextracts the shape and color of object-positive and object-negative\nsuperpixels, and encodes this information in a compact feature vector.\n\nWhile this project is only in its infancy, DSeg has already shown promising\nresults on superpixel-wise object detection (is superpixel A part of an instance\nof object class B?). Below we cover the implementation details of DSeg, present\npreliminary experimental results, and discuss future work and additional\nenhancements.\n\n## Motivation\n\nMany who work or dabble in computer vision have noticed the intrinsic, deep coupling\nbetween image *segmentation* and higher-level tasks, such as detection, classification,\nlocalization, or more generally, whole-image *understanding*. The interdependence\nbetween these crucial computer vision tasks and segmentation is undeniable, and often\nfrustrating. A chicken-and-egg scenario quickly arises:\n\n\u003e How can you properly find the outline of an object without knowing what it is?\n\n\u003e On the other hand, how can you begin to understand the context and identity of\nan object, before determining its extents?\n\n### Superpixel Segmentation\n\nOne \"neat trick\" that partially alleviates this issue is the superpixel segmentation\nalgorithm. Superpixels arise from a clever, *over*-segmentation of the original image.\nOnce the superpixel algorithm has been run, virtually all major and most minor edges\nand boundaries in the image will fall on some boundary or another between two\nsuperpixels. This fact exposes an intriguing trade-off: after running the superpixel\nalgorithm, we can rest assured that virtually all edges and boundaries of interest\nhave been found *somewhere* encoded in our superpixels. That said, we unfortunately\n*also* have to deal with a large number of minor or imagined edges that are\nincluded in the superpixel segmentation.\n\n### Properties of Superpixels\n\nSuperpixels are extremely powerful because they often fall on important boundaries\nwithin the image, and tend to take on abnormal or unique shapes when they contain\nsalient object features. For example, in the superpixel segmentation of the below\naircraft carrier, notice that the most salient object features (the radar array,\nanchor, and railing of the carrier) are almost perfectly segmented by highly\nunique superpixels.\n\n![](https://storage.googleapis.com/durosoft-shared/delvr/contours_235.png)\n\n*Fig. 1: A superpixel segmented aircraft carrier. Salient features are revealed by the superpixels.*\n\nFrom this example, several intuitions arise:\n* In general, superpixels tend to hug salient features and edges in the image\n* \"Weird\" superpixels that differ greatly from the standard hexagonal (or rectangular)\n  shape are more likely to segment points of interest\n* \"Boring\" superpixels are part of something larger than themselves, and can often\n  be combined with adjacent superpixels\n\n### Advantages of Synthetic Data\n\nWhen writing object detectors, it helps to have images that fully describe the broad\nrange of possible poses of the object in question. With real-world images, this is\nseldom possible, however the advent of readily available, high quality 3D models,\nand the maturity of modern 3D rendering pipelines has created ample opportunity to\nuse synthetic renderings of 3D models to generate training data that:\n\n* Covers the full range of poses of the object in question, with the ability to render\n  arbitrary poses on the fly\n* Can be tuned to vary scale, rotation, translation, lighting settings, and even occlusion\n* Is *already segmented* (at every possible pose) without the need for human labeling\n* Is easy to generate, no humans required once a 3D model is made\n\n![Hamina Missile Boat](http://i.imgur.com/xryMdTz.jpg?1)\n\n*Fig. 2: 3D rendering of a Hamina-class missile boat, one of the models used to train DSeg*\n\n## DSeg Feature Extraction Pipeline\n\nIn this section we outline the details of the DSeg feature extraction pipeline and comment\non the major design decisions and possible future work and/or alterations.\n\n### Image Data Sources\n\nDSeg requires a synthetic data source (i.e. a 3D model of the object that is to be learned)\nto generate *class-positive* training samples, and a \"real-world\" image source, preferably\nscene images, to generate *class-negative* training samples. That is, we use synthetic data\nto generate *positive* examples of our object class's superpixels, and we use real-world data\nto generate *negative* examples of our object class's superpixels.\n\nFor this project, we used the high quality ships 3D model dataset from [1], and SUN2012 [2]\nfor real-world scene images. In particular, we randomly sampled 2000 images from [2],\nand for each 3D model in [1], we rendered 27,000 auto-scaled, auto-cropped poses uniformly\ndistributed across the space off possible poses at 128 x 128, at a fixed camera distance,\nand with slightly variable lighting. The rendering of these images is covered in greater\ndepth in [1].\n\n### DSeg Step 1: Segmentation\n\nSLIC [3] is the state-of-the-art method for performing superpixel segmentation, whereas\ngSLICr [4] is a CUDA implementation of SLIC that can perform superpixel segmentation at\nover 250 Hz. Ideally we would use gSLICr in our pipeline, however significant modifications\nwould need to be made to gSLICr to make it possible to extract the pixel boundaries of the\ngenerated superpixels, so we settled on the slower, VLFeat [5] implementation of SLIC.\n\nWe run SLIC on each image in our training set with a constant regularization parameter of\n4000 (so as to avoid over-sensitivity to illumination). We do this over a range of 25\ndifferent grid sizes to ensure we capture all of the possible superpixel shapes that\ncould occur within our object (or couldn't occur, in the case of negative examples). For\npositive examples, it is necessary to ignore superpixels that are not part of the object,\nhowever this is easy since we rendered translucent PNGs and merely need to ignore\nsuperpixels with transparent pixels.\n\n![Hamina Superpixel Segmentation Multiscale](https://storage.googleapis.com/durosoft-shared/delvr/hamina_scales.png)\n\n*Fig. 3: Superpixel segmentation is performed at multiple scales*\n\nFor each valid superpixel found in the current image, we will eventually create a\nfeature vector. Before passing these superpixel features on to the next pipeline\nstage, we calculate the *closest-to-average* RGB color for each superpixel. That is,\nwe calculate for each superpixel the RGB color from its color palette that is closest\nto the mean RGB color across the entire superpixel. We measure color distance simply using\nthe 3D euclidean distance formula. Future work might explore LAB or other color spaces.\n\nIn practice, stage 2 yields approximately 700 features per positive input image, depending\non the model and viewing angle.\n\n### DSeg Step 2: Scale Invariance via ResizeContour\n\nTo obtain scale invariance for our features, we must normalize all of our superpixel\nfeatures to the same square grid size. For this project, we used a 16 x 16 pixel grid.\n\nA naive approach to this normalization phase would be to use an off-the-shelf image\nre-sizing routine, such as bicubic or bilinear interpolation. In practice, we found\nthat these algorithms, along with nearest neighbor and other alternatives all result\nin either excessive blurring or pixelation when up-scaling very small features (which\nare quite common in DSeg, as most features are approximately 6 x 6 pixels in size.\nThis blurring or pixelation means that a neural network will be able to tell that\nthe feature in question was up-scaled. The purpose of scale invariance is to hide\nall evidence that any sort of up-scaling or down-scaling has taken place, so this\nis undesirable.\n\n![ResizeContour Upscaling Demo](https://storage.googleapis.com/durosoft-shared/delvr/resizecontour%20demo.png)\n\n*Fig. 4: comparison of image re-sizing algorithms up-scaling from 16x16\n\n### ResizeContour\n\nResizeContour is a simple, novel algorithm for up-scaling single color threshold images\nwith neither the pixelation introduced by nearest neighbor, nor the blurring introduced\nby bilinear interpolation. ResizeContour is how we up-scale superpixels (especially very\nsmall ones) in DSeg. The algorithm itself is relatively simple:\n\n1. Project each pixel from the original image to its new location in the larger image\n2. Find all the black (filled in) pixels among the 8 pixels immediately surrounding our\n   pixel in the original image\n3. Project the coordinates of these dark pixels to the new larger image\n4. Perform a polygon fill operation using this list of projected points\n\n\n![Superpixel resizing via ResizeContour algorithm](https://storage.googleapis.com/durosoft-shared/delvr/segmentation_breakout.png)\n\n*Fig. 5: ResizeContour in action. The + signs represent pixels from the original superpixel.*\n\n\nBecause we fill in overlapping polygons over each pixel in the image, this approach will\nonly work if the entire image is of one uniform color. A side effect of this algorithm\nis that one pixel lines are usually erased from the resulting image, however this is good\nas it makes our features more block-like and more resistant to artifacts in the\npre-segmented image.\n\nOnce the superpixel is re-sized to our decided canonical form, our feature vector is ready.\nWe store the thresholded values of the pixel grid, concatenated with three floats\nrepresenting the RGB value of the *closest-to-average* color. For a 16 x 16 standard\nsuperpixel size, this results in a 16 * 16 + 3 = 259 dimensional vector.\n\n## Evaluation\n\nTo evaluate the efficacy of DSeg as a general purpose feature, we constructed a simple\nexperiment that trains (using RPROP) a standard feed-forward Multi-Layer Perceptron to\ntake a single DSeg feature as input and output a single value indicating either that\nthe feature belongs to an instance of our object, or that it is background noise and/or\npart of an unknown object. We conducted this same experiment on each of the 7 models used\nin RAPTOR ([here](https://github.com/samkelly/raptor/tree/master/data/models)) and observed\nslight variations in performance.\n\n### Network Topology\n\nFor our neural network, we used OpenCV's Multi-Layer Perceptron implementation. We used\nthe recommended settings for Resilient Backpropagation (RPROP), which is known to converge\nmore quickly than standard backpropagation, and has much fewer free parameters that need\nto be configured.\n\n![basic MLP setup](http://docs.opencv.org/2.4/_images/mlp.png)\n\n#### Input Layer\n\nFor input data, we used the formulation discussed earlier, generating features from our\n27,000 positive input examples, and generating features from 2,000 negative input examples\nrandomly selected from SUN2012. When a feature is sent to our neural network, it is\nprocessed as a 259-dimensional vector of floats, where the input grid is represented as\n0.0's and 1.0's, and the RGB color is represented as three floats. Because of the inordinate\nnumber of features generated by this procedure, we randomly selected 500,000 features from\nthe input set, and trained off of these. We used a 50/50 negative/positive example split.\nWe also utilized a separate validation set roughly 1/2 the size of our training set.\n\n#### Hidden Layer\n\nWe found that a hidden layer consisting of 48 neurons was sufficient. While we experimented\nwith networks with many more neurons (and hidden layers), none seemed to have any advantage\nover our 48 hidden neuron setup. For an activation function throughout the network, we used\nthe classic hyperbolic tangent function.\n\n#### Output Layer\n\nWe used a single output neuron, where a value less than 0.5 indicates the input was not of\nour object class, and a value of 0.5 or greater indicates the input was of our object class.\n\n### Results\n\nThe table below displays the accuracy results for the best network we were able to train for\neach of the 7 RAPTOR models, including the model name, number of training epochs, and accuracy\n(based off of the validation set).\n\nModel | Training Epochs | Accuracy\n--- | --- | ---\nHamina | 140 | 72.24%\nHalifax | 189 | 69.98%\nKirov | 137 | 71.31%\nKuznet | 162 | 70.12%\nSovDDG | 145 | 65.7%\nUdaloy | 80 | 70.17%\nM4A1 | 153 | 71.5%\n\nMost models were able to yield 70% accuracy or nearly 70% accuracy. It is no surprise that the\nHamina missile boat performed the best, as of all the models it has the highest number of visually\nrecognizable features, and is not as oblong as the other models. The M4A1 is the only model in the\nset that is not a ship (it is, in fact, an assault rifle). It was our second best performing model,\nmost likely due to its easily recognizable shape and correspondingly unique superpixel shapes.\n\nGiven that we used a simple Multi-Layer Perceptron, it is striking that we were able to achieve\neven 70% accuracy for a task as difficult as per-superpixel object classification.\n\n## Future Work\n\nFollow-up studies must be conducted to continue to verify the efficacy of DSeg as a multi-purpose\nfeature vector. In particular, it will be important to benchmark DSeg as a feature in deep\nnetworks that perform object detection and/or localization.\n\nA very valuable study that could be conducted would be to generate DSeg features for objects in\nMS COCO, and supplement this data with synthetic data rendered from 3D models of common objects\nin COCO. This would allow us to benchmark DSeg against a commonly used dataset.\n\nPerforming classification based on a single superpixel feature is extremely difficult. Perhaps\nan easier (but more difficult to design) experiment would be to create a network that takes\nas input a DSeg superpixel feature, augmented by features for the top 5 nearest superpixels.\nThe network would be trained to determine the object classification of the middle superpixel,\nand this would be significantly easier because it would have context information in the form\nof the surrounding superpixels.\n\n## Installation\n\nDSeg has only been tested on Ubuntu 14.04. To install, do the following:\n\n1. Make sure you have CUDA installed and available (visit [https://developer.nvidia.com/cuda-downloads]   \n   (https://developer.nvidia.com/cuda-downloads) for more info.\n2. Clone the DSeg repo `git clone git@github.com:samkelly/DSeg.git`\n3. `cd` into the local repo\n4. `./install_prerequisites.sh`\n\nYou can now run the \"delvr\" command-line utility via `./build_run.sh`. You can customize\nbuild_run.sh to carry out various DSeg related tasks (see commented out lines for examples).\n\n## Related Work\n\nOf particular importance to this project are the original SLIC superpixel segmentation\nroutine [3], gSLICr [4], the state-of-the-art CUDA implementation of SLIC, as well as\nVLFeat [5], OpenCV [6], and the Boost C++ library [7].\n\nThe authors of [6] derive a *fully convolutional* model capable of semantic segmentation.\nAs with DSeg, their network performs superpixel-wise classification, resulting in fully\nsegmented object proposals. Unlike DSeg, their network can analyze an entire image rather\nthan one superpixel at a time, but as a result, their approach is more of an end-to-end\ndetection and segmentation solution than a feature extraction pipeline that might be of\nsome help to high level computer vision tasks.\n\nIn [8], the authors leverage a similar synthetic data scheme whereby commodity 3D models\nare used to augment and/or replace real-world images in lieu of sufficient training data.\nThis idea first appeared in the 2014 RAPTOR technical report [1], from which this project\noriginally evolved.\n\n## References\n\n1. Sam Kelly, Jeff Byers, David Aha, \"RAPTOR Technical Report\", *NCARAI Technical Note, Navy\n   Center for Applied Research in Artificial Intelligence*. September 2014.\n\n2. J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. \"SUN Database: Large-scale\n   Scene Recognition from Abbey to Zoo\", *IEEE Conference on Computer Vision and Pattern Recognition*, 2010.\n\n3. Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine\n   S`usstrunk, \"SLIC Superpixels\", *EPFL Technical Report 149300*. June 2010.\n\n4. Carl Yuheng Ren, Victor Adrian Prisacariu, and Ian D Reid, \"gSLICr: SLIC superpixels\n   at over 250Hz\", *ArXiv e-prints, 1509.04232*. September 2015.\n\n5. \"VLFeat\", http://ivrl.epfl.ch/research/superpixels\n\n6. \"OpenCV\", http://opencv.org/\n\n7. \"Boost C++ Libraries\", http://boost.org\n\n8. Jonathan Long, Evan Shelhamer, and Trevor Darrell. \"Fully convolutional networks for\n   semantic segmentation.\" *Proceedings of the IEEE Conference on Computer Vision and\n   Pattern Recognition*. 2015.\n\n9. Xingchao Peng, Baochen Sun, Karim Ali, Kate Saenko, \"Learning Deep Object Detectors from\n   3D Models\", *ArXiv e-prints, 1412.7122v4*. October 2015.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsam0x17%2Fdseg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsam0x17%2Fdseg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsam0x17%2Fdseg/lists"}