{"id":15553575,"url":"https://github.com/bes-dev/pytorch_clip_bbox","last_synced_at":"2025-04-14T09:41:18.446Z","repository":{"id":153308698,"uuid":"441535699","full_name":"bes-dev/pytorch_clip_bbox","owner":"bes-dev","description":"Pytorch based library to rank predicted bounding boxes using text/image user's prompts.","archived":false,"fork":false,"pushed_at":"2021-12-25T15:16:16.000Z","size":39448,"stargazers_count":51,"open_issues_count":1,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-11T14:11:03.115Z","etag":null,"topics":["computer-vision","deep-learning","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bes-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-24T18:50:58.000Z","updated_at":"2024-04-04T15:51:38.000Z","dependencies_parsed_at":null,"dependency_job_id":"ca1e1523-3b7f-421e-b6c0-6940679935f1","html_url":"https://github.com/bes-dev/pytorch_clip_bbox","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bes-dev%2Fpytorch_clip_bbox","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bes-dev%2Fpytorch_clip_bbox/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bes-dev%2Fpytorch_clip_bbox/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bes-dev%2Fpytorch_clip_bbox/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bes-dev","download_url":"https://codeload.github.com/bes-dev/pytorch_clip_bbox/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248855755,"owners_count":21172639,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","deep-learning","pytorch"],"created_at":"2024-10-02T14:38:52.093Z","updated_at":"2025-04-14T09:41:18.435Z","avatar_url":"https://github.com/bes-dev.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pytorch_clip_bbox: Implementation of the CLIP guided bbox ranking for Object Detection.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"resources/preds.jpg\"/\u003e\n\u003c/p\u003e\n\nPytorch based library to rank predicted bounding boxes using text/image user's prompts.\n\nUsually, object detection models trains to detect common classes of objects such as \"car\", \"person\", \"cup\", \"bottle\".\nBut sometimes we need to detect more complex classes such as \"lady in the red dress\", \"bottle of whiskey\", or \"where is my red cup\" instead of \"person\", \"bottle\", \"cup\" respectively.\nOne way to solve this problem is to train more complex detectors that can detect more complex classes,\nbut we propose to use text-driven object detection that allows detecting any complex classes that can be described by natural language.\nThis library is written to rank predicted bounding boxes using text/image descriptions of complex classes.\n\n## Install package\n\n```bash\npip install pytorch_clip_bbox\n```\n\n## Install the latest version\n\n```bash\npip install --upgrade git+https://github.com/bes-dev/pytorch_clip_bbox.git\n```\n\n## Features\n- The library supports multiple prompts (images or texts) as targets for filtering.\n- The library automatically detects the language of the input text, and multilingual translate it via google translate.\n- The library supports the original CLIP model by OpenAI and ruCLIP model by SberAI.\n- Simple integration with different object detection models.\n\n## Usage\n\nWe provide examples to integrate our library with different popular object detectors like: [YOLOv5](examples/yolov5.py), [MaskRCNN](examples/maskrcnn.py).\nPlease, follow to [examples](examples/) to find more examples.\n\n### Simple example to integrate pytorch_clip_bbox with MaskRCNN model\n\n```bash\n$ pip install -r wheel cython opencv-python numpy torch torchvision pytorch_clip_bbox\n```\n\n```python\nimport argparse\nimport random\nimport cv2\nimport numpy as np\nimport torch\nimport torchvision.transforms as T\nimport torchvision\nfrom pytorch_clip_bbox import ClipBBOX\n\ndef get_coloured_mask(mask):\n    colours = [[0, 255, 0],[0, 0, 255],[255, 0, 0],[0, 255, 255],[255, 255, 0],[255, 0, 255],[80, 70, 180],[250, 80, 190],[245, 145, 50],[70, 150, 250],[50, 190, 190]]\n    r = np.zeros_like(mask).astype(np.uint8)\n    g = np.zeros_like(mask).astype(np.uint8)\n    b = np.zeros_like(mask).astype(np.uint8)\n    c = colours[random.randrange(0,10)]\n    r[mask == 1], g[mask == 1], b[mask == 1] = c\n    coloured_mask = np.stack([r, g, b], axis=2)\n    return coloured_mask, c\n\ndef main(args):\n    # build detector\n    detector = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True).eval().to(args.device)\n    clip_bbox = ClipBBOX(clip_type=args.clip_type).to(args.device)\n    # add prompts\n    if args.text_prompt is not None:\n        for prompt in args.text_prompt.split(\",\"):\n            clip_bbox.add_prompt(text=prompt)\n    if args.image_prompt is not None:\n        image = cv2.cvtColor(cv2.imread(args.image_prompt), cv2.COLOR_BGR2RGB)\n        image = torch.from_numpy(image).permute(2, 0, 1).unsqueeze(0)\n        image = img / 255.0\n        clip_bbox.add_prompt(image=image)\n    image = cv2.imread(args.image)\n    pred = detector([\n        T.ToTensor()(cv2.cvtColor(image, cv2.COLOR_BGR2RGB)).to(args.device)\n    ])\n    pred_score = list(pred[0]['scores'].detach().cpu().numpy())\n    pred_threshold = [pred_score.index(x) for x in pred_score if x \u003e args.confidence][-1]\n    boxes = [[int(b) for b in box] for box in list(pred[0]['boxes'].detach().cpu().numpy())][:pred_threshold + 1]\n    masks = (pred[0]['masks'] \u003e 0.5).squeeze().detach().cpu().numpy()[:pred_threshold + 1]\n    ranking = clip_bbox(image, boxes, top_k=args.top_k)\n    for key in ranking.keys():\n        if key == \"loss\":\n            continue\n        for box in ranking[key][\"ranking\"]:\n            mask, color = get_coloured_mask(masks[box[\"idx\"]])\n            image = cv2.addWeighted(image, 1, mask, 0.5, 0)\n            x1, y1, x2, y2 = box[\"rect\"]\n            cv2.rectangle(image, (x1, y1), (x2, y2), color, 6)\n            cv2.rectangle(image, (x1, y1), (x2, y1-100), color, -1)\n            cv2.putText(image, ranking[key][\"src\"], (x1 + 5, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 4, (0, 0, 0), thickness=5)\n    if args.output_image is None:\n        cv2.imshow(\"image\", image)\n        cv2.waitKey()\n    else:\n        cv2.imwrite(args.output_image, image)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"-i\", \"--image\", type=str, help=\"Input image.\")\n    parser.add_argument(\"--device\", type=str, default=\"cuda:0\", help=\"inference device.\")\n    parser.add_argument(\"--confidence\", type=float, default=0.7, help=\"confidence threshold [MaskRCNN].\")\n    parser.add_argument(\"--text-prompt\", type=str, default=None, help=\"Text prompt.\")\n    parser.add_argument(\"--image-prompt\", type=str, default=None, help=\"Image prompt.\")\n    parser.add_argument(\"--clip-type\", type=str, default=\"clip_vit_b32\", help=\"Type of CLIP model [ruclip, clip_vit_b32, clip_vit_b16].\")\n    parser.add_argument(\"--top-k\", type=int, default=1, help=\"top_k predictions will be returned.\")\n    parser.add_argument(\"--output-image\", type=str, default=None, help=\"Output image name.\")\n    args = parser.parse_args()\n    main(args)\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbes-dev%2Fpytorch_clip_bbox","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbes-dev%2Fpytorch_clip_bbox","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbes-dev%2Fpytorch_clip_bbox/lists"}