{"id":20550576,"url":"https://github.com/manmolecular/http-response-clustering","last_synced_at":"2026-04-29T13:31:01.623Z","repository":{"id":48523099,"uuid":"249135792","full_name":"manmolecular/http-response-clustering","owner":"manmolecular","description":":chart_with_downwards_trend: Clustering of HTTP responses using k-means++ and the elbow method ","archived":false,"fork":false,"pushed_at":"2021-07-21T18:14:23.000Z","size":982,"stargazers_count":2,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-06T05:44:33.934Z","etag":null,"topics":["data-analysis","elbow-method","elbow-plot","jupyter","k-means-plus-plus","python3"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/manmolecular.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-03-22T07:41:09.000Z","updated_at":"2020-11-19T03:34:29.000Z","dependencies_parsed_at":"2022-08-31T17:21:02.704Z","dependency_job_id":null,"html_url":"https://github.com/manmolecular/http-response-clustering","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/manmolecular/http-response-clustering","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manmolecular%2Fhttp-response-clustering","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manmolecular%2Fhttp-response-clustering/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manmolecular%2Fhttp-response-clustering/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manmolecular%2Fhttp-response-clustering/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/manmolecular","download_url":"https://codeload.github.com/manmolecular/http-response-clustering/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manmolecular%2Fhttp-response-clustering/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260288461,"owners_count":22986665,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","elbow-method","elbow-plot","jupyter","k-means-plus-plus","python3"],"created_at":"2024-11-16T02:26:02.939Z","updated_at":"2026-04-29T13:30:56.588Z","avatar_url":"https://github.com/manmolecular.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# http-response-clustering\n[![Required OS](https://img.shields.io/badge/OS-Linux%20based-blue)](https://en.wikipedia.org/wiki/Linux)\n[![Python3 Version](https://img.shields.io/badge/python-3.7%2B-blue)](https://www.python.org/downloads/)\n[![Issues](https://img.shields.io/github/issues/manmolecular/http-response-clustering)](https://github.com/manmolecular/http-response-clustering/issues)\n[![Pull Requests](https://img.shields.io/github/issues-pr/manmolecular/http-response-clustering)](https://github.com/manmolecular/http-response-clustering/pulls)\n[![Last Commits](https://img.shields.io/github/last-commit/manmolecular/http-response-clustering)](https://github.com/manmolecular/http-response-clustering/commits/master)  \n  \n:chart_with_downwards_trend: Clustering of HTTP responses using k-means++ and the elbow method (PoC)\n\n## Screenshot\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/manmolecular/http-response-clustering/master/assets/screenshot-1.png\"\u003e\n  \u003cp align=\"center\"\u003e\u003ci\u003eA basic example of dividing HTTP responses into clusters\u003c/i\u003e\u003c/p\u003e\n\u003c/div\u003e \n\n## Contents\n1. [Disclaimer](#disclaimer)\n1. [Description](#description)\n1. [Examples](#examples)\n1. [Requirements](#requirements)\n1. [Installation](#installation)\n1. [Determine the number of clusters](#determine-the-number-of-clusters)\n1. [Provide your data into clusters](#provide-your-data-into-clusters)\n1. [Format of the HTTP responses to analyze](#format-of-the-http-responses-to-analyze)\n\n\n## Disclaimer\n:construction: This work is just an early-proof-of-concept, and almost in deep WIP status. So feel free to make some issues, feature requests, and bug reports. Thanks.\n\n## Description\nThis module based on SciPy (SciKit) implementation of K-means (to be more precious, K-means++) clustering method allows you to sort different HTTP responses from various HTTP hosts into some clusters, divided by several unique features. \n  \nThis can be useful when, for example, you cannot process a large amount of data from various network scanners (for example, Nmap Network Scanner, Shodan database, Censys database, etc.) or collected by yourself, because the quantity is too big and responses look almost the same. \n  \nIn this case, you can try to divide all of your data into some small subgroups called clusters to find some unique features in every group.\n\n## Examples\nSee [interactive_cluster.ipynb](/interactive_cluster.ipynb) notebook\n\n## Requirements\n### Run in docker\n- Docker\n- docker-compose\n### Run on host directly\n- Python 3.7+\n- pip3\n\n## Installation\n### Run in docker\n```\ndocker-compose up\n```\nThe Jupyter notebook will be running on the `http://localhost:8888/\u003cyour-token-etc\u003e`, follow the instructions from your terminal logs\n  \n**Note:** this repository contains 2 Dockerfiles: one is based on the Debian default python image and another one based on the Alpine python image. You can choose a preferable variant via modifying `docker-compose.yml` file:\nreplace `docker/debian/Dockerfile` string in `docker-compose.yml` file with `docker/alpine/Dockerfile`, if you want to build the image on the Alpine base. \n  \nBy default, the Debian python image will be built.\n### Run on host\nInstall the requirements and you are ready to go:\n```\npip3 install -r requirements.txt\n```\nAfter that, you can use the Jupyter Notebook or module itself directly with `interactive_cluster.ipynb` notebook:\n```\njupyter notebook\n```\nor with `clustering.py`:\n```\npython3 clustering.py\n```\n  \n## Determine the number of clusters\nOne of the problems here is to understand how many clusters can we get from our results? One, three, five, or maybe even ten different clusters? In case when we know the quantity (for example, when we have 1.000 hosts - half of them is Apache servers, and another half - is Nginx servers, and we only need to divide these hosts into 2 clusters), it can be pretty easy - we just need to define the number of clusters as 2, for example. But in the case when we don't know exactly the quantity, we need to use some additional method to understand possible effective numbers of clusters. To do this we use \"The Elbow Method\" - heuristic method of data validation and analysis that helps to find the right quantity of clusters.\n  \nTo demonstrate this method, you can play around with `interactive_cluster.ipynb` notebook.\n\n## Provide your data into clusters\nTo try these methods with your data, you can add collected HTTP-responses in the format that you can find in the \"data\" directory. \n\nIn two words, the format must be the following:\n```json\n{\n    \"my-class-1\":[\n        \"response-1\",\n        \"response-2\",\n        \"response-3\"\n    ],\n    \"my-class-2\":[\n        \"response-1\",\n        \"response-2\"\n    ]\n}\n```\n`\"my-class-1\"` key can be anything you want, if you don't have any predictions:\n```json\n{\n    \"any\":[\n        \"response-1\",\n        \"response-2\",\n        \"response-3\",\n        \"response-4\"\n    ]\n}\n```\nYou need to create the JSON file with your data, provide a name for it (for example, `\"hosts_example.json\"`), put your data in the format above these lines and provide the name of your file into files handler:\n```python3\nraw_results = FilesHandler().open_results(filename=\"data/hosts_example.json\")\n```\n\n## Format of the HTTP responses to analyze\nFormat of the HTTP responses must the like raw string with all the special characters, like, for example, from cURL:\n```\nHTTP/1.1 200 OK\\r\\nkbn-name: kibana\\r\\nkbn-version: 6.3.2\\r\\nkbn-xpack-sig: 242104651e8721ad01d1a77b0e87738e\\r\\ncache-control: no-cache\\r\\ncontent-type: text/html; charset=utf-8\\r\\ncontent-length: 217\\r\\naccept-ranges: bytes\\r\\nvary: accept-encoding\\r\\nDate: Thu, 19 Mar 2020 15:30:33 GMT\\r\\nConnection: keep-alive\\r\\n\\r\\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanmolecular%2Fhttp-response-clustering","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmanmolecular%2Fhttp-response-clustering","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanmolecular%2Fhttp-response-clustering/lists"}