{"id":16119389,"url":"https://github.com/codingchili/tls-privacy","last_synced_at":"2025-12-30T22:06:55.564Z","repository":{"id":87434593,"uuid":"475006483","full_name":"codingchili/tls-privacy","owner":"codingchili","description":"ML with sklearn/pandas in Python to identify page loads for specific websites when TLS is used. Uses NodeJS/Puppeteer for traffic generation.","archived":false,"fork":false,"pushed_at":"2022-07-13T19:41:31.000Z","size":5798,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-07-22T21:29:59.792Z","etag":null,"topics":["machine-learning","nodejs","pandas","puppeteer","python","scapy","security","security-testing"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codingchili.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-28T13:08:15.000Z","updated_at":"2022-07-13T19:43:21.000Z","dependencies_parsed_at":null,"dependency_job_id":"8443f59b-17d8-43c7-a45e-2803d4702b35","html_url":"https://github.com/codingchili/tls-privacy","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/codingchili/tls-privacy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codingchili%2Ftls-privacy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codingchili%2Ftls-privacy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codingchili%2Ftls-privacy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codingchili%2Ftls-privacy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codingchili","download_url":"https://codeload.github.com/codingchili/tls-privacy/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codingchili%2Ftls-privacy/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28132996,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-30T02:00:05.476Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","nodejs","pandas","puppeteer","python","scapy","security","security-testing"],"created_at":"2024-10-09T20:54:01.367Z","updated_at":"2025-12-30T22:06:55.534Z","avatar_url":"https://github.com/codingchili.png","language":"JavaScript","readme":"# TLS Privacy vs. Payload Length\n\nThe project has two parts, the analyzer and the generator. The generator generates network traffic\nby loading web pages and the analyzer captures the network traffic, labels datasets and trains the model. The analyzer is implemented in Python (3.8+) and\nthe generator in nodejs (16+), these are required in order to run. See documentation for each part\non how to install application dependencies.\n\n\u003cp align=\"center\"\u003e\n  \n  \u003cimg src=\"https://thumbs.gfycat.com/ThickDeterminedFunnelweaverspider-size_restricted.gif\" width=412\u003e\n\u003c/p\u003e\n\nThe goal of the project is to build machine learning models usable on TLS encrypted websites. The model\nshould be able to distinguish pages within a site to break privacy and infer information about the user. As an example,\na user browses a health information site and reads about rare disease X - the model should then be able\nto classify the flow and match it to the webpage describing disease X. \n\nIt is not a primary interest to distinguish websites from each other, because this is much easier, already leaks\nin multiple ways and doesn't break privacy to the same degree. As a result of the analysis, certain key characteristics\nthat makes sites easy to analyze is identified.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./data/plots/sample/plot.png\" width=512\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  Early sample of dataset for one site with ten pages, plotting page load size to time.\n\u003c/p\u003e\n  \n### Limitations\n\nThis is the authors first project considering machine learning so information might be incorrect, the approach inefficient etc. The \nproject is nevertheless published here. In order for the project to see some success, limitations are as follows,\n\n* Analyzes a single ciphersuite per site.\n* Evaluate a limited number of features.\n* Use of limited/ineffective/simple learning algorithms.\n* Latency/jitter is not considered, LAN analysis only.\n* Flows are generated by a single client (generator).\n* The model is only trained for a single browser/version, Chromium.\n\n### Ethical \u0026 Legal considerations\n\n* Ensure that user consent has been explicitly given when capturing live data.\n* Prefer to mirror websites and serve them locally on LAN for training.\n* Ensure proper delay between site loads when testing against online sites.\n* Limit the number of requests that is performed for online sites.\n\n## Traffic analyzer\n\nImplements the following features\n\n* collects network traffic based on a given filter.\n* communicates with the generator to label traffic.\n* saves/loads traffic datasets in json.\n* generates dataset visualisations.\n* trains the machine learning model. \n* performs classification of traffic as it is analyzed.\n* forwards information about identified pages in realtime.\n\n### Requirements\nTo install requirements run `pip install -r requirements.txt`\n\nThe analyzer uses `asyncio`, `pandas` and `scapy`.\n\n```\n$ ./analyze -h\nusage: -c [-h] {list,sniff,plot,learn,monitor} ...\n\noptional arguments:\n  -h, --help            show this help message and exit\n\nTraffic analyzer:\n  Available commands\n\n  {list,sniff,plot,learn,monitor}\n    list                lists the available interfaces\n    sniff               capture network data to create data sets.\n    plot                create plots of the given data set.\n    learn               train a new model using the given data set.\n    monitor             monitor traffic using the given model.\n```\n\nrun example,\n\n```\n./analyze sniff --ip 192.168.0.149 --ports 443 --dump dataset_1 eno1\n```\n\noutput sample,\n\n```javascript\n2022-05-04 11:35:29,357 started capture on 'eno1'\n2022-05-04 11:35:29,357 using filter 'ip and host 192.168.0.149 and port (443)'..\n2022-05-04 11:35:29,359 listening on '127.0.0.1:9555' and publishing on '224.0.0.14:9555'.\n2022-03-29 09:48:52,050 capture in progress [packets = 0]\n2022-03-29 09:42:53,815 sniffer collecting by label 'https://192.168.0.114/login' ..\n2022-03-29 09:42:53,999 capture in progress [packets = 167]\n2022-03-29 09:42:54,512 capture in progress [packets = 192]\n```\n\n## Traffic generator\nImplements the following features\n\n* generates network traffic by loading webpages using puppeteer/chrome.\n* notifies the analyzer which site is being loaded for labeling.\n* create clones of websites for testing locally.\n* includes a mDNS beacon for hostname simulation.\n* support realtime monitoring through a live Chromium browser.\n\n### Requirements\nTo install dependencies, run `npm install` in `./generator`.\n\n```\n$ ./generate -h\nusage: index.js [-v] {forge,serve,sites,site,beacon,monitor} ...\n\nDataset generator and real-time monitoring of analyzer.\n\noptional arguments:\n  -v, --version\n\nAvailable commands:\n  {forge,serve,sites,site,beacon,monitor}\n    forge               Create a static copy of a remote website for the webserver\n    serve               Serve static websites that the generator can target\n    sites               list available sites.\n    site                Generate web traffic for the analyzers sniffer module\n    beacon              Multicast DNS beacon for hostname simulation\n    monitor             monitor replay, requires a running analyzer.\n```\n\nrun example,\n\n```\n./generate site testsite -d 0 -n 128 -c\n```\n\noutput sample,\n\n```javascript\n2022-05-04 11:37:47 [INFO   ] generating data for 1 site(s).\n2022-05-04 11:37:47 [INFO   ] starting generator with 128 load(s) per page and delay 0s.\n2022-05-04 11:37:47 [INFO   ] notifier listening on '0.0.0.0:53330'.\n2022-05-04 11:37:47 [INFO   ] cache is enabled.\n[###########################] initializing generator.. [100%]\n[########                   ] requests 24%, [360/1500] testsite (/256k)\n```\n## Tutorial\nStep-by-step tutorial to get started.\n\n1. choose a target website and clone it. The `-f` flag specifies the number of links to follow. It's also possible to run the forge command\nmultiple times with different urls for the same site to manually specify urls to clone. The `--missing` flag attempts to generate a 404\npage which is then used by the webserver. `--favicon` attempts to explicitly request the favicon.ico as this isn't done when running\nchromium in headless, this results in one extra request though. \n\n```bash\n./generator forge https://example.com/ -f 5 -o example.com --missing --favicon\n```\n\n2. create a self-signed certificate, from `generator/server/keys/` run the following. \n\n```bash\n./mkcert\n```\n\nCertificate options can be configured in `generator/server/keys/request.ext`, for more information on using certificates\nplease see generator/browser/server/keys/certificates.md.\n\nFor the DNS names set in request.ext ensure the hostname matches with the target. This can be done through `/etc/hosts` or with the mDNS beacon in step 4.\n\n4. start the local webserver.\n\n```bash\n./generator serve -t -h2 -c br -p 443 example.com\n```\n\n4. start the mDNS beacon for hostname verification. (alternatively edit /etc/hosts)\n\n```bash\n./generator beacon \u003cserver-ip\u003e example\n```\n\n5. start the analyzer in sniff mode and dump to dataset 'example'.\n\n```bash\n./analyze list # list the available interfaces\n./analyze sniff --ip \u003cserver-ip\u003e --ports 443 --dump example \u003cinterface\u003e\n```\n\n6. run the traffic generator to generate browser traffic and label it.\n\na) create the site template in `generator/browser/sites/example.js`, which lists the urls to navigate.\n\n```javascript\nimport {Site} from '../site.js';\n\nexport default class ExampleSite extends Site {\n\n    constructor(browser) {\n        super(browser, 'https://example/');\n    }\n\n    static pages() {\n        return [\n            '/1k-file.html',\n            '/2k-file.html',\n            '/4k-file.html',\n            '/8k-file.html',\n            '/16k-file.html'\n        ];\n    }\n}\n```\n\nb) check that the created template is listed,\n\n```bash\n./generator list\n# $ [..., example]\n```\n\nc) reference the created template file when running the site command.\n\n```javascript\n./generator site -d 0 -n 128 -c example\n```\n\nThis will generate traffic for all listed urls, each url will be loaded n times.\nBefore loading each page the analyzer will be notified, to label requests. When complete the analyzer\ndumps the collected network traffic into a json dataset.\n\n7. plot charts from the generated dataset.\n\n```bash\n./analyzer plot example\n```\n\nThis will create plots in `data/plots/example`.\n\n8. use the generated dataset for machine learning.\n\n```bash\n./analyzer learn example \n```\n\nUse `./analyzer learn -h` to find more options, this will create a model in `data/models/example.bin`.\n\n9. run the analyzer in monitor mode.\n```bash\n./analyze monitor --ip \u003cserver-ip\u003e --ports 443 \u003cinterface\u003e example\n```\n\nThis will sniff traffic and attempt to classify it using the given model. Again, try `-h` for more options.\n\n10. manually generate website traffic to be classified using the generation browser.\n```bash\n./generator browser\n```\n\nThis starts the Chromium browser using the same environment as it was used to generate traffic.\n\nIf successful the following should appear in the log\n\n```bash\n2022-05-04 12:04:45,198 started capture on '\u003cinterface\u003e'\n2022-05-04 12:04:45,198 using filter 'ip and host \u003cserver-ip\u003e and port (443)'..\n2022-05-04 12:04:45,199 loaded model from 'data/models/example.bin'.\n...\n2022-05-04 12:04:49,630 analyzing x1 page loads..\ntime                n/a\nin                 1878\nout                  36\npackets              27\nlabel      monitor-mode\nName: 0, dtype: object\n2022-05-04 12:04:49,638 match example/4k-file.html with accuracy 100.00% in 1.03ms.\n...\n```\n\n11. Viewing the identified pages in a realtime browser mirror\n \n```bash\n./generate browser --live\n```\n\nNote that the traffic generated by the mirrored browser should not be captured by the analyzer, this will\nresult in an infinite loop. Use a proxy, filter or a local server.\n\nFor tuning please list available options using the `-h` flag or see the source code.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodingchili%2Ftls-privacy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodingchili%2Ftls-privacy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodingchili%2Ftls-privacy/lists"}