{"id":13540306,"url":"https://github.com/essandess/isp-data-pollution","last_synced_at":"2025-12-29T23:37:58.689Z","repository":{"id":19070014,"uuid":"86130996","full_name":"essandess/isp-data-pollution","owner":"essandess","description":"ISP Data Pollution to Protect Private Browsing History with Obfuscation","archived":false,"fork":false,"pushed_at":"2023-03-20T20:11:47.000Z","size":1249,"stargazers_count":591,"open_issues_count":6,"forks_count":53,"subscribers_count":40,"default_branch":"master","last_synced_at":"2024-11-03T05:32:41.011Z","etag":null,"topics":["crawling","data","data-analytics","obfuscation","privacy","privacy-enhancing-technologies","web"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/essandess.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2017-03-25T03:57:18.000Z","updated_at":"2024-10-25T00:04:36.000Z","dependencies_parsed_at":"2024-01-19T05:52:37.651Z","dependency_job_id":"287f66b2-a1dd-4fbd-8eb5-4b91799d2b92","html_url":"https://github.com/essandess/isp-data-pollution","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/essandess%2Fisp-data-pollution","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/essandess%2Fisp-data-pollution/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/essandess%2Fisp-data-pollution/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/essandess%2Fisp-data-pollution/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/essandess","download_url":"https://codeload.github.com/essandess/isp-data-pollution/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246774248,"owners_count":20831497,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawling","data","data-analytics","obfuscation","privacy","privacy-enhancing-technologies","web"],"created_at":"2024-08-01T09:01:46.177Z","updated_at":"2025-12-29T23:37:58.649Z","avatar_url":"https://github.com/essandess.png","language":"Python","funding_links":[],"categories":["Python","Python (1887)","\u003ca id=\"783f861b9f822127dba99acb55687cbb\"\u003e\u003c/a\u003e工具"],"sub_categories":["\u003ca id=\"85bb0c28850ffa2b4fd44f70816db306\"\u003e\u003c/a\u003e混淆器\u0026\u0026Obfuscate"],"readme":"# ISP Data Pollution\n\nCongress's party-line vote will allow ISP's to exploit your family's private data without your consent. See \"**[Senate Puts ISP Profits Over Your Privacy](https://www.eff.org/deeplinks/2017/03/senate-puts-isp-profits-over-your-privacy)**\".\n\nThis script is designed to defeat this violation by generating large amounts of realistic, random web browsing to pollute ISP data and render it effectively useless by obfuscating actual browsing data.\n\nI pay my ISP a lot for data usage every month. I typically don't use all the bandwidth that I pay for. If my ISP is going to sell private browsing habits, then I'm going to pollute browsing with noise and use all the bandwidth that I pay for. This method accomplishes this.\n\nIf everyone uses all the data they've paid for to pollute their browsing history, then perhaps ISPs will reconsider the business model of selling customer's private browsing history.\n\nThe [alternative](https://arstechnica.com/information-technology/2017/03/how-isps-can-sell-your-web-history-and-how-to-stop-them/) of using a VPN or Tor merely pushes the issue onto to the choice of VPN provider, complicates networking, and adds the real issue of navigating captchas when appearing as a Tor exit node. Also, merely encrypted traffic has too much [exploitable side-channel information](https://www.theatlantic.com/technology/archive/2017/03/encryption-wont-stop-your-internet-provider-from-spying-on-you/521208/), and could still be used to determine when specific family members are at home, and the activities in which they're engaged.\n\nThis crawler uses [chromedriver](http://chromedriver.chromium.org) with the Python selenium library, uses blacklists for undesirable websites (see the code for details), does not download images, and respects robots.txt, which all provide good security.\n\n# Command Line\n\n```\npython3 isp_data_pollution.py\npython3 isp_data_pollution.py --help\npython3 isp_data_pollution.py -bw 1024  # 1 TB per month\npython3 isp_data_pollution.py -g        # print debugging statements\n```\n\n# Motivation for Efficacy\n\nThe approach used in this script is susceptible to both statistical attack and traffic anomalies. Jon Brodkin's [article](https://arstechnica.com/information-technology/2017/04/after-vote-to-kill-privacy-rules-users-try-to-pollute-their-web-history/) on privacy through noise injection covers several valid critiques: the approach is not guaranteed to obfuscate sensitive private information, and even if it does work initially, it may not scale. Known flaws and suggestions for improvements are welcomed in the [Issues](../../Issues) pages.\n\nHowever, there are good information theoretic and probabilistic reasons to suggest an approach like this could work in many practical situations. Privacy through obfuscation has been used in many contexts. In the data sciences, Rubin proposed a statistically sound method to preserve subject confidentiality by masking private data with synthetic data (\"[Statistical Disclosure Limitation](http://www.jos.nu/Articles/abstract.asp?article=92461)\", *JOS* **9**(2):461–468, 1993). In a nice paper relevant to this repo, Ye et al. describe a client-side privacy model that uses noise injection (\"[Noise Injection for Search Privacy Protection](http://web.cs.ucdavis.edu/~hchen/paper/passat2009.pdf)\", *Proc. 2009 Intl. Conf. CSE*).\n\nHere are two back-of-the-envelope arguments for the efficacy of this approach in the case of ISP privacy intrusion. These are not proofs, but simple models that suggest some optimism is warranted. Actual efficacy must be determined by testing these models in the real world.\n\n## Information Theoretic Argument\n\nYe et al.'s approach attempts to minimize the mutual information between user data and user data with injected noise presented to a server. Mutual information is the overlap between the entropy of the user data, and the entropy of the user data with injected noise (purple area below). The amount and distribution of injected noise is selected to make this mutual information as small as possible, thus making it difficult to exploit user data on the server side.\n\n![Mutual Information](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Entropy-mutual-information-relative-entropy-relation-diagram.svg/256px-Entropy-mutual-information-relative-entropy-relation-diagram.svg.png)\n\nThe example in Ye et al.'s paper is specific search queries. The analogy in this repo is specific domains. Domain information is the primary data leaked to ISPs if encrypted HTTPS is used, and is therefore relevant. The case of unencrypted traffic with explicit query terms and content is discussed in the next section on maximum likelihood.\n\nYe et al. show that the mutual information vanishes if:\n\n\u003e Number of noise calls ≥ (Number of user calls - 1) × Number of possible calls\n\nFor this application, the number of possible calls is the number of domains that a user might visit (per day), and the number of calls is the number of visits made. Nielson [reported](http://www.nielsen.com/us/en/insights/news/2010/nielsen-provides-topline-u-s-web-data-for-march-2010.html) in 2010 that the average person visits 89 domains per month. To be extremely conservative in (over)estimating the number of noise calls necessary to obscure this browsing data, assume that the average user visits *O*(100) domains per day, with *O*(200) user requests per day, or about one every five minutes over a long day. \n\nThe equation above asserts that (200-1)×100 or about twenty thousand (20,000) noise calls are required to achieve zero mutual information between user data and the user plus noise data.\n\nThis amounts to one noise call about every five seconds, which is very easy to achieve in practice, and easily falls within a nominal bandwidth limit of 50 GB per month.\n\nIf Ye et al.'s client-side information theoretic model is valid in practice, then it is reasonable to expect that the parameters chosen in this script would be able to greatly reduce or eliminate the mutual information between actual user domain data and the domain data presented to the ISP.\n\nFurthermore, fewer noise calls may be used if a dependency model is introduced between the user and noise distributions.\n\n## Maximum Likelihood Argument\n\nUnencrypted HTTP calls leak highly specific user data to the ISP. Targeted advertising methods uses this captured data to classify the user and serve tailored advertising based upon the user's category. Probabilistically, this approach inherently depends upon finding specific \"peaks\" in a users query distribution, then using these peaks to find the most likely consumer categories for the user. Injecting a large number of uncorrelated (or better, anti-correlated) calls may hinder the maximum-likelihood approach used to classify the user because it adds many more peaks throughout the measured distribution of user interests.\n\nFurthermore, the advertiser's transmission bandwidth is highly constrained—only so many ads will fit on a web page. Adding uncorrelated noise calls complicates the problem of selecting the appropriate ad.\n\n# Known Issues\n\n## Critiques of Data Pollution\n\nBoth Kaveh Waddell's and Jon Brodkin's excellent articles on ISP privacy in *[The Atlantic](https://www.theatlantic.com/technology/archive/2017/04/hiding-the-signal-in-the-noise/522564/)* and *[Ars Technica](https://arstechnica.com/information-technology/2017/04/after-vote-to-kill-privacy-rules-users-try-to-pollute-their-web-history/)* address important critiques of this approach. These are summarized here along with a response both so that users are aware of these issues, and to prompt suggestions to address them.\n\n- **“Masking a person’s browsing history by layering in copies of other people’s browsing patterns might be more useful. … ‘It would be a Tor-like system where anonymity comes through shared usage.’”** [[Bruce Schneier](https://www.schneier.com)]\n  - Comment 1: It is possible to mask privacy with statistical methods (Rubin, op. cit.; Ye et al., op. cit.)\n  - Comment 2: A Tor- or I2P-like routing system would be preferable if a good solution to the Tor exit-node problem is found. A sample crawl illustrates that creating self-generated pollution is much, much safer than running a Tor (or Tor-like) exit node that allows anyone to send open requests from a personal IP address.\n\n- **“[Do not underestimate] internet providers’ ability … to see through data-obfuscation tactics.”** [[Bruce Schneier](https://www.schneier.com)]\n  - Comment: The bandwidth parameters in this repo are chosen with a specific information theoretic model in mind that, if correct, eliminates the mutual information between user domain data and polluted data presented to the ISP. No mutual information means no big data exploitation opportunity. This is an area where more research is required because flaws/imperfections in the obfuscation method will leak information. Sufficient quantities of correctly chosen noise make big data approaches significantly more challenging. This is a hypothesis that remains to be tested in this context.\n\n- **“Random Google searches could send the program down a dark rabbit hole, without the user’s knowledge.”** [[Kenn White](https://twitter.com/kennwhite)]\n  - Comment 1: This is a possibility. It is mitigated by (1) using Google safe searches; (2) an in-memory blacklist; (3) no image downloads. Based on this critique, the explicit parameter `safe=active` is added to search queries.\n  - Comment 2: Tor exit-node traffic almost certainly contains such traffic, which is an important issue for exit-node operators. In contrast, self-generated noise is likely to be—and in practice appears to be—much safer.\n  - Comment 3: This potential problem has not yet been observed or reported. Reports of such problems or suggestions to further mitigate them are welcomed in the repo's [Issues](../../Issues).\n  \n- **“Some information is sensitive even if it's surrounded by noise. … Imagine if hackers targeted your ISP, your browsing history was leaked, and it showed you visiting specific controversial websites. … Even if that was surrounded by noise, it would be very hard to get the sort of noise that would give you plausible deniability.”** [[Jeremy Gillula](https://www.eff.org/about/staff/jeremy-gillula)]\n  - Comment 1: This is correct. Obfuscation is a statistical approach that cannot conceal highly specific, personal, sensitive data, and would not offer plausible deniability.\n  - Comment 2: This is also a potential issue for VPN users.\n\n## Known Limitations of Data Pollution\n\nAnalysis of other data obfuscation approaches show susceptibility to off-the-shelf machine learning classifer attacks: Pedinti and Saxena demonstrated meaningful user classification with the TrackMeNot browser plugin intended to defeat an adversarial search engine (\"[On the Privacy of Web Search Based on Query Obfuscation: A Case Study of TrackMeNot](http://link.springer.com/chapter/10.1007/978-3-642-14527-8_2)\", in *Proc. PETS2010*, 2010). The adversarial model and training methods used in this analysis are not directly applicable to the case of ISP intermediaries. Key features of Pedinti and Saxena's attack are:\n\n- \"We set out to investigate whether it is still possible (and to what extent) for an adversarial search engine–equipped with users’ search histories—to filter out TMN queries using off-the-shelf machine learning classifiers and thus undermine the privacy guarantees provided by TMN.\"\n- \"The problem considered in this paper is different from the problem of identifying queries from an anonymized search log. First, an adversary in our application is the search engine itself and not a third party attempting to de-anonymize a search log. Second, unlike a third party, the search engine is already in possession of users’ search history using which it can effectively train a classifier.\"\n- \"In our *adversarial model*, we assumed that the search engine is adversarial and its goal is to distinguish between TMN and user queries for profiling and aggregation purposes. We also assumed that the engine would have access to user’s search histories for a certain duration until the point the user starts using the TMN software.\"\n- \"**Classification Algorithms.** Since clustering with default parameters performed poorly, we decided to work with supervised/classification algorithms which are trained on prior labeled data.\"\n- Amount of noise data comparable to amount of user data because of search engine API limits.\n\nNone of these attack features are necessarily applicable to an ISP adversarial model. It is possible that an ISP could use historical unpolluted user data to train a classifier, however, this presumes that users's interests, numbers, and identities at an account IP address do not change from month-to-month, an unlikely event for most users and households. Without uncorrupted user data to train on, this paper illustrates the difficulties of third-party de-anonymization even with limited quantities of noise. It would be useful to quantify classification performance with and without the ability to train with uncorrupted user data. Knowing the answer for both cases would point to potential improvements in the obfuscation approach.\n\n# Privatizing Proxy Filter with VPN Access\n\nData pollution is one component of privatizing your personal data. Install the [EFF](../../../../EFForg)'s [HTTPS Everywhere](https://www.eff.org/https-everywhere) and [Privacy Badger](https://www.eff.org/privacybadger) on **all** browsers. Also see the repos [osxfortress](../../../osxfortress) and [osx-openvpn-server](../../../osx-openvpn-server) to block advertising, trackers, and malware across devices.\n\nUsing a [privatizing proxy](../../../osxfortress) to pool your own personal traffic with the data pollution traffic adds another layer of obfuscation with header traffic control. HTTP headers from the polluted traffic appear as:\n\n```\nGET /products/mens-suits.jsp HTTP/1.1\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko\nAccept-Encoding: gzip, deflate\nAccept-Language: en-US,*\nHost: www.bananarepublic.com\nConnection: keep-alive\n```\n\n# Example crawl\n\nAfter a while of random crawling looks like this:\n\n```\nThis is ISP Data Pollution 🐙💨, Version 1.1\nDownloading the blacklist… done.\nDisplay format:\nDownloading: website.com; NNNNN links [in library], H(domain)= B bits [entropy]\nDownloaded:  website.com: +LLL/NNNNN links [added], H(domain)= B bits [entropy]\n\nhttp://eponymousflower.blogspot.com/2017/02/lu…: +6/32349 links, H(domain)=6.8 b\n```\n\nThe screenshot of a randomly crawled web page looks like this. Note that there are no downloaded images.\n\n`driver.get_screenshot_as_file('his_all_time_greatest_hits.png')`:\n\n![His All Time Greatest Hits](his_all_time_greatest_hits.png)\n\n# Installation\n\nDepending upon your Python (v. 3) installation, the module dependencies are `numpy`, `requests`, `selenium`, and `fake_useragent`, as well as `chromedriver`. How you install these depends upon your OS.\n\nThis involves choosing a Python (v. 3) package manager, typically `pip` or `Anaconda`.\n\nI like `pip`, so on my machines I would say:\n\n```\nsudo pip-3.7 install numpy requests selenium fake_useragent OpenSSL\n```\n\n## ChromeDriver\n\nIt is recommended that the `chromedriver` binary be installed directly from [chromedriver.chromium.org](http://chromedriver.chromium.org/downloads. Be sure to verify the [Etag](https://chromedriver.storage.googleapis.com/index.html?path=2.42/) of the downloaded installation.\n\n## macOS\n\nThe [MacPorts](https://www.macports.org) install command is:\n\n```\nsudo port install chromedriver py37-numpy py37-requests py37-psutil py37-openssl psutil\n```\n\nThis is what was also necessary on macOS:\n\n```\nsudo port install chromedriver\nsudo -H pip-3.7 install selenium fake_useragent\n\n# if chromedriver fails to install because of an Xode configuration error: test with\n/usr/bin/xcrun -find xcrun\n# then do this:\ncd /Applications/Xcode.app/Contents/Developer/usr/bin/\nsudo ln -s xcodebuild xcrun\n```\n\n[Homebrew](../../../../Homebrew/brew) is another good option.\n\n## Linux\n\n### CentOS\n\n```\nsudo yum -y install https://centos7.iuscommunity.org/ius-release.rpm\nsudo yum -y groupinstall development\nsudo yum -y install python34 python34-pip python34-devel python34-pyflakes openssl-devel\nsudo pip3 install --upgrade pip\nsudo pip3 install numpy psutil requests selenium fake_useragent pyopenssl\n```\n\n### Ubuntu16\n\n```\nsudo apt-get install git\ngit clone https://github.com/essandess/isp-data-pollution.git\ncd isp-data-pollution/\nsudo apt install python3-pip\npip3 install --upgrade pip \npip3 install numpy\npip3 install psutil\nsudo -H pip3 install psutil --upgrade\nsudo -H pip3 install --upgrade pip\nsudo -H pip3 install selenium\nsudo -H pip3 install fake_useragent\nsudo -H pip3 install pyopenssl\nsudo apt-get install fontconfig\nsudo apt-get install libfontconfig\nsudo apt-get install build-essential chrpath libssl-dev libxft-dev\nsudo apt-get install libfreetype6 libfreetype6-dev\nsudo apt-get install libfontconfig1 libfontconfig1-dev\n\n#! Please update these commands for chromedriver\n# export PHANTOM_JS=\"phantomjs-2.1.1-linux-x86_64\"\n# sudo mv $PHANTOM_JS /usr/local/share\nls /usr/local/share\n# sudo ln -sf /usr/local/share/$PHANTOM_JS/bin/phantomjs /usr/local/bin\n# phantomjs --version\npython3 isp_data_pollution.py\n```\n\nIf you are behind a firewall, use `sudo -EH` to inherit `http_proxy` environment settings.\n\n## Headless\n\n`chromedriver` requires some graphical software, virtual or otherwise, so on a headless computer, you'll need the following system package and local package.\n\nIf you're not using virtualenv (below) then run pip as sudo.\n\n```\nsudo apt-get install xvfb\npip install pyvirtualdisplay\n```\n\n### Installation through virtualenv\n\nIn order to isolate pip library files, virtualenv is convenient. If you prefer this method, you can follow the steps below:\n```\npushd ~/.virtualenvs/ \u0026\u0026 virtualenv -p python3 isp-pollute \u0026\u0026 popd\nworkon isp-pollute\npip install numpy requests selenium fake_useragent psutil\nsudo apt-get install chromedriver\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fessandess%2Fisp-data-pollution","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fessandess%2Fisp-data-pollution","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fessandess%2Fisp-data-pollution/lists"}