https://github.com/remusao/privacy_bot
Privacy bot crawls privacy policies of popular domains, persist them and analyze them.
https://github.com/remusao/privacy_bot
Last synced: about 2 months ago
JSON representation
Privacy bot crawls privacy policies of popular domains, persist them and analyze them.
- Host: GitHub
- URL: https://github.com/remusao/privacy_bot
- Owner: remusao
- Created: 2017-05-18T09:35:44.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2023-05-22T21:34:13.000Z (almost 2 years ago)
- Last Synced: 2025-02-28T23:09:59.684Z (3 months ago)
- Language: Python
- Homepage:
- Size: 14.9 MB
- Stars: 7
- Watchers: 5
- Forks: 6
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# privacy_bot
Privacy Bot aims at giving the tools to collect, store and analyze privacy
policies of the most popular domains on the Internet.You can find the current privacy policies in the `privacy_policies` folder.
## Ideas
* Some websites require javascript to access the privacy policy (headless
browser?)
* After we find the URL of the privacy policy for a given domain, we could
manually validate it and forbid privacy_bot to use any other URL for this
domain. Privacy bot could notify if the validated URL is not valid.
* Some domains seem to have several pages related to privacy, we could collect
all of them.
* Some domain have URL with randomly generated parts inside, which will make
the policy appear like it was updated. We could strip these random parts before
saving the policy.
* Add even more domains.
* Make use of proxies to have an IP in the country we want.
* Improve parallelism (for a given domain, requests are sequential)