https://github.com/xinzhang-chen/tos-crawl
ToS-Crawl is a stealth-enabled crawler that extracts and recursively collects Terms of Service (ToS), Privacy Policies, and other legal agreements from major websites. It is designed for academic research, NLP dataset creation, and comparative policy analysis.
https://github.com/xinzhang-chen/tos-crawl
open-source-tooling terms-of-service web-scraping
Last synced: 24 days ago
JSON representation
ToS-Crawl is a stealth-enabled crawler that extracts and recursively collects Terms of Service (ToS), Privacy Policies, and other legal agreements from major websites. It is designed for academic research, NLP dataset creation, and comparative policy analysis.
- Host: GitHub
- URL: https://github.com/xinzhang-chen/tos-crawl
- Owner: Xinzhang-Chen
- License: agpl-3.0
- Created: 2025-03-28T10:02:10.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-03-30T20:27:31.000Z (about 2 months ago)
- Last Synced: 2025-05-05T09:10:02.412Z (24 days ago)
- Topics: open-source-tooling, terms-of-service, web-scraping
- Language: JavaScript
- Homepage:
- Size: 754 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# πΈοΈ ToS-Crawl β Terms of Service Crawler
**ToS-Crawl** is a stealth-enabled crawler that extracts and recursively collects **Terms of Service (ToS)**, **Privacy Policies**, and other legal agreements from major websites. It is designed for **academic research**, **NLP dataset creation**, and **comparative policy analysis**.
ToS-Crawl is part of the following research paper:
> **"ToSense: We Read, You Click"**\
> *Xinzhang Chen, Hassan Ali, Arash Shaghaghi, Salil S. Kanhere, Sanjay Jha*\
> *Under review at IEEE/IFIP DSN 2025*---
## π Features
- β Stealth-mode browser automation via `puppeteer-extra`
- β Automatic scrolling and expansion of dynamic content
- β Clean extraction using `@mozilla/readability`
- β Converts HTML to structured Markdown with `turndown`
- β Recursively follows TOS-related links
- β Filters non-HTML and duplicate fragment URLs
- β CLI support for custom URL and output path
- β Crawl summary with coverage stats and result tagging---
## π Installation
```bash
git clone https://github.com/Xinzhang-Chen/tos-crawl.git
cd tos-crawl
npm install
```> Requires Node.js β₯ 18. Puppeteer will auto-install Chromium. No system Chrome is needed.
---
## π§ͺ Usage
```bash
node tos-crawl.js --url --output
```### βΆοΈ Example
```bash
node tos-crawl.js --url https://www.linkedin.com/legal/l/service-terms --output ./output/Linkedin.md
```> If no parameters are provided, the script will crawl LinkedIn's Terms of Service by default and save to `./output/Linkedin.md`.
---
## βοΈ Parameters
| Argument | Description | Default |
|--------------|-----------------------------------------------|--------------------------------|
| `--url` | Starting URL to crawl | LinkedIn Service Terms |
| `--output` | Output `.md` file to store the extracted TOS | `./output/Linkedin.md` |---
## π Test URLs Table
Use the following well-known platform links to test the crawler:
| Platform | Terms of Service URL |
|-------------|----------------------|
| Facebook | https://www.facebook.com/terms.php |
| YouTube | https://www.youtube.com/t/terms |
| TikTok | https://www.tiktok.com/legal/page/row/terms-of-service/en |
| LinkedIn | https://www.linkedin.com/legal/l/service-terms |
| Google | https://policies.google.com/terms |> You can copy any of the above into `--url` to test the crawler on that site.
---
## π Sample Crawl Summary
```
π Starting TOS extraction from: https://www.linkedin.com/legal/l/service-terms
π Visiting: https://www.linkedin.com/legal/l/service-terms
π Visiting: https://www.linkedin.com/help/recruiter/answer/50181/recruiter-inmail-policy?lang=en
π Visiting: https://www.linkedin.com/help/recruiter/answer/a413279/recruiter-inmail-policy?lang=en
π Visiting: https://www.linkedin.com/legal/professional-community-policies
π Visiting: https://www.linkedin.com/legal/cookie-policy
π Visiting: https://www.linkedin.com/legal/copyright-policy
π Visiting: https://www.linkedin.com/legal/user-agreement
π Visiting: https://www.linkedin.com/legal/privacy-policy
π Visiting: https://www.linkedin.com/legal/l/jobs-policies
π Visiting: https://www.linkedin.com/legal/user-agreement?trk=content_footer-user-agreement
π Visiting: https://linkedin.com/legal/user-agreement
π Visiting: https://linkedin.com/legal/user-agreement-summary
π Visiting: https://linkedin.com/legal/privacy-policy
π Visiting: https://linkedin.com/legal/professional-community-policies
π Visiting: https://linkedin.com/legal/cookie-policy
π Visiting: https://linkedin.com/legal/copyright-policy
π Visiting: https://linkedin.com/legal/privacy/eu
π Visiting: https://linkedin.com/legal/california-privacy-disclosure
π Visiting: https://www.linkedin.com/legal/privacy/usa
π Visiting: https://www.linkedin.com/help/linkedin/answer/a1341216/updates-to-user-agreement-and-privacy-policy
π Visiting: https://www.linkedin.com/help/linkedin/answer/63?trk=microsites-frontend_legal_user-agreement&lang=en
π Visiting: https://www.linkedin.com/help/linkedin/answer/89880?trk=microsites-frontend_legal_user-agreement&lang=en
π Visiting: https://www.linkedin.com/legal/pop/terms-for-paid-services
π Visiting: https://www.linkedin.com/help/linkedin/answer/50?trk=microsites-frontend_legal_user-agreement&lang=en
π Visiting: https://www.linkedin.com/help/linkedin/answer/5704?trk=microsites-frontend_legal_user-agreement&lang=en
π Visiting: https://www.linkedin.com/help/linkedin/answer/67?trk=microsites-frontend_legal_user-agreement&lang=en
π Visiting: https://www.linkedin.com/help/linkedin/answer/86529?trk=microsites-frontend_legal_user-agreement&lang=en
π Visiting: https://www.linkedin.com/help/linkedin/answer/50021?trk=microsites-frontend_legal_user-agreement&lang=en
π Visiting: https://www.linkedin.com/services
π Visiting: https://www.linkedin.com/help/linkedin?trk=microsites-frontend_legal_user-agreement&lang=en
π Visiting: https://www.linkedin.com/help/linkedin/answer/79728?trk=microsites-frontend_legal_user-agreement&lang=en
π Visiting: https://www.linkedin.com/legal/privacy-policy?trk=content_footer-privacy-policy
π Visiting: https://www.linkedin.com/legal/cookie-policy?trk=content_footer-cookie-policy
π Visiting: https://www.linkedin.com/legal/copyright-policy?trk=content_footer-copyright-policy
π Visiting: https://brand.linkedin.com/policies?trk=content_footer-brand-policy
βοΈ Skipping non-HTML file: https://business.linkedin.com/content/dam/business/sales-solutions/global/en_US/site/pdf/ti/services.pdf
π Visiting: https://www.linkedin.com/legal/l/lmsprogramterms
π Visiting: https://www.linkedin.com/legal/l/sponsorship-program-terms
π Visiting: https://www.linkedin.com/legal/ads-policy
π Visiting: https://legal.linkedin.com/dpa
π Visiting: https://www.linkedin.com/legal/l/dpa
π Visiting: https://legal.linkedin.com/customer-subprocessors
π Visiting: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Flegal%2Fcontracting-entity-terms&data=02%7C01%7Crvolpineto%40linkedin.com%7Cfd020363a3da48a2455308d7f690fd09%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637248974895609773&sdata=oS%2B38oz5KnsRLYvzMD6i6iDjAmROwKXnJNwlbKu3kfo%3D&reserved=0
π Visiting: https://www.linkedin.com/legal/contracting-entity-terms
β Terms of Service content saved to: ./output/Linkedin.mdπ Crawl Summary:
Total Pages Visited: 43
β Successfully Extracted: 43
βοΈ Skipped: 1
β Failed: 0
```---
## β οΈ Disclaimer
> This tool is intended solely for **non-commercial**, **academic research**, and **educational** purposes. It is the userβs responsibility to ensure compliance with applicable laws, website terms of service, and ethical research guidelines. This repository does **not** promote or encourage violating any platform's policies.
Additionally, the maintainers **do not guarantee the accuracy, completeness, or continued availability** of the extracted data. Websites may change their structure or access policies at any time. The responsibility for verifying the correctness and relevance of the collected content lies solely with the user.
---
## π License
This project is licensed under the **GNU Affero General Public License v3.0 (AGPL-3.0)**.
See the [LICENSE](./LICENSE) file for more information.---
## π©βπ» Maintainer
- Xinzhang Chen - [email protected]
- Hassan Ali - [email protected]
- Dr Arash Shaghaghi - [email protected]π§π§π§π§π§π§π§π§π§π§
This project is under **active development** π οΈ.
New features and improvements are being added continuously. **Stay tuned!**