Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mikemeliz/torcrawl.py
Crawl and extract (regular or onion) webpages through TOR network
https://github.com/mikemeliz/torcrawl.py
crawler extractor onion osint python tor
Last synced: 3 months ago
JSON representation
Crawl and extract (regular or onion) webpages through TOR network
- Host: GitHub
- URL: https://github.com/mikemeliz/torcrawl.py
- Owner: MikeMeliz
- License: gpl-3.0
- Created: 2016-12-05T11:38:00.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2024-01-22T17:47:36.000Z (10 months ago)
- Last Synced: 2024-08-02T16:02:55.270Z (3 months ago)
- Topics: crawler, extractor, onion, osint, python, tor
- Language: Python
- Homepage:
- Size: 293 KB
- Stars: 271
- Watchers: 6
- Forks: 56
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE
- Code of conduct: .github/CODE_OF_CONDUCT.md
Awesome Lists containing this project
- awesome-network-stuff - **32**星
README
### TorCrawl.py is a Python script designed for anonymous web scraping via the Tor network.
It combines ease of use with the robust privacy features of Tor, allowing for secure and untraceable data collection. Ideal for both novice and experienced programmers, this tool is essential for responsible data gathering in the digital age.
[![Release][release-version-shield]][releases-link]
[![Last Commit][last-commit-shield]][commit-link]
![Python][python-version-shield]
[![license][license-shield]][license-link]
### What makes it simple and easy to use?
If you are a terminal maniac you know that things have to be simple and clear. Passing the output into other tools is necessary and accuracy is the key.
With a single argument, you can read an .onion webpage or a regular one, through TOR Network and by using pipes you can pass the output at any other tool you prefer.
```shell
$ torcrawl -u http://www.github.com/ | grep 'google-analytics'
```If you want to crawl the links of a webpage use the `-c` and **BAM** you got on a file all the inside links. You can even use `-d` to crawl them and so on. You can also use the argument `-p` to wait some seconds before the next crawl.
```shell
$ torcrawl -v -u http://www.github.com/ -c -d 2 -p 2
# TOR is ready!
# URL: http://www.github.com/
# Your IP: XXX.XXX.XXX.XXX
# Crawler started from http://www.github.com/ with 2 depth crawl and 2 second(s) delay:
# Step 1 completed with: 11 results
# Step 2 completed with: 112 results
# File created on /path/to/project/links.txt
```> [!TIP]
> Crawling is not illegal, but violating copyright *is*. It’s always best to double check a website’s T&C before start crawling them. Some websites set up what’s called `robots.txt` to tell crawlers not to visit those pages.
>
This crawler *will* allow you to go around this, but we always *recommend* respecting robots.txt.
## Installation
### Easy Installation with pip:
*Comming soon..*### Manual Installation:
1. **Clone this repository**:
`git clone https://github.com/MikeMeliz/TorCrawl.py.git`
2. **Install dependecies**:
`pip install -r requirements.txt`
3. **Install and Start TOR Service**:
1. **Debian/Ubuntu**:
`apt-get install tor`
`service tor start`
3. **Windows**: Download [`tor.exe`][tor-download], and:
`tor.exe --service install`
`tor.exe --service start`
5. **MacOS**:
`brew install tor`
`brew services start tor`
6. For different distros, visit:
[TOR Setup Documentation][tor-docs]## Arguments
**arg** | **Long** | **Description**
----|------|------------
**General**: | |
-h |--help| Help message
-v |--verbose| Show more information about the progress
-u |--url *.onion| URL of Webpage to crawl or extract
-w |--without| Without using TOR Network
-f |--folder| The directory which will contain the generated files
**Extract**: | |
-e |--extract| Extract page's code to terminal or file (Default: Terminal)
-i |--input filename| Input file with URL(s) (seperated by line)
-o |--output [filename]| Output page(s) to file(s) (for one page)
-y |--yara | Perform yara keyword search:
h = search entire html object,
t = search only text
**Crawl**: | |
-c |--crawl| Crawl website (Default output on website/links.txt)
-d |--cdepth| Set depth of crawler's travel (Default: 1)
-p |--pause| Seconds of pause between requests (Default: 0)
-l |--log| Log file with visited URLs and their response code## Usage & Examples
### As Extractor:
To just extract a single webpage to terminal:```shell
$ python torcrawl.py -u http://www.github.com...