https://github.com/mikemeliz/torcrawl.py

Crawl and extract (regular or onion) webpages through TOR network
https://github.com/mikemeliz/torcrawl.py

crawler extractor onion osint python tor

Last synced: 2 months ago
JSON representation

Crawl and extract (regular or onion) webpages through TOR network

Host: GitHub
URL: https://github.com/mikemeliz/torcrawl.py
Owner: MikeMeliz
License: gpl-3.0
Created: 2016-12-05T11:38:00.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2024-11-16T11:28:52.000Z (8 months ago)
Last Synced: 2025-04-07T14:41:13.566Z (3 months ago)
Topics: crawler, extractor, onion, osint, python, tor
Language: Python
Homepage:
Size: 355 KB
Stars: 376
Watchers: 6
Forks: 71
Open Issues: 2
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE
- Code of conduct: .github/CODE_OF_CONDUCT.md

Awesome Lists containing this project

awesome-network-stuff - **32**星

README

        



  

  

  

  ### TorCrawl.py is a Python script designed for anonymous web scraping via the Tor network.

  It combines ease of use with the robust privacy features of Tor, allowing for secure and untraceable data collection. Ideal for both novice and experienced programmers, this tool is essential for responsible data gathering in the digital age.

  

  [![Release][release-version-shield]][releases-link]

  [![Last Commit][last-commit-shield]][commit-link]

  ![Python][python-version-shield]

  [![Quality Gate Status][quality-gate-shield]][quality-gate-link]

  [![license][license-shield]][license-link]



### What makes it simple and easy to use?

If you are a terminal maniac you know that things have to be simple and clear. Passing the output into other tools is necessary and accuracy is the key.

With a single argument, you can read an .onion webpage or a regular one, through TOR Network and by using pipes you can pass the output at any other tool you prefer.

```shell

$ torcrawl -u http://www.github.com/ | grep 'google-analytics'

     

```

If you want to crawl the links of a webpage use the `-c` and **BAM** you got on a file all the inside links. You can even use `-d` to crawl them and so on. You can also use the argument `-p` to wait some seconds before the next crawl.

```shell

$ torcrawl -v -u http://www.github.com/ -c -d 2 -p 2

# TOR is ready!

# URL: http://www.github.com/

# Your IP: XXX.XXX.XXX.XXX

# Crawler started from http://www.github.com/ with 2 depth crawl and 2 second(s) delay:

# Step 1 completed with: 11 results

# Step 2 completed with: 112 results

# File created on /path/to/project/links.txt

```

> [!TIP]  

> Crawling is not illegal, but violating copyright *is*. It’s always best to double-check a website’s T&C before start crawling them. Some websites set up what’s called `robots.txt` to tell crawlers not to visit those pages.

> 
This crawler *will* allow you to go around this, but we always *recommend* respecting robots.txt.



## Installation

### Easy Installation with pip:

*Coming soon...*

### Manual Installation:

1. **Clone this repository**:


`git clone https://github.com/MikeMeliz/TorCrawl.py.git`

2. **Install dependencies**:


`pip install -r requirements.txt`

3. **Install and Start TOR Service**:

    1. **Debian/Ubuntu**: 


        `apt-get install tor`


        `service tor start`

    2. **Windows**: Download [`tor.exe`][tor-download], and:


        `tor.exe --service install`


        `tor.exe --service start`

    3. **MacOS**: 


        `brew install tor`


        `brew services start tor`

    4. For different distros, visit:


       [TOR Setup Documentation][tor-docs]

## Arguments

| **arg**      | **Long**            | **Description**                                                                        |

|--------------|---------------------|----------------------------------------------------------------------------------------|

| **General**: |                     |                                                                                        |

| -h           | --help              | Help message                                                                           |

| -v           | --verbose           | Show more information about the progress                                               |

| -u           | --url *.onion       | URL of Webpage to crawl or extract                                                     |

| -w           | --without           | Without using TOR Network                                                              |

| -f           | --folder            | The directory which will contain the generated files                                   |

| **Extract**: |                     |                                                                                        |

| -e           | --extract           | Extract page's code to terminal or file (Default: Terminal)                            |

| -i           | --input filename    | Input file with URL(s) (separated by line)                                             |

| -o           | --output [filename] | Output page(s) to file(s) (for one page)                                               |

| -y           | --yara              | Perform yara keyword search:
h = search entire html object,
t = search only text |

| **Crawl**:   |                     |                                                                                        |

| -c           | --crawl             | Crawl website (Default output on website/links.txt)                                    |

| -d           | --depth             | Set depth of crawler's travel (Default: 1)                                             |

| -p           | --pause             | Seconds of pause between requests (Default: 0)                                         |

| -l           | --log               | Log file with visited URLs and their response code                                     |

## Usage & Examples

### As Extractor:

To just extract a single webpage to terminal:

```shell

$ python torcrawl.py -u http://www.github.com

...

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mikemeliz/torcrawl.py

Awesome Lists containing this project

README