Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gill-singh-a/crawler
A Program that crawls on web starting from a given web page and looking for keywords through other internal links that are found
https://github.com/gill-singh-a/crawler
crawler multithreading osint python python3 requests scraper
Last synced: 7 days ago
JSON representation
A Program that crawls on web starting from a given web page and looking for keywords through other internal links that are found
- Host: GitHub
- URL: https://github.com/gill-singh-a/crawler
- Owner: Gill-Singh-A
- Created: 2023-04-20T00:44:22.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-04-19T17:57:12.000Z (7 months ago)
- Last Synced: 2024-05-12T05:47:30.670Z (6 months ago)
- Topics: crawler, multithreading, osint, python, python3, requests, scraper
- Language: Python
- Homepage:
- Size: 17.6 KB
- Stars: 2
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Crawler
A Program that crawls on web starting from a given web page and looking for keywords through other internal links that are found.## Requirements
Langauge Used = Python3
Modules/Packages used:
* requests
* pickle
* bs4
* datetime
* optparse
* colorama
* timeInstall the dependencies:
```bash
pip install -r requirements.txt
```
## Input
* '-u', "--url" : URL to start Crawling from
* '-t', "--in-text" : Words to find in text (seperated by ',')
* '-s', "--session-id" : Session ID (Cookie) for the Request Header (Optional)
* '-w', "--write" : Name of the File for the data to be dumped (default=current data and time)
* '-e', "--external" : Crawl on External URLs (True/False, default=False)
* '-T', "--timeout" : Request Timeout
## Output
It will stop when it has crawled all the internal links of the given URL or if the user presses CTRL+C.
It then display Information about total URLs extracted, Internal URLs extracted and external URLs extracted.
And finally gives a list or URLs in which the keywords we've interested in were found.