Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gill-singh-a/crawler

A Program that crawls on web starting from a given web page and looking for keywords through other internal links that are found
https://github.com/gill-singh-a/crawler

crawler multithreading osint python python3 requests scraper

Last synced: about 2 months ago
JSON representation

A Program that crawls on web starting from a given web page and looking for keywords through other internal links that are found

Host: GitHub
URL: https://github.com/gill-singh-a/crawler
Owner: Gill-Singh-A
Created: 2023-04-20T00:44:22.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-04-19T17:57:12.000Z (9 months ago)
Last Synced: 2024-05-12T05:47:30.670Z (8 months ago)
Topics: crawler, multithreading, osint, python, python3, requests, scraper
Language: Python
Homepage:
Size: 17.6 KB
Stars: 2
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Crawler
A Program that crawls on web starting from a given web page and looking for keywords through other internal links that are found.

## Requirements
Langauge Used = Python3

Modules/Packages used:
* requests
* pickle
* bs4
* datetime
* optparse
* colorama
* time

Install the dependencies:
```bash
pip install -r requirements.txt
```
## Input
* '-u', "--url" : URL to start Crawling from
* '-t', "--in-text" : Words to find in text (seperated by ',')
* '-s', "--session-id" : Session ID (Cookie) for the Request Header (Optional)
* '-w', "--write" : Name of the File for the data to be dumped (default=current data and time)
* '-e', "--external" : Crawl on External URLs (True/False, default=False)
* '-T', "--timeout" : Request Timeout
## Output
It will stop when it has crawled all the internal links of the given URL or if the user presses CTRL+C.

It then display Information about total URLs extracted, Internal URLs extracted and external URLs extracted.

And finally gives a list or URLs in which the keywords we've interested in were found.