Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/okwilkins/web-crawler

This program will crawl through entire domains, exporting every link it can find into a txt file.
https://github.com/okwilkins/web-crawler

crawler crawling files html htmlparser python python3 reader scraper threading threads web writer

Last synced: 21 days ago
JSON representation

This program will crawl through entire domains, exporting every link it can find into a txt file.

Host: GitHub
URL: https://github.com/okwilkins/web-crawler
Owner: okwilkins
Created: 2018-03-19T20:01:36.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2018-03-20T17:51:36.000Z (almost 7 years ago)
Last Synced: 2024-12-06T22:36:30.672Z (2 months ago)
Topics: crawler, crawling, files, html, htmlparser, python, python3, reader, scraper, threading, threads, web, writer
Language: Python
Homepage:
Size: 232 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Python Web Crawler
## Created by Oliver Wilkins
### 19/03/2018

This program will crawl through entire domains, exporting every link it can find into a txt file.

## Installating/Running the Project

You will not need to download any libraries, plug-in and play by:
* Downloading or cloning the repository
* Running the main.py file
* Links which the program saves are found in the *queued.txt* and *crawled.txt* files in the [projects folder](https://github.com/HomelessSandwich/Web-Crawler/tree/master/projects) - the folder has example projects with *queued.txt* and *crawled.txt*

## Important

* This program works by reading a webpage and extracting the links to the *queued.txt* file, when gotten round to the program will read further links from the *queued.txt* file and will then dump the then completed (crawled) webpage to the *crawled.txt* file
* You can try to trawl through massive domains, with many links - this will take a *VERY* long time however
* Also note that you may need to change the NUMBER_OF_THREADS variable in the [main.py](https://github.com/HomelessSandwich/Web-Crawler/blob/master/main.py) (line 12) file - this will depend on your operating system
```python
NUMBER_OF_THREADS = 8
```

## Updates for the Future
* Add a tree view for all the links found
* Reduce the number of decoding errors
* Fix some URLs completely shutting down threads and ultimately the whole program. This issue is described in detail [here](https://github.com/HomelessSandwich/Web-Crawler/issues/1)
* Create a nicer output to the console + a GUI