Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/okwilkins/web-crawler
This program will crawl through entire domains, exporting every link it can find into a txt file.
https://github.com/okwilkins/web-crawler
crawler crawling files html htmlparser python python3 reader scraper threading threads web writer
Last synced: 21 days ago
JSON representation
This program will crawl through entire domains, exporting every link it can find into a txt file.
- Host: GitHub
- URL: https://github.com/okwilkins/web-crawler
- Owner: okwilkins
- Created: 2018-03-19T20:01:36.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2018-03-20T17:51:36.000Z (almost 7 years ago)
- Last Synced: 2024-12-06T22:36:30.672Z (2 months ago)
- Topics: crawler, crawling, files, html, htmlparser, python, python3, reader, scraper, threading, threads, web, writer
- Language: Python
- Homepage:
- Size: 232 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Python Web Crawler
## Created by Oliver Wilkins
### 19/03/2018This program will crawl through entire domains, exporting every link it can find into a txt file.
## Installating/Running the Project
You will not need to download any libraries, plug-in and play by:
* Downloading or cloning the repository
* Running the main.py file
* Links which the program saves are found in the *queued.txt* and *crawled.txt* files in the [projects folder](https://github.com/HomelessSandwich/Web-Crawler/tree/master/projects) - the folder has example projects with *queued.txt* and *crawled.txt*## Important
* This program works by reading a webpage and extracting the links to the *queued.txt* file, when gotten round to the program will read further links from the *queued.txt* file and will then dump the then completed (crawled) webpage to the *crawled.txt* file
* You can try to trawl through massive domains, with many links - this will take a *VERY* long time however
* Also note that you may need to change the NUMBER_OF_THREADS variable in the [main.py](https://github.com/HomelessSandwich/Web-Crawler/blob/master/main.py) (line 12) file - this will depend on your operating system
```python
NUMBER_OF_THREADS = 8
```## Updates for the Future
* Add a tree view for all the links found
* Reduce the number of decoding errors
* Fix some URLs completely shutting down threads and ultimately the whole program. This issue is described in detail [here](https://github.com/HomelessSandwich/Web-Crawler/issues/1)
* Create a nicer output to the console + a GUI