https://github.com/eivindarvesen/naive-spider
A minimal web crawler
https://github.com/eivindarvesen/naive-spider
crawler python spider
Last synced: about 1 month ago
JSON representation
A minimal web crawler
- Host: GitHub
- URL: https://github.com/eivindarvesen/naive-spider
- Owner: EivindArvesen
- Created: 2016-04-06T21:06:42.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2016-04-09T11:38:38.000Z (about 10 years ago)
- Last Synced: 2025-12-17T04:29:59.447Z (7 months ago)
- Topics: crawler, python, spider
- Language: Python
- Size: 125 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# A Minimal Web Crawler
A naive implementation of a spider that depends only on the Python standard library.
It is able to resume operations if shut down (via ctrl+c).
Run the script with the command
```bash
python main.py
```
The applications takes an arbitrary number of URLs to start with as arguments.
It then saves urls to a queue and to a database.
Then, if there is a document at the URL, it downloads this to a databse, parses it and finds any links, and adds these to the queue.
This work is preformed on several threads concurrently.
## Assumptions
This crawler assumes that Python (2.x.x) is available, and that there is a functional SQLite installation.
## Limitations
The program can only deal with well formed documents - valid markup (with UTF-8 encoding).
Were this a serious project I would probably base my work upon robust frameworks
like Scrapy and Beautiful Soup, to better handle scraping strategies and parsing
of broken markup, respectively.
## Possible future improvements
- Error handling on queries to SQLite.
- Not behaving as though the crawler is finished parsing and extracting links from current URL if it is interrupted while parsing.
- Serialization/saving should be optimized.
- Separate the tasks of downloading and saving document, and parsing document to find links into different processes.
- Add rule support (blacklisting, etc.)
- Better model abstraction (i.e. objects/classes for interactions with frontier and storage).
- Reorganization of some code, e.g. dealing with args etc. elsewhere.