https://github.com/eivindarvesen/naive-spider

A minimal web crawler
https://github.com/eivindarvesen/naive-spider

crawler python spider

Last synced: about 1 month ago
JSON representation

A minimal web crawler

Host: GitHub
URL: https://github.com/eivindarvesen/naive-spider
Owner: EivindArvesen
Created: 2016-04-06T21:06:42.000Z (about 10 years ago)
Default Branch: master
Last Pushed: 2016-04-09T11:38:38.000Z (about 10 years ago)
Last Synced: 2025-12-17T04:29:59.447Z (7 months ago)
Topics: crawler, python, spider
Language: Python
Size: 125 KB
Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# A Minimal Web Crawler
A naive implementation of a spider that depends only on the Python standard library.
It is able to resume operations if shut down (via ctrl+c).

Run the script with the command
```bash
python main.py
```

The applications takes an arbitrary number of URLs to start with as arguments.
It then saves urls to a queue and to a database.
Then, if there is a document at the URL, it downloads this to a databse, parses it and finds any links, and adds these to the queue.
This work is preformed on several threads concurrently.

## Assumptions
This crawler assumes that Python (2.x.x) is available, and that there is a functional SQLite installation.

## Limitations
The program can only deal with well formed documents - valid markup (with UTF-8 encoding).

Were this a serious project I would probably base my work upon robust frameworks
like Scrapy and Beautiful Soup, to better handle scraping strategies and parsing
of broken markup, respectively.

## Possible future improvements
- Error handling on queries to SQLite.
- Not behaving as though the crawler is finished parsing and extracting links from current URL if it is interrupted while parsing.
- Serialization/saving should be optimized.
- Separate the tasks of downloading and saving document, and parsing document to find links into different processes.
- Add rule support (blacklisting, etc.)
- Better model abstraction (i.e. objects/classes for interactions with frontier and storage).
- Reorganization of some code, e.g. dealing with args etc. elsewhere.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/eivindarvesen/naive-spider

Awesome Lists containing this project

README