https://github.com/hugodf/spider
https://github.com/hugodf/spider
Last synced: 22 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/hugodf/spider
- Owner: HugoDF
- Created: 2017-03-07T10:38:10.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2017-04-20T15:47:55.000Z (about 8 years ago)
- Last Synced: 2025-02-14T01:39:08.948Z (3 months ago)
- Language: Python
- Size: 40 KB
- Stars: 0
- Watchers: 5
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# COMPM052 Spider Scrapy
# Setup
Need sqlite3, python 2.7 and pip
Run `sqlite3 urls.db`, run this query to create the urls table:
```sql
CREATE TABLE urls (url STRING, content TEXT, links STRING, pageRank REAL, amount INTEGER);
CREATE INDEX contentIndex ON urls (content);
CREATE TABLE incomingLinks (urlId INTEGER, incomingLinks STRING);
CREATE INDEX incomingLinksUrlId on incomingLinks (urlId);
```Run `pip install scrapy`.
To start the spider:
`scrapy runspider scrape.py`To enable PageRank:
Generate incoming links for the database: `python makeIncomingLinks.py`
Run the algorithm: `python pageRank.py`