https://github.com/horizon-blue/wiki-crawler
Crawl movie & actor info from Wikipedia
https://github.com/horizon-blue/wiki-crawler
Last synced: about 1 month ago
JSON representation
Crawl movie & actor info from Wikipedia
- Host: GitHub
- URL: https://github.com/horizon-blue/wiki-crawler
- Owner: horizon-blue
- License: gpl-3.0
- Created: 2018-03-11T04:54:17.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2018-03-11T05:09:59.000Z (over 7 years ago)
- Last Synced: 2025-02-16T22:19:18.684Z (4 months ago)
- Language: Python
- Size: 86.9 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# Crawler
A Wikipedia crawler that extract information from actor pages and movie pages and store in database
## Install packages
It is recommended that you install all requirements in virtual environment.
To create a virtual environment, cd into the project root directory, and do
```bash
virtualenv venv
```(The project is written in Python 3, and might not be compatible with Python 2. Make sure that you are creating an virtualenv using Python 3)
Then you can activate your virtual env by
```bash
source venv/bin/activate
```Install packages via pip
```bash
pip install -r requirements.txt
```Then you can begin exploring the project. When you are done, deactivate the environment by
```bash
deactivate
```## Initialize database
You need to initialize the database the first time you run the crawler. To do so, start a python console, and execute the following code
```python
from database import init_db
init_db()
```Then you can start running the spider by
```bash
python spider.py
```