https://github.com/venkatamutyala/wordpress-plugins-crawler-scrapy
Scrapy scripts to crawl all WordPress.org plugins
https://github.com/venkatamutyala/wordpress-plugins-crawler-scrapy
scrapy scrapy-crawler scrapy-spider webscraper wordpress wordpress-plugin-crawler
Last synced: 5 months ago
JSON representation
Scrapy scripts to crawl all WordPress.org plugins
- Host: GitHub
- URL: https://github.com/venkatamutyala/wordpress-plugins-crawler-scrapy
- Owner: venkatamutyala
- Created: 2018-01-28T21:24:26.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2024-06-05T23:53:05.000Z (about 2 years ago)
- Last Synced: 2025-10-08T14:52:21.715Z (9 months ago)
- Topics: scrapy, scrapy-crawler, scrapy-spider, webscraper, wordpress, wordpress-plugin-crawler
- Language: Python
- Homepage:
- Size: 14.6 KB
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# NOTE: THIS PROJECT IS NO LONGER IN ACTIVE DEVELOPMENT. Please ensure that you update all the libraries prior to execution. One or more of the libraries in this project may have security vulnerabilities.
# WordPress Plugin crawler using Scrapy
### Development Environment setup
Developed with Python 3.6.3
```
$ virtualenv venv -p python3
$ source venv/bin/activate
$ pip install -r requirements.txt
```
Notes:
The main.py file was added to help make it easier to interactively debug in pycharm.
The default output format is newline delimited json.
To run:
```
$ scrapy crawl WordPressPlugins
```
By default output will be stored in: "YYYY-MM-DD.ndjson"
### Export the variables below to save to AWS S3:
```
$ export AWS_ACCESS_KEY_ID=AKIAXXXXXXXXXXXXXX
$ export AWS_SECRET_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXXX
$ export AWS_DEFAULT_REGION=XXXXXXXX
$ export SCRAPY_WORDPRESS_FEED_URI="s3://el-gato-public/scrapy/wordpress-plugins/"`date +%F`".ndjson"
```
Other:
You are also welcome to hit my bucket directly at: s3://el-gato-public/scrapy/wordpress-plugins/*
**** Please be aware that I have enabled requestor pays on the bucket.
If you have any questions feel free to reach out.