Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/robmch/mindfactory_crawling
A Python 3 Crawler for Mindfactory.de
https://github.com/robmch/mindfactory_crawling
crawler crawling data webcrawler webcrawling
Last synced: about 3 hours ago
JSON representation
A Python 3 Crawler for Mindfactory.de
- Host: GitHub
- URL: https://github.com/robmch/mindfactory_crawling
- Owner: RobMcH
- License: mit
- Created: 2018-12-10T20:33:56.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2022-10-03T17:56:23.000Z (about 2 years ago)
- Last Synced: 2024-03-11T10:21:10.952Z (8 months ago)
- Topics: crawler, crawling, data, webcrawler, webcrawling
- Language: Python
- Homepage:
- Size: 38.1 KB
- Stars: 4
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Mindfactory.de Crawler
This repository contains a crawler for [Mindfactory](https://www.mindfactory.de), a German eCommerce shop (for computer hardware).
The crawler extracts the data contained on every single product page and stores the scraped products and reviews in a SQLite database consisting of two tables.Each product has the following properties:
* ID (SQLite identifier)
* URL
* Product name
* Brand name
* Category (i.e. CPU)
* EAN
* SKU
* Items sold (count)
* People watching (count)
* RMA quote (in percent)
* Average rating (from 1.0 to 5.0)
* Shipping (information on availability)
* Price (in Euro)Additionally, for every product all reviews are collected and stored in a separate SQLite table. An entry in this table has the following properties:
* Product ID (reference to the corresponding ID in the product table)
* Stars (rating, from 1 to 5)
* Text (not tokenized/pre-processed in any kind)
* Author
* Date (YYYY-MM-DD)
* Verified (if the customer actually bought the product at Mindfactory)# Prerequisites
* Python 3 (>= 3.5)
* scrapy (>= 1.6.0)
* SQLite3# Run the scraper
scrapy crawl mindfactory_products
# Deploy the scraper
The scraper can be deployed using scrapyd. In order to do that, just run [scrapyd-deploy](https://github.com/scrapy/scrapyd-client#scrapyd-deploy)
with the address to the server running scrapyd. Afterwards the scraper can be used with scrapyd.python scrapyd-deploy