https://github.com/robmch/mindfactory_crawling

A Python 3 Crawler for Mindfactory.de
https://github.com/robmch/mindfactory_crawling

crawler crawling data webcrawler webcrawling

Last synced: 12 days ago
JSON representation

A Python 3 Crawler for Mindfactory.de

Host: GitHub
URL: https://github.com/robmch/mindfactory_crawling
Owner: RobMcH
License: mit
Created: 2018-12-10T20:33:56.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2022-10-03T17:56:23.000Z (over 2 years ago)
Last Synced: 2025-03-31T12:01:55.183Z (about 2 months ago)
Topics: crawler, crawling, data, webcrawler, webcrawling
Language: Python
Homepage:
Size: 38.1 KB
Stars: 4
Watchers: 1
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Mindfactory.de Crawler

This repository contains a crawler for [Mindfactory](https://www.mindfactory.de), a German eCommerce shop (for computer hardware).

The crawler extracts the data contained on every single product page and stores the scraped products and reviews in a SQLite database consisting of two tables.  

Each product has the following properties:  

* ID (SQLite identifier)

* URL

* Product name

* Brand name

* Category (i.e. CPU)

* EAN

* SKU

* Items sold (count)

* People watching (count)

* RMA quote (in percent)

* Average rating (from 1.0 to 5.0)

* Shipping (information on availability)

* Price (in Euro)  

Additionally, for every product all reviews are collected and stored in a separate SQLite table. An entry in this table has the following properties:

* Product ID (reference to the corresponding ID in the product table)

* Stars (rating, from 1 to 5)

* Text (not tokenized/pre-processed in any kind)

* Author

* Date (YYYY-MM-DD)

* Verified (if the customer actually bought the product at Mindfactory)

# Prerequisites  

* Python 3 (>= 3.5)

* scrapy (>= 1.6.0)

* SQLite3

# Run the scraper  

    scrapy crawl mindfactory_products

    

# Deploy the scraper

The scraper can be deployed using scrapyd. In order to do that, just run [scrapyd-deploy](https://github.com/scrapy/scrapyd-client#scrapyd-deploy)

with the address to the server running scrapyd. Afterwards the scraper can be used with scrapyd.

    python scrapyd-deploy

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/robmch/mindfactory_crawling

Awesome Lists containing this project

README