An open API service indexing awesome lists of open source software.

https://github.com/rggh/scrapy19

Scrapy To Postgresql for ML / NLP
https://github.com/rggh/scrapy19

bs4 postgresql python scrapy webscraping

Last synced: about 2 months ago
JSON representation

Scrapy To Postgresql for ML / NLP

Awesome Lists containing this project

README

          

## Scrapy To Postgresql for ML / NLP

### Notes:
- Run as 'scrapy crawl' not via crawler process, as pipelines don't work with crawler process
- Make sure columns in Postgres have large enough varchar() for long urls
- written with Python 3.9.5
- If you are coming from MySQL, change port to 5432 !
- Keep search small, as the site has 'load more' (not AJAX/JSON) so will not work from scrapy

- Clone Scrapy19, run in a virtualenv, use cj.sh (Cron Job Shell Script) via cron to run it daily

#### code to age out old records:

jobs=# DELETE FROM listings
WHERE posted < NOW() - interval '7 days';
DELETE 0
jobs=# DELETE FROM listings
WHERE posted < NOW() - interval '5 days';
DELETE 0
jobs=# DELETE FROM listings
WHERE posted < NOW() - interval '4 days';
DELETE 10
jobs=#

#### check size of a column:

jobs=# select
sum(pg_column_size(posted)) as total_size,
avg(pg_column_size(posted)) as average_size,
sum(pg_column_size(posted)) * 100.0 / pg_relation_size('listings') as percentage
from listings;
total_size | average_size | percentage
-----------+--------------------+------------------------
136 | 4.0000000000000000 | 0.23716517857142857143