https://github.com/rggh/scrapy19
Scrapy To Postgresql for ML / NLP
https://github.com/rggh/scrapy19
bs4 postgresql python scrapy webscraping
Last synced: about 2 months ago
JSON representation
Scrapy To Postgresql for ML / NLP
- Host: GitHub
- URL: https://github.com/rggh/scrapy19
- Owner: RGGH
- Created: 2021-05-28T11:00:30.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2021-06-08T19:19:51.000Z (about 5 years ago)
- Last Synced: 2025-03-28T01:55:27.835Z (about 1 year ago)
- Topics: bs4, postgresql, python, scrapy, webscraping
- Language: Jupyter Notebook
- Homepage:
- Size: 1.45 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Scrapy To Postgresql for ML / NLP
### Notes:
- Run as 'scrapy crawl' not via crawler process, as pipelines don't work with crawler process
- Make sure columns in Postgres have large enough varchar() for long urls
- written with Python 3.9.5
- If you are coming from MySQL, change port to 5432 !
- Keep search small, as the site has 'load more' (not AJAX/JSON) so will not work from scrapy
- Clone Scrapy19, run in a virtualenv, use cj.sh (Cron Job Shell Script) via cron to run it daily
#### code to age out old records:
jobs=# DELETE FROM listings
WHERE posted < NOW() - interval '7 days';
DELETE 0
jobs=# DELETE FROM listings
WHERE posted < NOW() - interval '5 days';
DELETE 0
jobs=# DELETE FROM listings
WHERE posted < NOW() - interval '4 days';
DELETE 10
jobs=#
#### check size of a column:
jobs=# select
sum(pg_column_size(posted)) as total_size,
avg(pg_column_size(posted)) as average_size,
sum(pg_column_size(posted)) * 100.0 / pg_relation_size('listings') as percentage
from listings;
total_size | average_size | percentage
-----------+--------------------+------------------------
136 | 4.0000000000000000 | 0.23716517857142857143