https://github.com/rggh/scrapy19

Scrapy To Postgresql for ML / NLP
https://github.com/rggh/scrapy19

bs4 postgresql python scrapy webscraping

Last synced: 2 months ago
JSON representation

Scrapy To Postgresql for ML / NLP

Host: GitHub
URL: https://github.com/rggh/scrapy19
Owner: RGGH
Created: 2021-05-28T11:00:30.000Z (about 5 years ago)
Default Branch: main
Last Pushed: 2021-06-08T19:19:51.000Z (about 5 years ago)
Last Synced: 2025-03-28T01:55:27.835Z (over 1 year ago)
Topics: bs4, postgresql, python, scrapy, webscraping
Language: Jupyter Notebook
Homepage:
Size: 1.45 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          ## Scrapy To Postgresql for ML / NLP

### Notes:

  - Run as 'scrapy crawl' not via crawler process, as pipelines don't work with crawler process

  - Make sure columns in Postgres have large enough varchar() for long urls

  - written with Python 3.9.5

  - If you are coming from MySQL, change port to 5432 !

  - Keep search small, as the site has 'load more' (not AJAX/JSON) so will not work from scrapy

  - Clone Scrapy19, run in a virtualenv, use cj.sh (Cron Job Shell Script) via cron to run it daily

#### code to age out old records:

    jobs=# DELETE FROM listings

    WHERE posted < NOW() - interval '7 days';

    DELETE 0

    jobs=# DELETE FROM listings

    WHERE posted < NOW() - interval '5 days';

    DELETE 0

    jobs=# DELETE FROM listings

    WHERE posted < NOW() - interval '4 days';

    DELETE 10

    jobs=# 

    

#### check size of a column:

    jobs=# select

    sum(pg_column_size(posted)) as total_size,

    avg(pg_column_size(posted)) as average_size,

    sum(pg_column_size(posted)) * 100.0 / pg_relation_size('listings') as percentage

    from listings;

    total_size |    average_size    |       percentage       

    -----------+--------------------+------------------------

           136 | 4.0000000000000000 | 0.23716517857142857143

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rggh/scrapy19

Awesome Lists containing this project

README