https://github.com/gilzoide/pparker

Aranhas que buscam notícias
https://github.com/gilzoide/pparker

scraping scrapy-crawler scrapy-spider web-crawling web-scraping

Last synced: about 1 month ago
JSON representation

Aranhas que buscam notícias

Host: GitHub
URL: https://github.com/gilzoide/pparker
Owner: gilzoide
Created: 2017-03-16T05:12:54.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2017-06-11T23:24:10.000Z (almost 8 years ago)
Last Synced: 2025-02-13T22:19:09.086Z (3 months ago)
Topics: scraping, scrapy-crawler, scrapy-spider, web-crawling, web-scraping
Language: Python
Homepage:
Size: 18.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.rst

Awesome Lists containing this project

README

        PParker

=======

Aranhas que buscam notícias usando scrapy_. Notícias são tiradas dos sites

das revistas `Galileu`_, `Super Interessante`_ e `Mundo Educação`_.

.. _scrapy: https://scrapy.org/

.. _python 3: https://www.python.org/

.. _Galileu: http://revistagalileu.globo.com/

.. _Super Interessante: http://super.abril.com.br/

.. _Mundo Educação: http://mundoeducacao.bol.uol.com.br/

Dependências

============

- `python 3`_

- scrapy_

Como rodar

==========

Há uma aranha para cada revista. Para rodar todas, utilize os seguintes

comandos::

    $ scrapy crawl galileu

    $ scrapy crawl super

    $ scrapy crawl mundoeducacao

Note que, por enquanto, PParker busca somente 20 notícias, para facilitar os

testes. Para baixar todas as notícias disponíveis (o que demora), utilize os

seguintes comandos::

    $ scrapy crawl -s DEPTH_LIMIT=0 galileu

    $ scrapy crawl -s DEPTH_LIMIT=0 super

    $ scrapy crawl -s DEPTH_LIMIT=0 mundoeducacao

Para alterar a pasta de destino das notícias, utilize a opção ``DIRETORIO_SAIDA``::

    $ scrapy crawl -s DIRETORIO_SAIDA=caminho_das_noticias galileu

Saídas

======

As notícias coletadas são armazenadas na pasta "noticias", em subpastas

específicas da revista e seções da mesma. Cada arquivo é uma notícia

individual.

Curiosidade

===========

**Por que PParker?**

É uma aranha que busca notícias, quem isso te lembra? =P

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gilzoide/pparker

Awesome Lists containing this project

README