{"id":29360024,"url":"https://github.com/labic/ze-the-scraper","last_synced_at":"2025-07-09T07:09:56.052Z","repository":{"id":18805744,"uuid":"85353770","full_name":"labic/ze-the-scraper","owner":"labic","description":null,"archived":false,"fork":false,"pushed_at":"2022-12-27T15:00:26.000Z","size":11495,"stargazers_count":5,"open_issues_count":13,"forks_count":3,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-06-02T05:24:01.516Z","etag":null,"topics":["brazil","crawler","mongodb","news","newspaper","portals","scraper"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/labic.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-03-17T20:52:46.000Z","updated_at":"2023-07-24T13:02:51.000Z","dependencies_parsed_at":"2023-01-13T20:01:15.154Z","dependency_job_id":null,"html_url":"https://github.com/labic/ze-the-scraper","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/labic/ze-the-scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/labic%2Fze-the-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/labic%2Fze-the-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/labic%2Fze-the-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/labic%2Fze-the-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/labic","download_url":"https://codeload.github.com/labic/ze-the-scraper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/labic%2Fze-the-scraper/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263641317,"owners_count":23493406,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["brazil","crawler","mongodb","news","newspaper","portals","scraper"],"created_at":"2025-07-09T07:09:02.529Z","updated_at":"2025-07-09T07:09:56.041Z","avatar_url":"https://github.com/labic.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Zé The Scraper\n[![Build Status](https://travis-ci.org/labic/ze-the-scraper.svg?branch=develop)](https://travis-ci.org/labic/ze-the-scraper)\n\n## Install\n\n- [Install Berkeley DB](http://www.linuxfromscratch.org/blfs/view/7.9/server/db.html)\n\n\n\n## Limitações\n\n - Os artigos `article` são listados por ordem da data de coleta `dateCreated` porem os artigos podem ser considerados com atualizados e serem coletados novamente causado que a data de coleta e data de publicação `datePublished` divirjam \n \n\n## Usage\n\n### Crawlling using a single spider an single url\n```shell\nscrapy crawl \u003cspider_name\u003e -a url=http(s):someurl.com?query1=a\u0026query2=b\n```\n\n### Crawlling using a single spider with urls extrected from Google\n```shell\nscrapy crawl \u003cspider_name\u003e -a search='{ \\\n  \"query\": \"Enem OR \\\"Exame Nacional * Ensino Médio\\\"\", \\\n  \"regex\": \"(?i)Enem|Exame.{0,}Nacional.{0,}Ensino.{0,}Mé?e?dio\" \\\n  \"engine\": \"google\", \\\n  \"dateRestrict\": \"d1\",\\\n  \"results_per_page\": 50,\\\n  \"pages\": 2 \\\n}' \n```\n\n### Crawlling using all spiders with urls extrected from Google\n```shell\nscrapy crawl all -a search='{ \\\n  \"query\": \"Enem OR \\\"Exame Nacional * Ensino Médio\\\"\", \\\n  \"regex\": \"(?i)Enem|Exame.{0,}Nacional.{0,}Ensino.{0,}Mé?e?dio\"\n  \"engine\": \"google\", \\\n  \"dateRestrict\": \"d1\", \\\n  \"results_per_page\": 50, \\\n  \"pages\": 2 \\\n}'\n\nscrapy crawl all \\\n-a search=google \\\n-a query=\"Enem OR \\\"Exame Nacional * Ensino Médio\\\"\" \\\n-a regex=\"(?i)Enem|Exame.{0,}Nacional.{0,}Ensino.{0,}Mé?e?dio\" \\\n-a dateRestrict=d1\n\n```\n\n## References\n\n - http://xpo6.com/list-of-english-stop-words/\n - [Scrapy - Docs | Jobs: pausing and resuming crawls](https://doc.scrapy.org/en/latest/topics/jobs.html?highlight=scheduler)\n - [scrapy.extensions.memusage][https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/memusage.py]\n   It's a good code to extend, overide `_send_report_` function to send to another services than only mail\n\n\n## TODO:\n\n- [ ] Implement DeltaFetch midleware\n- [ ] decompose class `.n--noticia__newsletter` to spider estadao\n- [ ] Use https://github.com/codelucas/newspaper\n\n## Ideas\n\n### Relation DB Schema\n\nhttps://cloud.google.com/bigtable/docs/schema-design\n\n| Row key | Column data |\n| INEP | NEWS:EDUCACAO (V1 03/01/15):558.40 | \n\nUse this:\n- TinyDB CodernityDB\n- https://blog.scrapinghub.com/2016/04/20/scrapy-tips-from-the-pros-april-2016-edition/\n- https://helpdesk.scrapinghub.com/support/solutions/articles/22000200401-dotscrapy-persistence-addon\n- https://helpdesk.scrapinghub.com/support/solutions/articles/22000200418-magic-fields-addon\n- https://helpdesk.scrapinghub.com/support/solutions/articles/22000200411-delta-fetch-addon\n- \n### lambda\n\n```python\n\nclass AVRO_FIELD_TYPE(Enum):\n    str = 'STRING'\n    list = 'RECORD'\n    int = 'INTERGE'\n    bool = 'BOOLEAN'\n\nf_avro = lambda ft, md='NULLABLE', fd=[]: { 'avro': { \n    # 'field_type': ft.uppe() if ft else AVRO_FIELD_TYPE[type(ft)], \n    'field_type': ft.uppe(), \n    'mode': md, \n    'fields': fd } }\n\n@property\ndef identifier(self):\n    self['output_processor'] = self.get('output_processor') if self.get('output_processor') \\\n                                else TakeFirst()\n    if not hasattr(self, 'schemas'):\n        self['schemas'] = self.f_avro('STRING', 'NULLABLE', [])\n    \n    return self \n\n@identifier.setter\ndef identifier(self, value):\n    self['output_processor'] if self.get('output_processor') else TakeFirst()\n    return self \n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flabic%2Fze-the-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flabic%2Fze-the-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flabic%2Fze-the-scraper/lists"}