{"id":13419568,"url":"https://github.com/elliotgao2/gain","last_synced_at":"2025-04-08T13:02:26.317Z","repository":{"id":49355217,"uuid":"92926018","full_name":"elliotgao2/gain","owner":"elliotgao2","description":"Web crawling framework  based on asyncio.","archived":false,"fork":false,"pushed_at":"2019-06-01T22:54:09.000Z","size":219,"stargazers_count":2039,"open_issues_count":8,"forks_count":207,"subscribers_count":74,"default_branch":"master","last_synced_at":"2025-04-06T11:05:22.205Z","etag":null,"topics":["aiohttp","asyncio","crawler","python","spider","uvloop"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/elliotgao2.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-05-31T08:56:04.000Z","updated_at":"2025-03-25T03:57:54.000Z","dependencies_parsed_at":"2022-09-26T20:20:40.489Z","dependency_job_id":null,"html_url":"https://github.com/elliotgao2/gain","commit_stats":null,"previous_names":["elliotgao2/gain","gaojiuli/gain"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elliotgao2%2Fgain","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elliotgao2%2Fgain/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elliotgao2%2Fgain/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elliotgao2%2Fgain/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/elliotgao2","download_url":"https://codeload.github.com/elliotgao2/gain/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247847598,"owners_count":21006098,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aiohttp","asyncio","crawler","python","spider","uvloop"],"created_at":"2024-07-30T22:01:17.797Z","updated_at":"2025-04-08T13:02:26.278Z","avatar_url":"https://github.com/elliotgao2.png","language":"Python","funding_links":[],"categories":["Python","🤖 AI \u0026 Machine Learning"],"sub_categories":[],"readme":"# \u003cimg width=\"200\" height=\"200\" src=\"img/logo.png\"/\u003e\n\n[![Build](https://travis-ci.org/gaojiuli/gain.svg?branch=master)](https://travis-ci.org/gaojiuli/gain)\n[![Python](https://img.shields.io/pypi/pyversions/gain.svg)](https://pypi.python.org/pypi/gain/)\n[![Version](https://img.shields.io/pypi/v/gain.svg)](https://pypi.python.org/pypi/gain/)\n[![License](https://img.shields.io/pypi/l/gain.svg)](https://pypi.python.org/pypi/gain/)\n\nWeb crawling framework for everyone. Written with `asyncio`, `uvloop` and `aiohttp`.\n\n![](img/architecture.png)\n\n## Requirements\n\n- Python3.5+\n\n## Installation\n\n`pip install gain`\n\n`pip install uvloop` (Only linux)\n\n## Usage\n\n1. Write spider.py:\n\n```python\nfrom gain import Css, Item, Parser, Spider\nimport aiofiles\n\nclass Post(Item):\n    title = Css('.entry-title')\n    content = Css('.entry-content')\n\n    async def save(self):\n        async with aiofiles.open('scrapinghub.txt', 'a+') as f:\n            await f.write(self.results['title'])\n\n\nclass MySpider(Spider):\n    concurrency = 5\n    headers = {'User-Agent': 'Google Spider'}\n    start_url = 'https://blog.scrapinghub.com/'\n    parsers = [Parser('https://blog.scrapinghub.com/page/\\d+/'),\n               Parser('https://blog.scrapinghub.com/\\d{4}/\\d{2}/\\d{2}/[a-z0-9\\-]+/', Post)]\n\n\nMySpider.run()\n```\n\nOr use XPathParser:\n\n```python\nfrom gain import Css, Item, Parser, XPathParser, Spider\n\n\nclass Post(Item):\n    title = Css('.breadcrumb_last')\n\n    async def save(self):\n        print(self.title)\n\n\nclass MySpider(Spider):\n    start_url = 'https://mydramatime.com/europe-and-us-drama/'\n    concurrency = 5\n    headers = {'User-Agent': 'Google Spider'}\n    parsers = [\n               XPathParser('//span[@class=\"category-name\"]/a/@href'),\n               XPathParser('//div[contains(@class, \"pagination\")]/ul/li/a[contains(@href, \"page\")]/@href'),\n               XPathParser('//div[@class=\"mini-left\"]//div[contains(@class, \"mini-title\")]/a/@href', Post)\n              ]\n    proxy = 'https://localhost:1234'\n\nMySpider.run()\n\n```\nYou can add proxy setting to spider as above. \n\n\n2. Run `python spider.py`\n\n3. Result:\n\n![](img/sample.png)\n\n## Example\n\nThe examples are in the `/example/` directory.\n\n## Contribution\n\n- Pull request.\n- Open issue.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felliotgao2%2Fgain","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felliotgao2%2Fgain","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felliotgao2%2Fgain/lists"}