{"id":18076566,"url":"https://github.com/strongbugman/ant_nest","last_synced_at":"2025-06-30T06:36:03.855Z","repository":{"id":55417446,"uuid":"112139805","full_name":"strongbugman/ant_nest","owner":"strongbugman","description":"Simple, clear and fast Web Crawler framework build on python3.6+, powered by asyncio.","archived":false,"fork":false,"pushed_at":"2022-07-17T05:33:59.000Z","size":247,"stargazers_count":95,"open_issues_count":0,"forks_count":19,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-06-11T06:57:43.725Z","etag":null,"topics":["asyncio","early-development","framework","python36","spider"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/strongbugman.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-11-27T02:55:53.000Z","updated_at":"2024-07-18T07:11:57.000Z","dependencies_parsed_at":"2022-08-14T23:50:50.611Z","dependency_job_id":null,"html_url":"https://github.com/strongbugman/ant_nest","commit_stats":null,"previous_names":[],"tags_count":20,"template":false,"template_full_name":null,"purl":"pkg:github/strongbugman/ant_nest","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strongbugman%2Fant_nest","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strongbugman%2Fant_nest/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strongbugman%2Fant_nest/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strongbugman%2Fant_nest/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/strongbugman","download_url":"https://codeload.github.com/strongbugman/ant_nest/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strongbugman%2Fant_nest/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262724883,"owners_count":23354358,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asyncio","early-development","framework","python36","spider"],"created_at":"2024-10-31T11:10:22.417Z","updated_at":"2025-06-30T06:36:03.830Z","avatar_url":"https://github.com/strongbugman.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"========\nAntNest\n========\n\n.. image:: https://img.shields.io/pypi/v/ant_nest.svg\n   :target: https://pypi.python.org/pypi/ant_nest\n\n.. image:: https://img.shields.io/travis/strongbugman/ant_nest/master.svg\n   :target: https://travis-ci.org/strongbugman/ant_nest\n\n.. image:: https://codecov.io/gh/strongbugman/ant_nest/branch/master/graph/badge.svg\n  :target: https://codecov.io/gh/strongbugman/ant_nest\n\nOverview\n========\n\nAntNest is a simple, clear and fast Web Crawler framework build on python3.6+, powered by asyncio.\nIt has only 600+ lines core code now(thanks powerful lib like aiohttp, lxml and other else).\n\nFeatures\n========\n\n* Useful http client out of box\n* Easy pipelines, in async or not\n* Easy item extractor, define data detail(by xpath, jpath or regex) and extract from html, json or strings.\n* Easy async work flow, build in async task pool\n\nInstall\n=======\n::\n\n    pip install ant_nest\n\nUsage\n=====\n\nCreate one demo project::\n\n    \u003e\u003e\u003e ant_nest -c examples\n\nThen we get a project::\n\n    drwxr-xr-x   5 bruce  staff  160 Jun 30 18:24 ants\n    -rw-r--r--   1 bruce  staff  208 Jun 26 22:59 settings.py\n\nPresume we want to get hot repos from github, let`s create \"examples/ants/example2.py\"::\n\n    from yarl import URL\n    from ant_nest.ant import Ant\n    from ant_nest.pipelines import ItemFieldReplacePipeline\n    from ant_nest.items import Extractor\n\n\n    class GithubAnt(Ant):\n        \"\"\"Crawl trending repositories from github\"\"\"\n\n        item_pipelines = [\n            ItemFieldReplacePipeline(\n                (\"meta_content\", \"star\", \"fork\"), excess_chars=(\"\\r\", \"\\n\", \"\\t\", \"  \")\n            )\n        ]\n\n        def __init__(self):\n            super().__init__()\n            self.item_extractor = Extractor(dict)\n            self.item_extractor.add_extractor(\n                \"title\", lambda x: x.html_element.xpath(\"//h1/strong/a/text()\")[0]\n            )\n            self.item_extractor.add_extractor(\n                \"author\", lambda x: x.html_element.xpath(\"//h1/span/a/text()\")[0]\n            )\n            self.item_extractor.add_extractor(\n                \"meta_content\",\n                lambda x: \"\".join(\n                    x.html_element.xpath(\n                        '//div[@class=\"repository-content \"]/div[2]//text()'\n                    )\n                ),\n            )\n            self.item_extractor.add_extractor(\n                \"star\",\n                lambda x: x.html_element.xpath(\n                    '//a[@class=\"social-count js-social-count\"]/text()'\n                )[0],\n            )\n            self.item_extractor.add_extractor(\n                \"fork\",\n                lambda x: x.html_element.xpath('//a[@class=\"social-count\"]/text()')[0],\n            )\n            self.item_extractor.add_extractor(\"origin_url\", lambda x: str(x.url))\n\n        async def crawl_repo(self, url):\n            \"\"\"Crawl information from one repo\"\"\"\n            response = await self.request(url)\n            # extract item from response\n            item = self.item_extractor.extract(response)\n            item[\"origin_url\"] = response.url\n\n            await self.collect(item)  # let item go through pipelines(be cleaned)\n            self.logger.info(\"*\" * 70 + \"I got one hot repo!\\n\" + str(item))\n\n        async def run(self):\n            \"\"\"App entrance, our play ground\"\"\"\n            response = await self.request(\"https://github.com/explore\")\n            for url in response.html_element.xpath(\n                \"/html/body/div[4]/main/div[2]/div/div[2]/div[1]/article/div/div[1]/h1/a[2]/\"\n                \"@href\"\n            ):\n                # crawl many repos with our coroutines pool\n                self.schedule_task(self.crawl_repo(response.url.join(URL(url))))\n            self.logger.info(\"Waiting...\")\n\n\nThen we can list all ants we defined (in \"examples\") ::\n\n    \u003e\u003e\u003e $ant_nest -l\n    ants.example2.GithubAnt\n\nRun it! (without debug log)::\n\n    \u003e\u003e\u003e ant_nest -a ants.example2.GithubAnt\n    INFO:GithubAnt:Opening\n    INFO:GithubAnt:Waiting...\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'app-ideas', 'author': 'florinpop17', 'meta_content': 'A Collection of application ideas which can be used to improve your coding skills.', 'star': '11.7k', 'fork': '500', 'origin_url': URL('https://github.com/florinpop17/app-ideas')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'Carbon', 'author': 'briannesbitt', 'meta_content': 'A simple PHP API extension for DateTime.https://carbon.nesbot.com/', 'star': '14k', 'fork': '249', 'origin_url': URL('https://github.com/briannesbitt/Carbon')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'org-roam', 'author': 'jethrokuan', 'meta_content': 'Rudimentary Roam replica with Org-modehttps://org-roam.readthedocs.io/en/la…', 'star': '261', 'fork': '27', 'origin_url': URL('https://github.com/jethrokuan/org-roam')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'joplin', 'author': 'laurent22', 'meta_content': 'Joplin - an open source note taking and to-do application with synchronization capabilities for Windows, macOS, Linux, Android and iOS. Forum: https://discourse.joplinapp.org/https://joplinapp.org', 'star': '13k', 'fork': '335', 'origin_url': URL('https://github.com/laurent22/joplin')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'snoop', 'author': 'snooppr', 'meta_content': 'Snoop — инструмент разведки на основе открытых данных', 'star': '281', 'fork': '9', 'origin_url': URL('https://github.com/snooppr/snoop')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': '1on1-questions', 'author': 'VGraupera', 'meta_content': 'Mega list of 1 on 1 meeting questions compiled from a variety to sources', 'star': '4k', 'fork': '93', 'origin_url': URL('https://github.com/VGraupera/1on1-questions')}\n    INFO:GithubAnt:Get 8 Request in total with 8/60s rate\n    INFO:GithubAnt:Get 7 Response in total with 7/60s rate\n    INFO:GithubAnt:Get 6 dict in total with 6/60s rate\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'python-small-examples', 'author': 'jackzhenguo', 'meta_content': 'Python有趣的小例子一网打尽。Python基础、Python坑点、Python字符串和正则、Python绘图、Python日期和文件、Web开发、数据科学、机器学习、深度2.4k', 'fork': '102', 'origin_url': URL('https://github.com/jackzhenguo/python-small-examples')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'system-design-primer', 'author': 'donnemartin', 'meta_content': 'Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.', 'star': '83.2k', 'fork': '4.4k', 'origin_url': URL('https://github.com/donnemartin/system-design-primer')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'awesome-scalability', 'author': 'binhnguyennus', 'meta_content': 'The Patterns of Scalable, Reliable, and Performant Large-Scale Systemshttp://awesome-scalability.com/', 'star': '24.5k', 'fork': '1.4k', 'origin_url': URL('https://github.com/binhnguyennus/awesome-scalability')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'gdb-frontend', 'author': 'rohanrhu', 'meta_content': '☕ GDBFrontend is an easy, flexible and extensionable gui debugger.https://oguzhaneroglu.com/projects/gd…', 'star': '716', 'fork': '14', 'origin_url': URL('https://github.com/rohanrhu/gdb-frontend')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'Complete-Python-3-Bootcamp', 'author': 'Pierian-Data', 'meta_content': 'Course Files for Complete Python 3 Bootcamp Course on Udemy', 'star': '8.1k', 'fork': '1.8k', 'origin_url': URL('https://github.com/Pierian-Data/Complete-Python-3-Bootcamp')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'leon', 'author': 'leon-ai', 'meta_content': '\\U0001f9e0 Leon is your open-source personal assistant.https://getleon.ai', 'star': '6.3k', 'fork': '147', 'origin_url': URL('https://github.com/leon-ai/leon')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'esbuild', 'author': 'evanw', 'meta_content': 'An extremely fast JavaScript bundler and minifier', 'star': '2.3k', 'fork': '38', 'origin_url': URL('https://github.com/evanw/esbuild')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'wearable-microphone-jamming', 'author': 'y-x-c', 'meta_content': 'Repository for our paper Wearable Microphone Jamminghttp://sandlab.cs.uchicago.edu/jammer/', 'star': '138', 'fork': '10', 'origin_url': URL('https://github.com/y-x-c/wearable-microphone-jamming')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'efcore', 'author': 'dotnet', 'meta_content': 'EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.https://docs.microsoft.com/ef/core/', 'star': '8.7k', 'fork': '965', 'origin_url': URL('https://github.com/dotnet/efcore')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'playwright', 'author': 'microsoft', 'meta_content': 'Node library to automate Chromium, Firefox and WebKit with a single APIhttps://www.npmjs.com/package/playwright', 'star': '9k', 'fork': '92', 'origin_url': URL('https://github.com/microsoft/playwright')}\n    INFO:GithubAnt:Get 18 Request in total with 10/60s rate\n    INFO:GithubAnt:Get 17 Response in total with 10/60s rate\n    INFO:GithubAnt:Get 16 dict in total with 10/60s rate\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'degoogle', 'author': 'tycrek', 'meta_content': 'A huge list of alternatives to Google products. Privacy tips, tricks, and links.https://degoogle.jmoore.dev', 'star': '2k', 'fork': '50', 'origin_url': URL('https://github.com/tycrek/degoogle')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'sherlock', 'author': 'sherlock-project', 'meta_content': '🔎 Hunt down social media accounts by username across social networkshttp://sherlock-project.github.io', 'star': '10.4k', 'fork': '207', 'origin_url': URL('https://github.com/sherlock-project/sherlock')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'the-art-of-command-line', 'author': 'jlevy', 'meta_content': 'Master the command line, in one page', 'star': '68.9k', 'fork': '2.2k', 'origin_url': URL('https://github.com/jlevy/the-art-of-command-line')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'freespeech', 'author': 'Merkie', 'meta_content': 'A free program designed to help the non-verbal.', 'star': '168', 'fork': '20', 'origin_url': URL('https://github.com/Merkie/freespeech')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'awesome-pentest', 'author': 'enaqx', 'meta_content': 'A collection of awesome penetration testing resources, tools and other shiny things', 'star': '11.4k', 'fork': '1k', 'origin_url': URL('https://github.com/enaqx/awesome-pentest')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'trax', 'author': 'google', 'meta_content': 'Trax — your path to advanced deep learning', 'star': '2.7k', 'fork': '90', 'origin_url': URL('https://github.com/google/trax')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'introtodeeplearning', 'author': 'aamini', 'meta_content': 'Lab Materials for MIT 6.S191: Introduction to Deep Learning', 'star': '1.6k', 'fork': '116', 'origin_url': URL('https://github.com/aamini/introtodeeplearning')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': 'CleanArchitecture', 'author': 'ardalis', 'meta_content': 'A starting point for Clean Architecture with ASP.NET Core', 'star': '3.8k', 'fork': '300', 'origin_url': URL('https://github.com/ardalis/CleanArchitecture')}\n    INFO:GithubAnt:**********************************************************************I got one hot repo!\n    {'title': '3y', 'author': 'ZhongFuCheng3y', 'meta_content': '📓从Java基础、JavaWeb基础到常用的框架再到面试题都有完整的教程，几乎涵盖了Java后端必备的知识点', 'star': '5.1k', 'fork': '285', 'origin_url': URL('https://github.com/ZhongFuCheng3y/3y')}\n    INFO:GithubAnt:Closed\n    INFO:GithubAnt:Get 26 Request in total\n    INFO:GithubAnt:Get 26 Response in total\n    INFO:GithubAnt:Get 25 dict in total\n    INFO:GithubAnt:Run GithubAnt in 180.234251 seconds\n\n\nAbout Item\n==========\n\nWe use dict to store one item in examples, actually it support many way:\ndict, normal class, atrrs's class, data class and ORM class, it depend on your need and choice.\n\nExamples\n========\n\nYou can get some example in \"./examples\"\n\nDefect\n======\n\n* Complex exception handle\n\none coroutine's exception will break await chain especially in a loop, unless we handle it by hand. eg::\n\n    for cor in self.as_completed((self.crawl(url) for url in self.urls)):\n        try:\n            await cor\n        except Exception:  # may raise many exception in a await chain\n            pass\n\nbut we can use \"self.as_completed_with_async\" now, eg::\n\n    async fo result in self.as_completed_with_async(\n    self.crawl(url) for url in self.urls, raise_exception=False):\n        # exception in \"self.crawl(url)\" will be passed and logged automatic\n        self.handle(result)\n\n* High memory usage\n\nIt`s a \"feature\" that asyncio eat large memory especially with high concurrent IO, we can set a\nconcurrent limit(\"connection_limit\" or \"concurrent_limit\") simply, but it`s complex to get the balance between performance and limit.\n\n\nCoding style\n============\n\nFollow \"Flake8\", Format by \"Black\", typing check by \"MyPy\", sea Makefile for more detail.\n\n\nTodo\n====\n\n[*] Log system\n[*] Nest item extractor\n[ ] Docs\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstrongbugman%2Fant_nest","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstrongbugman%2Fant_nest","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstrongbugman%2Fant_nest/lists"}