{"id":17563974,"url":"https://github.com/howie6879/talospider","last_synced_at":"2025-10-25T05:48:52.524Z","repository":{"id":57473277,"uuid":"93219740","full_name":"howie6879/talospider","owner":"howie6879","description":"talospider - A simple,lightweight scraping micro-framework","archived":false,"fork":false,"pushed_at":"2019-02-22T06:55:48.000Z","size":178,"stargazers_count":55,"open_issues_count":0,"forks_count":4,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-10-05T00:33:32.484Z","etag":null,"topics":["crawler","crawling","python","spider","web-spider"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/howie6879.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-03T03:04:28.000Z","updated_at":"2024-11-23T20:17:56.000Z","dependencies_parsed_at":"2022-09-26T22:11:34.187Z","dependency_job_id":null,"html_url":"https://github.com/howie6879/talospider","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/howie6879/talospider","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/howie6879%2Ftalospider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/howie6879%2Ftalospider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/howie6879%2Ftalospider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/howie6879%2Ftalospider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/howie6879","download_url":"https://codeload.github.com/howie6879/talospider/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/howie6879%2Ftalospider/sbom","scorecard":{"id":469919,"data":{"date":"2025-08-11","repo":{"name":"github.com/howie6879/talospider","commit":"da4f0bdc6f6046c306be5c36d9016b74794823b0"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":1.3,"checks":[{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"License","score":0,"reason":"license file not detected","details":["Warn: project does not have a license file"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":0,"reason":"11 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: GHSA-55x5-fj6c-h6m8","Warn: Project is vulnerable to: PYSEC-2014-9 / GHSA-57qw-cc2g-pv5p","Warn: Project is vulnerable to: PYSEC-2021-19 / GHSA-jq4v-f5q6-mjqq","Warn: Project is vulnerable to: GHSA-pgww-xf46-h92r","Warn: Project is vulnerable to: PYSEC-2022-230 / GHSA-wrxv-2j5q-m38w","Warn: Project is vulnerable to: PYSEC-2018-12 / GHSA-xp26-p53h-6h2p","Warn: Project is vulnerable to: PYSEC-2014-14 / GHSA-652x-xj99-gmcc","Warn: Project is vulnerable to: GHSA-9hjg-9r4m-mvj7","Warn: Project is vulnerable to: GHSA-9wx4-h78v-vm56","Warn: Project is vulnerable to: PYSEC-2014-13 / GHSA-cfj3-7x9c-4p3h","Warn: Project is vulnerable to: PYSEC-2018-28 / GHSA-x84v-xcm2-53pg"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-19T13:36:43.778Z","repository_id":57473277,"created_at":"2025-08-19T13:36:43.778Z","updated_at":"2025-08-19T13:36:43.778Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280911394,"owners_count":26412209,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-25T02:00:06.499Z","response_time":81,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","crawling","python","spider","web-spider"],"created_at":"2024-10-21T13:10:37.881Z","updated_at":"2025-10-25T05:48:52.474Z","avatar_url":"https://github.com/howie6879.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## talospider\n\n![travis](https://travis-ci.org/howie6879/talospider.svg?branch=master) [![PyPI](https://img.shields.io/pypi/v/talospider.svg)](https://pypi.python.org/pypi/talospider/)\n\n### 1.为什么写这个？\n\n\u003e 一些简单的页面，无需用比较大的框架来进行爬取，自己纯手写又比较麻烦，适用于单页面的爬虫编写\n\u003e\n\u003e 微爬虫框架 - 小巧、方便、练手学习\n\n因此针对这个需求写了`talospider`:\n\n- 1.针对单页面的item提取 - 具体介绍点[这里](./docs/item.md)\n- 2.spider模块 - 具体介绍点[这里](./docs/spider.md)\n\n**注意：此项目已经废弃，有需求请大家转用我新编写的异步框架[ruia](https://github.com/howie6879/ruia)**\n\n### 2.介绍\u0026\u0026使用\n\n![process](./docs/process.png)\n\n#### 使用\n\n```shell\npip install talospider\n```\n\n#### 2.1.item\n\n这个模块是可以独立使用的，对于一些请求比较简单的网站（比如只需要`get`请求），单单只用这个模块就可以快速地编写出你想要的爬虫，比如(以下使用python3，python2见examples目录)：\n\n##### 2.1.1.单页面单目标\n\n比如要获取这个网址http://book.qidian.com/info/1004608738 的书籍信息，封面等信息，可直接这样写：\n\n```python\nimport time\n\nfrom pprint import pprint\nfrom talospider import Item, TextField, AttrField\n\nclass QidianSpider(Item):\n    title = TextField(css_select='.book-info\u003eh1\u003eem')\n    author = TextField(css_select='a.writer')\n    cover = AttrField(css_select='a#bookImg\u003eimg', attr='src')\n\n    def tal_title(self, title):\n        return title\n\n    def tal_cover(self, cover):\n        return 'http:' + cover\n\nif __name__ == '__main__':\n    item_data = QidianSpider.get_item(url='http://book.qidian.com/info/1004608738')\n    pprint(item_data)\n```\n\n具体见[qidian_details_by_item.py](./examples/qidian_details_by_item.py)\n\n##### 2.1.1.单页面多目标\n\n比如获取[豆瓣250电影]([https://movie.douban.com/top250](https://movie.douban.com/top250))首页展示的25部电影，这一个页面有25个目标，可直接这样写：\n\n```python\nfrom pprint import pprint\nfrom talospider import Item, TextField, AttrField\n\nclass DoubanSpider(Item):\n    # 定义继承自item的Item类\n    target_item = TextField(css_select='div.item')\n    title = TextField(css_select='span.title')\n    cover = AttrField(css_select='div.pic\u003ea\u003eimg', attr='src')\n    abstract = TextField(css_select='span.inq')\n\n    def tal_title(self, title):\n        if isinstance(title, str):\n            return title\n        else:\n            return ''.join([i.text.strip().replace('\\xa0', '') for i in title])\n\nif __name__ == '__main__':\n    items_data = DoubanSpider.get_items(url='https://movie.douban.com/top250')\n    result = []\n    for item in items_data:\n        result.append({\n            'title': item.title,\n            'cover': item.cover,\n            'abstract': item.abstract,\n        })\n    pprint(result)\n```\n\n具体见[douban_page_by_item.py](./examples/douban_page_by_item.py)\n\n#### 2.2.spider\n\n当需要爬取有层次的页面时，比如爬取豆瓣250全部电影，这时候`spider`部分就派上了用场：\n\n```python\n# !/usr/bin/env python\nfrom talospider import AttrField, Request,Spider, Item, TextField\nfrom talospider.utils import get_random_user_agent\n\n\nclass DoubanItem(Item):\n    # 定义继承自item的Item类\n    target_item = TextField(css_select='div.item')\n    title = TextField(css_select='span.title')\n    cover = AttrField(css_select='div.pic\u003ea\u003eimg', attr='src')\n    abstract = TextField(css_select='span.inq')\n\n    def tal_title(self, title):\n        if isinstance(title, str):\n            return title\n        else:\n            return ''.join([i.text.strip().replace('\\xa0', '') for i in title])\n\n\nclass DoubanSpider(Spider):\n    # 定义起始url，必须\n    start_urls = ['https://movie.douban.com/top250']\n    # requests配置\n    request_config = {\n        'RETRIES': 3,\n        'DELAY': 0,\n        'TIMEOUT': 20\n    }\n    def parse(self, res):\n        # 解析函数 必须有\n        # 将html转化为etree\n        etree = self.e_html(res.html)\n        # 提取目标值生成新的url\n        pages = [i.get('href') for i in etree.cssselect('.paginator\u003ea')]\n        pages.insert(0, '?start=0\u0026filter=')\n        headers = {\n            \"User-Agent\": get_random_user_agent()\n        }\n        for page in pages:\n            url = self.start_urls[0] + page\n            yield Request(url, request_config=self.request_config, headers=headers, callback=self.parse_item)\n\n    def parse_item(self, res):\n        items_data = DoubanItem.get_items(html=res.html)\n        # result = []\n        for item in items_data:\n            # result.append({\n            #     'title': item.title,\n            #     'cover': item.cover,\n            #     'abstract': item.abstract,\n            # })\n            # 保存\n            with open('douban250.txt', 'a+') as f:\n                f.writelines(item.title + '\\n')\n\n\nif __name__ == '__main__':\n    DoubanSpider.start()\n```\n\n控制台：\n\n```shell\n2018-01-02 09:33:34 - [talospider ]: talospider started\n2018-01-02 09:33:35 - [downloading]: GET: https://movie.douban.com/top250\n2018-01-02 09:33:35 - [downloading]: GET: https://movie.douban.com/top250?start=0\u0026filter=\n2018-01-02 09:33:35 - [downloading]: GET: https://movie.douban.com/top250?start=25\u0026filter=\n2018-01-02 09:33:36 - [downloading]: GET: https://movie.douban.com/top250?start=50\u0026filter=\n2018-01-02 09:33:36 - [downloading]: GET: https://movie.douban.com/top250?start=75\u0026filter=\n2018-01-02 09:33:36 - [downloading]: GET: https://movie.douban.com/top250?start=100\u0026filter=\n2018-01-02 09:33:37 - [downloading]: GET: https://movie.douban.com/top250?start=125\u0026filter=\n2018-01-02 09:33:37 - [downloading]: GET: https://movie.douban.com/top250?start=150\u0026filter=\n2018-01-02 09:33:37 - [downloading]: GET: https://movie.douban.com/top250?start=175\u0026filter=\n2018-01-02 09:33:37 - [downloading]: GET: https://movie.douban.com/top250?start=200\u0026filter=\n2018-01-02 09:33:38 - [downloading]: GET: https://movie.douban.com/top250?start=225\u0026filter=\n2018-01-02 09:33:38 - [talospider ]: Time usage：0:00:03.367604\n```\n\n此时当前目录会生成`douban250.txt`，具体见[douban_page_by_spider.py](./examples/douban_page_by_spider.py)。\n\n### 3.说明\n\n学习之作，待完善的地方还有很多\n\n由`talospider`编写的示例：\n\n- [百度图片爬虫 ](https://github.com/howie6879/spider/blob/master/baidu_img/bd_img.py)\n- [百度贴吧图片爬虫](https://github.com/howie6879/spider/blob/master/baidu_img/tieba_img.py)\n- [起点小说信息以及榜单爬虫](https://github.com/howie6879/spider/tree/master/qidian)\n- [豆瓣250](https://github.com/howie6879/spider/tree/master/douban250)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhowie6879%2Ftalospider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhowie6879%2Ftalospider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhowie6879%2Ftalospider/lists"}