{"id":40224667,"url":"https://github.com/lewoudar/scalpel","last_synced_at":"2026-01-19T22:30:52.900Z","repository":{"id":38196011,"uuid":"242496238","full_name":"lewoudar/scalpel","owner":"lewoudar","description":"A fast and powerful web scraping library","archived":false,"fork":false,"pushed_at":"2024-07-05T22:33:50.000Z","size":633,"stargazers_count":43,"open_issues_count":12,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-08-23T11:46:44.933Z","etag":null,"topics":["anyio","asyncio","crawler","gevent","python","scalpel","trio","webscraping"],"latest_commit_sha":null,"homepage":"https://scalpel.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lewoudar.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-02-23T10:21:50.000Z","updated_at":"2025-07-05T05:45:43.000Z","dependencies_parsed_at":"2024-01-13T20:49:22.152Z","dependency_job_id":"c72ae612-3825-4bdc-b5c7-3e00d5e7d27d","html_url":"https://github.com/lewoudar/scalpel","commit_stats":{"total_commits":142,"total_committers":4,"mean_commits":35.5,"dds":"0.43661971830985913","last_synced_commit":"722a51ae0a77120a138bacaac04195492c35935b"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/lewoudar/scalpel","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lewoudar%2Fscalpel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lewoudar%2Fscalpel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lewoudar%2Fscalpel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lewoudar%2Fscalpel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lewoudar","download_url":"https://codeload.github.com/lewoudar/scalpel/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lewoudar%2Fscalpel/sbom","scorecard":{"id":586478,"data":{"date":"2025-08-11","repo":{"name":"github.com/lewoudar/scalpel","commit":"f9d3c266313b0edcd52256b8873f78969e12ea02"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.5,"checks":[{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Code-Review","score":0,"reason":"Found 0/26 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/ci.yml:1","Warn: no topLevel permission defined: .github/workflows/publish.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/ci.yml:26: update your workflow using https://app.stepsecurity.io/secureworkflow/lewoudar/scalpel/ci.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/ci.yml:28: update your workflow using https://app.stepsecurity.io/secureworkflow/lewoudar/scalpel/ci.yml/master?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/ci.yml:42: update your workflow using https://app.stepsecurity.io/secureworkflow/lewoudar/scalpel/ci.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish.yml:14: update your workflow using https://app.stepsecurity.io/secureworkflow/lewoudar/scalpel/publish.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish.yml:16: update your workflow using https://app.stepsecurity.io/secureworkflow/lewoudar/scalpel/publish.yml/master?enable=pin","Warn: pipCommand not pinned by hash: .github/workflows/ci.yml:33","Warn: pipCommand not pinned by hash: .github/workflows/publish.yml:21","Info:   0 out of   4 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   1 third-party GitHubAction dependencies pinned","Info:   0 out of   2 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 8 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Vulnerabilities","score":0,"reason":"29 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2023-177 / GHSA-x7m3-jprg-wc5g","Warn: Project is vulnerable to: PYSEC-2024-60 / GHSA-jjg7-2v4v-x38h","Warn: Project is vulnerable to: PYSEC-2022-43167","Warn: Project is vulnerable to: PYSEC-2023-206","Warn: Project is vulnerable to: PYSEC-2024-48 / GHSA-fj7x-q9j7-g6q6","Warn: Project is vulnerable to: PYSEC-2024-230 / GHSA-248v-346w-9cwc","Warn: Project is vulnerable to: PYSEC-2022-42986 / GHSA-43fp-rhv2-5gv8","Warn: Project is vulnerable to: PYSEC-2023-135 / GHSA-xqr8-7jwr-rhp7","Warn: Project is vulnerable to: GHSA-vqfr-h8mv-ghfj","Warn: Project is vulnerable to: GHSA-cpwx-vrp4-4pq7","Warn: Project is vulnerable to: GHSA-gmj6-6f8f-6699","Warn: Project is vulnerable to: GHSA-h5c8-rqwp-cp95","Warn: Project is vulnerable to: GHSA-h75v-3vvj-5mfj","Warn: Project is vulnerable to: GHSA-q2x7-8rv6-6q7h","Warn: Project is vulnerable to: PYSEC-2022-42969","Warn: Project is vulnerable to: PYSEC-2023-117 / GHSA-mrwq-x4v8-fh7p","Warn: Project is vulnerable to: GHSA-jh85-wwv9-24hv","Warn: Project is vulnerable to: GHSA-9hjg-9r4m-mvj7","Warn: Project is vulnerable to: GHSA-9wx4-h78v-vm56","Warn: Project is vulnerable to: PYSEC-2023-74 / GHSA-j8r2-6x86-q33q","Warn: Project is vulnerable to: PYSEC-2025-49 / GHSA-5rjg-fvgr-3xxf","Warn: Project is vulnerable to: GHSA-cx63-2mw6-8hw5","Warn: Project is vulnerable to: PYSEC-2022-43012 / GHSA-r9hx-vwmv-q579","Warn: Project is vulnerable to: GHSA-34jh-p97f-mpxf","Warn: Project is vulnerable to: PYSEC-2023-212 / GHSA-g4mx-q9vg-27p4","Warn: Project is vulnerable to: GHSA-pq67-6m6q-mj2v","Warn: Project is vulnerable to: PYSEC-2023-192 / GHSA-v845-jxx5-vc9f","Warn: Project is vulnerable to: PYSEC-2024-187 / GHSA-rqc4-2hc7-8c8v","Warn: Project is vulnerable to: GHSA-jfmj-5v4g-7637"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-20T20:38:43.679Z","repository_id":38196011,"created_at":"2025-08-20T20:38:43.679Z","updated_at":"2025-08-20T20:38:43.679Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28587238,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-19T20:45:59.482Z","status":"ssl_error","status_checked_at":"2026-01-19T20:45:41.500Z","response_time":67,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anyio","asyncio","crawler","gevent","python","scalpel","trio","webscraping"],"created_at":"2026-01-19T22:30:49.481Z","updated_at":"2026-01-19T22:30:52.888Z","avatar_url":"https://github.com/lewoudar.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pyscalpel\n\n[![Pypi version](https://img.shields.io/pypi/v/pyscalpel.svg)](https://pypi.org/project/pyscalpel/)\n![](https://github.com/lewoudar/scalpel/workflows/CI/badge.svg)\n[![Coverage Status](https://codecov.io/gh/lewoudar/scalpel/branch/master/graphs/badge.svg?branch=master)](https://codecov.io/gh/lewoudar/scalpel)\n[![Documentation Status](https://readthedocs.org/projects/scalpel/badge/?version=latest)](https://scalpel.readthedocs.io/en/latest/?badge=latest)\n[![Code Style](https://img.shields.io/badge/code%20style-black-black)](https://github.com/lewoudar/scalpel)\n[![License Apache 2](https://img.shields.io/hexpm/l/plug.svg)](http://www.apache.org/licenses/LICENSE-2.0)\n\nYour easy-to-use, fast and powerful web scraping library.\n\n## Why?\n\nI already knew [scrapy](https://docs.scrapy.org/en/latest/) which is the reference in python for web scraping. But\ntwo things bothered me.\n- I feel like scrapy cannot integrate into an existing project, you need to treat your web scraping stuff like a project\non its own.\n- Usage of [Twisted](https://twistedmatrix.com/trac/) who is a veteran in asynchronous programming, but I think\n that there are better asynchronous frameworks today. Note that this second point is not true anymore as I'm writing\n the document since scrapy adds support for [asyncio](https://docs.scrapy.org/en/latest/topics/asyncio.html)\n\n After having made this observation I decided to create pyscalpel. And let's be honest, I also want to have my own web\n scraping library, and it is fun to write one ;)\n\n\n## Installation\n\n```bash\npip install pyscalpel  # to only use the asyncio backend\npip install pyscalpel[gevent] # to install the gevent backend\npip install pyscalpel[trio] # to installl the trio backend\npip install pyscalpel[full] # to install all the backends\n```\n\nIf you know about [poetry](https://python-poetry.org/) you can use it instead of pip.\n\n```bash\npoetry add pyscalpel  # to only use the asyncio backend\npoetry add pyscalpel[gevent] # to install the gevent backend\npoetry add pyscalpel[trio] # to install the trio backend\npoetry add pyscalpel[full] # to install all the backends\n```\n\npyscalpel works starting from **python 3.7**, it relies on robust packages:\n- [configuror](https://configuror.readthedocs.io/en/latest/): A configuration toolkit.\n- [httpx](https://www.python-httpx.org/): A modern http client.\n- [selenium](https://pypi.org/project/selenium/): A library for controlling a browser.\n- [gevent](http://www.gevent.org/): An asynchronous framework using the synchronous way. (optional)\n- [trio](https://trio.readthedocs.io/en/stable/): A modern asynchronous framework using `async/await` syntax. (optional)\n- [anyio](https://anyio.readthedocs.io/): An asynchronous networking and concurrency library that works on top of\neither asyncio or trio.\n- [parsel](https://parsel.readthedocs.io/): A library elements in HTML/XML documents.\n- [attrs](https://www.attrs.org/en/stable/): A library helping to write classes without pain.\n- [fake-useragent](https://pypi.org/project/fake-useragent/): A simple library to fake a user agent.\n- [rfc3986](https://rfc3986.readthedocs.io/en/latest/): A library for url parsing and validation.\n- [msgpack](https://pypi.org/project/msgpack/): A library allowing for fast serialization/deserialization of data\nstructures.\n\n## Documentation\n\nThe documentation is available at https://scalpel.readthedocs.io/en/latest/.\n\n\n## Usage\n\nTo give you an overview of what can be done, this is a simple example of quote scraping. Don't hesitate to look at the\nexamples folder for more snippets to look at.\n\nwith gevent\n\n```python\nfrom pathlib import Path\n\nfrom scalpel import Configuration\nfrom scalpel.green import StaticSpider, StaticResponse, read_mp\n\ndef parse(spider: StaticSpider, response: StaticResponse) -\u003e None:\n    for quote in response.xpath('//div[@class=\"quote\"]'):\n        data = {\n            'message': quote.xpath('./span[@class=\"text\"]/text()').get(),\n            'author': quote.xpath('./span/small/text()').get(),\n            'tags': quote.xpath('./div/a/text()').getall()\n        }\n        spider.save_item(data)\n\n    next_link = response.xpath('//nav/ul/li[@class=\"next\"]/a').xpath('@href').get()\n    if next_link is not None:\n        response.follow(next_link)\n\nif __name__ == '__main__':\n    backup = Path(__file__).parent / 'backup.mp'\n    config = Configuration(backup_filename=f'{backup}')\n    spider = StaticSpider(urls=['http://quotes.toscrape.com'], parse=parse, config=config)\n    spider.run()\n    print(spider.statistics())\n    # you can do whatever you want with the results\n    for quote_data in read_mp(filename=backup, decoder=spider.config.msgpack_decoder):\n        print(quote_data)\n```\n\nwith anyio\n\n```python\nfrom pathlib import Path\n\nimport anyio\nfrom scalpel import Configuration\nfrom scalpel.any_io import StaticResponse, StaticSpider, read_mp\n\n\nasync def parse(spider: StaticSpider, response: StaticResponse) -\u003e None:\n    for quote in response.xpath('//div[@class=\"quote\"]'):\n        data = {\n            'message': quote.xpath('./span[@class=\"text\"]/text()').get(),\n            'author': quote.xpath('./span/small/text()').get(),\n            'tags': quote.xpath('./div/a/text()').getall()\n        }\n        await spider.save_item(data)\n\n    next_link = response.xpath('//nav/ul/li[@class=\"next\"]/a').xpath('@href').get()\n    if next_link is not None:\n        await response.follow(next_link)\n\nasync def main():\n    backup = Path(__file__).parent / 'backup.mp'\n    config = Configuration(backup_filename=f'{backup}')\n    spider = StaticSpider(urls=['http://quotes.toscrape.com'], parse=parse, config=config)\n    await spider.run()\n    print(spider.statistics())\n    # you can do whatever you want with the results\n    async for item in read_mp(backup, decoder=spider.config.msgpack_decoder):\n        print(item)\n\nif __name__ == '__main__':\n    # by default, this will run the asyncio backend, if you want the trio backend, you must first install the trio\n    # package and replace the following line with: anyio.run(main, backend='trio').\n    anyio.run(main)\n```\n\n## Known limitations\n\npyscalpel aims to handle SPA (single page application) through the use of selenium. However, due to the synchronous nature\nof selenium, it is hard to leverage anyio and gevent asynchronous feature. You will notice that the *selenium spider* is\nslower than the *static spider*. For more information look at the documentation.\n\n## Warning\n\npyscalpel is a young project, so it is expected to have breaking changes in the api without respecting the\n[semver](https://semver.org/) principle. It is recommended to pin the version you are using for now.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flewoudar%2Fscalpel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flewoudar%2Fscalpel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flewoudar%2Fscalpel/lists"}