Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/spider-rs/spider-py
Spider ported to Python
https://github.com/spider-rs/spider-py
crawler headless-chrome python scraper spider web-crawler
Last synced: 3 days ago
JSON representation
Spider ported to Python
- Host: GitHub
- URL: https://github.com/spider-rs/spider-py
- Owner: spider-rs
- License: mit
- Created: 2023-12-08T12:22:24.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-09-24T10:20:00.000Z (3 months ago)
- Last Synced: 2024-12-16T12:40:30.525Z (10 days ago)
- Topics: crawler, headless-chrome, python, scraper, spider, web-crawler
- Language: Rust
- Homepage: https://spider-rs.github.io/spider-py/
- Size: 1.33 MB
- Stars: 53
- Watchers: 1
- Forks: 5
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# spider-py
The [spider](https://github.com/spider-rs/spider) project ported to Python.
## Getting Started
1. `pip install spider_rs`
```python
import asynciofrom spider_rs import Website
async def main():
website = Website("https://choosealicense.com")
website.crawl()
print(website.get_links())asyncio.run(main())
```View the [examples](./examples/) to learn more.
## Development
Install maturin `pipx install maturin` and python.
1. `maturin develop`
## Benchmarks
View the [benchmarks](./bench/README.md) to see a breakdown between libs and platforms.
Test url: `https://espn.com`
| `libraries` | `pages` | `speed` |
| :--------------------------- | :-------- | :------ |
| **`spider(rust): crawl`** | `150,387` | `1m` |
| **`spider(nodejs): crawl`** | `150,387` | `153s` |
| **`spider(python): crawl`** | `150,387` | `186s` |
| **`scrapy(python): crawl`** | `49,598` | `1h` |
| **`crawlee(nodejs): crawl`** | `18,779` | `30m` |The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.
## Issues
Please submit a Github issue for any issues found.