https://github.com/jamesturk/spatula
A modern Python library for writing maintainable web scrapers.
https://github.com/jamesturk/spatula
hacktoberfest python3 scraping
Last synced: about 1 year ago
JSON representation
A modern Python library for writing maintainable web scrapers.
- Host: GitHub
- URL: https://github.com/jamesturk/spatula
- Owner: jamesturk
- License: mit
- Created: 2017-02-21T04:49:00.000Z (over 9 years ago)
- Default Branch: main
- Last Pushed: 2024-07-10T07:18:10.000Z (almost 2 years ago)
- Last Synced: 2025-03-29T00:05:07.788Z (about 1 year ago)
- Topics: hacktoberfest, python3, scraping
- Language: Python
- Homepage: https://jamesturk.github.io/spatula/
- Size: 1.26 MB
- Stars: 247
- Watchers: 6
- Forks: 11
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
- Contributing: docs/contributing.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: docs/code_of_conduct.md
Awesome Lists containing this project
README
# Overview
*spatula* is a modern Python library for writing maintainable web scrapers.
Source: [https://github.com/jamesturk/spatula](https://github.com/jamesturk/spatula)
Documentation: [https://jamesturk.github.io/spatula/](https://jamesturk.github.io/spatula/)
Issues: [https://github.com/jamesturk/spatula/issues](https://github.com/jamesturk/spatula/issues)
[](https://badge.fury.io/py/spatula)
[](https://github.com/jamesturk/spatula/actions?query=workflow%3A%22Test+%26+Lint%22)
## Features
- **Page-oriented design**: Encourages writing understandable & maintainable scrapers.
- **Not Just HTML**: Provides built in [handlers for common data formats](https://jamesturk.github.io/spatula/reference/#pages) including CSV, JSON, XML, PDF, and Excel. Or write your own.
- **Fast HTML parsing**: Uses `lxml.html` for fast, consistent, and reliable parsing of HTML.
- **Flexible Data Model Support**: Compatible with `dataclasses`, `attrs`, `pydantic`, or bring your own data model classes for storing & validating your scraped data.
- **CLI Tools**: Offers several [CLI utilities](https://jamesturk.github.io/spatula/cli/) that can help streamline development & testing cycle.
- **Fully Typed**: Makes full use of Python 3 type annotations.