https://github.com/openzim/python-scraperlib
Collection of Python code to re-use across Python-based scrapers
https://github.com/openzim/python-scraperlib
library python webscraping zim
Last synced: about 1 year ago
JSON representation
Collection of Python code to re-use across Python-based scrapers
- Host: GitHub
- URL: https://github.com/openzim/python-scraperlib
- Owner: openzim
- License: gpl-3.0
- Created: 2020-02-03T09:05:32.000Z (over 6 years ago)
- Default Branch: main
- Last Pushed: 2024-10-24T15:03:56.000Z (over 1 year ago)
- Last Synced: 2024-10-25T03:25:00.405Z (over 1 year ago)
- Topics: library, python, webscraping, zim
- Language: Python
- Homepage:
- Size: 5.63 MB
- Stars: 19
- Watchers: 8
- Forks: 16
- Open Issues: 24
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# zimscraperlib
[](https://github.com/openzim/python-scraperlib/actions?query=branch%3Amain)
[](https://www.codefactor.io/repository/github/openzim/python-scraperlib)
[](https://www.gnu.org/licenses/gpl-3.0)
[](https://pypi.org/project/zimscraperlib/)
[](https://pypi.org/project/zimscraperlib)
[](https://codecov.io/gh/openzim/python-scraperlib)
[](https://python-scraperlib.readthedocs.io/)
Collection of python code to re-use across python-based scrapers
# Usage
- This library is meant to be installed via PyPI ([`zimscraperlib`](https://pypi.org/project/zimscraperlib/)).
- Make sure to reference it using a version code as the API is subject to frequent changes.
- API should remain the same only within the same _minor_ version.
Example usage:
```pip
zimscraperlib>=1.1,<1.2
```
See documentation at [Read the Docs](https://python-scraperlib.readthedocs.io/) for details.
# Dependencies
- libmagic
- wget
- libzim (auto-installed, not available on Windows)
- Pillow
- FFmpeg
- gifsicle (>=1.92)
- libcairo (if you use the image manipulation, this is used for svg conversion)
## macOS
```sh
brew install libmagic wget libtiff libjpeg webp little-cms2 ffmpeg gifsicle
```
## Linux
```sh
sudo apt install libmagic1 wget ffmpeg \
libtiff5-dev libjpeg8-dev libopenjp2-7-dev zlib1g-dev \
libfreetype6-dev liblcms2-dev libwebp-dev tcl8.6-dev tk8.6-dev python3-tk \
libharfbuzz-dev libfribidi-dev libxcb1-dev gifsicle
```
## Alpine
```
apk add ffmpeg gifsicle libmagic wget libjpeg
```
# Contribution
This project adheres to openZIM's [Contribution Guidelines](https://github.com/openzim/overview/wiki/Contributing).
This project has implemented openZIM's [Python bootstrap, conventions and policies](https://github.com/openzim/_python-bootstrap/docs/Policy.md) **v1.0.2**.
```shell
pip install hatch
pip install ".[dev]"
pre-commit install
# For tests
invoke coverage
```
# Users
Non-exhaustive list of scrapers using it (check status when updating API):
- [openzim/freecodecamp](https://github.com/openzim/freecodecamp)
- [openzim/gutenberg](https://github.com/openzim/gutenberg)
- [openzim/ifixit](https://github.com/openzim/ifixit)
- [openzim/kolibri](https://github.com/openzim/kolibri)
- [openzim/nautilus](https://github.com/openzim/nautilus)
- [openzim/nautilus](https://github.com/openzim/nautilus)
- [openzim/openedx](https://github.com/openzim/openedx)
- [openzim/sotoki](https://github.com/openzim/sotoki)
- [openzim/ted](https://github.com/openzim/ted)
- [openzim/warc2zim](https://github.com/openzim/warc2zim)
- [openzim/wikihow](https://github.com/openzim/wikihow)
- [openzim/youtube](https://github.com/openzim/youtube)