https://github.com/undp-data/dsc-sdg-scraper
A collection of web scrapers to harvest SDG-labelled publications.
https://github.com/undp-data/dsc-sdg-scraper
Last synced: 11 months ago
JSON representation
A collection of web scrapers to harvest SDG-labelled publications.
- Host: GitHub
- URL: https://github.com/undp-data/dsc-sdg-scraper
- Owner: UNDP-Data
- License: bsd-3-clause
- Created: 2024-09-26T10:51:58.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-21T18:36:12.000Z (over 1 year ago)
- Last Synced: 2025-03-05T16:39:47.814Z (over 1 year ago)
- Language: Python
- Size: 63.5 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# dsc-sdg-scraper
[](https://www.python.org/downloads/release/python-3110/)
[](https://github.com/undp-data/dsc-sdg-scraper/blob/main/LICENSE)
[](https://github.com/psf/black)
[](https://pycqa.github.io/isort/)
[](https://conventionalcommits.org)
A collection of web scrapers to harvest SDG-labelled publications. The project is written
in Python using an async interface of [`httpx`](https://www.python-httpx.org) and exposed
to users via a simple command line interface (CLI) built in [`click`](https://click.palletsprojects.com/en/8.1.x/).
## Table of Contents
- [Getting Started](#getting-started)
- [Usage](#usage)
- [License](#license)
- [Contributing](#contributing)
## Getting Started
These instructions will help you set up the project locally. The project has been developed and tested with Python `3.11`.
To set up a local environment:
1. Clone the repository:
```shell
git clone https://github.com/undp-data/dsc-sdg-scraper.git
```
2. Navigate to the project directory:
```shell
cd dsc-sdg-scraper
```
3. Create a virtual environment:
```shell
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```
4. Install dependencies:
```shell
pip install -r requirements.txt
```
5. Explore the CLI:
```shell
python -m sdg_scraper
```
## Usage
The CLI enables you to run scrapers for any of the supported sources.
To list available sources, run:
```shell
python -m sdg_scraper list
```
To scrape a specific source, run:
```shell
python -m sdg_scraper run
```
By default, the programme will scrape resources from the first two pages of the source (pages 0-1) and save the
files and metadata to the current directory. To customise this behaviour, use the command line options:
```shell
python -m sdg_scraper run --pages 1 10 -f data
# or
python -m sdg_scraper run -p 1 10 -f data
```
Use CLI help for more details:
```shell
python -m sdg_scraper --help
```
## License
This project's codebase is licensed under the BSD 3-Clause License. Data collected by the scraper may
be licensed under a different clause or even copyrighted. It is your responsibility to ensure that
any processing of data with the help of the scraper is responsible, ethical and legal.
## Contributing
All contributions must follow [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/).
The codebase is formatted with `black` and `isort`. Use the provided [Makefile](./Makefile) for these
routine operations.
1. Clone or fork the repository.
2. Create a new branch (`git checkout -b feature-branch`).
3. Make your changes. Include tests for new features.
4. Run tests (`make test`).
5. Ensure your code is properly formatted (`make format`).
6. Commit your changes (`git commit -m 'Feat: add some feature'`).
7. Push to the branch (`git push origin feature-branch`).
8. Open a pull request.