Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sdsc-ordes/license-collector
Gather licenses about data science repositories.
https://github.com/sdsc-ordes/license-collector
workflow
Last synced: about 15 hours ago
JSON representation
Gather licenses about data science repositories.
- Host: GitHub
- URL: https://github.com/sdsc-ordes/license-collector
- Owner: sdsc-ordes
- License: mit
- Created: 2023-08-08T07:24:38.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-11-09T17:19:15.000Z (12 months ago)
- Last Synced: 2023-12-14T09:49:20.631Z (11 months ago)
- Topics: workflow
- Language: Python
- Homepage:
- Size: 188 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# License collector
Automatically collect license and other metadata from open-source data science repositories.
## Workflow
The repository contains a Prefect workflow which downloads a list of open source repositories from paperswithcode (see [links between papers and code](https://paperswithcode.com/about)) and runs [gimie](https://github.com/SDSC-ORD/gimie) on each git repository. The extracted metadata from all repositories are then combined into a single RDF graph. A table is also extracted from this graph to provide specific attributes.
The input dataset contains links to ~200k github repositories. It is provided by paperswithcode under the [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/) license ([download link](https://production-media.paperswithcode.com/about/links-between-papers-and-code.json.gz)).
The repository is composed of 3 workflows:
* [`retrieve.py`](src/retrieve.py): Download paperswithcode dataset and filter papers with github/gitlab repositories.
* [`extract.py`](src/extract.py): Run gimie on each repository and extract RDF metadata using the GitLab / GitHub API.
* [`enhance.py`](src/enhance.py): Combine original paperswithcode dataset with repository metadata for visualization.> ⚠️ You will need to provide your own GitLab / GitHub API tokens as environment variables `GITHUB_TOKEN` and `GITLAB_TOKEN`.
```mermaid
graph TD;
paperswithcode -->|retrieve.py| filtered_papers.json;
filtered_papers.json -->|extract.py| repo_metadata.ttl;
repo_metadata.ttl -->|enhance.py| combined.csv;
filtered_papers.json -->|enhance.py| combined.csv;
```
For convenience, [`main.py`](src/main.py) script composes the above steps into a higher level workflow, which can be run with a single command:Workflow configuration is defined in [`config.py`](src/config.py).
```bash
make pipeline
```## Quick Start
### Set up the environment
1. Install [Poetry](https://python-poetry.org/docs/#installation)
2. Set up the environment:
```bash
make setup
make activate
```
### Install new packages
To install new PyPI packages, run:
```bash
poetry add
```### Run Python scripts
To run the Python scripts type the following:
```bash
make pipeline
```## Contributing
Issues and pull requests are welcome. The code formatting standard we use is [black](https://github.com/psf/black), with `--line-length=79` to follow [PEP8](https://peps.python.org/pep-0008/) recommendations. This project uses [pyproject.toml](https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/) to define package information, requirements and tooling configuration.