{"id":21659883,"url":"https://github.com/slub/docsa","last_synced_at":"2025-04-11T22:40:46.432Z","repository":{"id":63311054,"uuid":"444430340","full_name":"slub/docsa","owner":"slub","description":"SLUB Document Classification and Similarity Analysis","archived":false,"fork":false,"pushed_at":"2023-08-31T14:02:06.000Z","size":1090,"stargazers_count":10,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-25T18:41:05.359Z","etag":null,"topics":["bibliographic-data","classification","library","machine-learning","similarity-analysis"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/slub.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-01-04T13:31:32.000Z","updated_at":"2024-08-02T02:23:13.000Z","dependencies_parsed_at":"2024-11-25T09:42:58.288Z","dependency_job_id":null,"html_url":"https://github.com/slub/docsa","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slub%2Fdocsa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slub%2Fdocsa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slub%2Fdocsa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slub%2Fdocsa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/slub","download_url":"https://codeload.github.com/slub/docsa/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248493011,"owners_count":21113159,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bibliographic-data","classification","library","machine-learning","similarity-analysis"],"created_at":"2024-11-25T09:31:50.366Z","updated_at":"2025-04-11T22:40:46.407Z","avatar_url":"https://github.com/slub.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SLUB Document Classification and Similarity Analysis\n\nThis project provides a library for bibliographic document classification and similarity analysis.\n\nIt contains a selection of methods that support:\n\n- pre-processing of bibliographic meta data and full-text documents,\n- training of multi-label multi-class classification models,\n- integrating and using hierarchical subject classifications (pruning methods, performance scores),\n- similarity analysis and clustering.\n\nA detailed description including tutorials and examples can be found in the API documentation, which needs to be \ngenerated as described below.\n\n## Installation\n\nThis projects requires [Python](https://www.python.org/) v3.8 or above and uses [pip](https://pypi.org/project/pip/) \nfor dependency management. Besides, this package uses [pyTorch](https://pytorch.org/) to train \n[Artificial Neural Networks](https://en.wikipedia.org/wiki/Artificial_neural_network) via GPUs. \nMake sure to install the latest Nvidia graphics drivers and check \n[further requirements](https://pytorch.org/get-started/locally/#linux-prerequisites).\n\n### Via Python Package Installer (not available yet)\n\nOnce published to PyPI (*not available yet*), install via:\n\n- `python3 -m pip install slub_docsa`\n\n### From Source\n\nDownload the source code by checking out the repository:\n\n - `git clone https://git.slub-dresden.de/lod/maschinelle-klassifizierung/docsa.git`\n\nUse *make* to install python dependencies by executing the following commands:\n\n- `make install` or `make install-test`  \n  (installs *slub_docsa* package and downloads all required runtime / test dependencies via *pip*)\n- `make test`  \n  (runs tests to verify correct installation, requires test dependencies)\n- `make docs`  \n  (generate API documentation, requires test dependencies)\n\n### From Source using Ubuntu 20.04\n\nInstall essentials like *python3*, *pip* and *make*:\n\n- `apt-get update`  \n   (update the Ubuntu package installer index)\n- `apt-get install -y make python3 python3-pip`  \n   (install python3, pip and make)\n\nOptionally, set up a python [virtual environment](https://docs.python.org/3/tutorial/venv.html):\n\n- `apt-get install -y python3-venv`\n- `python3 -m venv /path/to/venv`\n- `source /path/to/venv/bin/activate`\n\nRun *make* commands as provided above:\n\n- `make install-test` \n- `make test` \n\n## Documentation\n\nFurther documentation of this project can be found at the following locations:\n\n- [API documentation](http://slub.github.io/docsa/) needs to be generated via `make docs` and is than provided \n  in the directory `docs/python/slub_docsa.html`.\n- Tutorials and examples are included in the [API documentation](http://slub.github.io/docsa/)\n- Developer meeting notes can be found in a separate \n  [Gitlab Wiki](https://git.slub-dresden.de/lod/maschinelle-klassifizierung/docs/-/wikis/home/Protokolle).\n- Results of various experiments related to the [Qucosa](https://www.qucosa.de/) and \n  [k10plus](https://wiki.k10plus.de/pages/viewpage.action?pageId=358711298) datasets can be found in a separate \n  [Gitlab repository](https://git.slub-dresden.de/lod/maschinelle-klassifizierung/docs/-/tree/main/experiments).\n\n## Development\n\n### Python Virtual Environment\n\nDownload all developer dependencies and install the *slub_docsa* package via pip in development mode:\n\n- `make install-dev` \n\nThis will link your local project such that changes to source files are immediately reflected, see \n[pip install -e](https://pip.pypa.io/en/stable/cli/pip_install/#install-editable).\n\n### Container Environment\n\nThis project also provides container images for development. You can use [docker](https://docs.docker.com), but also \nother container runtimes, e.g., [podman](https://podman.io/).\n\nInstall a Container Runtime\n\n- Either, install `docker` and `docker-compose`:\n  - Install docker, see https://docs.docker.com/get-docker/\n  - Install `docker-compose`, see https://docs.docker.com/compose/install/\n\n- Or, setup `podman` in Fedora 34 including the Nvidia container runtime:\n  - Install nvidia graphics driver, and check they are working by running `nvidia-smi`\n  - Install `podman` and `podman-compose` from repositories via `dnf install podman podman-compose`\n  - Install the [nvidia container runtime](https://github.com/NVIDIA/nvidia-container-runtime) using the `centos8`\n  repositories via `dnf install nvidia-container-runtime`, see\n  [installation instructions](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)\n  - Set `no-cgroups = true` in `/etc/nvidia-container-runtime/config.toml`, which is required since Nvidia does not yet \n    support cgroups v2\n  - Check your CUDA version with `nvidia-smi`, e.g., `11.4`\n  - Identify the matching cuda docker image, e.g., `nvidia/cuda:11.4.1-base-centos8`\n  - Verify gpu support in podman via\n    `podman run --security-opt=label=disable --rm nvidia/cuda:11.4.1-base-centos8 nvidia-smi`\n\nSetup the Development Environment\n- Docker images for development can be found in the `code/docker/devel` directory.\n- Run `build.sh gpu` to build these docker images with gpu support.\n- Run `up.sh gpu` and `down.sh gpu` to start and shutdown the development container.\n- Run `shell_python.sh gpu` to enter the python container with gpu support.\n- Run `shell_annif.sh` to enter the Annif container\n\nSetup [Visual Studio Code](https://code.visualstudio.com/), which supports many useful features during development:\n- [Python Integration](https://code.visualstudio.com/docs/languages/python), including auto complete, linting, debugging\n- [Remote Container](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers), which \n  allows to use the Python environment provided by a docker container\n\nContinuous Integration\n- The CI pipeline can be triggered by running `make coverage` and `make lint`. Both commands run automated tests using \n[pytest](https://pytest.org/), ensure code guidelines by using [pylint](https://www.pylint.org/) and \n[flake8](https://flake8.pycqa.org/), and check for common security issues using \n[bandit](https://github.com/PyCQA/bandit).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fslub%2Fdocsa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fslub%2Fdocsa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fslub%2Fdocsa/lists"}