{"id":19608136,"url":"https://github.com/jfilter/german-lemmatizer-docker","last_synced_at":"2025-08-12T12:33:45.686Z","repository":{"id":68654686,"uuid":"157131795","full_name":"jfilter/german-lemmatizer-docker","owner":"jfilter","description":"✂️ Combining the power of several tools for lemmatization of German text","archived":false,"fork":false,"pushed_at":"2022-09-30T19:44:13.000Z","size":65,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-08-12T10:55:09.440Z","etag":null,"topics":["docker-image","german","lemmas","lemmatization","lemmatizer","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jfilter.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-11-11T23:30:53.000Z","updated_at":"2020-10-17T20:53:53.000Z","dependencies_parsed_at":"2023-03-11T04:05:38.014Z","dependency_job_id":null,"html_url":"https://github.com/jfilter/german-lemmatizer-docker","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jfilter/german-lemmatizer-docker","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fgerman-lemmatizer-docker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fgerman-lemmatizer-docker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fgerman-lemmatizer-docker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fgerman-lemmatizer-docker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jfilter","download_url":"https://codeload.github.com/jfilter/german-lemmatizer-docker/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fgerman-lemmatizer-docker/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270061277,"owners_count":24520274,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-12T02:00:09.011Z","response_time":80,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker-image","german","lemmas","lemmatization","lemmatizer","python"],"created_at":"2024-11-11T10:14:20.601Z","updated_at":"2025-08-12T12:33:45.675Z","avatar_url":"https://github.com/jfilter.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"matt-artz-353291-unsplash.jpg\" alt=\"Scissors\"\u003e\n\u003c/div\u003e\n\n# German Lemmatizer Docker Image\n\nA Docker image to [lemmatize](https://en.wikipedia.org/wiki/Lemmatisation) German texts.\n\nBuilt upon:\n\n-   [IWNLP](https://github.com/Liebeck/spacy-iwnlp) uses the crowd-generated token tables on [de.wikitionary](https://de.wiktionary.org/).\n-   [GermaLemma](https://github.com/WZBSocialScienceCenter/germalemma): Looks up lemmas in the [TIGER Corpus](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/) and uses [Pattern](https://www.clips.uantwerpen.be/pattern) as a fallback for some rule-based lemmatizations.\n\nIt works as follows. First [spaCy](https://spacy.io/) tags the token with POS. Then `German Lemmatizer` looks up lemmas on IWNLP and GermanLemma. If they disagree, choose the one from IWNLP. If they agree or only one tool finds it, take it. Try to preserve the casing of the original token.\n\nYou may want to use the Python wrapper: [German Lemmatizer](https://github.com/jfilter/german-lemmatizer)\n\n## Installation\n\n1. Install [Docker](https://docs.docker.com/).\n\n## Usage\n\n1. Read and accept the [license terms of the TIGER Corpus](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/license/htmlicense.html) (free to use for non-commercial purposes).\n2. Start Docker.\n3. To execute, you have two options:\n\n    1. To lemmatize a string from the termial, run:\n\n    ```bash\n    docker run -it filter/german-lemmatizer:0.5.0 \"Was ist das für ein Leben?\" [--remove_stop]\n    ```\n\n    2. To lemmatize a collection of text, add two local folders to the docker container (NB: you have to give absolute paths):\n\n    ```bash\n    docker run -it -v $(pwd)/some_input_folder:/input -v $(pwd)/some_output_folder:/output filter/german-lemmatizer:0.5.0 [--line] [--escape] [--remove_stop]\n    ```\n\n    With `--line` each line is treated as a single document instead of the whole file.\n\n    With `--escape` The newlines are escaped ('\\n' -\u003e '\\\\\\n') for each document (per line), so the text in the input file has to be processed like this.\n\n    `--remove_stop` removes stop words as defined by spaCy.\n\n## The Case for Reproduciblilty\n\nEverything – all the code and all the data – is packaged in the Docker image. This means that every lemmatization is reproduceable. For the future, I may update the code and/or data but each images is tagged with a specific version.\n\n## Dev Remarks\n\n-   Tried to base in on an [Docker Apline Image](https://hub.docker.com/_/alpine/) but there were too many installation hassels.\n-   Tried to parallelise with [joblib](https://github.com/joblib/joblib) but it created too much overhead\n-   To build an image run `docker build -t lemma .` in this folder\n-   For debugging purposes, you may want enter the container and override the entry point: `docker run -it --entrypoint /bin/bash lemma`\n-   `docker build -t filter/german-lemmatizer:0.5.0 .` and `docker push filter/german-lemmatizer:0.5.0`\n\n## License\n\nMIT.\n\n## Sponsoring\n\nThis work was created as part of a [project](https://github.com/jfilter/ptf) that was funded by the German [Federal Ministry of Education and Research](https://www.bmbf.de/en/index.html).\n\n\u003cimg src=\"./bmbf_funded.svg\"\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjfilter%2Fgerman-lemmatizer-docker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjfilter%2Fgerman-lemmatizer-docker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjfilter%2Fgerman-lemmatizer-docker/lists"}