https://github.com/damesek/hustem
Create a better Snowball Hungarian stemmer .sbl config file via TDD
https://github.com/damesek/hustem
clojure error-rate hunspell snowball stem
Last synced: 6 months ago
JSON representation
Create a better Snowball Hungarian stemmer .sbl config file via TDD
- Host: GitHub
- URL: https://github.com/damesek/hustem
- Owner: damesek
- License: other
- Created: 2023-10-28T15:59:17.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-10-28T19:38:45.000Z (almost 2 years ago)
- Last Synced: 2024-07-30T20:43:08.698Z (about 1 year ago)
- Topics: clojure, error-rate, hunspell, snowball, stem
- Language: C
- Homepage:
- Size: 13.4 MB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# HuStem: the hungarian stemmer
WIP
HuStem was created to work with Snowball's .sbl files.
The goal is to update the quality of the Hungarian stemmer config file.
To achieve this, I first try to identify the issues and then proceed with tests, following a TDD (Test-Driven Development) workflow.## Error rate:
```clojure
{:hunspell 32,91%, :snowball 47,41%, :hunspell-mdb 32,91%}
```
This means Hunspell is 27% more accurate than Snowball.
I tested two different dic/aff sources, but there was no difference in efficiency.## Snowball cli basics
Test the new Hungarian Snowball SBL file from Snowball root folder
(src/resources/snowball-master)```bash
make && echo "baglyokat" | ./stemwords -l hungarian
```The hungarian words dict to check the results
```bash
grep "teremt" ../magyar-szavak.txt
```## Compile the Java sources
Todo: add as prep-task
```bash
clj -T:build clean
clj -T:build compile-java
```## License
Copyright © 2023 FIXME
This program and the accompanying materials are made available under the
terms of the Eclipse Public License 2.0 which is available at
http://www.eclipse.org/legal/epl-2.0.This Source Code may also be made available under the following Secondary
Licenses when the conditions for such availability set forth in the Eclipse
Public License, v. 2.0 are satisfied: GNU General Public License as published by
the Free Software Foundation, either version 2 of the License, or (at your
option) any later version, with the GNU Classpath Exception which is available
at https://www.gnu.org/software/classpath/license.html.