Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/timedelta/submodularity-for-data-selection
Some old code I wrote around 2014 based on "Submodularity for Data Selection in Statistical Machine Translation"
https://github.com/timedelta/submodularity-for-data-selection
Last synced: about 2 months ago
JSON representation
Some old code I wrote around 2014 based on "Submodularity for Data Selection in Statistical Machine Translation"
- Host: GitHub
- URL: https://github.com/timedelta/submodularity-for-data-selection
- Owner: TimeDelta
- Created: 2023-09-28T15:12:33.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2023-09-28T15:54:27.000Z (over 1 year ago)
- Last Synced: 2024-10-16T13:20:59.229Z (3 months ago)
- Language: Roff
- Size: 32.7 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
This is some old corpus augmentation code I wrote around 2014. The main algorithm is in [Filter.cpp](scripts/Filter.cpp), which is based on the paper [Submodularity for Data Selection in Statistical Machine Translation](https://aclanthology.org/D14-1014.pdf). There are also some supporting scripts for data prep, transforming the output to a WFST ([arpa2fst](scripts/arpa2fst)), the main mining script ([mine_google.py](scripts/mine_google.py)), etc.