Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/thammegowda/realigner
Re-aligner tool for aligning parallel sentences from comparable documents
https://github.com/thammegowda/realigner
Last synced: about 1 month ago
JSON representation
Re-aligner tool for aligning parallel sentences from comparable documents
- Host: GitHub
- URL: https://github.com/thammegowda/realigner
- Owner: thammegowda
- Created: 2018-07-05T07:28:08.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2018-07-13T05:14:17.000Z (over 6 years ago)
- Last Synced: 2024-04-18T02:58:24.228Z (7 months ago)
- Language: Python
- Size: 95.7 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Re-Aligner
This code was written during the DARPA LORELEI y3 evaluation week. We found that the LDC data packs had mis aligned sentences.
This project uses a bunch of heuristics and scoring function to re-align sentences within document.The hueristics are based on appearance of same numbers and URLs on both sides of bitext.
Scoring functions:
1. MCSS, a multilingual common semantic space based approach which aligns words from both source and target language into same vector space. Then makes alignments based on the similarity of words in this MCSS space.
2. T-table measure. Uses GIZA++ aligner's translation table entries to compute a simple score. The score can be interpreted as probability of source sentence generating target sentence and target sentence generating source sentence as per the given translation table## How to use:
The `scripts` directory has bunch of scripts (the actual scripts) I used to run.