Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/stoufa/neural-machine-translation_transliteration
An Intelligent Approach for Translation / Transliteration using Neural Networks
https://github.com/stoufa/neural-machine-translation_transliteration
attention-mechanism character-embeddings fasttext neural-machine-translation nlp pytorch rnn-encoder-decoder translation transliteration word-embeddings
Last synced: about 1 month ago
JSON representation
An Intelligent Approach for Translation / Transliteration using Neural Networks
- Host: GitHub
- URL: https://github.com/stoufa/neural-machine-translation_transliteration
- Owner: stoufa
- Created: 2018-04-23T08:01:33.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2018-12-24T13:17:00.000Z (about 6 years ago)
- Last Synced: 2024-11-06T06:14:56.912Z (3 months ago)
- Topics: attention-mechanism, character-embeddings, fasttext, neural-machine-translation, nlp, pytorch, rnn-encoder-decoder, translation, transliteration, word-embeddings
- Language: Jupyter Notebook
- Homepage:
- Size: 293 KB
- Stars: 0
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Neural-Machine-Translation_Transliteration
An Intelligent Approach for Translation / Transliteration using Neural NetworksThis translation approach is based on Recurrent Neural Networks (RNNs) which are the type of Neural Networks to be used when dealing with sequences of input like videos, sound or text like in our case.
![RNNs](https://camo.githubusercontent.com/c847d37b28afbb2cf3c73bb428354308e16f5efc/68747470733a2f2f63646e2d696d616765732d312e6d656469756d2e636f6d2f6d61782f3830302f312a445537373653477231726859655537696c494b5839772e706e67)
For the data, I used the [bible-corpus](http://christos-c.com/bible/), you have to download the corresponding raw XML files and place them in the directory (`data/bible-corpus/raw/`) then extract the text from these files : you can use the Jupyter Notebook (`word-character embedding/XMLparser.ipynb`) to help you in this task, then save the results in the directory (`data/bible-corpus/pre-processed/`) and finaly run the script (`createEmbeddings.sh`) to generate the embeddings in the directory (`data/bible-corpus/processed/`).
By the way, I used [Fasttext](https://fasttext.cc/) for the embeddings.
The script (`word-character embedding/getEmbedding.py`) reads a word or a character from the user and checks if the embedding is already saved in the SQLite database (`word-character embedding/embeddingDB.db`), otherwise, it computes it using Fasttext even if it's not found in the training corpus! in this case, it will generate the closest embedding based on the word's characters.
The Jupyter Notebook `translate_dev.ipynb` explains the whole pipeline which starts by reading in the training data, tokenization, embedding then building and training the model.