https://github.com/cahya-wirawan/indonesian-language-models

Indonesian Language Models and its Usage
https://github.com/cahya-wirawan/indonesian-language-models

deep-learning fastai huggingface-transformers language-model machine-learning nlp pytorch transformer

Last synced: 8 months ago
JSON representation

Indonesian Language Models and its Usage

Host: GitHub
URL: https://github.com/cahya-wirawan/indonesian-language-models
Owner: cahya-wirawan
License: mit
Created: 2018-08-19T06:43:45.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2023-05-22T11:21:46.000Z (over 2 years ago)
Last Synced: 2024-04-24T05:11:09.837Z (over 1 year ago)
Topics: deep-learning, fastai, huggingface-transformers, language-model, machine-learning, nlp, pytorch, transformer
Language: Jupyter Notebook
Homepage: https://cahya-wirawan.github.io/indonesian-language-models
Size: 61 MB
Stars: 149
Watchers: 13
Forks: 28
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff

Awesome Lists containing this project

README

# Indonesian Language Models

The language model is a probability distribution over word sequences used to predict the next word based on previous
sentences. This ability makes the language model the core component of modern natural language processing. We use it
for many different tasks, such as speech recognition, conversational AI, information retrieval, sentiment analysis,
or text summarization.

For this reason, many big companies are competing to build large and larger language models, such as Google BERT,
Facebook RoBERTa, or OpenAI GPT3, with its massive number of parameters. Most of the time, they built only
language models in English and some other European languages. Other countries with low resource languages have big
challenges to catch up on this technology race.

Therefore the author tries to build some language models for Indonesian, started with ULMFiT in 2018. The first
language model has been only trained with Indonesian Wikipedia, which is very small compared to other datasets used
to train the English language model.

## Universal Language Model Fine-tuning (ULMFiT)
Jeremy Howard and Sebastian Ruder proposed [ULMFiT](https://arxiv.org/abs/1801.06146) in early 2018 as a novel method for
fine-tuning language models for inductive transfer learning. The language model [ULMFiT for Indonesian](https://github.com/cahya-wirawan/indonesian-language-models/tree/master/ULMFiT)
has been trained as part of the author's project while learning [FastAI](https://www.fast.ai). It achieved a perplexity
of **27.67** on Indonesian Wikipedia.

## Transformers
Ashish Vaswani et al. proposed Transfomer in the paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762).
It is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies
with ease.

At the time of writing (March 2021), there are already more than 50 different types of transformer-based language
models (according to the model list at huggingface), such as BERT, GPT2, Longformer, or MT5, built by companies and
individual contributors. The author built also several
[Indonesian transformer-based language models](https://github.com/cahya-wirawan/indonesian-language-models/tree/master/Transformers)
using [Huggingface Transformers Library](https://github.com/huggingface/transformers) and hosted them in the
[Huggingfaces model hub](https://huggingface.co/cahya).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cahya-wirawan/indonesian-language-models

Awesome Lists containing this project

README