Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/oscarsaharoy/slm
small language model
https://github.com/oscarsaharoy/slm
ai deep-learning nlp
Last synced: 13 days ago
JSON representation
small language model
- Host: GitHub
- URL: https://github.com/oscarsaharoy/slm
- Owner: OscarSaharoy
- Created: 2024-05-12T22:35:36.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-05-22T22:31:04.000Z (8 months ago)
- Last Synced: 2024-05-23T09:34:53.614Z (8 months ago)
- Topics: ai, deep-learning, nlp
- Language: Python
- Homepage:
- Size: 41 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# small language model
this is an nlp experiment to make a tiny chatbot :)
it is using a really simple four layer architecture which is made recurrent by concatenating the previous timestep's 2nd hidden layer with the current timestep's 1st input layer. the 1st hidden layer in the network is like an embedding of the current token, and the 2nd hidden layer is like an embedding of the context, which is a function of the current token embedding and the previous context embedding.
the model is able to generate reasonable sentences over a very small vocabulary with just a few hundred parameters :) the training data makes it into a chatbot named boris who loves frogs.
## training
make sure numpy is installed:
```
python3 -m pip install numpy
```
run the training process:
```
python3 train.py
```
after this the weights will be saved to `weights.npz` and some sample output from the network will be printed.## inference
you can run the network to complete prompts with the infer script like this:
```
$ ./infer.py "who are you?"
i am boris.
$ ./infer.py "what do you love?"
frogs.
```
it only knows a few words though which are in `wordmap.json` - other words are just converted to an unknown word token.## embeddings
the hidden layers have semantic interpretations as embeddings of a word or whole sentence. to investigate this, the `searchembeddings.py` script will look for the words that are closest to an input word in embedding space by comparing the second hidden layer of the network when the words are fed through, and printing the 5 closest ones. you can see positive adjectives are close together, and so are question words.
```
$ ./searchembeddings.py "good"
[('good', 0.0), ('nice', 0.46020754596450925), ('.', 0.5950859535028892), ('is', 1.5625579068145068), ('thing', 1.681436310537393)]
$ ./searchembeddings.py "who"
[('who', 0.0), ('?', 1.3721200404159846), ('how', 1.6996846881430183), ('what', 2.1081211336680465), ('are', 2.6312298617361227)]
```i also created a `compareembeddings.py` script that takes two sentences and calculates the distance between them in embedding space, and you can see the embeddings for similar sentences are closer together than those for opposing ones.
```
$ ./compareembeddings.py "i am good" "i am nice"
1.3235389881128539
$ ./compareembeddings.py "i am good" "i am bad"
1.9960487879227995
```