Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/shubham0204/glove.c
Simple, cross-platform port of GloVe embeddings, written in C
https://github.com/shubham0204/glove.c
c-programming embeddings-model embeddings-similarity information-retrieval language-understanding natural-language-processing
Last synced: about 1 month ago
JSON representation
Simple, cross-platform port of GloVe embeddings, written in C
- Host: GitHub
- URL: https://github.com/shubham0204/glove.c
- Owner: shubham0204
- License: apache-2.0
- Created: 2023-12-14T06:42:31.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-04-12T01:51:09.000Z (9 months ago)
- Last Synced: 2024-04-12T09:13:52.655Z (9 months ago)
- Topics: c-programming, embeddings-model, embeddings-similarity, information-retrieval, language-understanding, natural-language-processing
- Language: C
- Homepage: https://shubham0204.github.io/glove.c/
- Size: 275 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
![glove-c-banner](https://github.com/shubham0204/glove.c/assets/41076823/c0bf7ad9-3187-4f62-9f56-73852d1c9286)
# `glove.c` Port of GloVe Embeddings in C
> A simple, cross-platform library which accesses GloVe word embeddings from a given text file using open-chaining hashmap
**Contents**
1. [Setup](#setup)
2. [Usage](#usage)
1. [Usage in C](#usage-in-c)
2. [Usage in Java](#usage-in-java)
3. [Usage in Python](#usage-in-python)
3. [Working](#working)
4. [Using Glove Pretrained On A Custom Corpus](#using-glove-pretrained-on-a-custom-corpus)
5. [Contributing](#contributing)
6. [Useful External Resources](#useful-external-resources)## Setup
### Build Instructions
The build-script (`CMakeLists.txt`) generates two targets, `example` and `libglove` where
- `example` is a sample C-program demonstrating the use of the library, as in [Usage in C](#usage-in-c)
- `libglove` is a dynamic library that can be linked in other programs```
$> git clone https://github.com/shubham0204/glove.c
$> cd glove.c
$glove.c> mkdir build && cd build
$glove.c/build> cmake ..
$glove.c/build> make
```### `valgrind` Report
```
$glove.c> gcc -Wall -Wextra -ggdb3 src/glove.c src/main.c -o main -lm
$glove.c> valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=valgrind-out.txt ./main
```The build artifacts are generated in the `build` directory after executing the above commmands.
## Usage
The program below shows the usage of the `glove.c` with pretrained embeddings taken from [StanfordNLP/GloVe](https://github.com/stanfordnlp/GloVe?tab=readme-ov-file#download-pre-trained-word-vectors) repository. The embeddings are derived from the Wikipedia 2014 + Gigaword 5 datasets consisting of 6B tokens and 400K vocab with 50 dimensions.
### Usage in C
> See [src/main.c](https://github.com/shubham0204/glove.c/blob/main/src/main.c)
```c
#include "glove.h"int main( int argc , char** argv ) {
glove* instance = glove_create(
"glove.6B.50d.txt" , /* vectors */
400000 , /* vocab size */
50 /* vector size */
) ;
float* embedding = glove_get_embedding( instance , argv[1] ) ;
if( embedding ) {
for( int i = 0 ; i < instance -> embedding_dims ; i++ ) {
printf( "%f " , embedding[i] ) ;
}
printf( "\n" ) ;
}
else {
printf( "embedding not found" ) ;
}
glove_release( instance ) ;
return 0;
}
```Compile the program with `glove.c` and `libmath`,
```
$> gcc main.c glove.c -o main -lm
$> ./main hello
```### Usage in Java
> See [examples/java](https://github.com/shubham0204/glove.c/tree/main/examples/java)
```java
import java.util.Arrays;public class Main {
public static void main(String[] args) {
String word = "hello" ;
GloVe glove = new GloVe(
"glove.6B.50d.txt" ,
400000 ,
50
) ;
float[] embedding = glove.getEmbedding( word ) ;
System.out.println( "Embedding: " + Arrays.toString( embedding ) ) ;
}
}
```### Usage in Python
> See [examples/python](https://github.com/shubham0204/glove.c/tree/main/examples/python)
```python
from glove import GloVe
import timeglove = GloVe( "glove.6B.50d.txt" , 400000 , 50 )
vec = glove.get_embedding( "hello" )
print( vec )
```## Working
`glove.c` uses a hashtable with open-chaining to get near-constant access times for all embeddings, at the expense of extra storage overhead.
## Using GloVe Trained On A Custom Corpus
The steps for training a GloVe model on a custom corpus is provided on the [official GitHub repository](https://github.com/stanfordnlp/GloVe/tree/master/src#readme). Once the training is started with the by executing the `demo.sh` script, we see the following in output written on the console,
```
...
TRAINING MODEL
Read 60666468 lines.
Initializing parameters...Using random seed 1702550061
done.
vector size: 50
vocab size: 71290
x_max: 10.000000
alpha: 0.750000
12/14/23 - 10:35.36AM, iter: 001, cost: 0.071237
...
12/14/23 - 10:52.36AM, iter: 014, cost: 0.036445
12/14/23 - 10:53.54AM, iter: 015, cost: 0.036244
$ python eval/python/evaluate.py
...
```After training is complete, the `vectors.txt` file can be found in the root directory of the project. Along with `vectors.txt`, we also need `vector size` and `vocab size` from console output, as given above. These three parameters would go into the `glove_create` function which returns an instance of `glove` and allows us to get embeddings for words.
## Contributing
- A Javascript wrapper with WebAssembly
- Improving `hashmap.c` for better performance (memory and access time)
- Reducing the overall memory consumption - currently, all contents of the vectors text-file are loaded in memory## Useful External Resources
- [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)
- [Comment on my post on r/C_Programming by `skeeto`](https://www.reddit.com/r/C_Programming/comments/19b3br8/comment/kipm82f/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
- [How do I use valgrind to find memory leaks?](https://stackoverflow.com/a/44989219/13546426)
- [Still Reachable Leak detected by Valgrind](https://stackoverflow.com/questions/3840582/still-reachable-leak-detected-by-valgrind)
- [How can I install clang-format in Ubuntu?](https://stackoverflow.com/questions/20756924/how-can-i-install-clang-format-in-ubuntu)
- [How to call clang-format over a c/cpp project folder?](https://stackoverflow.com/questions/28896909/how-to-call-clang-format-over-a-cpp-project-folder)
- [How to auto indent a C++ class with 4 spaces using clang-format?](https://stackoverflow.com/questions/42799183/how-to-auto-indent-a-c-class-with-4-spaces-using-clang-format)
- [Reddit discussion on `glove.c`](https://www.reddit.com/r/C_Programming/comments/19b3br8/introducing_glovec_a_port_of_glove_word/)