https://github.com/maxoodf/tgnews

Telegram Data Clustering Contest (Bossy Gnu's submission )
https://github.com/maxoodf/tgnews

cpp document-clustering document-embedding document-similarity nlp nlp-machine-learning telegram word2vec

Last synced: 11 months ago
JSON representation

Telegram Data Clustering Contest (Bossy Gnu's submission )

Host: GitHub
URL: https://github.com/maxoodf/tgnews
Owner: maxoodf
License: apache-2.0
Created: 2020-08-01T12:27:09.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2021-02-08T16:30:37.000Z (about 5 years ago)
Last Synced: 2024-11-04T12:39:57.071Z (over 1 year ago)
Topics: cpp, document-clustering, document-embedding, document-similarity, nlp, nlp-machine-learning, telegram, word2vec
Language: C++
Homepage: https://contest.com/data-clustering-2
Size: 41 KB
Stars: 4
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: newsCluster/CMakeLists.txt
- License: LICENSE

Awesome Lists containing this project

README

# **TGNEWS: Data Clustering Contest**
The task decision of Data Clustering Context described at https://contest.com/docs/data_clustering2

## Description
`tgnews` utility data flow and used algorithms are the following:
- language detection: Google's GLDv3 used to detect laguages (https://github.com/google/cld3.git);
- normalization: de-facto standard UTF library (ICU) to convert documents to lowercase and remove non-alpha chars; tokenizer from https://github.com/maxoodf/word2vec;
- documents vectorisation: all words of a document are embedded into vector space (https://github.com/maxoodf/word2vec) and each document itself is embedded into a vector with the same size as the word2vec vectors;
- news detection: based on a very simple DNN with the loss binary layer, fully connected layer and document's vector (https://github.com/davisking/dlib);
- category detection: based on a very simple DNN with the loss multi-class layer, fully connected layer and document's vector (https://github.com/davisking/dlib);
- clustering: DBSCAN-based algorithm with a dynamic similarity threshold and improved neighbors detection logic;
- clusters ordering: based on a very simple DNN with the mean squared loss layer, fully connected layer and document's vector (https://github.com/davisking/dlib).

## Building
Base packages installation:
```bash
sudo apt update
sudo apt upgrade
sudo apt install -y libtool g++ git cmake pkg-config libprotobuf-dev libprotoc-dev protobuf-compiler libblas-dev liblapack-dev libicu-dev libssl-dev
sudo /usr/sbin/ldconfig
```
Google's gumbo-parser library (HTML5 parser):
```bash
git clone https://github.com/google/gumbo-parser.git
cd ./gumbo-parser
./autogen.sh
./configure
make -j 8
sudo make install
cd ../
```
Google's CLD3 library (language detection):
```bash
git clone https://github.com/google/cld3.git
cd ./cld3
sed -i 's/add_definitions(-D_GLIBCXX_USE_CXX11_ABI=0)//g' ./CMakeLists.txt
mkdir build-release
cd ./build-release
cmake -DCMAKE_BUILD_TYPE=Release ../
make -j 8
sudo cp ./libcld3.a /usr/local/lib
sudo mkdir /usr/local/include/google
sudo mkdir /usr/local/include/google/cld_3
sudo cp -r ./cld_3/protos /usr/local/include/google/cld_3
sudo cp -r ../src/script_span /usr/local/include/google/cld_3
sudo cp ../src/*.h /usr/local/include/google/cld_3
cd ../../
```
DLib library (DNN, clustering, etc):
```bash
git clone https://github.com/davisking/dlib.git
cd ./dlib
git checkout tags/v19.19
mkdir ./build-release
cd ./build-release
cmake -DCMAKE_BUILD_TYPE=Release ../
make -j 8
sudo make install
cd ../../
```
Libevent library (HTTP/HTTPS server):
```bash
git clone https://github.com/libevent/libevent.git
cd libevent
git checkout tags/release-2.1.11-stable
mkdir build-release
cd build-release
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_FLAGS=-fPIC -DEVENT__LIBRARY_TYPE=STATIC -DEVENT__DISABLE_BENCHMARK:BOOL=ON -DEVENT__DISABLE_DEBUG_MODE:BOOL=ON -DEVENT__DISABLE_SAMPLES:BOOL=ON -DEVENT__DISABLE_TESTS:BOOL=ON ../
make -j 8
sudo make install
cd ../../
```
Rapidjson library (JSON parsing/writing):
```bash
git clone https://github.com/Tencent/rapidjson.git
cd rapidjson
mkdir build-release
cd build-release
cmake -DCMAKE_BUILD_TYPE=Release ../
make -j 8
sudo make install
cd ../../
```
Sqlite3 (document attributes storage)
```bash
wget https://www.sqlite.org/2020/sqlite-autoconf-3310100.tar.gz
tar xfz ./sqlite-autoconf-3310100.tar.gz
cd ./sqlite-autoconf-3310100
./configure --enable-shared=no
make -j 8
sudo make install
cd ../
```
Word2vec++ library (words and documents embedding into vector space)
```bash
git clone https://github.com/maxoodf/word2vec.git
cd ./word2vec
mkdir ./build-release
cd ./build-release
cmake -DCMAKE_BUILD_TYPE=Release ../
make -j 8
sudo make install
cd ../../
```
TGNEWS utility:
```bash
sudo /usr/sbin/ldconfig
cd ./submission/src/tgnews/
mkdir ./build-release
cd ./build-release
cmake -DCMAKE_BUILD_TYPE=Release ../
make -j 8
cd ../bin
```
MODELS:
- download model files [archive](https://drive.google.com/file/d/1CoN_59XyNdgy_Cia_bqv9LrEjMFaYhbB/view?usp=sharing) (1.2GB)
- extract files to `./model` folder
- go to `./bin` folder and run `./tgnews` for more information

#dataclustering
Bossy Gnu's source code is available here: https://github.com/maxoodf/tgnews

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/maxoodf/tgnews

Awesome Lists containing this project

README