{"id":13560700,"url":"https://github.com/maxoodf/tgnews","last_synced_at":"2025-04-03T16:31:00.176Z","repository":{"id":217045462,"uuid":"284255434","full_name":"maxoodf/tgnews","owner":"maxoodf","description":"Telegram Data Clustering Contest (Bossy Gnu's submission )","archived":false,"fork":false,"pushed_at":"2021-02-08T16:30:37.000Z","size":42,"stargazers_count":4,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-11-04T12:39:57.071Z","etag":null,"topics":["cpp","document-clustering","document-embedding","document-similarity","nlp","nlp-machine-learning","telegram","word2vec"],"latest_commit_sha":null,"homepage":"https://contest.com/data-clustering-2","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maxoodf.png","metadata":{"files":{"readme":"README.md","changelog":"newsCluster/CMakeLists.txt","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-08-01T12:27:09.000Z","updated_at":"2023-01-14T18:07:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"87515727-0c99-4ccf-ac24-e35521c581ae","html_url":"https://github.com/maxoodf/tgnews","commit_stats":null,"previous_names":["maxoodf/tgnews"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxoodf%2Ftgnews","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxoodf%2Ftgnews/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxoodf%2Ftgnews/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxoodf%2Ftgnews/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maxoodf","download_url":"https://codeload.github.com/maxoodf/tgnews/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247036925,"owners_count":20873049,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","document-clustering","document-embedding","document-similarity","nlp","nlp-machine-learning","telegram","word2vec"],"created_at":"2024-08-01T13:00:48.827Z","updated_at":"2025-04-03T16:30:59.868Z","avatar_url":"https://github.com/maxoodf.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"# **TGNEWS: Data Clustering Contest**\nThe task decision of Data Clustering Context described at https://contest.com/docs/data_clustering2\n\n## Description\n`tgnews` utility data flow and used algorithms are the following:\n  - language detection: Google's GLDv3 used to detect laguages (https://github.com/google/cld3.git);\n  - normalization: de-facto standard UTF library (ICU) to convert documents to lowercase and remove non-alpha chars; tokenizer from https://github.com/maxoodf/word2vec;\n  - documents vectorisation: all words of a document are embedded into vector space (https://github.com/maxoodf/word2vec) and each document itself is embedded into a vector with the same size as the word2vec vectors;\n  - news detection: based on a very simple DNN with the loss binary layer, fully connected layer and document's vector (https://github.com/davisking/dlib);\n  - category detection: based on a very simple DNN with the loss multi-class layer, fully connected layer and document's vector (https://github.com/davisking/dlib);\n  - clustering: DBSCAN-based algorithm with a dynamic similarity threshold and improved neighbors detection logic;\n  - clusters ordering: based on a very simple DNN with the mean squared loss layer, fully connected layer and document's vector (https://github.com/davisking/dlib).\n   \n## Building\nBase packages installation:    \n```bash\nsudo apt update\nsudo apt upgrade\nsudo apt install -y libtool g++ git cmake pkg-config libprotobuf-dev libprotoc-dev protobuf-compiler libblas-dev liblapack-dev libicu-dev libssl-dev\nsudo /usr/sbin/ldconfig\n```\nGoogle's gumbo-parser library (HTML5 parser):    \n```bash\ngit clone https://github.com/google/gumbo-parser.git\ncd ./gumbo-parser\n./autogen.sh\n./configure\nmake -j 8\nsudo make install\ncd ../\n```\nGoogle's CLD3 library (language detection):    \n```bash\ngit clone https://github.com/google/cld3.git\ncd ./cld3\nsed -i 's/add_definitions(-D_GLIBCXX_USE_CXX11_ABI=0)//g' ./CMakeLists.txt\nmkdir build-release\ncd ./build-release\ncmake -DCMAKE_BUILD_TYPE=Release ../\nmake -j 8\nsudo cp ./libcld3.a /usr/local/lib\nsudo mkdir /usr/local/include/google\nsudo mkdir /usr/local/include/google/cld_3\nsudo cp -r ./cld_3/protos /usr/local/include/google/cld_3\nsudo cp -r ../src/script_span /usr/local/include/google/cld_3\nsudo cp ../src/*.h /usr/local/include/google/cld_3\ncd ../../\n```\nDLib library (DNN, clustering, etc):    \n```bash\ngit clone https://github.com/davisking/dlib.git\ncd ./dlib\ngit checkout tags/v19.19\nmkdir ./build-release\ncd ./build-release\ncmake -DCMAKE_BUILD_TYPE=Release ../\nmake -j 8\nsudo make install\ncd ../../\n```\nLibevent library (HTTP/HTTPS server):\n```bash\ngit clone https://github.com/libevent/libevent.git\ncd libevent\ngit checkout tags/release-2.1.11-stable\nmkdir build-release\ncd build-release\ncmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_FLAGS=-fPIC -DEVENT__LIBRARY_TYPE=STATIC -DEVENT__DISABLE_BENCHMARK:BOOL=ON -DEVENT__DISABLE_DEBUG_MODE:BOOL=ON -DEVENT__DISABLE_SAMPLES:BOOL=ON -DEVENT__DISABLE_TESTS:BOOL=ON ../\nmake -j 8\nsudo make install\ncd ../../\n```\nRapidjson library (JSON parsing/writing):\n```bash\ngit clone https://github.com/Tencent/rapidjson.git\ncd rapidjson\nmkdir build-release\ncd build-release\ncmake -DCMAKE_BUILD_TYPE=Release ../\nmake -j 8\nsudo make install\ncd ../../\n```\nSqlite3 (document attributes storage)\n```bash\nwget https://www.sqlite.org/2020/sqlite-autoconf-3310100.tar.gz\ntar xfz ./sqlite-autoconf-3310100.tar.gz\ncd ./sqlite-autoconf-3310100\n./configure --enable-shared=no\nmake -j 8\nsudo make install\ncd ../\n```\nWord2vec++ library (words and documents embedding into vector space)\n```bash\ngit clone https://github.com/maxoodf/word2vec.git\ncd ./word2vec\nmkdir ./build-release\ncd ./build-release\ncmake -DCMAKE_BUILD_TYPE=Release ../\nmake -j 8\nsudo make install\ncd ../../ \n```\nTGNEWS utility:    \n```bash\nsudo /usr/sbin/ldconfig\ncd ./submission/src/tgnews/\nmkdir ./build-release\ncd ./build-release\ncmake -DCMAKE_BUILD_TYPE=Release ../\nmake -j 8\ncd ../bin\n```\nMODELS:\n- download model files [archive](https://drive.google.com/file/d/1CoN_59XyNdgy_Cia_bqv9LrEjMFaYhbB/view?usp=sharing) (1.2GB)\n- extract files to `./model` folder\n- go to `./bin` folder and run `./tgnews` for more information\n\n#dataclustering \nBossy Gnu's source code is available here: https://github.com/maxoodf/tgnews","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxoodf%2Ftgnews","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaxoodf%2Ftgnews","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxoodf%2Ftgnews/lists"}