{"id":13560971,"url":"https://github.com/IlyaGusev/tgcontest","last_synced_at":"2025-04-03T16:31:42.738Z","repository":{"id":53341397,"uuid":"222744125","full_name":"IlyaGusev/tgcontest","owner":"IlyaGusev","description":"Telegram Data Clustering contest solution by Mindful Squirrel","archived":false,"fork":false,"pushed_at":"2022-06-12T15:07:55.000Z","size":14772,"stargazers_count":94,"open_issues_count":0,"forks_count":25,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-11-04T12:40:10.247Z","etag":null,"topics":["classification","clustering","cpp","data-science","document-similarity","fasttext","machine-learning","nlp"],"latest_commit_sha":null,"homepage":"https://contest.com/docs/data_clustering2","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/IlyaGusev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-11-19T16:48:08.000Z","updated_at":"2024-08-09T16:23:41.000Z","dependencies_parsed_at":"2022-09-03T04:41:09.890Z","dependency_job_id":null,"html_url":"https://github.com/IlyaGusev/tgcontest","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IlyaGusev%2Ftgcontest","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IlyaGusev%2Ftgcontest/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IlyaGusev%2Ftgcontest/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IlyaGusev%2Ftgcontest/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/IlyaGusev","download_url":"https://codeload.github.com/IlyaGusev/tgcontest/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247037031,"owners_count":20873082,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","clustering","cpp","data-science","document-similarity","fasttext","machine-learning","nlp"],"created_at":"2024-08-01T13:00:51.289Z","updated_at":"2025-04-03T16:31:37.728Z","avatar_url":"https://github.com/IlyaGusev.png","language":"HTML","funding_links":[],"categories":["HTML"],"sub_categories":[],"readme":"# TGNews\n\n[![Build Status](https://travis-ci.com/IlyaGusev/tgcontest.svg?token=9pgxYSDpb2YAVSfz53Nq\u0026branch=master)](https://travis-ci.com/IlyaGusev/tgcontest)\n\n## Links\n* Description in English: https://medium.com/@phoenixilya/news-aggregator-in-2-weeks-5b38783b95e3\n* Description in Russian: https://habr.com/ru/post/487324/\n\n## Demo\n* Russian: [https://ilyagusev.github.io/tgcontest/ru/main.html](https://ilyagusev.github.io/tgcontest/ru/main.html)\n* English: [https://ilyagusev.github.io/tgcontest/en/main.html](https://ilyagusev.github.io/tgcontest/en/main.html)\n\n## Install\nPrerequisites: CMake, Boost\n```\n$ sudo apt-get install cmake libboost-all-dev build-essential libjsoncpp-dev uuid-dev protobuf-compiler libprotobuf-dev\n```\n\nFor MacOS\n```\n$ brew install boost jsoncpp ossp-uuid protobuf\n```\n\n\nIf you got zip archive, just go to building binary\n\nTo download code and models:\n```\n$ git clone https://github.com/IlyaGusev/tgcontest\n$ cd tgcontest\n$ git submodule update --init --recursive\n$ bash download_models.sh\n$ wget https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.5.0%2Bcpu.zip\n$ unzip libtorch-cxx11-abi-shared-with-deps-1.5.0+cpu.zip\n```\n\nFor MacOS use https://download.pytorch.org/libtorch/cpu/libtorch-macos-1.5.0.zip\n\nTo build binary (in \"tgcontest\" dir):\n```\n$ mkdir build \u0026\u0026 cd build \u0026\u0026 Torch_DIR=\"../libtorch\" cmake -DCMAKE_BUILD_TYPE=Release .. \u0026\u0026 make -j4\n```\n\nTo download datasets:\n```\n$ bash download_data.sh\n```\n\nRun on sample:\n```\n./build/tgnews top data --ndocs 10000\n```\n\n## Training\n\n* Russian FastText vectors training:\n[VectorsRu.ipynb](https://github.com/IlyaGusev/tgcontest/blob/master/scripts/VectorsRu.ipynb)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QeyhqsHy5MO3yzvsn446LsqK_PqOjIVb)\n* Russian fasttext category classifier training:\n[CatTrainRu.ipynb](https://github.com/IlyaGusev/tgcontest/blob/master/scripts/CatTrainRu.ipynb)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1U7Wxm5eDnrBRWE_logCSJIq6DzTFV0Zo)\n* Russian text embedder with **triplet loss** training (v3):\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1vp_qbWKtqtqgNLW5Upp4Gh2UL36zdTvT)\n* English FastText vectors training:\n[VectorsEn.ipynb](https://github.com/IlyaGusev/tgcontest/blob/master/scripts/VectorsEn.ipynb)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lbmgJ_iGBdwKdkU_1l1-WZuO7XbYZlWQ)\n* English fasttext category classifier training:\n[CatTrainEn.ipynb](https://github.com/IlyaGusev/tgcontest/blob/master/scripts/CatTrainEn.ipynb)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ayg5dtA_KdhzVehN4-_EiyIcwRhBVSob)\n* English text embedder with **triplet loss** training (v3):\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1o1B50aktpHJmEzHCQ8lPV3yZOtKSTXa3)\n* PageRank rating calculation:\n[PageRankRating.ipynb](https://github.com/IlyaGusev/tgcontest/blob/master/scripts/PageRankRating.ipynb)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bd35S0rl_Uysiuz_7fmkYRArzNcP-wZB)\n* Russian **ELMo-based** sentence embedder training (not used):\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Q0S5OvramxxqQZnaSIH8xWfmOsWeKhIz)\n* XLM-RoBERTa pseudo-labeling for categorization: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fEmNPi41mnxLrc4hqamqi72xDCCH0Ima)\n\n## Models\n* Language detection model (2 round): [lang_detect_v10.ftz](https://www.dropbox.com/s/hoapmnvqlknmu6v/lang_detect_v10.ftz)\n* Russian FastText vectors (2 round): [ru_vectors_v3.bin](https://www.dropbox.com/s/vttjivmmxw7leea/ru_vectors_v3.bin)\n* Russian categories detection model (2 round): [ru_cat_v5.ftz](https://www.dropbox.com/s/23x35wuet280eh6/ru_cat_v5.ftz)\n* English FastText vectors (2 round): [en_vectors_v3.bin](https://www.dropbox.com/s/6aaucelizfx7xl6/en_vectors_v3.bin)\n* English categories detection model (2 round): [en_cat_v5.ftz](https://www.dropbox.com/s/luh60dd0uw8p9ar/en_cat_v5.ftz)\n* PageRank-based agency rating: [pagerank_rating.txt](https://www.dropbox.com/s/0o9xr2pwuqeh17k/pagerank_rating.txt)\n* Alexa agency rating: [alexa_rating_4_fixed.txt](https://www.dropbox.com/s/fry1gsd1mans9jm/alexa_rating_4_fixed.txt)\n* XLM-RoBERTa for categorization (pytorch-lightning checkpoint): [xlmr_en_ru_cat_v1.tar.gz](https://www.dropbox.com/s/y6leppzf2l43eqo/xlmr_en_ru_cat_v1.tar.gz)\n\n## Data\n* Russian news from 11.01.2019 to 10.05.2020 with gaps: [ru_tg_1101_0510.jsonl.tar.gz](https://www.dropbox.com/s/r8iqi6h6x1w0pzv/ru_tg_1101_0510.jsonl.tar.gz)\n* Russian news from 11.05.2020 to 17.05.2020: [ru_tg_0511_0517.jsonl.tar.gz](https://www.dropbox.com/s/zvv1qvm1yidvc2p/ru_tg_0511_0517.jsonl.tar.gz)\n* English news from 11.01.2019 to 10.05.2020 with gaps: [en_tg_1101_0510.jsonl.tar.gz](https://www.dropbox.com/s/9f11mdgv4qsjrvb/en_tg_1101_0510.jsonl.tar.gz)\n* English news from 11.05.2020 to 17.05.2020: [en_tg_0511_0517.jsonl.tar.gz](https://www.dropbox.com/s/qj7s8ek91usmcxp/en_tg_0511_0517.jsonl.tar.gz)\n\n## Markup\n* Russian categories raw train markup: [ru_cat_v4_train_raw_markup.tsv](https://www.dropbox.com/s/24rsyxxp00kxjzr/ru_cat_v4_train_raw_markup.tsv)\n* Russian categories aggregated train markup: [ru_cat_v4_train_annot.json](https://www.dropbox.com/s/2rpsabep7tstmkq/ru_cat_v4_train_annot.json)\n* Russian categories aggregated train markup in fastText format: [ft_ru_cat_v4_train.txt](https://www.dropbox.com/s/tdz4k44o0jmrpi5/ft_ru_cat_v4_train.txt)\n* Russian categories manual train markup: [ru_cat_v4_train_manual_annot.json](https://www.dropbox.com/s/fibw7remhk2bodl/ru_cat_v4_train_manual_annot.json)\n* Russian categoreis manual train markup in fastText format: [ft_ru_cat_v4_train_manual.txt](https://www.dropbox.com/s/y9jg50rck1pg1w1/ft_ru_cat_v4_train_manual.txt)\n* Russian categoreis raw test markup: [ru_cat_v4_test_raw_markup.tsv](https://www.dropbox.com/s/9cbubupcht00kqn/ru_cat_v4_test_raw_markup.tsv)\n* Russian categories aggregated test markup: [ru_cat_v4_test_annot.json](https://www.dropbox.com/s/ur7jhiyi22tmzxd/ru_cat_v4_test_annot.json)\n* Russian categories aggregated test markup in fastText format: [ft_ru_cat_v4_test.txt](https://www.dropbox.com/s/89opmh9alx7tfy3/ft_ru_cat_v4_test.txt)\n* English categories aggregated train markup: [en_cat_v4_train_annot.json](https://www.dropbox.com/s/fysoyx1mz8rf6rs/en_cat_v4_train_annot.json)\n* English categories aggregated train markup in fastText format: [ft_en_cat_v4_train.txt](https://www.dropbox.com/s/7a2k2tmkf61nsks/ft_en_cat_v4_train.txt)\n* English categories aggregated test markup: [en_cat_v4_test_annot.json](https://www.dropbox.com/s/ucwzhucwgtuy8k1/en_cat_v4_test_annot.json)\n* English categories aggregated test markup in fastText format: [ft_en_cat_v4_test.txt](https://www.dropbox.com/s/yga8i06hqv0pvqc/ft_en_cat_v4_test.txt)\n* Russian clustering pairs: [ru_pairs_raw_markup.tsv](https://www.dropbox.com/s/jugcl80vfd4wg0h/ru_pairs_raw_markup.tsv)\n* English clustering pairs: [en_pairs_raw_markup.tsv](https://www.dropbox.com/s/1zs05c3frm8cygq/en_pairs_raw_markup.tsv)\n* Russian clustering pairs for one day (0517): [ru_clustering_0517.tsv](https://www.dropbox.com/s/rrkxdnml6ukql8j/ru_clustering_0517.tsv)\n\n## Misc\n* Flamegraph: https://ilyagusev.github.io/tgcontest/flamegraph.svg\n\n## Other contestants\n* Round 2\n  * II place\n    * Daring Frog: https://github.com/a-l-e-x-k/data_clustering_contest, article: https://medium.com/@alexkuznetsov/2nd-place-solution-for-telegram-data-clustering-contest-f28d55b98d30\n    * Swift Skunk: https://github.com/sorrge/tg_news_cluster\n  * III place\n    * Mindful Kitten: https://danlark.org/2020/07/31/news-aggregator-from-scratch-in-2-weeks/\n  * IV place\n    * Bossy Gnu: https://github.com/maxoodf/tgnews\n  * Other:\n    * Large Crab: https://github.com/ilya-ustinov/tgcontest\n* Round 1\n  * III place\n    * Kooky Dragon: https://github.com/nick-baliesnyi/tgnews\n  * IV place\n    * Sharp Sloth: https://github.com/thehemen/telegram-data-clustering\n  * Other\n    * Desert Python: https://github.com/crazyleg/telegram_data_clustering_2019\n    * Funky Peacock: https://github.com/Stepka/telegram_clustering_contest\n    * Unknown animal: https://github.com/roman-rybalko/telegram-data-clustering-contest\n    * Unknown animal: https://github.com/MarcoBuster/data-clustering-contest\n    * Unknown animal: https://github.com/sudevschiz/tgnews\n    * Unknown animal: https://github.com/crazyleg/telegram_data_clustering_2019\n    * Unknown animal: https://github.com/77ph/tgnews\n    * Unknown animal: https://github.com/akash-joshi/telegram-cluster\n    * Unknown animal: https://github.com/dremovd/telegram-clustering\n\n## Contacts\n* Telegram: [@YallenGusev](https://t.me/YallenGusev)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FIlyaGusev%2Ftgcontest","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FIlyaGusev%2Ftgcontest","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FIlyaGusev%2Ftgcontest/lists"}