{"id":18567117,"url":"https://github.com/diovisgood/agrocode","last_synced_at":"2026-04-29T10:32:38.097Z","repository":{"id":75449155,"uuid":"429068650","full_name":"diovisgood/agrocode","owner":"diovisgood","description":"Cow's diseases prediction from textual symptoms. My solution to the Agro Code competition (2021).","archived":false,"fork":false,"pushed_at":"2021-11-17T20:19:11.000Z","size":313,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-07T14:05:07.600Z","etag":null,"topics":["catboost","multilabel-classification","natural-language-processing","nlp","python","sentiment-analysis","transformer"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/diovisgood.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-17T14:02:56.000Z","updated_at":"2021-11-17T20:19:14.000Z","dependencies_parsed_at":"2023-03-09T15:15:21.793Z","dependency_job_id":null,"html_url":"https://github.com/diovisgood/agrocode","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/diovisgood/agrocode","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/diovisgood%2Fagrocode","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/diovisgood%2Fagrocode/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/diovisgood%2Fagrocode/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/diovisgood%2Fagrocode/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/diovisgood","download_url":"https://codeload.github.com/diovisgood/agrocode/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/diovisgood%2Fagrocode/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32421638,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T06:29:02.080Z","status":"ssl_error","status_checked_at":"2026-04-29T06:29:00.631Z","response_time":110,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["catboost","multilabel-classification","natural-language-processing","nlp","python","sentiment-analysis","transformer"],"created_at":"2024-11-06T22:25:30.033Z","updated_at":"2026-04-29T10:32:38.080Z","avatar_url":"https://github.com/diovisgood.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# My Solution to Agro Code Contest\n\nIn November 2021 there was a **Machine Learning contest** in Russia.\nYou can read about it here (in russian, of course):\n\n[https://contest.ds.agro-code.ru/competition](https://contest.ds.agro-code.ru/competition)\n\nThe task is to predict the disease(s) of cows by a given textual\ndescription of it in natural language (in russian language).\nIt is a **Multi-Label Classification** problem, since\nthere could be several diseases for some texts in the training set.\n\nI registered to participate, but somehow forgot about the _deadline_,\nwhich was 17 of November :)\n\nThat is why the solutions I posted were marked as: **\"out of competition\"**.\n\nThere were about 50 participants, only 36 of which were *\"in competition\"*,\nothers, like me, have missed the deadline.\n\n## Baseline\n\nOrganizers provided some baseline, as a python notebook: [Baseline.ipynb](Baseline.ipynb).\nThe baseline solution is based on the\n[**CatBoostClassifier**](https://catboost.ai/en/docs/concepts/python-reference_catboostclassifier).\n\nIt is capable to work with texts, treating words as **tokens**.\nThough, without analysis of word forms or word meanings.\n\nBaseline also specified some scoring function, based on **log_loss**.\nThe score of baseline model is: **0.38** (the lower - the better). \n\n## My Approach\n\nIn my opinion, treating words as tokens is very **bad idea**.\n\nBecause russian language has a lot of [**grammatical cases**](https://en.wikipedia.org/wiki/Grammatical_case).\nIt is when the form of the word depends on a context:\n\n\u003e \"корова\", \"коровы\", \"корове\", \"коровье\", ...\n\nAlso because in russian people often use\n[**diminutives**](https://en.wikipedia.org/wiki/Diminutive)\nand\n[**augmentatives**](https://en.wikipedia.org/wiki/Augmentative).\nThese add some *tiny* meaning to the base meaning of a word,\nbut *significantly* increase the complexity for Machine Learning.\n\n\u003e \"коровушка\", \"коровёнка\", \"коровка\" \u003cbr/\u003e\n\u003e \"придоил\", \"додоил\", \"передоил\"\n\nInstead of using tokens I decided to utilize [**word embeddings**](https://en.wikipedia.org/wiki/Word_embedding).\nI.e. when each word is represented as a point in N-dimensional space.\nTypically, N is: 100...300.\n\nWhen you have a proper word embeddings, built on a large language corpus,\nyou can do amazing things with them.\n\nFor example, you can use **vector operations** in this space to search for appropriate words.\n\n``v \u003c- (Woman - Man);  King + v ~\u003e Queen``\n\n``v \u003c- (Truck - Car);  Puppy + v ~\u003e Dog``\n\nAnother important thing is that words which are often used together,\nwill have their embeddings located **nearby** in this multidimensional space. \n\n## Dictionaries of Embeddings for Russian Language\n\nWe need to get some pre-trained dictionaries of embeddings for russian words.\nAnd we need it to have free license for commercial usage.\n\nThere are some available dictionaries at\n[**RusVectores**](https://rusvectores.org/ru/).\nBut dictionaries there contain 150...300 thousands of words,\nwhich is rather small.\nBesides, I could not find any license conditions on their site.\n\nThere is [**\"Natasha\"**](https://github.com/natasha/navec) project.\nIt contains 250...500k words. License is **MIT**, free for commercial usage.\n\nThere is also another project:\n[**DeepPavlov**](https://docs.deeppavlov.ai/en/0.0.7/intro/pretrained_vectors.html)\n, which contains about 1500k words.\nLicense is **Apache 2.0**, free for commercial usage.\n\nI decided to use the last variant - **DeepPavlov** dictionary.\nWe need to download the dictionary file, 4.14Gb, and load it into memory.\n\nI wrote a special class for this task: `GloveModel`.\nYou may find it in [agrocode_embeddings_catboost.ipynb](agrocode_embeddings_catboost.ipynb).\n\n## First Attempt - Transformers\n\nThe task of disease labelling is very similar to\n[**Sentiment Analysis**](https://en.wikipedia.org/wiki/Sentiment_analysis).\nI.e. when you recognize and assign some labels to input text, like:\n`Positive`, `Negative` or `Neutral`.\n\nIn this case there could be multiple labels for a text,\nwhich is a Multi-Label classification.\n\n### Description\n\nI decided that [**Transformer**](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model))\narchitecture is suited for this task.\nParticularly, its first half - the **TransformerEncoder** part.\n\nThe basic idea is the following:\n\n1. The model takes as the input the sequence of words embeddings.\n   \n2. For each embedding we add information about its position in text,\n   using `PositionalEncoding` class.\n\n3. Next comes `TransformerEncoder` with several built-in layers of `MultiHeadAttention`\n   followed by `Linear`, `LayerNorm` and `Dropout` layers.\n   It processes all the sequence, highlighting some important and shadowing some useless parts of it.\n\n4. Then comes the `MultiHeadAttention` layer, which compares each embedding in the sequence with\n   some **Target Embeddings**, it effectively sums up the whole sequence into **several final embeddings**,\n   one for each target.\n\n5. Finally, there comes the `Linear` layer, which receives these final embeddings\n   and outputs **probabilities**, one for each target. \n\nThe source code for this class is in [agrocode_embeddings_transformer.py](agrocode_embeddings_transformer.py).\nHere is the source code of the main `forward` method of the neural net:\n\n```python\n    def forward(self, texts: List[str]):\n        # Convert batch of texts into tensor of embeddings\n        x, padding_mask, batch_offsets = self.texts2batch(texts)\n        # x has shape: (sequence_length, batch_size, d_model)\n        # padding_mask has shape: (batch_size, sequence_length)\n        # batch_offsets is the list of length of batch_size, which contains a list of offsets for each tag\n        \n        # Add positional information into x\n        x = self.position_encoder.forward(x, mask=padding_mask)\n        \n        # Initialize self-attention mask, so that words could attend only prior words.\n        attn_mask = None\n        if self.causal_mask:\n            attn_mask = th.full((len(x), len(x)), -math.inf, device=x.device, dtype=x.dtype)\n            attn_mask = th.triu(attn_mask, diagonal=1)\n\n        x = self.transformer_encoder.forward(x, mask=attn_mask, src_key_padding_mask=padding_mask)\n        # x still has shape (sequence_length, batch_size, d_model)\n        \n        # Combine source embeddings into one embedding, one for each target\n        attn_output, attn_weights = self.collect.forward(\n            query=self.targets.expand((self.num_targets, x.size(1), self.d_model)),\n            key=x,\n            value=x,\n            key_padding_mask=padding_mask,\n            need_weights=True\n        )\n        # attn_output has the shape: (num_targets, batch_size, d_model)\n        # attn_weights has the shape: (batch_size, num_targets, sequence_length)\n        \n        attn_output = attn_output.permute((1, 0, 2)).reshape(x.size(1), -1)\n        # attn_output now has the shape: (batch_size, num_targets * d_model)\n\n        output = th.sigmoid(self.output.forward(attn_output))\n        # output has the shape: (batch_size, num_targets)\n        \n```\n\nThe training of this model is typical for all PyTorch project - the simple loop over training epochs.\nTraining takes from several minutes to one hour at maximum.\n\n### Score\n\nUnfortunately, the result was unsatisfying, score on validation set was about **0.41** (the lower - the better),\nwhich is even worse than in the baseline (0.38)!\n\n### Conclusion\n\nI believe, the poor performance of Transformer architecture\nis due to the **small training dataset**, which has only **294 records**!\nNote, that you need to split this dataset into train and validation parts,\nthus you've got small number of examples to learn on.\n\n## Second Attempt - CatBoostClassifier\n\nI decided to try\n[**CatBoostClassifier**](https://catboost.ai/en/docs/concepts/python-reference_catboostclassifier),\nwhich is based on\n[**Random Forest Algorithm**](https://en.wikipedia.org/wiki/Random_forest).\n\nStill, I wanted to avoid word tokens, and use word embeddings instead.\nHowever, there was some big problem, as CatBoostClassifier was not designed\nto get a sequence of embeddings of **arbitrary length** as the input.\n\nAfter a while, I came up with an idea of how to convert text\nof arbitrary length to a set of numbers of a fixed length.\n(Disclaimer: I'm sure, this idea was described somewhere long before it came to my mind.\nI just never heard about it.)\n\n### Description\n\nThe idea is to use some **keyword embeddings** as an anchors,\nand measure distance of **each word** of text to all keyword embeddings,\nkeeping only the **minimal distances** found.\n\nNext the idea is described in details.\n\nTake some keywords or phrases.\nIn this particular task they could be the **symptoms of different diseases**, like:\n\u003e `watery eyes`, `bloody urine`, `fever`, `convulsions`, etc. \n\nThe more various keywords you get - the better.\nBut also it will require more time to converge.\n\nWe need to compute their respective **embeddings**.\nWhen a keyword is a single word - it is merely its embedding.\nBut when you have a phrase - you need to combine several embeddings\nof each word of the phrase into one embedding.\n\nThere are different ways to do it.\nI used the **mean** operation in this project.\nSome papers say that using **addition** also works fine.\n\nThese **keyword embeddings** now become like **beacons** or **anchors** in the multidimensional embeddings space.\nThey will help us to convert a text of *arbitrary length* to a *fixed set of distances* to these anchors.\n\n1. Each text we divide into **tokens** - words and punctuation symbols.\n\n2. For each token we take its **embedding**.\n   For missing words - a special embedding 'unk'.\n   For numbers - a special embedding 'num'.\n\n3. For each token's embedding we compute the **distances** to all anchor embeddings.\n   Euclidean distance may not be the perfect solution for multidimensional space.\n   That is why we utilize 4 different distance functions: ``(cosine, cityblock, euclidean, braycurtis)``\n\n4. While processing all tokens, we will keep only the **minimal distance** to each anchor embedding.\n\n5. This way, after processing we get a set of minimal distances to each anchor embedding,\n   in particular: 4 minimal distances due to using of four different distance functions.\n   After all, we have multiple new features: ``num_features = num_anchors * num_distance_functions``.\n\nFor this project I used **45** keyword phrases, which produced **180** numerical features\nfor each text in training dataset.\nNote, that random forest algorithm does NOT need normalized features.\n\n\u003e This approach has its limitations. It can not correctly understand negation. Example:\n\u003e \n\u003e The owner says: **\"My cow has no bleeding, only diarrhea\"**\n\u003e\n\u003e The model sees the word: **bleeding** - it may consider a cow has an injury!\n\u003e Because model can not understand the phrase 'has no bleeding'.\n\nTo build the classifier I used the same approach as in the Baseline.\n\nThe model is based on `sklearn.multiclass.OneVsRestClassifier` from scikit package,\nwhich uses multiple `catboost.CatBoostClassifier` instances to predict each disease probability independently.\n\n```python\nfrom catboost import CatBoostClassifier\nfrom sklearn.multiclass import OneVsRestClassifier\n\nRANDOM_SEED = 2021\n\nestimator = CatBoostClassifier(\n    max_depth=12,\n    iterations=1000,\n    verbose=False,\n    allow_writing_files=False,\n    random_seed=RANDOM_SEED,\n)\n\nmodel = OneVsRestClassifier(estimator=estimator)\nlog.info(f'Initialized model: {model}')\n\nlog.info(f'Starting training...')\nmodel.fit(X_train, y_train)\nlog.info('Done')\n```\n\n### Score\n\nThe score on validation set was about **0.03** (the lower - the better),\nwhich is far better than in the baseline (0.38)!\n\n### Conclusion\n\nThis approach seem to work fine.\nIt does not treat words as a set of unique tokens.\nInstead, it relies on the semantic meaning of a word, which is encoded in its embedding.\n\nThus, model can (hopefully) manage to work with texts and words, **which it has never seen before**.\nSimply because the embeddings of the words in new texts will, somehow,\nbe related to the anchor embeddings, which model uses.\n\n## Final Thoughts\n\nI can hardly ever beat *huge and heavily trained models*, which gain the highest scores in the leaderboard.\nI believe, though, my idea is right and my model is robust to the new unexpected texts.\n\nMy thoughts about huge, highly overfitted models, which gain top scores in competitions,\nare best said by Michał Marcinkiewicz in his article:\n[\"The Real World is not a Kaggle Competition\"](https://www.netguru.com/codestories/real-world-is-not-a-kaggle-competition).\n\nThe approach I used has its limitations.\nLike said before, it can not correctly understand negation or any complex relations between words.\nA Transformer architecture is capable of doing so,\nbut it requires much more training data than was given in this competition.\n\nWorking on natural language processing is fun!\n\n:)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdiovisgood%2Fagrocode","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdiovisgood%2Fagrocode","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdiovisgood%2Fagrocode/lists"}