{"id":15787353,"url":"https://github.com/tomtung/omikuji","last_synced_at":"2025-04-09T20:12:49.833Z","repository":{"id":35069792,"uuid":"163709649","full_name":"tomtung/omikuji","owner":"tomtung","description":"An efficient implementation of Partitioned Label Trees \u0026 its variations for extreme multi-label classification","archived":false,"fork":false,"pushed_at":"2024-02-20T18:20:16.000Z","size":23225,"stargazers_count":87,"open_issues_count":9,"forks_count":11,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-09T20:12:37.808Z","etag":null,"topics":["classification","extreme-classification","extreme-multi-label-classification","machine-learning","multi-label-classification","rust","supervised-learning"],"latest_commit_sha":null,"homepage":"https://crates.io/crates/omikuji","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tomtung.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-01-01T02:51:59.000Z","updated_at":"2025-04-03T07:23:43.000Z","dependencies_parsed_at":"2022-09-26T16:20:24.293Z","dependency_job_id":"a14b34df-aa3f-4188-a9d5-9b1f36f5c20f","html_url":"https://github.com/tomtung/omikuji","commit_stats":{"total_commits":201,"total_committers":3,"mean_commits":67.0,"dds":"0.10447761194029848","last_synced_commit":"4807729ccbdc964e4619e571f760b56112cda6fe"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomtung%2Fomikuji","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomtung%2Fomikuji/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomtung%2Fomikuji/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomtung%2Fomikuji/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tomtung","download_url":"https://codeload.github.com/tomtung/omikuji/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248103872,"owners_count":21048245,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","extreme-classification","extreme-multi-label-classification","machine-learning","multi-label-classification","rust","supervised-learning"],"created_at":"2024-10-04T21:07:54.209Z","updated_at":"2025-04-09T20:12:49.808Z","avatar_url":"https://github.com/tomtung.png","language":"Rust","funding_links":[],"categories":["Machine Learning"],"sub_categories":[],"readme":"# Omikuji\n[![Build Status](https://dev.azure.com/yubingdong/omikuji/_apis/build/status/tomtung.omikuji?branchName=master)](https://dev.azure.com/yubingdong/omikuji/_build/latest?definitionId=1\u0026branchName=master) [![Crate version](https://img.shields.io/crates/v/omikuji)](https://crates.io/crates/omikuji) [![PyPI version](https://img.shields.io/pypi/v/omikuji)](https://pypi.org/project/omikuji/)\n\nAn efficient implementation of Partitioned Label Trees (Prabhu et al., 2018) and its variations for extreme multi-label classification, written in Rust🦀 with love💖.\n\n## Features \u0026 Performance\n\nOmikuji has has been tested on datasets from the [Extreme Classification Repository](http://manikvarma.org/downloads/XC/XMLRepository.html). All tests below are run on a quad-core Intel® Core™ i7-6700 CPU, and we allowed as many cores to be utilized as possible. We measured training time, and calculated precisions at 1, 3, and 5. (Note that, due to randomness, results might vary from run to run, especially for smaller datasets.)\n\n### Parabel, better parallelized\n\nOmikuji provides a more parallelized implementation of Parabel (Prabhu et al., 2018) that trains faster when more CPU cores are available. Compared to the [original implementation](http://manikvarma.org/code/Parabel/download.html) written in C++, which can only utilize the same number of CPU cores as the number of trees (3 by default), Omikuji maintains the same level of precision but trains 1.3x to 1.7x faster on our quad-core machine. **Further speed-up is possible if more CPU cores are available**.\n\n| Dataset         \t| Metric     \t| Parabel \t| Omikuji\u003cbr/\u003e(balanced,\u003cbr/\u003ecluster.k=2) \t|\n|-----------------\t|------------\t|---------\t|------------------------------------------\t|\n|  EURLex-4K      \t| P@1        \t| 82.2    \t| 82.1                                     \t|\n|                 \t| P@3        \t| 68.8    \t| 68.8                                     \t|\n|                 \t| P@5        \t| 57.6    \t| 57.7                                     \t|\n|                 \t| Train Time \t| 18s     \t| 14s                                      \t|\n| Amazon-670K     \t| P@1        \t| 44.9    \t| 44.8                                     \t|\n|                 \t| P@3        \t| 39.8    \t| 39.8                                     \t|\n|                 \t| P@5        \t| 36.0    \t| 36.0                                     \t|\n|                 \t| Train Time \t| 404s    \t| 234s                                     \t|\n|  WikiLSHTC-325K \t| P@1        \t| 65.0    \t| 64.8                                     \t|\n|                 \t| P@3        \t| 43.2    \t| 43.1                                     \t|\n|                 \t| P@5        \t| 32.0    \t| 32.1                                     \t|\n|                 \t| Train Time \t| 959s    \t| 659s                                     \t|\n\n### Regular k-means for shallow trees\n\nFollowing Bonsai (Khandagale et al., 2019), Omikuji supports using regular k-means instead of balanced 2-means clustering for tree construction, which results in wider, shallower and unbalanced trees that train slower but have better precision. Comparing to the [original Bonsai implementation](https://github.com/xmc-aalto/bonsai), Omikuji also achieves the same precisions while training 2.6x to 4.6x faster on our quad-core machine. (Similarly, further speed-up is possible if more CPU cores are available.)\n\n| Dataset         \t| Metric     \t| Bonsai  \t| Omikuji\u003cbr/\u003e(unbalanced,\u003cbr/\u003ecluster.k=100,\u003cbr/\u003emax\\_depth=3)\t|\n|-----------------\t|------------\t|---------\t|--------------------------------------------------------------\t|\n|  EURLex-4K      \t| P@1        \t| 82.8    \t| 83.0                                                         \t|\n|                 \t| P@3        \t| 69.4    \t| 69.5                                                         \t|\n|                 \t| P@5        \t| 58.1    \t| 58.3                                                         \t|\n|                 \t| Train Time \t| 87s     \t| 19s                                                          \t|\n| Amazon-670K     \t| P@1        \t| 45.5*   \t| 45.6                                                         \t|\n|                 \t| P@3        \t| 40.3*   \t| 40.4                                                         \t|\n|                 \t| P@5        \t| 36.5*   \t| 36.6                                                         \t|\n|                 \t| Train Time \t| 5,759s  \t| 1,753s                                                       \t|\n|  WikiLSHTC-325K \t| P@1        \t| 66.6*   \t| 66.6                                                         \t|\n|                 \t| P@3        \t| 44.5*   \t| 44.4                                                         \t|\n|                 \t| P@5        \t| 33.0*   \t| 33.0                                                         \t|\n|                 \t| Train Time \t| 11,156s \t| 4,259s                                                       \t|\n\n*\\*Precision numbers as reported in the paper; our machine doesn't have enough memory to run the full prediction with their implementation.*\n\n### Balanced k-means for balanced shallow trees\n\nSometimes it's desirable to have shallow and wide trees that are also balanced, in which case Omikuji supports the balanced k-means algorithm used by HOMER (Tsoumakas et al., 2008) for clustering as well.\n\n| Dataset         \t| Metric     \t| Omikuji\u003cbr/\u003e(balanced,\u003cbr/\u003ecluster.k=100)\t|\n|-----------------\t|------------\t|------------------------------------------\t|\n|  EURLex-4K      \t| P@1        \t| 82.1                                    \t|\n|                 \t| P@3        \t| 69.4                                    \t|\n|                 \t| P@5        \t| 58.1                                    \t|\n|                 \t| Train Time \t| 19s                                     \t|\n| Amazon-670K     \t| P@1        \t| 45.4                                    \t|\n|                 \t| P@3        \t| 40.3                                    \t|\n|                 \t| P@5        \t| 36.5                                    \t|\n|                 \t| Train Time \t| 1,153s                                  \t|\n|  WikiLSHTC-325K \t| P@1        \t| 65.6                                    \t|\n|                 \t| P@3        \t| 43.6                                    \t|\n|                 \t| P@5        \t| 32.5                                    \t|\n|                 \t| Train Time \t| 3,028s                                  \t|\n\n### Layer collapsing for balanced shallow trees\n\nAn alternative way for building balanced, shallow and wide trees is to collapse adjacent layers, similar to the tree compression step used in AttentionXML (You et al., 2019): intermediate layers are removed, and their children replace them as the children of their parents. For example, with balanced 2-means clustering, if we collapse 5 layers after each layer, we can increase the tree arity from 2 to 2⁵⁺¹ = 64.\n\n| Dataset         \t| Metric     \t| Omikuji\u003cbr/\u003e(balanced,\u003cbr/\u003ecluster.k=2,\u003cbr/\u003ecollapse 5 layers)\t|\n|-----------------\t|------------\t|---------------------------------------------------------------\t|\n|  EURLex-4K      \t| P@1        \t| 82.4                                                          \t|\n|                 \t| P@3        \t| 69.3                                                          \t|\n|                 \t| P@5        \t| 58.0                                                          \t|\n|                 \t| Train Time \t| 16s                                                           \t|\n| Amazon-670K     \t| P@1        \t| 45.3                                                          \t|\n|                 \t| P@3        \t| 40.2                                                          \t|\n|                 \t| P@5        \t| 36.4                                                          \t|\n|                 \t| Train Time \t| 460s                                                           \t|\n|  WikiLSHTC-325K \t| P@1        \t| 64.9                                                           \t|\n|                 \t| P@3        \t| 43.3                                                          \t|\n|                 \t| P@5        \t| 32.3                                                          \t|\n|                 \t| Train Time \t| 1,649s                                                        \t|\n\n## Build \u0026 Install\nOmikuji can be easily built \u0026 installed with [Cargo](https://doc.rust-lang.org/cargo/getting-started/installation.html) as a CLI app:\n```\ncargo install omikuji --features cli --locked\n```\n\nOr install from the latest source:\n```\ncargo install --git https://github.com/tomtung/omikuji.git --features cli --locked\n```\n\nThe CLI app will be available as `omikuji`. For example, to reproduce the results on the EURLex-4K dataset:\n```\nomikuji train eurlex_train.txt --model_path ./model\nomikuji test ./model eurlex_test.txt --out_path predictions.txt\n```\n\n\n### Python Binding\n\nA simple Python binding is also available for training and prediction. It can be install via `pip`:\n```\npip install omikuji\n```\n\nNote that you might still need to install Cargo should compilation become necessary.\n\nYou can also install from the latest source:\n```\npip install git+https://github.com/tomtung/omikuji.git -v\n```\n\nThe following script demonstrates how to use the Python binding to train a model and make predictions:\n\n```python\nimport omikuji\n\n# Train\nhyper_param = omikuji.Model.default_hyper_param()\n# Adjust hyper-parameters as needed\nhyper_param.n_trees = 5\nmodel = omikuji.Model.train_on_data(\"./eurlex_train.txt\", hyper_param)\n\n# Serialize \u0026 de-serialize\nmodel.save(\"./model\")\nmodel = omikuji.Model.load(\"./model\")\n# Optionally densify model weights to trade off between prediction speed and memory usage\nmodel.densify_weights(0.05)\n\n# Predict\nfeature_value_pairs = [\n    (0, 0.101468),\n    (1, 0.554374),\n    (2, 0.235760),\n    (3, 0.065255),\n    (8, 0.152305),\n    (10, 0.155051),\n    # ...\n]\nlabel_score_pairs =  model.predict(feature_value_pairs)\n```\n\n## Usage\n```\n$ omikuji train --help\nTrain a new omikuji model\n\nUSAGE:\n    omikuji train [OPTIONS] \u003cTRAINING_DATA_PATH\u003e\n\nARGS:\n    \u003cTRAINING_DATA_PATH\u003e\n            Path to training dataset file\n\n            The dataset file is expected to be in the format of the Extreme Classification\n            Repository.\n\nOPTIONS:\n        --centroid_threshold \u003cTHRESHOLD\u003e\n            Threshold for pruning label centroid vectors\n\n            [default: 0]\n\n        --cluster.eps \u003cCLUSTER_EPS\u003e\n            Epsilon value for determining linear classifier convergence\n\n            [default: 0.0001]\n\n        --cluster.k \u003cK\u003e\n            Number of clusters\n\n            [default: 2]\n\n        --cluster.min_size \u003cMIN_SIZE\u003e\n            Labels in clusters with sizes smaller than this threshold are reassigned to other\n            clusters instead\n\n            [default: 2]\n\n        --cluster.unbalanced\n            Perform regular k-means clustering instead of balanced k-means clustering\n\n        --collapse_every_n_layers \u003cN_LAYERS\u003e\n            Number of adjacent layers to collapse\n\n            This increases tree arity and decreases tree depth.\n\n            [default: 0]\n\n    -h, --help\n            Print help information\n\n        --linear.c \u003cC\u003e\n            Cost coefficient for regularizing linear classifiers\n\n            [default: 1]\n\n        --linear.eps \u003cLINEAR_EPS\u003e\n            Epsilon value for determining linear classifier convergence\n\n            [default: 0.1]\n\n        --linear.loss \u003cLOSS\u003e\n            Loss function used by linear classifiers\n\n            [default: hinge]\n            [possible values: hinge, log]\n\n        --linear.max_iter \u003cM\u003e\n            Max number of iterations for training each linear classifier\n\n            [default: 20]\n\n        --linear.weight_threshold \u003cMIN_WEIGHT\u003e\n            Threshold for pruning weight vectors of linear classifiers\n\n            [default: 0.1]\n\n        --max_depth \u003cDEPTH\u003e\n            Maximum tree depth\n\n            [default: 20]\n\n        --min_branch_size \u003cSIZE\u003e\n            Number of labels below which no further clustering \u0026 branching is done\n\n            [default: 100]\n\n        --model_path \u003cMODEL_PATH\u003e\n            Optional path of the directory where the trained model will be saved if provided\n\n            If an model with compatible settings is already saved in the given directory, the newly\n            trained trees will be added to the existing model\")\n\n        --n_threads \u003cN_THREADS\u003e\n            Number of worker threads\n\n            If 0, the number is selected automatically.\n\n            [default: 0]\n\n        --n_trees \u003cN_TREES\u003e\n            Number of trees\n\n            [default: 3]\n\n        --train_trees_1_by_1\n            Finish training each tree before start training the next\n\n            This limits initial parallelization but saves memory.\n\n        --tree_structure_only\n            Build the trees without training classifiers\n\n            Might be useful when a downstream user needs the tree structures only.\n```\n\n```\n$ omikuji test --help\nTest an existing omikuji model\n\nUSAGE:\n    omikuji test [OPTIONS] \u003cMODEL_PATH\u003e \u003cTEST_DATA_PATH\u003e\n\nARGS:\n    \u003cMODEL_PATH\u003e\n            Path of the directory where the trained model is saved\n\n    \u003cTEST_DATA_PATH\u003e\n            Path to test dataset file\n\n            The dataset file is expected to be in the format of the Extreme Classification\n            Repository.\n\nOPTIONS:\n        --beam_size \u003cBEAM_SIZE\u003e\n            Beam size for beam search\n\n            [default: 10]\n\n    -h, --help\n            Print help information\n\n        --k_top \u003cK\u003e\n            Number of top predictions to write out for each test example\n\n            [default: 5]\n\n        --max_sparse_density \u003cDENSITY\u003e\n            Density threshold above which sparse weight vectors are converted to dense format\n\n            Lower values speed up prediction at the cost of more memory usage.\n\n            [default: 0.1]\n\n        --n_threads \u003cN_THREADS\u003e\n            Number of worker threads\n\n            If 0, the number is selected automatically.\n\n            [default: 0]\n\n        --out_path \u003cOUT_PATH\u003e\n            Path to the which predictions will be written, if provided\n```\n\n### Data format\n\nOur implementation takes dataset files formatted as those provided in the [Extreme Classification Repository](http://manikvarma.org/downloads/XC/XMLRepository.html). A data file starts with a header line with three space-separated integers: total number of examples, number of features, and number of labels. Following the header line, there is one line per each example, starting with comma-separated labels, followed by space-separated feature:value pairs:\n```\nlabel1,label2,...labelk ft1:ft1_val ft2:ft2_val ft3:ft3_val .. ftd:ftd_val\n```\n\n## Trivia\n\nThe project name comes from [o-mikuji](https://en.wikipedia.org/wiki/O-mikuji) (御神籤), which are predictions about one's future written on strips of paper (labels?) at jinjas and temples in Japan, often tied to branches of pine trees after they are read.\n\n## References\n- Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma, “Parabel: Partitioned Label Trees for Extreme Classification with Application to Dynamic Search Advertising,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 993–1002.\n- S. Khandagale, H. Xiao, and R. Babbar, “Bonsai - Diverse and Shallow Trees for Extreme Multi-label Classification,” Apr. 2019.\n- G. Tsoumakas, I. Katakis, and I. Vlahavas, “Effective and efficient multilabel classification in domains with large number of labels,” ECML, 2008.\n- R. You, S. Dai, Z. Zhang, H. Mamitsuka, and S. Zhu, “AttentionXML: Extreme Multi-Label Text Classification with Multi-Label Attention Based Recurrent Neural Networks,” Jun. 2019.\n\n## License\nOmikuji is licensed under the MIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomtung%2Fomikuji","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftomtung%2Fomikuji","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomtung%2Fomikuji/lists"}