{"id":26148827,"url":"https://github.com/hpcaitech/cachedembedding","last_synced_at":"2025-04-14T03:41:12.534Z","repository":{"id":56789292,"uuid":"493209224","full_name":"hpcaitech/CachedEmbedding","owner":"hpcaitech","description":"A memory efficient DLRM training solution using ColossalAI","archived":false,"fork":false,"pushed_at":"2022-11-22T08:17:59.000Z","size":1267,"stargazers_count":104,"open_issues_count":2,"forks_count":14,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-11T14:48:23.534Z","etag":null,"topics":["colossal-ai","deep-learning","dlrm","embeddings","nlp","pytorch","recommandation-system"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hpcaitech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"license","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-05-17T10:49:53.000Z","updated_at":"2025-03-29T09:38:36.000Z","dependencies_parsed_at":"2022-08-16T08:50:16.159Z","dependency_job_id":null,"html_url":"https://github.com/hpcaitech/CachedEmbedding","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcaitech%2FCachedEmbedding","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcaitech%2FCachedEmbedding/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcaitech%2FCachedEmbedding/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcaitech%2FCachedEmbedding/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hpcaitech","download_url":"https://codeload.github.com/hpcaitech/CachedEmbedding/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248818525,"owners_count":21166438,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["colossal-ai","deep-learning","dlrm","embeddings","nlp","pytorch","recommandation-system"],"created_at":"2025-03-11T05:21:52.701Z","updated_at":"2025-04-14T03:41:12.505Z","avatar_url":"https://github.com/hpcaitech.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## CachedEmbedding : larger embedding tables, smaller GPU memory budget.\n\nThe embedding tables in deep learning recommendation system models are becoming extremly large and cannot be fit in GPU memory.\nThis project provides an efficient way to train the extremely large recommendation system models.\nThe entire training runs on GPU in a synchronized parameter updating manner.\n\nThis project applies the CachedEmbedding, which extends the vanilla\n[PyTorch EmbeddingBag](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html#torch.nn.EmbeddingBag) \nwith the help from [ColossalAI](https://github.com/hpcaitech/ColossalAI).\nThe CachedEmbedding use a [software cache approach](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.nn.parallel.layers.html) to dynamically manage the extremely large embedding table in the CPU and GPU memory space.\nFor example, this repo can train DLRM model including a **91.10 GB** embedding table on Criteo 1TB dataset allocating just **3.75 GB** CUDA memory  on a single GPU!\n\nIn order to reduce the overhead time of the Cache, we designed a \"far-sighted\" Cache mechanism. \nInstead of only performing cache operations on the first mini-batch, wefetches several mini-batches that will be used later, and performs Cache query operations together.\nIt also uses a pipeline method to overlap the overhead of data loading and model training, which is shown in the following figures.\n\n\u003cimg src=\"./pics/prefetch.png\" width=800/\u003e\n\nDespite the extra cache indexing and CPU-GPU overhead, the end-to-end performance of our system drops very little compared to the torchrec.\nHowever, torchrec usually requires an order of magnitude more CUDA memory requirements.\nAlso, our software cache is implemented using pytorch without any customized C++/CUDA kernels, and developers can customize or optimize it according to their needs.\n\n### Dataset  \n1. [Criteo Kaggle](https://www.kaggle.com/c/avazu-ctr-prediction/data)\n2. [Avazu](https://www.kaggle.com/c/avazu-ctr-prediction/data)\n3. [Criteo 1TB](https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/) \n\nBasically, the preprocessing processes are derived from \n[Torchrec's utilities](https://github.com/pytorch/torchrec/blob/main/torchrec/datasets/scripts/npy_preproc_criteo.py) \nand [Avazu kaggle community](https://www.kaggle.com/code/leejunseok97/deepfm-deepctr-torch)\nPlease refer to `scripts/preprocess` dir to see the details.\n\n### Usage\n\n1. Installation Dependencies\n\nInstall [ColossalAI](https://github.com/hpcaitech/ColossalAI) (commit id e8d8eda5e7a0619bd779e35065397679e1536dcd)\n\nhttps://github.com/hpcaitech/ColossalAI\n\nInstall our customized [torchrec](https://github.com/hpcaitech/torchrec) (commit id e8d8eda5e7a0619bd779e35065397679e1536dcd)\n\nhttps://github.com/hpcaitech/torchrec\n\nOr, build a docker image using [docker/Dockerfile](./docker/Dockerfile).\nOr, use prebuilt docker image on dockerhub.\n\n```\ndocker pull hpcaitech/cacheembedding:0.2.2\n```\n\nlauch a docker container.\n\n```\nbash ./docker/launch.sh\n```\n\n2. Run\n\nAll the commands to run DLRM on three datasets are presented in `scripts/run.sh`\n```\nbash scripts/run.sh\n```\n\nSet `--prefetch_num` to use prefetching.\n\n### Model  \nCurrently, this repo only contains facebook DLRM models, and we are working on testing more recommendation models.\n\n### Performance\n\nThe DLRM performance on three datasets using ColossalAI version (this repo) and torchrec (with UVM) is shown as follows. The cache ratio of FreqAwareEmbedding is set as 1%. The evaluation is conducted on a single A100 (80GB memory) and AMD 7543 32-Core CPU (512GB memory).\n\n|            |   method   | AUROC over Test after 1 Epoch | Acc over test | Throughput | Time to Train 1 Epoch | GPU memory allocated (GB) | GPU memory reserved (GB) | CPU memory usage (GB) |\n|:----------:|:----------:|:-----------------------------:|:-------------:|:----------:|:---------------------:|:-------------------------:|:------------------------:|:---------------------:|\n| criteo 1TB | ColossalAI |          0.791299403          |  0.967155457  |   42 it/s  |         1h40m         |            3.75           |           5.04           |         94.39         |\n|            |  torchrec  |           0.79515636          |  0.967177451  |   45 it/s  |         1h35m         |           66.54           |           68.43          |          7.7          |\n|   kaggle   | ColossalAI |          0.776755869          |  0.779025435  |   50 it/s  |          49s          |            0.9            |           2.14           |         34.66         |\n|            |  torchrec  |          0.786652029          |  0.782288849  |   81 it/s  |          30s          |           16.13           |           17.99          |         13.89         |\n|   avazue   | ColossalAI |          0.72732079           |  0.824390948  |   72 it/s  |          31s          |            0.31           |           1.06           |         16.89         |\n|            |  torchrec  |          0.725972056          |  0.824484706  |  111 it/s  |          21s          |            4.53           |           5.83           |         12.25         |\n\n### Cite us\n```\n@article{fang2022frequency,\n  title={A Frequency-aware Software Cache for Large Recommendation System Embeddings},\n  author={Fang, Jiarui and Zhang, Geng and Han, Jiatong and Li, Shenggui and Bian, Zhengda and Li, Yongbin and Liu, Jin and You, Yang},\n  journal={arXiv preprint arXiv:2208.05321},\n  year={2022}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhpcaitech%2Fcachedembedding","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhpcaitech%2Fcachedembedding","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhpcaitech%2Fcachedembedding/lists"}