{"id":15601004,"url":"https://github.com/lucidrains/coco-lm-pytorch","last_synced_at":"2025-04-30T11:33:35.446Z","repository":{"id":43898854,"uuid":"343864811","full_name":"lucidrains/coco-lm-pytorch","owner":"lucidrains","description":"Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch","archived":false,"fork":false,"pushed_at":"2021-03-03T20:32:45.000Z","size":123,"stargazers_count":45,"open_issues_count":2,"forks_count":7,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-15T01:05:25.823Z","etag":null,"topics":["artificial-intelligence","deep-learning","pre-training","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucidrains.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-03-02T17:57:55.000Z","updated_at":"2024-11-05T06:46:43.000Z","dependencies_parsed_at":"2022-09-17T06:40:47.919Z","dependency_job_id":null,"html_url":"https://github.com/lucidrains/coco-lm-pytorch","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fcoco-lm-pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fcoco-lm-pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fcoco-lm-pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fcoco-lm-pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucidrains","download_url":"https://codeload.github.com/lucidrains/coco-lm-pytorch/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251691659,"owners_count":21628367,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","deep-learning","pre-training","transformers"],"created_at":"2024-10-03T02:11:28.893Z","updated_at":"2025-04-30T11:33:35.378Z","avatar_url":"https://github.com/lucidrains.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"./coco.png\" width=\"500px\"\u003e\u003c/img\u003e\n\n## COCO LM Pretraining (wip)\n\nImplementation of \u003ca href=\"https://arxiv.org/abs/2102.08473\"\u003eCOCO-LM\u003c/a\u003e, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were able to make contrastive learning work in a self-supervised manner for language model pretraining. Seems like a solid successor to Electra.\n\n## Install\n\n```bash\n$ pip install coco-lm-pytorch\n```\n\n## Usage\n\nAn example using the `x-transformers` library\n\n```bash\n$ pip install x-transformers\n```\nThen\n\n```python\nimport torch\nfrom coco_lm_pytorch import COCO\n\n# (1) instantiate the generator and discriminator, making sure that the generator is roughly a quarter to a half of the size of the discriminator\n\nfrom x_transformers import TransformerWrapper, Encoder\n\ngenerator = TransformerWrapper(\n    num_tokens = 20000,\n    emb_dim = 128,\n    max_seq_len = 1024,\n    attn_layers = Encoder(\n        dim = 256,         # smaller hidden dimension\n        heads = 4,         # less heads\n        ff_mult = 2,       # smaller feedforward dimension\n        depth = 1\n    )\n)\n\ndiscriminator = TransformerWrapper(\n    num_tokens = 20000,\n    emb_dim = 128,\n    max_seq_len = 1024,\n    attn_layers = Encoder(\n        dim = 1024,\n        heads = 16,\n        ff_mult = 4,\n        depth = 12\n    )\n)\n\n# (2) weight tie the token and positional embeddings of generator and discriminator\n\ngenerator.token_emb = discriminator.token_emb\ngenerator.pos_emb = discriminator.pos_emb\n\n# weight tie any other embeddings if available, token type embeddings, etc.\n\n# (3) instantiate COCO\n\ntrainer = COCO(\n    generator,\n    discriminator,\n    discr_dim = 1024,            # the embedding dimension of the discriminator\n    discr_layer = 'norm',        # the layer name in the discriminator, whose output would be used for predicting token is still the same or replaced\n    cls_token_id = 1,            # a token id must be reserved for [CLS], which is prepended to the sequence for contrastive learning\n    mask_token_id = 2,           # the token id reserved for masking\n    pad_token_id = 0,            # the token id for padding\n    mask_prob = 0.15,            # masking probability for masked language modeling\n    mask_ignore_token_ids = [],  # ids of tokens to ignore for mask modeling ex. (cls, sep)\n    cl_weight = 1.,              # weight for the contrastive learning loss\n    disc_weight = 1.,            # weight for the corrective learning loss\n    gen_weight = 1.              # weight for the MLM loss\n)\n\n# (4) train\n\ndata = torch.randint(0, 20000, (1, 1024))\n\nloss = trainer(data)\nloss.backward()\n\n# after much training, the discriminator should have improved\n\ntorch.save(discriminator, f'./pretrained-model.pt')\n```\n\n## Citations\n\n```bibtex\n@misc{meng2021cocolm,\n    title   = {COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining}, \n    author  = {Yu Meng and Chenyan Xiong and Payal Bajaj and Saurabh Tiwary and Paul Bennett and Jiawei Han and Xia Song},\n    year    = {2021},\n    eprint  = {2102.08473},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fcoco-lm-pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucidrains%2Fcoco-lm-pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fcoco-lm-pytorch/lists"}