{"id":18377494,"url":"https://github.com/lightning-universe/text-classification-component","last_synced_at":"2025-04-06T21:31:42.212Z","repository":{"id":64493841,"uuid":"569625017","full_name":"Lightning-Universe/Text-Classification-component","owner":"Lightning-Universe","description":null,"archived":false,"fork":false,"pushed_at":"2023-01-23T18:51:51.000Z","size":50,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-05-14T00:06:00.901Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Lightning-Universe.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-11-23T08:55:39.000Z","updated_at":"2023-04-02T06:47:45.000Z","dependencies_parsed_at":"2023-02-13T02:00:55.605Z","dependency_job_id":null,"html_url":"https://github.com/Lightning-Universe/Text-Classification-component","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lightning-Universe%2FText-Classification-component","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lightning-Universe%2FText-Classification-component/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lightning-Universe%2FText-Classification-component/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lightning-Universe%2FText-Classification-component/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Lightning-Universe","download_url":"https://codeload.github.com/Lightning-Universe/Text-Classification-component/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223264269,"owners_count":17116084,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T00:28:18.479Z","updated_at":"2024-11-06T00:28:18.916Z","avatar_url":"https://github.com/Lightning-Universe.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003ch1\u003e\n        \u003cimg src=\"https://lightningaidev.wpengine.com/wp-content/uploads/2022/11/Asset-54-15.png\"\u003e\n        \u003cbr\u003e\n        Finetune large langugage models with Lightning\n        \u003c/br\u003e\n    \u003c/h1\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#run\"\u003eRun\u003c/a\u003e •\n  \u003ca href=\"https://www.lightning.ai/\"\u003eLightning AI\u003c/a\u003e •\n  \u003ca href=\"https://lightning.ai/lightning-docs/\"\u003eDocs\u003c/a\u003e\n\u003c/p\u003e\n\n[![ReadTheDocs](https://readthedocs.org/projects/pytorch-lightning/badge/?version=stable)](https://lightning.ai/lightning-docs/)\n[![Slack](https://img.shields.io/badge/slack-chat-green.svg?logo=slack)](https://www.pytorchlightning.ai/community)\n[![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/Lightning-AI/lightning/blob/master/LICENSE)\n\n\u003c/div\u003e\n\u003c/div\u003e\n\n______________________________________________________________________\n\nUse Lightning Classify to pre-train or fine-tune a large language model for text classification, \nwith as many parameters as you want (up to billions!). \n\nYou can do this:\n* using multiple GPUs\n* across multiple machines\n* on your own data\n* all without any infrastructure hassle! \n\nAll handled easily with the [Lightning Apps framework](https://lightning.ai/lightning-docs/).\n\n## Run\n\nTo run paste the following code snippet in a file `app.py`:\n\n```python\n#! pip install git+https://github.com/Lightning-AI/LAI-Text-Classification-Component\n#! curl -L https://bit.ly/yelp_train --create-dirs -o ${HOME}/data/yelp/train.csv -C -\n#! curl -L https://bit.ly/yelp_test --create-dirs -o ${HOME}/data/yelp/test.csv -C -\n\nimport lightning as L\n\nimport os, copy, torch\n\nfrom transformers import BloomForSequenceClassification, BloomTokenizerFast\nimport lai_textclf as txtclf\n\n\nclass MyTextClassification(L.LightningWork):\n    def __init__(self, *args, tb_drive, **kwargs):\n        super().__init__(*args, **kwargs)\n        self.tensorboard_drive = tb_drive\n\n    def run(self):\n        txtclf.warn_if_drive_not_empty(self.tensorboard_drive)\n        txtclf.warn_if_local()\n\n        # --------------------\n        # CONFIGURE YOUR MODEL\n        # --------------------\n        # Choose from: bloom-560m, bloom-1b1, bloom-1b7, bloom-3b\n        # For local runs: Choose a small model (i.e. bloom-560m)\n        model_type = \"bigscience/bloom-3b\"\n        tokenizer = BloomTokenizerFast.from_pretrained(model_type)\n        tokenizer.pad_token = tokenizer.eos_token\n        tokenizer.padding_side = \"left\"\n        num_labels = 5\n        module = BloomForSequenceClassification.from_pretrained(\n            model_type, num_labels=num_labels, ignore_mismatched_sizes=True\n        )\n\n        # -------------------\n        # CONFIGURE YOUR DATA\n        # -------------------\n        data_path = os.path.expanduser(\"~/data/yelp\")\n        train_dataloader = txtclf.TextClassificationDataLoader(\n            dataset=txtclf.TextDataset(csv_file=os.path.join(data_path, \"train.csv\")),\n            tokenizer=tokenizer,\n        )\n        val_dataloader = txtclf.TextClassificationDataLoader(\n            dataset=txtclf.TextDataset(csv_file=os.path.join(data_path, \"test.csv\")),\n            tokenizer=tokenizer,\n        )\n\n        # -------------------\n        # RUN YOUR FINETUNING\n        # -------------------\n        pl_module = TextClassification(model=module, tokenizer=tokenizer,\n                                       metrics=txtclf.clf_metrics(num_labels))\n\n        # For local runs without multiple gpus, change strategy to \"ddp\"\n        trainer = L.Trainer(\n            max_epochs=2, limit_train_batches=100, limit_val_batches=100,\n            strategy=\"deepspeed_stage_3_offload\", precision=16,\n            callbacks=txtclf.default_callbacks(), log_every_n_steps=5,\n            logger=txtclf.DriveTensorBoardLogger(save_dir=\".\", drive=self.tensorboard_drive),\n        )\n        trainer.fit(pl_module, train_dataloader, val_dataloader)\n\n\nclass TextClassification(L.LightningModule):\n    def __init__(self, model, tokenizer, metrics=None):\n        super().__init__()\n        self.model = model\n        self.tokenizer = tokenizer\n        self.train_metrics = copy.deepcopy(metrics or {})\n        self.val_metrics = copy.deepcopy(metrics or {})\n\n    def training_step(self, batch, batch_idx):\n        output = self.model(**batch)\n        self.log(\"train_loss\", output.loss, prog_bar=True, on_epoch=True, on_step=True)\n        self.train_metrics(output.logits, batch[\"labels\"])\n        self.log_dict(self.train_metrics, on_epoch=True, on_step=True)\n        return output.loss\n\n    def validation_step(self, batch, batch_idx):\n        output = self.model(**batch)\n        self.log(\"val_loss\", output.loss, prog_bar=True)\n        self.val_metrics(output.logits, batch[\"labels\"])\n        self.log_dict(self.val_metrics)\n\n    def configure_optimizers(self):\n        return torch.optim.AdamW(self.parameters(), lr=0.0001)\n\n\ncomponent = txtclf.MultiNodeLightningTrainerWithTensorboard(\n    MyTextClassification, num_nodes=2, cloud_compute=L.CloudCompute(\"gpu-fast-multi\", disk_size=50)\n)\napp = L.LightningApp(component)\n```\n\n### Running on the cloud\n\n```bash\nlightning run app app.py --cloud\n```\n\nDon't want to use the public cloud? Contact us at `product@lightning.ai` for early access to run on your private cluster (BYOC)!\n\n\n### Running locally (limited)\nThis example is optimized for the cloud. To run it locally on your laptop, choose a smaller model, and change the trainer settings like so:\n\n```python\nclass MyTextClassification(L.LightningWork):\n    def run(self):\n        ...\n        trainer = L.Trainer(strategy=\"ddp\")\n        ...\n```\nThen run the app with \n\n```bash\nlightning run app app.py --setup\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flightning-universe%2Ftext-classification-component","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flightning-universe%2Ftext-classification-component","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flightning-universe%2Ftext-classification-component/lists"}