{"id":13609742,"url":"https://github.com/IndicoDataSolutions/finetune","last_synced_at":"2025-04-12T20:32:33.856Z","repository":{"id":33246335,"uuid":"137103159","full_name":"IndicoDataSolutions/finetune","owner":"IndicoDataSolutions","description":"Scikit-learn style model finetuning for NLP","archived":false,"fork":false,"pushed_at":"2024-08-25T07:39:58.000Z","size":443388,"stargazers_count":700,"open_issues_count":46,"forks_count":81,"subscribers_count":34,"default_branch":"development","last_synced_at":"2024-08-25T09:56:58.677Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://finetune.indico.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/IndicoDataSolutions.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.txt","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-12T17:02:16.000Z","updated_at":"2024-08-25T09:56:58.678Z","dependencies_parsed_at":"2023-09-26T11:05:07.637Z","dependency_job_id":"4327fcb9-95e1-43e2-a0fb-d1e9f4fbc78c","html_url":"https://github.com/IndicoDataSolutions/finetune","commit_stats":null,"previous_names":[],"tags_count":22,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IndicoDataSolutions%2Ffinetune","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IndicoDataSolutions%2Ffinetune/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IndicoDataSolutions%2Ffinetune/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IndicoDataSolutions%2Ffinetune/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/IndicoDataSolutions","download_url":"https://codeload.github.com/IndicoDataSolutions/finetune/tar.gz/refs/heads/development","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223539353,"owners_count":17162088,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:01:37.650Z","updated_at":"2024-11-07T15:31:50.536Z","avatar_url":"https://github.com/IndicoDataSolutions.png","language":"Python","funding_links":[],"categories":["文本数据和NLP","Python"],"sub_categories":[],"readme":"[![DOI](https://zenodo.org/badge/137103159.svg)](https://zenodo.org/badge/latestdoi/137103159)\n\n\u003cimg src=\"https://i.imgur.com/kYL058E.png\" width=\"100%\"\u003e\n\n**Scikit-learn style model finetuning for NLP**\n\nFinetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide variety of downstream tasks.\n\nFinetune currently supports TensorFlow implementations of the following models:\n\n1. **BERT**, from [\"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\"](https://arxiv.org/abs/1810.04805)\n2. **RoBERTa**, from [\"RoBERTa: A Robustly Optimized BERT Pretraining Approach\"](https://arxiv.org/abs/1907.11692)\n3. **GPT**, from [\"Improving Language Understanding by Generative Pre-Training\"](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)\n4. **GPT2**, from [\"Language Models are Unsupervised Multitask Learners\"](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)\n5. **TextCNN**, from [\"Convolutional Neural Networks for Sentence Classification\"](https://arxiv.org/abs/1408.5882)\n6. **Temporal Convolution Network**, from [\"An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling\"](https://arxiv.org/pdf/1803.01271.pdf)\n7. **DistilBERT** from [\"Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT\"](https://medium.com/huggingface/distilbert-8cf3380435b5)\n\n\n| Section | Description |\n|-|-|\n| [API Tour](#finetune-api-tour) | Base models, configurables, and more |\n| [Installation](#installation-tour) | How to install using pip or directly from source |\n| [Finetune with Docker](#docker) | Finetune and inference within a Docker Container |\n| [Documentation](https://finetune.indico.io/) | Full API documentation |\n\n# Finetune API Tour\n\nFinetuning the base language model is as easy as calling `Classifier.fit`:\n\n```python3\nmodel = Classifier()               # Load base model\nmodel.fit(trainX, trainY)          # Finetune base model on custom data\nmodel.save(path)                   # Serialize the model to disk\n...\nmodel = Classifier.load(path)      # Reload models from disk at any time\npredictions = model.predict(testX) # [{'class_1': 0.23, 'class_2': 0.54, ..}, ..]\n```\n\nChoose your desired base model from `finetune.base_models`:\n```python3\nfrom finetune.base_models import BERT, RoBERTa, GPT, GPT2, TextCNN, TCN\nmodel = Classifier(base_model=BERT)\n```\n\nOptimize your model with a variety of configurables. A detailed list of all config items can be found [in the finetune docs](https://finetune.indico.io/config.html).\n```python3\nmodel = Classifier(low_memory_mode=True, lr_schedule=\"warmup_linear\", max_length=512, l2_reg=0.01, oversample=True, ...)\n```\n\nThe library supports finetuning for a number of tasks. A detailed description of all target models can be found [in the finetune API reference](https://finetune.indico.io/api.html).\n```python3\nfrom finetune import *\nmodels = (Classifier, MultiLabelClassifier, MultiFieldClassifier, MultipleChoice, # Classify one or more inputs into one or more classes\n          Regressor, OrdinalRegressor, MultifieldRegressor,                       # Regress on one or more inputs\n          SequenceLabeler, Association,                                           # Extract tokens from a given class, or infer relationships between them\n          Comparison, ComparisonRegressor, ComparisonOrdinalRegressor,            # Compare two documents for a given task\n          LanguageModel, MultiTask,                                               # Further pretrain your base models\n          DeploymentModel                                                         # Wrapper to optimize your serialized models for a production environment\n          )\n```\nFor example usage of each of these target types, see the [finetune/datasets directory](https://github.com/IndicoDataSolutions/finetune/tree/master/finetune/datasets).\nFor purposes of simplicity and runtime these examples use smaller versions of the published datasets.\n\n\n\n\n\n\nIf you have large amounts of unlabeled training data and only a small amount of labeled training data,\nyou can finetune in two steps for best performance.\n\n```python3\nmodel = Classifier()               # Load base model\nmodel.fit(unlabeledX)              # Finetune base model on unlabeled training data\nmodel.fit(trainX, trainY)          # Continue finetuning with a smaller amount of labeled data\npredictions = model.predict(testX) # [{'class_1': 0.23, 'class_2': 0.54, ..}, ..]\nmodel.save(path)                   # Serialize the model to disk\n```\n\n# Installation\n\nFinetune can be installed directly from PyPI by using `pip`\n\n```\npip3 install finetune\n```\n\nor installed directly from source:\n\n```bash\ngit clone -b master https://github.com/IndicoDataSolutions/finetune \u0026\u0026 cd finetune\npython3 setup.py develop              # symlinks the git directory to your python path\npip3 install tensorflow-gpu --upgrade # or tensorflow-cpu\npython3 -m spacy download en          # download spacy tokenizer\n```\n\nIn order to run `finetune` on your host, you'll need a working copy of tensorflow-gpu \u003e= 1.14.0 and up to date nvidia-driver versions.\n\nYou can optionally run the provided test suite to ensure installation completed successfully.\n\n```bash\npip3 install pytest\npytest\n```\n\n\n# Docker\n\nIf you'd prefer you can also run `finetune` in a docker container. The bash scripts provided assume you have a functional install of [docker](https://docs.docker.com/install/) and [nvidia-docker](https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)).\n\n```\ngit clone https://github.com/IndicoDataSolutions/finetune \u0026\u0026 cd finetune\n\n# For usage with NVIDIA GPUs\n./docker/build_gpu_docker.sh      # builds a docker image\n./docker/start_gpu_docker.sh      # starts a docker container in the background, forwards $PWD to /finetune\n\ndocker exec -it finetune bash # starts a bash session in the docker container\n```\n\nFor CPU-only usage:\n```\n./docker/build_cpu_docker.sh\n./docker/start_cpu_docker.sh\n```\n\n# Documentation\nFull documentation and an API Reference for `finetune` is available at [finetune.indico.io](https://finetune.indico.io).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FIndicoDataSolutions%2Ffinetune","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FIndicoDataSolutions%2Ffinetune","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FIndicoDataSolutions%2Ffinetune/lists"}