{"id":13935800,"url":"https://github.com/labteral/ernie","last_synced_at":"2025-04-05T13:09:30.350Z","repository":{"id":45281263,"uuid":"240615607","full_name":"labteral/ernie","owner":"labteral","description":"Simple State-of-the-Art BERT-Based Sentence Classification with Keras / TensorFlow 2. Built with HuggingFace's Transformers.","archived":false,"fork":false,"pushed_at":"2024-05-26T21:14:58.000Z","size":334,"stargazers_count":200,"open_issues_count":3,"forks_count":31,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-29T12:09:06.024Z","etag":null,"topics":["albert","bert","bert-as-service","bert-embeddings","bert-model","bert-models","distilbert","huggingface","huggingface-transformer","keras","natural-language-processing","nlp","roberta","sentence-classification","tensorflow","tensorflow2","transformer-architecture","transformer-tensorflow2","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/labteral.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-02-14T23:22:43.000Z","updated_at":"2025-03-17T10:04:20.000Z","dependencies_parsed_at":"2024-01-17T08:44:24.945Z","dependency_job_id":"76a53706-5677-4431-a1ce-5adfacf6b9a8","html_url":"https://github.com/labteral/ernie","commit_stats":null,"previous_names":["brunneis/ernie"],"tags_count":5,"template":false,"template_full_name":"labteral/python-package","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/labteral%2Fernie","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/labteral%2Fernie/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/labteral%2Fernie/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/labteral%2Fernie/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/labteral","download_url":"https://codeload.github.com/labteral/ernie/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247339158,"owners_count":20923014,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["albert","bert","bert-as-service","bert-embeddings","bert-model","bert-models","distilbert","huggingface","huggingface-transformer","keras","natural-language-processing","nlp","roberta","sentence-classification","tensorflow","tensorflow2","transformer-architecture","transformer-tensorflow2","transformers"],"created_at":"2024-08-07T23:02:06.294Z","updated_at":"2025-04-05T13:09:30.331Z","avatar_url":"https://github.com/labteral.png","language":"Python","funding_links":["https://www.buymeacoffee.com/brunneis"],"categories":["Python"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003ca href=\"https://github.com/labteral/ernie#stickers-by-sticker-mule\" alt=\"Stickers section\"\u003e\u003cimg src=\"misc/ernie-sticker-diecut.png\" alt=\"Ernie Logo\" width=\"150\"/\u003e\u003c/a\u003e\n    \u003cbr\u003e\n\u003cp\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://pepy.tech/project/ernie/\"\u003e\u003cimg alt=\"Downloads\" src=\"https://img.shields.io/badge/dynamic/json?style=flat-square\u0026maxAge=3600\u0026label=downloads\u0026query=$.total_downloads\u0026url=https://analytics.pepy.tech/api/v2/projects/ernie\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://pypi.python.org/pypi/ernie/\"\u003e\u003cimg alt=\"PyPi\" src=\"https://img.shields.io/pypi/v/ernie.svg?style=flat-square\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/labteral/ernie/releases\"\u003e\u003cimg alt=\"GitHub releases\" src=\"https://img.shields.io/github/release/labteral/ernie.svg?style=flat-square\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/labteral/ernie/blob/master/LICENSE\"\u003e\u003cimg alt=\"License\" src=\"https://img.shields.io/github/license/labteral/ernie.svg?style=flat-square\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003ch3 align=\"center\"\u003e\n    \u003cb\u003eBERT's best friend.\u003c/b\u003e\n\u003c/h3\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://www.buymeacoffee.com/brunneis\" target=\"_blank\"\u003e\u003cimg src=\"https://cdn.buymeacoffee.com/buttons/default-orange.png\" alt=\"Buy Me A Coffee\" height=\"35px\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cbr\u003e\n\nSponsored by \u003ca href=\"http://stickermule.com/supports/ernie20-sponsorship\"\u003e\u003cimg src=\"misc/stickermule-logo.png\" alt=\"Sticker Mule Logo\" width=\"80px\"/\u003e\u003c/a\u003e\n\n# Installation\n\u003e Ernie requires Python 3.6 or higher.\n```bash\npip install ernie\n```\n\u003ca href=\"https://colab.research.google.com/drive/10lmqZyAHFP_-x4LxIQxZCavYpPqcR28c\"\u003e\u003cimg alt=\"Open In Colab\" src=\"https://colab.research.google.com/assets/colab-badge.svg?style=flat-square\"\u003e\u003c/a\u003e\n\n# Fine-Tuning\n## Sentence Classification\n```python\nfrom ernie import SentenceClassifier, Models\nimport pandas as pd\n\ntuples = [\n    (\"This is a positive example. I'm very happy today.\", 1),\n    (\"This is a negative sentence. Everything was wrong today at work.\", 0)\n]\ndf = pd.DataFrame(tuples)\n\nclassifier = SentenceClassifier(\n    model_name=Models.BertBaseUncased,\n    max_length=64,\n    labels_no=2\n)\nclassifier.load_dataset(df, validation_split=0.2)\nclassifier.fine_tune(\n    epochs=4,\n    learning_rate=2e-5,\n    training_batch_size=32,\n    validation_batch_size=64\n)\n```\n\n# Prediction\n## Predict a single text\n```python\ntext = \"Oh, that's great!\"\n\n# It returns a tuple with the prediction\nprobabilities = classifier.predict_one(text)\n```\n\n## Predict multiple texts\n```python\ntexts = [\"Oh, that's great!\", \"That's really bad\"]\n\n# It returns a generator of tuples with the predictions\nprobabilities = classifier.predict(texts)\n```\n\n## Prediction Strategies\nIf the length in tokens of the texts is greater than the `max_length` with which the model has been fine-tuned, they will be truncated. To avoid losing information you can use a split strategy and aggregate the predictions in different ways.\n\n### Split Strategies\n- `SentencesWithoutUrls`. The text will be splitted in sentences.\n- `GroupedSentencesWithoutUrls`. The text will be splitted in groups of sentences with a length in tokens similar to `max_length`.\n\n### Aggregation Strategies\n- `Mean`: the prediction of the text will be the mean of the predictions of the splits.\n- `MeanTopFiveBinaryClassification`: the mean is computed over the 5 higher predictions only.\n- `MeanTopTenBinaryClassification`: the mean is computed over the 10 higher predictions only.\n- `MeanTopFifteenBinaryClassification`: the mean is computed over the 15 higher predictions only.\n- `MeanTopTwentyBinaryClassification`: the mean is computed over the 20 higher predictions only.\n\n```python\nfrom ernie import SplitStrategies, AggregationStrategies\n\ntexts = [\"Oh, that's great!\", \"That's really bad\"]\nprobabilities = classifier.predict(\n    texts,\n    split_strategy=SplitStrategies.GroupedSentencesWithoutUrls,\n    aggregation_strategy=AggregationStrategies.Mean\n) \n```\n\n\nYou can define your custom strategies through `AggregationStrategy` and `SplitStrategy` classes.\n```python\nfrom ernie import SplitStrategy, AggregationStrategy\n\nmy_split_strategy = SplitStrategy(\n    split_patterns: list,\n    remove_patterns: list,\n    remove_too_short_groups: bool,\n    group_splits: bool\n)\nmy_aggregation_strategy = AggregationStrategy(\n    method: function,\n    max_items: int,\n    top_items: bool,\n    sorting_class_index: int\n)\n```\n\n# Save and restore a fine-tuned model\n## Save model\n```python\nclassifier.dump('./model')\n```\n\n## Load model\n```python\nclassifier = SentenceClassifier(model_path='./model')\n```\n\n# Interrupted Training\nSince the execution may break during training (especially if you are using Google Colab), you can opt to secure every new trained epoch, so the training can be resumed without losing all the progress.\n\n```python\nclassifier = SentenceClassifier(\n    model_name=Models.BertBaseUncased,\n    max_length=64\n)\nclassifier.load_dataset(df, validation_split=0.2)\n\nfor epoch in range(1, 5):\n    if epoch == 3:\n        raise Exception(\"Forced crash\")\n\n    classifier.fine_tune(epochs=1)\n    classifier.dump(f'./my-model/{epoch}')\n```\n\n```python\nlast_training_epoch = 2\n\nclassifier = SentenceClassifier(model_path=f'./my-model/{last_training_epoch}')\nclassifier.load_dataset(df, validation_split=0.2)\n\nfor epoch in range(last_training_epoch + 1, 5):\n    classifier.fine_tune(epochs=1)\n    classifier.dump(f'./my-model/{epoch}')\n```\n\n# Autosave\nEven if you do not explicitly dump the model, it will be autosaved into `./ernie-autosave` every time `fine_tune` is successfully executed. \n\n```\nernie-autosave/\n└── model_family/\n    └── timestamp/\n        ├── config.json\n        ├── special_tokens_map.json\n        ├── tf_model.h5\n        ├── tokenizer_config.json\n        └── vocab.txt\n```\n\nYou can easily clean the autosaved models by invoking `clean_autosave` after finishing a session or when starting a new one.\n```python\nfrom ernie import clean_autosave\nclean_autosave()\n```\n\n# Supported Models\n\nYou can access some of the official base model names through the `Models` class. However, you can directly type the HuggingFace's model name such as `bert-base-uncased` or `bert-base-chinese` when instantiating a `SentenceClassifier`.\n\n\u003e See all the available models at [huggingface.co/models](https://huggingface.co/models).\n\n# Additional Info\n\n## Accesing the model and tokenizer\nYou can directly access both the model and tokenizer objects once the classifier has been instantiated:\n```python\nclassifier.model\nclassifier.tokenizer\n```\n\n## Keras `model.fit` arguments\nYou can pass Keras arguments of the `model.fit` method to the `classifier.fine_tune` method. For example:\n```python\nclassifier.fine_tune(class_weight={0: 0.2, 1: 0.8})\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flabteral%2Fernie","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flabteral%2Fernie","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flabteral%2Fernie/lists"}