https://github.com/stefan-it/model-garden-lms

Language Model Pretraining with TensorFlow Model Garden
https://github.com/stefan-it/model-garden-lms

Last synced: about 1 month ago
JSON representation

Language Model Pretraining with TensorFlow Model Garden

Host: GitHub
URL: https://github.com/stefan-it/model-garden-lms
Owner: stefan-it
Created: 2024-07-23T20:19:55.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-12-20T22:20:43.000Z (5 months ago)
Last Synced: 2025-04-23T01:09:07.857Z (about 1 month ago)
Language: Python
Size: 1.54 MB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # 🏡 TensorFlow Model Garden LMs



  

    

  

  




## 🔎 Overview

This repository showcases language model pretraining with the awesome [TensorFlow Model Garden](https://github.com/tensorflow/models) library.

The following LMs are currently supported:

* [BERT Pretraining](https://aclanthology.org/N19-1423/) - see [pretraining instructions](bert/README.md)

* [Token Dropping for efficient BERT Pretraining](https://aclanthology.org/2022.acl-long.262/) - see [pretraining instructions](token-dropping-bert/README.md)

* [Training ELECTRA Augmented with Multi-word Selection](https://aclanthology.org/2021.findings-acl.219/) (TEAMS) - see [pretraining instructions](teams/README.md)

## 💡 Features

Additionally, the following features are provided:

* A cheatsheet for TPU VM creation (including all necessary dependencies to pretrain models with TF Model Garden library), which can be found [here](cheatsheet/README.md).

* An extended pretraining data generation script that allows, for example, the use of tokenizers from the Hugging Face Model Hub or different data packing strategies (Original BERT packing or RoBERTa-like packing), which can be found [here](utils/README.md).

* Conversion scripts that convert TF Model Garden weights to Hugging Face Transformers-compatible models, which can be found [here](conversion/README.md).

## 🏡 Model Zoo

### FineWeb-LMs

Following LMs were pretrained on the (10BT subset) of the famous [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset:

* BERT-based - find the [best model checkpoint here](https://huggingface.co/model-garden-lms/bert-base-finewebs-951k)

* Token Dropping BERT-based - find the [best model checkpoint here](https://huggingface.co/model-garden-lms/bert-base-token-dropping-finewebs-901k)

* TEAMS-based - fine the [best model checkpoint here](https://huggingface.co/model-garden-lms/teams-base-finewebs-1m)

All models can be found in the [TensorFlow Model Garden LMs](https://huggingface.co/model-garden-lms) organization on the Model Hub and in [this collection](https://huggingface.co/collections/stefan-it/fineweb-lms-67561ed9d83c390221aaa2d4).

Detailed evaluation results with the [ScandEval](https://github.com/ScandEval/ScandEval) library are available in [this repository](https://huggingface.co/datasets/model-garden-lms/finewebs-scandeval-results).

## ❤️ Acknowledgements

This repository is the outcome of the last two years of working with TPUs from the awesome [TRC program](https://sites.research.google/trc/about/) and the [TensorFlow Model Garden](https://github.com/tensorflow/models) library.

Made from Bavarian Oberland with ❤️ and 🥨.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stefan-it/model-garden-lms

Awesome Lists containing this project

README