{"id":13473515,"url":"https://github.com/dpressel/mint","last_synced_at":"2025-09-07T23:40:46.410Z","repository":{"id":38414991,"uuid":"473016854","full_name":"dpressel/mint","owner":"dpressel","description":"MinT: Minimal Transformer Library and Tutorials","archived":false,"fork":false,"pushed_at":"2022-07-26T20:14:44.000Z","size":126,"stargazers_count":254,"open_issues_count":2,"forks_count":14,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-05-08T21:26:52.214Z","etag":null,"topics":["bart","bert","gpt","gpt2","opt","pytorch","roberta","sentence-transformers","t5","transformer","transformer-models","transformers","tutorials"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dpressel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-03-23T02:59:44.000Z","updated_at":"2025-04-23T08:04:40.000Z","dependencies_parsed_at":"2022-07-11T19:49:52.309Z","dependency_job_id":null,"html_url":"https://github.com/dpressel/mint","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dpressel/mint","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpressel%2Fmint","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpressel%2Fmint/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpressel%2Fmint/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpressel%2Fmint/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dpressel","download_url":"https://codeload.github.com/dpressel/mint/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpressel%2Fmint/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274112667,"owners_count":25224328,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-07T02:00:09.463Z","response_time":67,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bart","bert","gpt","gpt2","opt","pytorch","roberta","sentence-transformers","t5","transformer","transformer-models","transformers","tutorials"],"created_at":"2024-07-31T16:01:04.342Z","updated_at":"2025-09-07T23:40:46.361Z","avatar_url":"https://github.com/dpressel.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# MinT: Minimal Transformer Library and Tutorials\n\nA minimalistic implementation of common Transformers from scratch!\n\n## Colabs\n\nA series of tutorials on building common Transformer models from scratch. Each tutorial builds on the previous one, so they should be done in order.\n\n- [BERT from scratch](https://colab.research.google.com/drive/175hnhLkJcXH40tGGpO-1kbBrb2IIcIuT?usp=sharing)\n- [GPT \u0026 GPT2 from scratch](https://colab.research.google.com/drive/1svaeO-TF1UEEIq8aew4B5x-y4i79fIXv?usp=sharing)\n- [BART from scratch](https://colab.research.google.com/drive/12C764uTLwPMM9hUlprm_a4bUwHz91a7P?usp=sharing)\n- [T5 from scratch](https://colab.research.google.com/drive/1G3egJjNRrXog-8reY1Ssfoa6c92Dp4jh?usp=sharing)\n- [Build your own SentenceBERT](https://colab.research.google.com/drive/1P11ogAYU-EZ_Kbo7WorMM7p35qvwPuMo?usp=sharing)\n\nThe code here is also factored out here as a python package for easy use outside of the tutorial.\n\nBecause this is written for a tutorial to explain the modeling and training approach, we currently depend on the\nHuggingFace tokenizers library to implement subword tokenization.  I selected it because its fast, and widely used.\nThere are also other good, fast libraries (like BlingFire) that cover multiple subword approaches, but the library\ndoesnt support them at this time.\n\n\n## A Tiny Library for Transformers from the ground up\n\nMinimal PyTorch implementation of common Transformer architectures.  Currently implements\n\n- Encoder Only\n  - [BERT](https://aclanthology.org/N19-1423/) / [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf)\n- Decoder Only\n  - [GPT](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)\n  - [GPT2](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)\n- Encoder-Decoder\n  - [BART](https://arxiv.org/pdf/1910.13461v1.pdf)\n  - [T5](https://arxiv.org/pdf/1910.10683.pdf)\n- Dual-Encoder\n  - [SentenceBERT](https://aclanthology.org/D19-1410.pdf)\n\n\n## Pretraining\n\nThere are example programs at this time showing how to pretrain from scratch (or continue pre-training on pre-trained models)\n\n### In-memory training on a small dataset\nThere are 2 pretraining examples, one is a toy example good for small datasets like Wikitext-2.\nThe loader preprocesses the data and slurps the tensors into a TensorDataset. \nIt uses the `SimpleTrainer` to train several epochs.  Because the dataset is small and a Map-style dataset, it makes sense to train a whole epoch and then evaluate a whole test dataset.  For large datasets, I would not recommend this approach.\n\n### Out-of-memory training on a large dataset\nThe second example uses an infinite IterableDataset to read multiple files (shards) and converts them to tensors on the fly.\nThis program is a more realistic example of language modeling.\n\n### Out-of-memory preprocessed shards on a large dataset\n\nThe library also supports fully preprocessed datasets, but there is no example for that usage at this time.\n\n### Wikipedia\n\nTo pretrain on English Wikipedia with this program, you'll need an XML wikipedia dump.\nThis is usually named `enwiki-latest-pages-articles.xml.bz2` and can be found from the [Wikipedia dump site](https://dumps.wikimedia.org/enwiki/latest/).\nFor example, this should work for downloading:\n\n```\nwget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2\n```\nYou also need to use this repository:\n\n```\ngit clone https://github.com/attardi/wikiextractor\ncd wikiextractor\ngit checkout 16186e290d9eb0eb3a3784c6c0635a9ed7e855c3\n\n```\nHere is how I ran it for my example:\n\n```\npython WikiExtractor.py ${INPUT}/enwiki-latest-pages-articles.xml.bz2 \\\n       -q --json \\\n       --processes 7 \\\n       --output ${OUTPUT}/enwiki-extracted \\\n       --bytes 100M \\\n       --compress \\\n       --links \\\n       --discard_elements gallery,timeline,noinclude \\\n       --min_text_length 0 \\\n       --filter_disambig_pages\n```\nRegarding the command line above, only use `--compress` if you have bzip2 on your system and your Python can\n\n```python\nimport bz2\n```\n\nIn each target generated (e.g. AA, AB, AC), we are going to rename with a prefix (e.g. AA):\n\n```\nfor file in *.bz2; do mv \"$file\" \"AA_$file\"; done;\n```\nWe can then copy these to a single directory, or split them however we would like into train and test\n\nHere is how you can train on multiple workers with DistributedDataParallel:\n\n```\nCUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 python -m torch.distributed.launch \\\n        --node=1 \\\n        --nproc_per_node=8 \\\n        --node_rank=0 \\\n        --master_port=$PORT \\\n        pretrain_bert_wiki.py \\\n        --vocab_file /data/k8s/hf-models/bert-base-uncased/vocab.txt \\\n        --lowercase \\\n        --train_file \"/path/to/enwiki-extracted/train/\" \\\n        --valid_file \"/path/to/enwiki-extracted/valid/\" \\\n        --num_train_workers 4 \\\n        --num_valid_workers 1 --batch_size $B --num_steps $N --saves_per_cycle 1 \\\n        --train_cycle_size 10000 \\\n        --eval_cycle_size 500 \\\n        --distributed\n\n```\n\n## Fine-tuning\n\nThe [tune_bert_for_cls](src/tfs/examples/tune_bert_for_cls.py) program is a simple example of fine-tuning\nour BERT implementation from scratch. \n\n## Completer REPL\n\nThe [bert_completer](src/tfs/examples/bert_completer.py) program allows you to type in masked strings and\nsee how BERT would complete them.  When it starts, you can pass `--sample` in order to get sampling from the output,\notherwise it uses the most likely values.  You can switch between the 2 modes at runtime using:\n\n```\nBERT\u003e\u003e :sample\n```\nor \n```\nBERT\u003e\u003e :max\n```\nThis example uses `prompt_toolkit` which is not a core dependency, but you can install it like this:\n```\npip install .[examples]\n```\n\n\n## More Info Soon\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdpressel%2Fmint","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdpressel%2Fmint","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdpressel%2Fmint/lists"}