{"id":24630737,"url":"https://github.com/arrrrrmin/albert-guide","last_synced_at":"2025-10-05T17:12:23.423Z","repository":{"id":158780002,"uuid":"243263816","full_name":"arrrrrmin/albert-guide","owner":"arrrrrmin","description":"Understanding \"A Lite BERT\". An Transformer approach for learning self-supervised Language Models.","archived":false,"fork":false,"pushed_at":"2023-01-28T11:12:51.000Z","size":54,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-09-24T14:49:07.458Z","etag":null,"topics":["albert-guide","albert-models","guide","language-modeling","nlp","pretrain","pretraining"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arrrrrmin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-02-26T13:02:22.000Z","updated_at":"2024-08-16T09:43:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"c5eaa4d5-f172-42fc-8b32-8638ab3a31cb","html_url":"https://github.com/arrrrrmin/albert-guide","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/arrrrrmin/albert-guide","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arrrrrmin%2Falbert-guide","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arrrrrmin%2Falbert-guide/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arrrrrmin%2Falbert-guide/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arrrrrmin%2Falbert-guide/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arrrrrmin","download_url":"https://codeload.github.com/arrrrrmin/albert-guide/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arrrrrmin%2Falbert-guide/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278486308,"owners_count":25994945,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-05T02:00:06.059Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["albert-guide","albert-models","guide","language-modeling","nlp","pretrain","pretraining"],"created_at":"2025-01-25T07:12:58.017Z","updated_at":"2025-10-05T17:12:23.417Z","avatar_url":"https://github.com/arrrrrmin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# albert-guide\n\nA guide to pretrain a new own albert model from scretch\n\n# Pretaining ALBERT models from scretch\n\n\u003e A detailed guide for to get started with ALBERT models as they where intended by google-research.\n\u003e Hints for usages in prod can be found at the [end]() of this guide.\n\n# Stages\n\n* [__Environments__, __setups__ and __configurations__](#environments-setups-and-configurations)\n    * [Environments](#environments)\n    * [Setups](#setups)\n    * [Configuration objects](#configuration-objects)\n* [__Tokenizers__, __Raws__, __model tasks__ and __records__](#tokenizers-raws-model-tasks-and-records)\n    * [Tokenizers](#tokenizers)\n    * [Raws](#raws)\n    * [Model tasks](#model-tasks)\n    * [Records](#records)\n* [Main entry __run_pretraining__](#main-entry-run-pretraining)\n* [Usage Albert with HF __Transformers__](#usage-albert-with-hf-transformers)\n\n# Environments setups and configuration objects\n\n## Environments\n\nEverything the environment needs to offer is documented in requirements.txt. Here an example where `==X.Y.Z` refers\nto the satisfied version of this dependency package.\n\n    transformers\n    tensorflow==1.15.2\n    tensorflow-gpu==1.15.2\n    tensorflow-estimator==1.15.1\n\nThe `transformers` package for example will automatically look for the newest version available.\nWhereas `tensorflow==1.15.2` will install this exact version, and the therein documented dependencies.\nFuture note: Packages like [Poetry](https://github.com/python-poetry/poetry) can handle these dependencies\npretty well, as the requirements are growing.\n\nTheres a difference between a local environment and production usage. On a server you most likely don't want to use\nan environment, since the server does not need to handly many projects. Thus one can skip the environment and directly\ninstall packages on the system.\n\nFor local development it's highly recommanded to use a local environment. When handling different software projects\nevery environment can define it's own dependencies. For setting those up see [Setups](#Setups)\n\n## Setups\n\n    # Set the virtual environment (please call venv as module with -m)\n    python3 -m venv env\n\n    # Enter the environment\n    source env/bin/activate\n\n    # Install a pip version and upgrade it (again -m is important)\n    python3 -m pip install --upgrade pip\n\n    # Install all packages mentioned in requirements.txt\n    # This call should be used with freezed requirements (==X.Y.Z)\n    pip3 install -r requirements.txt\n\n    # Upgrade what's possible\n    # Execute with --upgrade if you want to have the newest libraries\n    # Not recommended if for example tensorflow would upgrade to 2.Y.Z from 1.Y.Z\n    # pip3 install -r requirements.txt --upgrade\n\n\n## Configuration objects\n\n`ALBERT` has a large architecture configuration and also defines a lot of other parameters.\nParameters that suggest how to perform the pretraining. Like `sequence_length`, `masked_lm_prob`,\n`dupe_factor` or even newer parameter that didn't exist in original [BERT](https://github.com/google-research/bert)\nlike `ngram`, `random_next_sentence`, or `poly_power`. `albert_config` is common model architecture\njson config.\n\n    \"albert_config\": {\n        \"attention_probs_dropout_prob\": 0.1,\n        \"hidden_act\": \"gelu\",\n        \"hidden_dropout_prob\": 0.1,\n        \"embedding_size\": 128,\n        \"hidden_size\": 1024,\n        \"initializer_range\": 0.02,\n        \"intermediate_size\": 4096,\n        \"max_position_embeddings\": 512,\n        \"num_attention_heads\": 16,\n        \"num_hidden_layers\": 24,\n        \"num_hidden_groups\": 1,\n        \"net_structure_type\": 0,\n        \"gap_size\": 0,\n        \"num_memory_blocks\": 0,\n        \"inner_group_num\": 1,\n        \"down_scale_factor\": 1,\n        \"type_vocab_size\": 2,\n        \"vocab_size\": 30000\n    }\nAdditional parameters can be the following:\n\n|Parameter                  |Default    |\n|---                        |---        |\n|do_lower_case              | true      |\n|max_predictions_per_seq    | 20        |\n|random_seed                | 12345     |\n|dupe_factor                | 2         |\n|masked_lm_prob             | 0.15      |\n|short_seq_prob             | 0.2       |\n|do_permutation             | false     |\n|random_next_sentence       | false     |\n|do_whole_word_mask         | true      |\n|favor_shorter_ngram        | true      |\n|ngram                      | 3         |\n|optimizer                  | lamb      |\n|poly_power                 | 1.0       |\n|learning_rate              | 0.00176   |\n|max_seq_length             | 512       |\n|num_train_steps            | 125000    |\n|num_warmup_steps           | 3125      |\n|save_checkpoints_steps     | 5000      |\n|keep_checkpoint_max        | 5         |\n\nThere are even more but these (i think) are the most important for ALBERT. I suggest to also keep theses in a json or\nyaml file. If those are kept in a json one can easily read them and build pipelines around the commands provided in the\nALBERT repository.\n\n# Tokenizers raws model tasks and records\n\n## Tokenizers\n\n`ALBERT` supports [`SentencePiece`-Tokenizer](https://github.com/google/sentencepiece/blob/master/python/README.md)\nnatively. It's fully integrated in the preprocessing pipeline. But to use it, one has to learn a tokenizer, on the\nprovided data. Google Standard Tokenizers mostly do not support  german and even if they do it's a mulitlingual version\nwhere each Language just is provided with around 1000 individual tokens.\n\nFor most NLP applications and corpora a `vocab_size` inbetween `20000` to `40000` should be fine.\nThe tokenizer itself is trained via:\n\n```python\nimport os\nimport logging\nimport sentencepiece\n\ntext_filepath = \"path/to/corpus.txt\"\nmodel_filepath = \"path/to/model/\"\nvocab_size = 25000\ncontrol_symbols = [\"[CLS]\", \"[SEP]\", \"[MASK]\"]\n\nif not os.path.isfile(text_filepath):\n    raise BaseException(f\"Could not train sp tokenizer, due to missing text file at {text_filepath}\")\n\ntrain_command = f\"--input={text_filepath} \" \\\n                f\"--model_prefix={model_filepath} \" \\\n                f\"--vocab_size={vocab_size - len(control_symbols)} \" \\\n                f\"--pad_id=0 --unk_id=1 --eos_id=-1 --bos_id=-1 \" \\\n                f\"--user_defined_symbols=(,),”,-,.,–,£,€ \" \\\n                f\"--control_symbols={','.join(control_symbols)} \" \\\n                f\"--shuffle_input_sentence=true --input_sentence_size=10000000 \" \\\n                f\"--character_coverage=0.99995 --model_type=unigram \"\n\nlogging.info(f\"Learning SentencePiece tokenizer with following train command: {train_command}\")\nsentencepiece.SentencePieceTrainer.Train(train_command)\nassert (os.path.isfile(f\"{model_filepath}.model\"))\n```\n\nIt'll write two files to `--model_prefix`: `tokenizer.model` and the `tokenizer.vocab`. The vocabulary\nhas all subtokens and the model is a binary file, to load the model from.\n\nBut to train the tokenizer we need a file to pass to `text_filepath`. This can be done with\n\n## Raws\n\nThe only thing we need to train a tokenizer is a file that contains all our data. Since the\n[`SentencePiece`-Tokenizer](https://github.com/google/sentencepiece/blob/master/python/README.md) is trained on\nsentences to detect subtokens in a text, we need to find all sentences that are provided in our data. At this point\nwe can already think about the way in which we should provide data to tensorflow and the preprocessing pipeline of\n`ALBERT`.\n\nIn fact there only a very light difference between what the\n[`sentencepiece.SentencePieceTrainer.Train`](https://github.com/google/sentencepiece/blob/master/python/sentencepiece.py)\nand [`create_pretrain_data.py`](https://github.com/google-research/ALBERT/blob/master/create_pretraining_data.py)\nby `ALBERT` original google-research repository. ALBERTs preprocessing pipeline expects the data to be one sentence\nper line, just like sentencepiece, but the documents must be seperated by an additional line break (`\\n`).\n\nSince `SentencePiece` is fine with no tokens in a line we can format the data such that we only need one file,\ninstead of two seperate sentences from eachother by `\\n` and documents with `\\n\\n`.\n\nBut we still don't know what a sentence is. Classic NLP problem, we need to find what defines a sentence. This question\nseem far to complex to tackle at this point, since we just want to format data for the first step on the way to train an\nALBERT model.\n\nBefore diving deep into designing regexes for every many many special cases and exceptions in your data: My\nrecommendation is to pick up `NLTK` as another dependency in your project and add download the tokenizer pickle\nfrom their repository, usoing the `ǹltk.download()` function in your terminal. There are a few languages and it's\neasy to handle.\n\nOnce your models perform reasonabily with not as many training steps (like `100000` to `150000`) you can tackle\nthe problem and find your sentences with more accurate ways, that fit your needs.\n\nNow we have our raw.txt file which most likely will be a large file around 1 to 2GB or even larger\n\n## Model tasks\n\nBefore we can enter the way in which we address the creation of our preprocessed data, we need to have a look at\nwhat the pretraining tasks are for the model. So let's have a short look at what ALBERT is actually trying to learn,\nwhen we pretrain it.\n\n### Masked LM Prediction\n\nFirst of all no matter what task we are on there is a new interesting set of parameters in BERT/ALBERT. Since these\nmodels operate on sentences (or `sequences`), we need to set a maximum size, a sequence can have. This parameter\nis limited to `512` and is usually either `64`, `128` or `265` if not. This parameter later on\ninfluences the `batch_size`, which determines how large a single batch is that is computed in out turn.\n\nParameters like `short_seq_prob` are interesting indipendent from what task is\nperformed. The `short_seq_prob`-Parameter describes at which probability a sequence is shortend down to the length\nthat is described in `target_seq_length`.\n\nBut now let get to the first task: __Masked LM Prediction__ is a task that takes `sentence` as input. Additionally\nsome other parameters like `do_lower_case` (used in the tokenization), `max_predictions_per_seq`,\n`do_whole_word_mask` and `masked_lm_prob` are passed, to fine configure this task. This task also exists in the\noriginal BERT model and aims to MASK tokens within a sentence. The model then tries to predict the words from the\nknown (passed words).\n\nHere is an example that comes from the original [BERT repository](https://github.com/google-research/BERT):\n\n`\nInput: the man went to the [MASK1] . he bought a [MASK2] of milk.\nLabels: [MASK1] = store; [MASK2] = gallon\n`\n\nIn reality there is no token named `[MASK1]` or `[MASK2]`. These will be masked with the same token called\n`[MASK]`. In binary elements per token this would look like `[0,0,0,0,0,1,0,0,0,0,1,0,0,0]`. All tokens at\npositions maked with `1` should be predicted, whereas tokens marked with `0` are passed as ids to the model.\n\nAdditionally the parameter `masked_lm_prob` tells how many of a sequences available tokens are masked. This is done\nbefore padding the sequence up to 512, or what ever is set as `max_seq_length`. So `masked_lm_prob` is applied\nto the length of the raw sequence, not the padded length.\n\nAnother interessting parameter is `do_whole_word_mask`. This tells the pretraining data process to only mask full\nwords, instead of subwords. Tokenizers like `sentencepiece` are using special characters to separate subtokens from\neach other and also mark a subtoken needs some other token combined to be understood as a full token/word. In\n`sentencepiece` this special character is `▁` (looks like a normal underscore but it is not). This character\nmarks a subword, so when `do_whole_word_mask` is used this token is used to find out if the token before or after\nshould be masked too. Like this it's possible to mask full words instead of subwords.\n\n\n### Sentence Order Prediction\n\nSentence Order Prediction (SOP) is a new task in ALBERT and didn't exist in BERT original. It replaces the \nNext Sentence Prediction (NSP) task. Basically both tasks aim to learn relationships between segments (sentences). Since \nMasked LM Prediction (MLM) does only care about tokens within a certain segement, these tasks are designed to learn \ninformation about language properties that are formed from sequences of tokens. On could say it's an inter sequencially\ndesigned task. \nWhereas BERT originally tried to predict wether *\"two segments appear consecutively in the original text\"*. \n[Yang et. al.](https://arxiv.org/abs/1906.08237) \u0026 [Liu et. al.](https://arxiv.org/abs/1907.11692) eliminated that task\nand observed an impovements on all finetune tasks. [Lan et. al.](https://arxiv.org/pdf/1909.11942) showed that the \nreason for this behaviour is coming from the fact that the task is to easy. The observation basically showed that NSP\nbenefited from MLM as this single sequence task was already learning a good portion of topical information. Which\nhelped when predicting the similarity between two sentences. NSP simply learned to use the already existing knowledge \nof topics. Maybe this could have also beent tackled with a different sampling strategy but anyways, they replaced it\nwith SOP. \n\nThis task does not predict wether a a segment is the next, wether two segments are in the correct order. Negative \nsamples are generated by swapping two consecutive sentences. Positive samples are taken from two a document as it is.\nSOP performes far more stable in contrast to NSP.\n\nAs in the original BERT model sentences are marked by `[CLS]` for the start of the first segment, `[SEP]` \nfor the end of this segment, and another `[SEP]` token for the end of the second segment. The process works like\nthis:\n\n* Choose two sentences from the corpus\n    * When `random_next_sentence` is set we'll want to use a random sentence from a another document\n    * When `random_next_sentence` is not set we'll just offset by one and take the one after the correct sentence \n* Apply subword tokenization\n    * `▁` helps to find out wether to aggregate tokens when `whole_word_masking` is set\n* Now finalize segments with `[CLS]` ... `[SEP]` ... `[SEP]`\n* Wrap all of this in a training `instance` and add `next_sentence_labels` with either `0` or `1`\n    * `0` labels the second segment as consecutively\n    * `1` labels the second segment as incorrect/*random*\n    * Note that `next_sentence_labels` was moved from BERT unchanged\n        * i guess to make it easier for orgs huggingface/transformers or spacy/transformers to update their code\n\n\n## Records\n\nNow that we fully understand what the model should do, we can create `instances`, that will be written to a special\nformat used in `tensoflow` to pass data, called `tfrecords`. In such a record each training instance looks like\nthis:\n\n    INFO:tensorflow:tokens: [CLS] a man went to a [MASK] [SEP] he bou ▁ght a [MASK] of milk [SEP]\n    INFO:tensorflow:input_ids: 2 13 48 1082 2090 18275 7893 13 4 3 37 325 328 3235 48 4 44 1131 3 0 0 0 0 ...\n    INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 ...\n    INFO:tensorflow:segment_ids: 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 ...\n    INFO:tensorflow:token_boundary: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 ...\n    INFO:tensorflow:masked_lm_positions: 8 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...\n    INFO:tensorflow:masked_lm_ids: 65 2636 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... \n    INFO:tensorflow:masked_lm_weights: 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...\n    INFO:tensorflow:next_sentence_labels: 1\n\n\n`Instances` hold data for MLM and SOP. And are written to `tfrecord`-files. In this case \n`next_sentence_labels` is refering to *Sentence Order Prediction*.\n\n\n# Main entry run pretraining\n\nWhen we are having everything in place `Configu`, `Tokenizer` and `Pretraining Data`, we can start pretraining a \nfresh model. The shell call is using a provided script \n[`run_pretraining.py`](https://github.com/google-research/ALBERT/blob/master/run_pretraining.py)\n\nThe call involves some parameters, for example `input_file`, which can also be a directory, where multiple `records`\nare created by [`data_preperation.py`](https://github.com/google-research/ALBERT/blob/master/create_pretraining_data.py)\n. `output_dir` is the directory, which will contain out model. In case we have started training and aborted somehow,\nwe can use `init_checkpoint` to continue. Keep an eye on `save_checkpoints_steps`, since it tells us how frequent the\nmodel is saved, during training. `num_warmup_steps` can be set to 2.5% of `num_train_steps`. This is the number of steps\nthe model will apply a lower learning rate, until it reaches the passed `learning_rate` parameter.\n\n    pip install -r albert/requirements.txt\n    python -m albert.run_pretraining \\\n        --input_file=... \\\n        --output_dir=... \\\n        --init_checkpoint=... \\\n        --albert_config_file=... \\\n        --do_train \\\n        --do_eval \\\n        --train_batch_size=4096 \\\n        --eval_batch_size=64 \\\n        --max_seq_length=512 \\\n        --max_predictions_per_seq=20 \\\n        --optimizer='lamb' \\\n        --learning_rate=.00176 \\\n        --num_train_steps=125000 \\\n        --num_warmup_steps=3125 \\\n        --save_checkpoints_steps=5000\n\n\n# Usage Albert with HF Transformers\n\nIn order to use Albert as efficiently as possible I'd recommend to use \n[Hugging Face (HF) Transformers](https://github.com/huggingface/transformers). It's an open source library, that \nprovides many very useful interfaces and functionalities, that make our live easier as NLP developers/researchers.\nThe guys at huggingface are very up to date on what's going on and also provide useful advice in case something is \nunclear. A very nice community.\n\n    TODO: WIP\n\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farrrrrmin%2Falbert-guide","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farrrrrmin%2Falbert-guide","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farrrrrmin%2Falbert-guide/lists"}