{"id":20807436,"url":"https://github.com/tikquuss/lm_hf","last_synced_at":"2025-10-28T18:08:06.111Z","repository":{"id":76992708,"uuid":"465077389","full_name":"Tikquuss/lm_hf","owner":"Tikquuss","description":"Causal and Mask Language Modeling with 🤗 Transformers","archived":false,"fork":false,"pushed_at":"2024-02-10T05:13:20.000Z","size":88,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-01-18T13:41:10.165Z","etag":null,"topics":["clm","mlm","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Tikquuss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-03-01T22:30:42.000Z","updated_at":"2022-08-06T04:50:04.000Z","dependencies_parsed_at":"2024-02-10T06:32:15.915Z","dependency_job_id":null,"html_url":"https://github.com/Tikquuss/lm_hf","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tikquuss%2Flm_hf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tikquuss%2Flm_hf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tikquuss%2Flm_hf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tikquuss%2Flm_hf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Tikquuss","download_url":"https://codeload.github.com/Tikquuss/lm_hf/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243152875,"owners_count":20244657,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clm","mlm","transformer"],"created_at":"2024-11-17T19:37:34.388Z","updated_at":"2025-10-28T18:08:06.012Z","avatar_url":"https://github.com/Tikquuss.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## 1. Setting\n```bash\ngit clone https://github.com/Tikquuss/lm_hf\ncd lm_hf\n\npython3 -m install pip\npip install -r requirements.txt\n```\n\n## 2. Build a tokenizer from scratch if you are not going to use a pre-trained model (supports txt and csv)  \n\nSee [tokenizing.py](src/tokenizing.py) for all other parameters (and descriptions).\n```bash\nst=my/save/path\nmkdir -p $st\n\ndatapath=/path/to/data\ntext_column=text\npython -m src.tokenizing -fe gpt2 -p ${datapath}/data_train.csv,${datapath}/data_val.csv,${datapath}/data_test.csv -vs 25000 -mf 2 -st $st -tc $text_column\n\n#python -m src.tokenizing -fe bert-base-uncased -p wikitext -dn wikitext-2-raw-v1 --vocab_size 25000 -st $st\n\n# ...\n```\n\nThe tokenizer will be saved in ```${save_to}/tokenizer.pt```.\n\n## 3. Dictionary (work, but deprecated for the moment)\n\nYou can, instead of pre-training a tokenizer, build a simple vocabulary (by dividing the sentences according to the whitespace character - \ndefault option, or by dividing sentences into phonemes, ...), then build the tokenizer with this vocabulary during the training/evaluation (```tokenizer_params=\"vocab_file=str(${save_to}/word_to_id.txt),t_class=str(bert_tokenizer),...\"```).\n\n```bash\nst=my/save/path\nmkdir -p $st\n\ndatapath=/path/to/data\ntext_column=text\n\npython -m src.utils -p ${datapath}/data_train.csv,${datapath}/data_val.csv,${datapath}/data_test.csv -st $st -tc $text_column\n```\n\nBut this option is not recommended for the moment (any deep sanitary check has been done so far).\n\n## 4. Train and/or evaluate a model (from scratch or from a pre-trained model and/or tokenizer)  \nSee [trainer.py](src/trainer.py) and [train.sh](train.sh) for all other parameters (and descriptions)\n```bash\n. train.sh\n```\n\n## 5. TensorBoard (visualize the evolution of the loss/acc/... per step/epoch/...)\nSee https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html\n```\n%load_ext tensorboard\n\n%tensorboard --logdir ${log_dir}/${task}/lightning_logs\n```\n\n## 6. Prediction\n\nTo generate texts or fill the masks, you have to use the ```predict_params``` parameter.\nBy default, this will be done on the test dataset (or the dataset specified with the ```split``` parameter), but it is better to put your examples in a text file, csv, json (...) and use it instead of the test dataset (```test_data_files``` parameter).\nDon't forget to set the ```group_texts``` parameter to ```False``` in this case, and make sure that the length of the prompts or sentences (and the value of the ```max_length``` parameter) does not exceed the value of the ```max_position_embeddings```/```n_positions```/ ... parameter of your model.\n\n- For example for text generation, the file can be in the following form:\n    * for a text file :\n    ```\n    prompt 1\n    prompt 2\n    ...\n    ```\n    * for a csv file :\n    ```\n    text_column | ...\n    prompt 1    | ...\n    prompt 2    | ...\n    ...\t    | ...\n    ```\n\n- For the mask filling, replace the prompts below by the sentences on which to do the MLM\n\n\nThe result will be stored by default in the ```${log_dir}/${task}/predict.txt``` file, but you can change this path by adding this value to the ```predict_params``` parameter:\n```bash\npredict_params=\"...,output_file=str(my_path/file.txt),...\"\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftikquuss%2Flm_hf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftikquuss%2Flm_hf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftikquuss%2Flm_hf/lists"}