{"id":23377506,"url":"https://github.com/facebookresearch/large_concept_model","last_synced_at":"2025-05-14T08:08:21.851Z","repository":{"id":267872537,"uuid":"902594666","full_name":"facebookresearch/large_concept_model","owner":"facebookresearch","description":"Large Concept Models: Language modeling in a sentence representation space","archived":false,"fork":false,"pushed_at":"2025-01-29T05:57:33.000Z","size":649,"stargazers_count":2086,"open_issues_count":13,"forks_count":186,"subscribers_count":35,"default_branch":"main","last_synced_at":"2025-04-13T15:54:24.962Z","etag":null,"topics":["language-models","nlp","pytorch","seq2seq","sequence-to-sequence"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookresearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-12T21:59:57.000Z","updated_at":"2025-04-13T03:44:20.000Z","dependencies_parsed_at":"2025-02-01T23:00:47.488Z","dependency_job_id":null,"html_url":"https://github.com/facebookresearch/large_concept_model","commit_stats":null,"previous_names":["facebookresearch/large_concept_model"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Flarge_concept_model","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Flarge_concept_model/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Flarge_concept_model/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Flarge_concept_model/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookresearch","download_url":"https://codeload.github.com/facebookresearch/large_concept_model/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254101558,"owners_count":22014908,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["language-models","nlp","pytorch","seq2seq","sequence-to-sequence"],"created_at":"2024-12-21T18:14:54.429Z","updated_at":"2025-05-14T08:08:16.835Z","avatar_url":"https://github.com/facebookresearch.png","language":"Python","readme":"# Large Concept Models\n## Language Modeling in a Sentence Representation Space\n\n[[Blog]](https://ai.meta.com/blog/meta-fair-updates-agents-robustness-safety-architecture/) [[Paper]](https://ai.meta.com/research/publications/large-concept-models-language-modeling-in-a-sentence-representation-space/)\n\nThis repository provides the official implementations and experiments for [Large Concept Models](https://ai.meta.com/research/publications/large-concept-models-language-modeling-in-a-sentence-representation-space/) (**LCM**).\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"space.svg\" width=\"50%\"\u003e\n\u003c/p\u003e\n\n\n\nThe LCM operates on an explicit higher-level semantic representation,\nwhich we name a \"concept\". Concepts are language- and modality-agnostic and represent a higher\nlevel idea. In this work, a concept corresponds to a sentence, and we use the [SONAR](https://github.com/facebookresearch/SONAR)\nembedding space, which supports up to 200 languages in text and 57 languages in speech. See the list of supported languages [here](https://github.com/facebookresearch/SONAR?tab=readme-ov-file#supported-languages-and-download-links).\n\n\n## Approach\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"lcm.svg\" width=\"70%\"\u003e\n\u003c/p\u003e\n\n\n\nThe LCM is a sequence-to-sequence model in the concepts space trained to perform auto-regressive sentence prediction.\nWe explore multiple approaches:\n- MSE regression (`base_lcm` in this code).\n- Variants of diffusion-based generation (we include `two_tower_diffusion_lcm` in this release).\n- Models operating in a quantized SONAR space (coming soon).\n\nThese explorations are performed using 1.6B parameter models and training data in the order of 1.3T tokens. We include in this repository recipes to reproduce the training and finetuning of 1.6B MSE LCM and Two-tower diffusion LCM. See instructions [below](#usage).\n\n## Installing\n\n### Using UV\n\nThe LCM repository relies on fairseq2. If you have `uv` installed on your system, you can install a virtual environment with all the necessary packages by running the following commands:\n```bash\nuv sync --extra cpu --extra eval --extra data\n```\n\nYou can also use `uv run` to run the demo commands with the correct environment.\n\nNote that we only provide requirements for `cpu` dependencies, if you want to use GPU support, you will have to choose the variants of torch and fairseq2 that work for your system.\nFor example for torch 2.5.1 with cuda 1.21, You would do something like:\n```\nuv pip install torch==2.5.1 --extra-index-url https://download.pytorch.org/whl/cu121 --upgrade\nuv pip install fairseq2==v0.3.0rc1 --pre --extra-index-url  https://fair.pkg.atmeta.com/fairseq2/whl/rc/pt2.5.1/cu121 --upgrade\n```\n\nCheck [fairseq2 variants](https://github.com/facebookresearch/fairseq2?tab=readme-ov-file#variants) for possible variants. Note that LCM currently relies on the release candidate for fairseq2 0.3.0 rc1.\n\n### Using pip\n\nTo install with pip, the commands are very similar, but you will have to manage your own environment and make sure to install fairseq2 manually first. For instance, for a `cpu` install.\n\n```bash\npip install --upgrade pip\npip install fairseq2==v0.3.0rc1 --pre --extra-index-url  https://fair.pkg.atmeta.com/fairseq2/whl/rc/pt2.5.1/cpu\npip install -e \".[data,eval]\"\n```\n\nIf [fairseq2](https://github.com/facebookresearch/fairseq2) does not provide a build for your machine, check the readme of that project to build it locally.\n\n## Usage\n\n\u003e [!NOTE]\n\u003e If using `uv` prefix all commands with `uv run` to use the environment created by default in `.venv`, e.g.,\n\u003e `uv run torchrun --standalone`.\n\u003e Alternatively, you can activate the environment once and for all with `source .venv/bin/activate`.\n\n### Preparing data\n\nThe LCM can be trained and evaluated using textual data split in sentences and embedded with [SONAR](https://github.com/facebookresearch/SONAR/). We provide a sample processing pipeline that can be used to prepare such training data, you can run it with:\n\n```\n uv run --extra data scripts/prepare_wikipedia.py /output/dir/for/the/data\n ```\n\n This pipeline shows how to get a dataset from huggingface and process it with SONAR and [SaT](https://arxiv.org/abs/2406.16678). Check out the file for more details on processing your own data. While the script provides an example pulling data from huggingface, we also provide [APIs](https://github.com/facebookresearch/stopes/tree/main/stopes/utils/sharding) to process jsonl, parquet and CSV.\n\n### Datacards\n\nThe trainer described below relies on datacards configuring the datasets. These datacards are yaml files with pointers to the dataset files (locally or on s3) and information on its schema. We provide some sample datacards in [`lcm/datacards/datacards.yaml`](https://github.com/facebookresearch/large_concept_model/blob/main/lcm/datacards/datacards.yaml). Once you have processed some data, you can update the datacards with your paths.\n\n#### Fitting a normalizer\nTo fit a new embedding space normalizer on a given weighted mixture of datasets\none can use the following command :\n```bash\npython scripts/fit_embedding_normalizer.py --ds dataset1:4 dataset2:1 dataset3:10 --save_path \"path/to/new/normalizer.pt\" --max_nb_samples 1000000\n```\nHere, `dataset1`, `dataset2`, `dataset3` are the names of datasets declared in the datacards as shown above\nand `(4, 1, 10)` their respective relative weights.\nThe resulting normalizer can be next declared as a model as shown in `lcm/cards/sonar_normalizer.yaml`\nand referenced in all model training configs.\n\n\n### Pre-training models\n\n#### Base MSE LCM\n\nTo train an MSE LCM, we will use one of the following commands:\n\n**Option 1.** Training with SLURM using [submitit](https://github.com/facebookincubator/submitit) via [stopes](https://github.com/facebookresearch/stopes/tree/main)'s launcher:\n```sh\npython -m lcm.train \\\n    +pretrain=mse \\\n    ++trainer.output_dir=\"checkpoints/mse_lcm\" \\\n    ++trainer.experiment_name=training_mse_lcm \\\n```\nWith this command, we will submit a slurm job named `training_mse_lcm` with the recipe's requirements, in this case:\n```yaml\nrequirements:\n  nodes: 4\n  tasks_per_node: 8\n  gpus_per_node: 8\n  cpus_per_task: 32\n  mem_gb: 0\n  timeout_min: 10000\n```\nYou can override the job's requirements like the timeout limit and the launcher's slurm partition with:\n```sh\npython -m lcm.train \\\n    +pretrain=mse \\\n    ++trainer.output_dir=\"checkpoints/mse_lcm\" \\\n    ++trainer.experiment_name=training_mse_lcm \\\n    ++trainer.requirements.timeout_min=100 \\\n    ++trainer.requirements.cpus_per_task=8 \\\n    ++launcher.partition=$partition_name\n```\n\n**Option 2.** Training locally with `torchrun` (e.g. using only 2 GPUs) with a smaller batch size (overriding `++trainer.data_loading_config.max_tokens=1000`):\n```sh\nCUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc-per-node=2 \\\n    -m lcm.train launcher=standalone \\\n    +pretrain=mse \\\n    ++trainer.data_loading_config.max_tokens=1000 \\\n    ++trainer.output_dir=\"checkpoints/mse_lcm\" \\\n    +trainer.use_submitit=false \\\n```\n\u003e [!IMPORTANT]\n\u003e Since we're changing the number of GPUs required by the recipe, this will not reproduce the experimental setup of the paper.\n\nThe checkpoints directory `checkpoints/mse_lcm` will be structured as:\n```\n.\n├── checkpoints\n│   ├── step_2000\n│   ├── ...\n│   └── step_250000\n├── config_logs\n├── executor_logs\n├── model_card.yaml\n├── tb   # tensorboard logs\n└── wandb  # W\u0026B logs\n```\nNote that W\u0026B logging is skipped unless `wandb` is available.\nYou can install `wandb` with `uv pip install wandb`.\nW\u0026B arguments can be changed by overriding Hydra config values in the recipe:\n\n```sh\n++trainer.wandb_project=$project_name\n++trainer.wandb_run_name=$run_name\n```\n\n#### Two-tower diffusion LCM\n\nSimilar to the base MSE LCM we can submit a training job following the recipe in [./recipes/train/pretrain/two_tower.yaml](./recipes/train/pretrain/two_tower.yaml) via:\n\n```sh\npython -m lcm.train \\\n    +pretrain=two_tower \\\n    ++trainer.output_dir=\"checkpoints/two_tower_lcm\" \\\n    ++trainer.experiment_name=training_two_tower_lcm \\\n```\n\n\u003e [!TIP]\n\u003e To understand the different ingredients of training recipes, check [this README](./recipes/train/README.md).\n\n\n### Finetuning models\nTo finetune the previously pre-trained two-tower diffusion LCM on supervised data,  follow these steps:\n\n**Step 1.** Register the pre-trained checkpoint as a fairseq2 asset.\n\nYou can finetune the final checkpoint with the card `checkpoints/two_tower_lcm/model_card.yaml` or any checkpoint after a specific number of training steps, e.g., `checkpoints/two_tower_lcm/checkpoints/step_2000/model_card.yaml`.\nTo register the selected checkpoint, copy the automatically created yaml file to `./lcm/cards/mycards.yaml` and rename the model to replace the default `on_the_fly_lcm`.\n`./lcm/cards/mycards.yaml` will look like:\n```yaml\n__source__: inproc\n checkpoint: file://path_to/large_concept_model/checkpoints/two_tower_lcm/checkpoints/step_2000/model.pt\n model_arch: two_tower_diffusion_lcm_1_6B\n model_family: two_tower_diffusion_lcm\n name: my_pretrained_two_tower\n```\nFor more on how to manage fairseq2 assets, see [documentation](https://facebookresearch.github.io/fairseq2/nightly/basics/assets.html).\n\n**Step 2.** Launch a finetuning job pointing to the model to finetune, in this instance `my_pretrained_two_tower`:\n```sh\nCUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc-per-node=2 \\\n    -m lcm.train launcher=standalone \\\n    +finetune=two_tower \\\n    ++trainer.output_dir=\"checkpoints/finetune_two_tower_lcm\" \\\n    ++trainer.data_loading_config.max_tokens=1000 \\\n    +trainer.use_submitit=false \\\n    ++trainer.model_config_or_name=my_pretrained_two_tower\n```\nor\n\n```sh\npython -m lcm.train \\\n    +finetune=two_tower \\\n    ++trainer.output_dir=\"checkpoints/finetune_two_tower_lcm\" \\\n    ++trainer.experiment_name=finetune_two_tower_lcm \\\n    ++trainer.model_config_or_name=my_pretrained_two_tower\n```\n\nSimilarly, to finetune an MSE LCM, follow the same instructions for registering a pre-trained checkpoint and submit a finetuning job with the appropriate recipe ([./recipes/train/finetune/mse.yaml](./recipes/train/finetune/mse.yaml)) via:\n```sh\npython -m lcm.train \\\n    +finetune=mse \\\n    ++trainer.output_dir=\"checkpoints/finetune_mse_lcm\" \\\n    ++trainer.experiment_name=finetune_mse_lcm \\\n    ++trainer.model_config_or_name=my_pretrained_mse_lcm\n```\n### Evaluating models\n\n\n\u003e [!NOTE]\n\u003e For advanced evaluation (benchmarking different tasks, comparing results with LLMs, etc.) , check [the evaluation documentation](./examples/evaluation/README.md).\n\n\n**Step 0.** Download NLTK data required for evaluating ROUGE:\n```py\npython -m nltk.downloader punkt_tab\n```\n\n**Step 1.**\nGenerate and score outputs of a model either by pointing to its `model_card` yaml file or after registering it as a fairseq2 asset (the same way we registerd `my_pretrained_two_tower`):\n```sh\nmodel_card=./checkpoints/finetune_two_tower_lcm/checkpoints/step_1000/model_card.yaml\nOUTPUT_DIR=evaluation_outputs/two_tower\n\ntorchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation  \\\n  --predictor two_tower_diffusion_lcm  \\\n  --show_progress true \\\n  --data_loading.max_samples 100 \\\n  --model_card ${model_card} \\\n  --launcher standalone \\\n  --dataset.source_suffix_text '[MODEL]:' \\\n  --tasks finetuning_data_lcm.validation \\\n   --task_args '{\"max_gen_len\": 10, \"eos_config\": {\"text\": \"End of text.\"}}' \\\n  --data_loading.batch_size 4  --generator_batch_size 4 \\\n  --dump_dir ${OUTPUT_DIR} \\\n  --inference_timesteps 40 \\\n  --initial_noise_scale 0.6 \\\n  --guidance_scale 3 \\\n  --guidance_rescale 0.7\n```\nwhere in the example we are evaluating 100 samples only (`--data_loading.max_samples 100`) and limiting the model output length to 10 sentences (`--task_args '{\"max_gen_len\": 10}'`).\n\nOutputs dumped in `./evaluation_outputs/two_tower` will be structured as:\n```\n.\n├── metadata.jsonl\n├── metrics.eval.jsonl\n├── raw_results\n├── results\n└── tb\n```\nwhere `metrics.eval.jsonl` contains corpus-level scores.\n\n\nTo evaluate an MSE LCM, we use the associated predictor (`base_lcm`) and evaluate with:\n\n```sh\nmodel_card=./checkpoints/finetune_mse_lcm/checkpoints/step_1000/model_card.yaml\nOUTPUT_DIR=evaluation_outputs/mse_lcm\n\ntorchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation  \\\n  --predictor base_lcm --sample_latent_variable False \\\n  --show_progress true \\\n  --data_loading.max_samples 100 \\\n  --model_card ${model_card} \\\n  --launcher standalone \\\n  --dataset.source_suffix_text '[MODEL]:' \\\n  --tasks finetuning_data_lcm.validation \\\n   --task_args '{\"max_gen_len\": 10, \"eos_config\": {\"text\": \"End of text.\"}}' \\\n  --data_loading.batch_size 4  --generator_batch_size 4 \\\n  --dump_dir ${OUTPUT_DIR} \\\n```\n\nNote that in this example, we only show how to evaluate the LCM on the same finetuning dataset (validation split). To evaluate in a downstream task, and compare results with the LLM, refer to the [Evaluation documentation](./examples/evaluation/README.md).\n\n## Contributing\n\nSee the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.\n\n## Citation\n\nIf you use this codebase, please cite:\n```\n@article{lcm2024,\n  author = {{LCM team}, Lo\\\"{i}c Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-juss\\`{a}, David Dale, Hady Elsahar, Kevin Heffernan, Jo\\~{a}o Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sánchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, Holger Schwenk},\n  title = {{Large Concept Models}: Language Modeling in a Sentence Representation Space},\n  publisher = {arXiv},\n  year = {2024},\n  url = {https://arxiv.org/abs/2412.08821},\n}\n```\n\n## License\n\nThis code is released under the MIT license (see [LICENSE](./LICENSE)).\n","funding_links":[],"categories":["A01_文本生成_文本对话","Python"],"sub_categories":["其他_文本生成_文本对话"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2Flarge_concept_model","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacebookresearch%2Flarge_concept_model","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2Flarge_concept_model/lists"}