{"id":19428917,"url":"https://github.com/eleutherai/semantic-memorization","last_synced_at":"2025-08-07T01:25:00.944Z","repository":{"id":157424030,"uuid":"614134585","full_name":"EleutherAI/semantic-memorization","owner":"EleutherAI","description":null,"archived":false,"fork":false,"pushed_at":"2024-11-17T14:55:08.000Z","size":160378,"stargazers_count":44,"open_issues_count":5,"forks_count":5,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-07-19T16:30:37.852Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EleutherAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-03-15T00:49:44.000Z","updated_at":"2025-02-04T04:17:29.000Z","dependencies_parsed_at":"2025-07-19T13:18:44.249Z","dependency_job_id":"f2374d40-9092-47d0-85d4-2fdcde85f7da","html_url":"https://github.com/EleutherAI/semantic-memorization","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/EleutherAI/semantic-memorization","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fsemantic-memorization","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fsemantic-memorization/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fsemantic-memorization/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fsemantic-memorization/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EleutherAI","download_url":"https://codeload.github.com/EleutherAI/semantic-memorization/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fsemantic-memorization/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269183969,"owners_count":24374417,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-06T02:00:09.910Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T14:17:11.389Z","updated_at":"2025-08-07T01:25:00.914Z","avatar_url":"https://github.com/EleutherAI.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Semantic Memorization\n\nThis repository is for EleutherAI's project Semantic Memorization which defines a unique taxonomy for memorized sequences based on factors that influence memorization. For detailed information on how likelihood of a sequence being memorized is dependant on taxonomy, please see our paper [Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon](https://arxiv_link_here)\n\n## Contents\n- [Semantic Memorization](#semantic-memorization)\n    * [Motivation](#motivation)\n    * [Taxonomy](#taxonomy)\n- [Reproducing Results](#reproducing-results)\n    * [Filters](#filters)\n    * [Combining Filters](#combining-filters)\n    * [Training Taxonomic Model](#training-taxonomic-model)\n    * [Plots](#plots)\n- [Citation Details](#citation-details)\n\n\n## Motivation\nMemorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.\n## Taxonomy\n![](./readme-images/memorization_taxonomy.png)\nOur taxonomy, illustrated above, defines three\ntypes of LM memorization based on colloquial de-\nscriptions of human memorization. Humans recite\ndirect quotes that they commit to memory through\nrepeated exposure, so LMs recite highly duplicated\nsequences. Humans reconstruct a passage by re-\nmembering a general pattern and filling in the gaps,\nso LMs reconstruct inherently predictable boiler-\nplate templates. Humans sporadically recollect an\nepisodic memory or fragment after a single expo-\nsure, so LMs recollect other sequences seen rarely\nduring training.\n\n# Reproducing Results\n## Filters\n### Code vs Natural Language\nTo train a [natural language vs code classifier](https://huggingface.co/usvsnsp/code-vs-nl), we used [huggingface's training pipeline](https://huggingface.co/docs/transformers/en/main_classes/trainer) on randomly sampled, equal weight subsets of [bookcorpus](https://huggingface.co/datasets/bookcorpus/bookcorpus) and [github-code](https://huggingface.co/datasets/codeparrot/github-code). following hparams were used while training\n- learning_rate: 1e-07\n- train_batch_size: 256\n- eval_batch_size: 1024\n- seed: 42\n- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08\n- lr_scheduler_type: linear\n- training_steps: 1000\n\nFollowing this, we used [this script](./working_dirs/orz/classify_code_vs_nl.py) to find probabilities of a sequence being memorized.\n### Highly Duplicated Filter\nTo replicate results of duplication, run the following scripts in a sequence\n- [script saving sequence hashes](.working_dirs/orz/sequence_duplication/save_sequence_hashes.py), to save hash of every 32-gram sequence of Pile\n- [script saving zero offset hashes](.working_dirs/orz/sequence_duplication/save_zero_offset_duplicate_hashes.py) Script saving hashes of only required offset (32 in our case)\n- [script saving approximate duplicates, based on hashes](.working_dirs/orz/sequence_duplication/calculate_asymptotic_duplicates.py). We now have a single numpy file that stores hashes and sequence ids of all sequences whose hashes are the same as atleast one of zero offset sequence's hashes\n- [script calculating exact duplicates](.working_dirs/orz/sequence_duplication/save_true_duplicate_counts.cpp) This script compares each sequence with all sequences with same hash to get exact count of duplicates. \n- Following this, you get a list of true counts, you can combine them use [this script](.working_dirs/orz/sequence_duplication/save_true_duplicates.py)\n- You can find already processed list of sequence ids with their count of duplicates in [standard](https://huggingface.co/datasets/usvsnsp/duped-num-duplicates) and [deduped](https://huggingface.co/datasets/usvsnsp/deduped-num-duplicates) datasets.\n### Semantic and Textual Matches Filter\nTo replicate semantics and textual matches filter, run the following scripts in a sequence:\n- Create sentence embeddings for various datasets, with [this script](./working_dirs/rintarou/sentence_embedding_maker.py)\n- Compute semantic filter counts with [this script](./working_dirs/rintarou/snowclones_maker.py)\n- Compute textual matche counts with [this script](./working_dirs/rintarou/templating.py). for texual macthes, we also need to create only query sentences for each partition as we compare levestein distance between queries for this filter. This can be acheived by [this script](working_dirs/rintarou/query_maker.py). \n  \n### Token frequencies\nTo replicate results of token frequences, run [this script](.working_dirs/orz/token_frequencies/calculate_token_frequencies.cpp). Full list of token frequencies can be found on huggingface for [standard](https://huggingface.co/datasets/usvsnsp/duped-num-frequencies) and [deduped](https://huggingface.co/datasets/usvsnsp/deduped-num-frequencies) datasets.\n\n## Combining Filters\nTo combine all the existing filters, run [combine metrics script](calculate_metrics.py). You will need to setup an appropriate JDK and install all requirements to run the script. Filter results can be found on [this huggingface dataset](https://huggingface.co/datasets/usvsnsp/semantic-filters)\n\nNote: Filters for templating (incrementing and repeating) as well has huffman coding length are calculated while the [filters](./filters) are combined.\n## Training Taxonomic Model\nTo train taxonomic model and launch greedy taxonomic search, launch [this script](model_training.py)\n## Plots\n- To replicate results on taxonomic model performance, and plots on model weights refer to [this notebook](./plotting/model_perf_testing.ipynb).  \n- For results on correlation coefficients, refer to [this notebook](./plotting/Correlation_Coefficients.ipynb)\n- For plot on optimal thresholds for code-classifier, refer to [this notebook](.working_dirs/alvin/code_classifier_evaluation/memorization_eval_analysis.ipynb)\n## Citation Details\n```\n@article{prashanth2024recite,\n  title={Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon},\n  author={Prashanth, USVSN Sai and Deng, Alvin and O'Brien, Kyle and SV, Jyothir and Khan, Mohammad Aflah and Borkar, Jaydeep and Choquette-Choo, Christopher A and Fuehne, Jacob Ray and Biderman, Stella and Ke, Tracy and Lee, Katherine and Saphra, Naomi},\n  journal={arXiv preprint arXiv:2406.17746},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feleutherai%2Fsemantic-memorization","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feleutherai%2Fsemantic-memorization","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feleutherai%2Fsemantic-memorization/lists"}