{"id":15600951,"url":"https://github.com/lucidrains/tf-bind-transformer","last_synced_at":"2025-10-04T05:30:00.153Z","repository":{"id":57474784,"uuid":"436068372","full_name":"lucidrains/tf-bind-transformer","owner":"lucidrains","description":"A repository with exploration into using transformers to predict DNA ↔ transcription factor binding","archived":false,"fork":false,"pushed_at":"2022-06-02T16:55:30.000Z","size":418,"stargazers_count":86,"open_issues_count":0,"forks_count":9,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-09-25T14:20:35.201Z","etag":null,"topics":["artificial-intelligence","attention-mechanism","deep-learning","gene-expression","transcription-factor-binding","transcription-factors","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucidrains.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-12-08T00:37:11.000Z","updated_at":"2025-08-25T22:51:43.000Z","dependencies_parsed_at":"2022-09-10T02:22:24.695Z","dependency_job_id":null,"html_url":"https://github.com/lucidrains/tf-bind-transformer","commit_stats":null,"previous_names":[],"tags_count":97,"template":false,"template_full_name":null,"purl":"pkg:github/lucidrains/tf-bind-transformer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Ftf-bind-transformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Ftf-bind-transformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Ftf-bind-transformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Ftf-bind-transformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucidrains","download_url":"https://codeload.github.com/lucidrains/tf-bind-transformer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Ftf-bind-transformer/sbom","scorecard":{"id":602670,"data":{"date":"2025-08-11","repo":{"name":"github.com/lucidrains/tf-bind-transformer","commit":"420d9382305d99de8a604a980099b634361d21d0"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.6,"checks":[{"name":"Code-Review","score":0,"reason":"Found 0/27 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/python-publish.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Pinned-Dependencies","score":2,"reason":"dependency not pinned by hash detected -- score normalized to 2","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/python-publish.yml:23: update your workflow using https://app.stepsecurity.io/secureworkflow/lucidrains/tf-bind-transformer/python-publish.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/python-publish.yml:25: update your workflow using https://app.stepsecurity.io/secureworkflow/lucidrains/tf-bind-transformer/python-publish.yml/main?enable=pin","Warn: pipCommand not pinned by hash: .github/workflows/python-publish.yml:30","Warn: pipCommand not pinned by hash: .github/workflows/python-publish.yml:31","Info:   0 out of   2 GitHub-owned GitHubAction dependencies pinned","Info:   1 out of   1 third-party GitHubAction dependencies pinned","Info:   0 out of   2 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'main'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 4 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-21T00:48:16.300Z","repository_id":57474784,"created_at":"2025-08-21T00:48:16.301Z","updated_at":"2025-08-21T00:48:16.301Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278267355,"owners_count":25958831,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-04T02:00:05.491Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","attention-mechanism","deep-learning","gene-expression","transcription-factor-binding","transcription-factors","transformers"],"created_at":"2024-10-03T02:10:08.872Z","updated_at":"2025-10-04T05:30:00.117Z","avatar_url":"https://github.com/lucidrains.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Transcription Factor binding predictions with Attention and Transformers\n\nA repository with exploration into using transformers to predict DNA ↔ transcription factor binding.\n\n## Install\n\nRun the following at the project root to download dependencies\n\n```bash\n$ python setup.py install --user\n```\n\nThen you must install `pybedtools`  as well as `pyBigWig`\n\n```bash\n$ conda install --channel conda-forge --channel bioconda pybedtools pyBigWig\n```\n\n## Usage\n\n```python\nimport torch\nfrom tf_bind_transformer import AdapterModel\n\n# instantiate enformer or load pretrained\n\nfrom enformer_pytorch import Enformer\nenformer = Enformer.from_hparams(\n    dim = 1536,\n    depth = 2,\n    target_length = 256\n)\n\n# instantiate model wrapper that takes in enformer\n\nmodel = AdapterModel(\n    enformer = enformer,\n    aa_embed_dim = 512,\n    contextual_embed_dim = 256\n).cuda()\n\n# mock data\n\nseq = torch.randint(0, 4, (1, 196_608 // 2)).cuda() # for ACGT\n\naa_embed = torch.randn(1, 1024, 512).cuda()\naa_mask = torch.ones(1, 1024).bool().cuda()\n\ncontextual_embed = torch.randn(1, 256).cuda() # contextual embeddings, including cell type, species, experimental parameter embeddings\n\ntarget = torch.randn(1, 256).cuda()\n\n# train\n\nloss = model(\n    seq,\n    aa_embed = aa_embed,\n    aa_mask = aa_mask,\n    contextual_embed = contextual_embed,\n    target = target\n)\n\nloss.backward()\n\n# after a lot of training\n\ncorr_coef = model(\n    seq,\n    aa_embed = aa_embed,\n    aa_mask = aa_mask,\n    contextual_embed = contextual_embed,\n    target = target,\n    return_corr_coef = True\n)\n```\n\n## Using ESM or ProtAlbert for fetching of transcription factor protein embeddings\n\n```python\nimport torch\nfrom enformer_pytorch import Enformer\nfrom tf_bind_transformer import AdapterModel\n\nenformer = Enformer.from_hparams(\n    dim = 1536,\n    depth = 2,\n    target_length = 256\n)\n\nmodel = AdapterModel(\n    enformer = enformer,\n    use_aa_embeds = True,                            # set this to True\n    aa_embed_encoder = 'esm',                        # by default, will use esm, but can be set to 'protalbert', which has a longer context length of 2048 (vs esm's 1024)\n    contextual_embed_dim = 256\n).cuda()\n\n# mock data\n\nseq = torch.randint(0, 4, (1, 196_608 // 2)).cuda()\ntf_aa = torch.randint(0, 21, (1, 4)).cuda()           # transcription factor amino acid sequence, from 0 to 20\n\ncontextual_embed = torch.randn(1, 256).cuda()\ntarget = torch.randn(1, 256).cuda()\n\n# train\n\nloss = model(\n    seq,\n    aa = tf_aa,\n    contextual_embed = contextual_embed,\n    target = target\n)\n\nloss.backward()\n```\n\n- [ ] add alphafold2\n\n## Context passed in as free text\n\nOne can also pass the context (cell type, experimental parameters) directly as free text, which will be encoded by a text transformer trained on pubmed abstracts.\n\n```python\nimport torch\nfrom tf_bind_transformer import AdapterModel\n\n# instantiate enformer or load pretrained\n\nfrom enformer_pytorch import Enformer\nenformer = Enformer.from_hparams(\n    dim = 1536,\n    depth = 2,\n    target_length = 256\n)\n\n# instantiate model wrapper that takes in enformer\n\nmodel = AdapterModel(\n    enformer = enformer,\n    use_aa_embeds = True,\n    use_free_text_context = True,        # this must be set to True\n    free_text_embed_method = 'mean_pool' # allow for mean pooling of embeddings, instead of using CLS token\n).cuda()\n\n# mock data\n\nseq = torch.randint(0, 4, (2, 196_608 // 2)).cuda() # for ACGT\ntarget = torch.randn(2, 256).cuda()\n\ntf_aa = [\n    'KVFGRCELAA',                  # single protein\n    ('AMKRHGLDNY', 'YNDLGHRKMA')   # complex, representations will be concatted together\n]\n\ncontextual_texts = [\n    'cell type: GM12878 | dual cross-linked',\n    'cell type: H1-hESC'\n]\n\n# train\n\nloss = model(\n    seq,\n    target = target,\n    aa = tf_aa,\n    contextual_free_text = contextual_texts,\n)\n\nloss.backward()\n```\n\n## Binary prediction\n\nFor predicting binary outcome (bind or not bind), just set the `binary_targets = True` when initializing either adapters\n\nex.\n\n```python\nimport torch\nfrom tf_bind_transformer import AdapterModel\nfrom enformer_pytorch import Enformer\n\n# instantiate enformer or load pretrained\n\nenformer = Enformer.from_hparams(\n    dim = 1536,\n    depth = 2,\n    target_length = 256\n)\n\n# instantiate model wrapper that takes in enformer\n\nmodel = AdapterModel(\n    enformer = enformer,\n    use_aa_embeds = True,\n    use_free_text_context = True,\n    free_text_embed_method = 'mean_pool',\n    use_squeeze_excite = True,\n    binary_target = True,                  # set this to True\n    target_mse_loss = False                # whether to use MSE loss with target value\n).cuda()\n\n# mock data\n\nseq = torch.randint(0, 4, (1, 196_608 // 2)).cuda() # for ACGT\nbinary_target = torch.randint(0, 2, (2,)).cuda()    # bind or not bind\n\ntf_aa = [\n    'KVFGRCELAA',\n    ('AMKRHGLDNY', 'YNDLGHRKMA')\n]\n\ncontextual_texts = [\n    'cell type: GM12878 | chip-seq dual cross-linked',\n    'cell type: H1-hESC | chip-seq single cross-linked'\n]\n\n# train\n\nloss = model(\n    seq,\n    target = binary_target,\n    aa = tf_aa,\n    contextual_free_text = contextual_texts,\n)\n\nloss.backward()\n```\n\n## Predicting Tracks from BigWig\n\n```python\nfrom pathlib import Path\nimport torch\nfrom enformer_pytorch import Enformer\n\nfrom tf_bind_transformer import AdapterModel\nfrom tf_bind_transformer.data_bigwig import BigWigDataset, get_bigwig_dataloader\n\n# constants\n\nROOT = Path('.')\nTFACTOR_TF = str(ROOT / 'tfactor.fastas')\nENFORMER_DATA = str(ROOT / 'chip_atlas' / 'sequences.bed')\nFASTA_FILE_PATH = str(ROOT / 'hg38.ml.fa')\nBIGWIG_PATH = str(ROOT / 'chip_atlas')\nANNOT_FILE_PATH = str(ROOT / 'chip_atlas' / 'annot.tab')\n\n# bigwig dataset and dataloader\n\nds = BigWigDataset(\n    factor_fasta_folder = TFACTOR_TF,\n    bigwig_folder = BIGWIG_PATH,\n    enformer_loci_path = ENFORMER_DATA,\n    annot_file = ANNOT_FILE_PATH,\n    fasta_file = FASTA_FILE_PATH\n)\n\ndl = get_bigwig_dataloader(ds, batch_size = 2)\n\n# enformer\n\nenformer = Enformer.from_hparams(\n    dim = 384,\n    depth = 1,\n    target_length = 896\n)\n\nmodel = AdapterModel(\n    enformer = enformer,\n    use_aa_embeds = True,\n    use_free_text_context = True\n).cuda()\n\n# mock data\n\nseq, tf_aa, context_str, target = next(dl)\nseq, target = seq.cuda(), target.cuda()\n\n# train\n\nloss = model(\n    seq,\n    aa = tf_aa,\n    contextual_free_text = context_str,\n    target = target\n)\n\nloss.backward()\n```\n## Data\n\nThe data needed for training is at \u003ca href=\"https://remap.univ-amu.fr/download_page\"\u003ethis download page\u003c/a\u003e.\n\n### Transcription factors for Human and Mouse\n\nTo download the protein sequences for both species, you need to download the remap CRMs bed files, from which all the targets will be extracted, and fastas to be downloaded from Uniprot.\n\nDownload human remap CRMS\n\n```bash\n$ wget https://remap.univ-amu.fr/storage/remap2022/hg38/MACS2/remap2022_crm_macs2_hg38_v1_0.bed.gz\n$ gzip -d remap2022_crm_macs2_hg38_v1_0.bed.gz\n```\n\nDownload mouse remap CRMs\n\n```bash\n$ wget https://remap.univ-amu.fr/storage/remap2022/mm10/MACS2/remap2022_crm_macs2_mm10_v1_0.bed.gz\n$ gzip -d remap2022_crm_macs2_mm10_v1_0.bed.gz\n```\n\nDownloading all human transcription factors\n\n```bash\n$ python script/fetch_factor_fastas.py --species human\n```\n\nFor mouse transcription factors\n\n```bash\n$ python script/fetch_factor_fastas.py --species mouse\n````\n\n## Generating Negatives\n\n### Generating Hard Negatives\n\nFor starters, the `RemapAllPeakDataset` will allow you to load data easily from the full remap peaks bed file for training.\n\nFirstly you'll need to generate the non-peaks dataset by running the following function\n\n```python\nfrom tf_bind_transformer.data import generate_random_ranges_from_fasta\n\ngenerate_random_ranges_from_fasta(\n    './hg38.ml.fa',\n    output_filename = './path/to/generated-non-peaks.bed',    # path to output file\n    context_length = 4096,\n    num_entries_per_key = 1_000_000,                          # number of negative samples\n    filter_bed_files = [\n        './remap_all.bed',                                    # filter out by all peak ranges (todo, allow filtering namespaced to experiment and target)\n        './hg38.blacklist.rep.bed'                            # further filtering by blacklisted regions (gs://basenji_barnyard/hg38.blacklist.rep.bed)\n    ]\n)\n```\n\n### Generating Scoped Negatives - Negatives per Dataset (experiment + target + cell type)\n\nTodo\n\n## Simple Trainer class for fine-tuning\n\nworking fine-tuning training loop for bind / no-bind prediction\n\n```python\nimport torch\nfrom enformer_pytorch import Enformer\n\nfrom tf_bind_transformer import AdapterModel, Trainer\n\n# instantiate enformer or load pretrained\n\nenformer = Enformer.from_pretrained('EleutherAI/enformer-official-rough', target_length = -1)\n\n# instantiate model wrapper that takes in enformer\n\nmodel = AdapterModel(\n    enformer = enformer,\n    use_aa_embeds = True,\n    use_free_text_context = True,\n    free_text_embed_method = 'mean_pool',\n    binary_target = True,\n    target_mse_loss = True,\n    use_squeeze_excite = True,\n    aux_read_value_loss = True     # use auxiliary read value loss, can be turned off\n).cuda()\n\n# pass the model (adapter + enformer) to the Trainer\n\ntrainer = Trainer(\n    model,\n    batch_size = 2,                                   # batch size\n    context_length = 4096,                            # genetic sequence length\n    grad_accum_every = 8,                             # gradient accumulation steps\n    grad_clip_norm = 2.0,                             # gradient clipping\n    validate_every = 250,\n    remap_bed_file = './remap2022_all.bed',           # path to remap bed peaks\n    negative_bed_file = './generated-non-peaks.bed',  # path to generated non-peaks\n    factor_fasta_folder = './tfactor.fastas',         # path to factor fasta files\n    fasta_file = './hg38.ml.fa',                      # human genome sequences\n    train_chromosome_ids = [*range(1, 24, 2), 'X'],   # chromosomes to train on\n    valid_chromosome_ids = [*range(2, 24, 2)],        # chromosomes to validate on\n    held_out_targets = ['AFF4'],                      # targets to hold out for validation\n    experiments_json_path = './data/experiments.json' # path to all experiments data, at this path relative to the project root, if repository is git cloned\n)\n\nwhile True:\n    _ = trainer()\n\n```\n\nworking fine-tuning script for training on new enformer tracks, with cross-attending transcription factor protein embeddings and cell type conditioning\n\n```python\nfrom dotenv import load_dotenv\n\n# set path to cache in .env and unset the next comment\n# load_dotenv()\n\nfrom enformer_pytorch import Enformer\nfrom tf_bind_transformer import AdapterModel, BigWigTrainer\n\n# training constants\n\nBATCH_SIZE = 1\nGRAD_ACCUM_STEPS = 8\n\n# effective batch size of BATCH_SIZE * GRAD_ACCUM_STEPS = 16\n\nVALIDATE_EVERY = 250\nGRAD_CLIP_MAX_NORM = 1.5\n\nTFACTOR_FOLDER = './tfactor.fastas'\nFASTA_FILE_PATH = './hg38.ml.fa'\n\nLOCI_PATH = './sequences.bed'\nBIGWIG_PATH = './bigwig_folder'\nANNOT_FILE_PATH =  './experiments.tab'\nTARGET_LENGTH = 896\n\nTRAIN_CHROMOSOMES = [*range(1, 24, 2), 'X'] # train on odd chromosomes\nVALID_CHROMOSOMES = [*range(2, 24, 2)]      # validate on even\n\nHELD_OUT_TARGET = ['SOX2']\n\n# instantiate enformer or load pretrained\n\nenformer = Enformer.from_pretrained('EleutherAI/enformer-official-rough', target_length = TARGET_LENGTH)\n\n# instantiate model wrapper that takes in enformer\n\nmodel = AdapterModel(\n    enformer = enformer,\n    use_aa_embeds = True,\n    use_free_text_context = True,\n    free_text_embed_method = 'mean_pool',\n    aa_embed_encoder = 'protalbert'\n).cuda()\n\n\n# trainer class for fine-tuning\n\ntrainer = BigWigTrainer(\n    model,\n    loci_path = LOCI_PATH,\n    bigwig_folder_path = BIGWIG_PATH,\n    annot_file_path = ANNOT_FILE_PATH,\n    target_length = TARGET_LENGTH,\n    batch_size = BATCH_SIZE,\n    validate_every = VALIDATE_EVERY,\n    grad_clip_norm = GRAD_CLIP_MAX_NORM,\n    grad_accum_every = GRAD_ACCUM_STEPS,\n    factor_fasta_folder = TFACTOR_FOLDER,\n    fasta_file = FASTA_FILE_PATH,\n    train_chromosome_ids = TRAIN_CHROMOSOMES,\n    valid_chromosome_ids = VALID_CHROMOSOMES,\n    held_out_targets = HELD_OUT_TARGET\n)\n\n# do gradient steps in a while loop\n\nwhile True:\n    _ = trainer()\n```\n\n## Resources\n\nIf you are low on GPU memory, you can save by making sure the protein and contextual embeddings are executed on CPU\n\n```bash\nCONTEXT_EMBED_USE_CPU=1 PROTEIN_EMBED_USE_CPU=1 python train.py\n```\n\n## Data\n\nTranscription factor dataset\n\n```python\nfrom tf_bind_transformer.data import FactorProteinDataset\n\nds = FactorProteinDataset(\n    folder = 'path/to/tfactor/fastas'\n)\n\n# single factor\n\nds['ETV1'] # \u003cseq\u003e\n\n# multi-complexes\n\nds['PAX3-FOXO1'] # (\u003cseq1\u003e, \u003cseq2\u003e)\n\n```\n\n## Preprocessing (wip)\n\nget a copy of hg38 blacklist bed file from calico\n\n```bash\n$ gsutil cp gs://basenji_barnyard/hg38.blacklist.rep.bed ./\n```\n\nusing bedtools to filter out repetitive regions of the genome\n\n```bash\n$ bedtools intersect -v -a ./remap2022_all_macs2_hg38_v1_0.bed -b ./hg38.blacklist.rep.bed \u003e remap2022_all_filtered.bed\n```\n\n## Caching\n\nDuring training, protein sequences and contextual strings are cached to `~/.cache.tf.bind.transformer` directory. If you would like to make sure the caching is working, you just need to run your training script with `VERBOSE=1`\n\nex.\n\n```bash\n$ VERBOSE=1 python train.py\n```\n\nYou can also force a cache clearance\n\n```bash\n$ CLEAR_CACHE=1 python train.py\n```\n\n## Todo\n\n- [x] ESM and AF2 embedding fetching integrations\n- [x] HF transformers integration for conditioning on free text\n- [x] allow for fine-tuning layernorms of Enformer easily\n- [x] add caching for external embeddings\n- [x] figure out a way for external models (ESM, transformers) to be omitted from state dictionary on saving (use singletons)\n- [x] take care of caching genetic sequences when enformer is frozen\n- [x] offer a fully transformer variant with cross-attention with shared attention matrix and FiLM conditioning with contextual embed\n- [x] also offer using pooled genetic / protein sequence concatted with context -\u003e project -\u003e squeeze excitation type conditioning\n- [x] use checkpointing when fine-tuning enformer\n- [x] take care of prepping dataframe with proper chromosome and training / validation split\n- [x] use basenji blacklist bed file for filtering out rows in remap\n- [x] filter remap dataframe based on tfactor fasta folder\n- [x] filter remap dataframe with hg38 blacklist\n- [x] handle targets with modifications from remap with all peaks (underscore in name)\n- [x] grad clipping\n- [x] add a safe initialization whereby rows of dataframe with targets not found in the tfactor fasta folder will be filtered out\n- [x] add accuracy metric to fine tune script\n- [x] master trainer class that handles both training / validation splitting, efficient instantiation of dataframe, filtering etc\n- [x] write a simple trainer class that takes care of the training loop\n- [x] create faster protein and context embedding derivation by optionally moving model to gpu and back to cpu when done\n- [x] use ProtTrans for longer context proteins, look into AF2\n- [x] make protalbert usable with one flag\n- [x] log auxiliary losses appropriately (read value)\n- [x] write fine-tuning script for finetuning on merged genomic track(s) from remap\n- [ ] support for custom transformers other than enformer\n- [ ] warmup in training loop\n- [ ] mixed precision\n- [ ] use wandb for experiment tracking\n- [ ] cleanup tech debt in data and protein_utils\n- [ ] explore protein model fine-tuning of layernorm\n- [ ] auto-auroc calc\n- [ ] k-fold cross validation\n- [ ] output attention intermediates (or convolution output for hypertransformer), for interpreting binding site\n- [ ] use prefect.io to manage downloading of tfactors fastas, remap scoped negative peaks, blacklist filtering etc\n\n## Appreciation\n\nThis work was generously sponsored by \u003ca href=\"https://github.com/jeffhsu3\"\u003eJeff Hsu\u003c/a\u003e to be done completely open sourced.\n\n## Citations\n\n```bibtex\n@article {Avsec2021.04.07.438649,\n    author  = {Avsec, {\\v Z}iga and Agarwal, Vikram and Visentin, Daniel and Ledsam, Joseph R. and Grabska-Barwinska, Agnieszka and Taylor, Kyle R. and Assael, Yannis and Jumper, John and Kohli, Pushmeet and Kelley, David R.},\n    title   = {Effective gene expression prediction from sequence by integrating long-range interactions},\n    elocation-id = {2021.04.07.438649},\n    year    = {2021},\n    doi     = {10.1101/2021.04.07.438649},\n    publisher = {Cold Spring Harbor Laboratory},\n    URL     = {https://www.biorxiv.org/content/early/2021/04/08/2021.04.07.438649},\n    eprint  = {https://www.biorxiv.org/content/early/2021/04/08/2021.04.07.438649.full.pdf},\n    journal = {bioRxiv}\n}\n```\n\n```bibtex\n@misc{yao2021filip,\n    title   = {FILIP: Fine-grained Interactive Language-Image Pre-Training},\n    author  = {Lewei Yao and Runhui Huang and Lu Hou and Guansong Lu and Minzhe Niu and Hang Xu and Xiaodan Liang and Zhenguo Li and Xin Jiang and Chunjing Xu},\n    year    = {2021},\n    eprint  = {2111.07783},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@misc{tay2020hypergrid,\n    title   = {HyperGrid: Efficient Multi-Task Transformers with Grid-wise Decomposable Hyper Projections},\n    author  = {Yi Tay and Zhe Zhao and Dara Bahri and Donald Metzler and Da-Cheng Juan},\n    year    = {2020},\n    eprint  = {2007.05891},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@misc{lowe2021logavgexp,\n    title   = {LogAvgExp Provides a Principled and Performant Global Pooling Operator},\n    author  = {Scott C. Lowe and Thomas Trappenberg and Sageev Oore},\n    year    = {2021},\n    eprint  = {2111.01742},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@article{10.1093/nar/gkab996,\n    author  = {Hammal, Fayrouz and de Langen, Pierre and Bergon, Aurélie and Lopez, Fabrice and Ballester, Benoit},\n    title   = \"{ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments}\",\n    journal = {Nucleic Acids Research},\n    issn    = {0305-1048},\n    doi     = {10.1093/nar/gkab996},\n    url     = {https://doi.org/10.1093/nar/gkab996},\n    eprint  = {https://academic.oup.com/nar/article-pdf/50/D1/D316/42058627/gkab996.pdf},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Ftf-bind-transformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucidrains%2Ftf-bind-transformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Ftf-bind-transformer/lists"}