{"id":35176670,"url":"https://github.com/asigalov61/midisim","last_synced_at":"2026-01-13T22:59:01.768Z","repository":{"id":331029241,"uuid":"1123948957","full_name":"asigalov61/midisim","owner":"asigalov61","description":"Calculate, search, and analyze MIDI-to-MIDI similarity at scale","archived":false,"fork":false,"pushed_at":"2025-12-31T10:24:55.000Z","size":4783,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-02T05:09:49.861Z","etag":null,"topics":["midi","midi-search","midi-similarity","music","music-search","music-similarity","similarity-search"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/midisim/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/asigalov61.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-28T01:32:59.000Z","updated_at":"2025-12-31T10:16:41.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/asigalov61/midisim","commit_stats":null,"previous_names":["asigalov61/midisim"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/asigalov61/midisim","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asigalov61%2Fmidisim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asigalov61%2Fmidisim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asigalov61%2Fmidisim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asigalov61%2Fmidisim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/asigalov61","download_url":"https://codeload.github.com/asigalov61/midisim/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asigalov61%2Fmidisim/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28400893,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-13T14:36:09.778Z","status":"ssl_error","status_checked_at":"2026-01-13T14:35:19.697Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["midi","midi-search","midi-similarity","music","music-search","music-similarity","similarity-search"],"created_at":"2025-12-28T22:51:59.174Z","updated_at":"2026-01-13T22:59:01.760Z","avatar_url":"https://github.com/asigalov61.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# midisim\n## Calculate, search, and analyze MIDI-to-MIDI similarity at scale\n\n\u003cimg width=\"1536\" height=\"1024\" alt=\"midisim\" src=\"https://github.com/user-attachments/assets/0b379b3a-ec9f-42c7-ba09-6b7cce87a338\" /\u003e\n\n***\n\n## Main features\n\n* Ultra-fast and flexible GPU/CPU MIDI-to-MIDI similarity calculation, search and analysis\n* Quality pre-trained models and comprehensive pre-computed embeddings sets\n* Stand-alone, versatile, and extensive codebase for general or custom MIDI-to-MIDI similarity tasks\n* Full cross-platform compatibility and support\n\n***\n\n## [Pre-trained models](https://huggingface.co/projectlosangeles/midisim)\n\n* ```midisim_small_pre_trained_model_2_epochs_43117_steps_0.3148_loss_0.9229_acc.pth``` - Very fast and accurate small model, suitable for all tasks. This model is included in PyPI package or it can be downloaded from Hugging Face\n* ```midisim_large_pre_trained_model_2_epochs_86275_steps_0.2054_loss_0.9385_acc.pth``` - Fast large model for more nuanced embeddings generation. Download checkpoint from Hugging Face\n\n#### Both pre-trained models were trained on full [Godzilla Piano](https://huggingface.co/datasets/asigalov61/Godzilla-Piano) dataset for 2 complete epochs\n\n***\n\n## [Pre-computed embeddings sets](https://huggingface.co/datasets/projectlosangeles/midisim-embeddings)\n\n### For small pre-trained model\n\n```discover_midi_dataset_37292_genres_midis_embeddings_cc_by_nc_sa.npy``` - 37292 genre MIDIs embeddings for genre (artist and song) identification tasks\n\n```discover_midi_dataset_202400_identified_midis_embeddings_cc_by_nc_sa.npy``` - 202400 identified MIDIs embeddings for MIDI identification tasks\n\n```discover_midi_dataset_3480123_clean_midis_embeddings_cc_by_nc_sa.npy``` - 3480123 select clean MIDIs embeddings for large scale similarity search and analysis tasks\n\n### For large pre-trained model\n\n```discover_midi_dataset_37303_genres_midis_embeddings_large_cc_by_nc_sa.npy``` - 37303 genre MIDIs embeddings for genre (artist and song) identification tasks\n\n```discover_midi_dataset_202400_identified_midis_embeddings_large_cc_by_nc_sa.npy``` - 202400 identified MIDIs embeddings for MIDI identification tasks\n\n```discover_midi_dataset_3480123_clean_midis_embeddings_large_cc_by_nc_sa.npy``` - 3480123 select clean MIDIs embeddings for large scale similarity search and analysis tasks\n\n#### Source MIDI dataset: [Discover MIDI Dataset](https://huggingface.co/datasets/projectlosangeles/Discover-MIDI-Dataset)\n\n***\n\n### [Similarity search output samples](https://huggingface.co/datasets/projectlosangeles/midisim-samples)\n\n```midisim-similarity-search-output-samples-CC-BY-NC-SA.zip``` - ~300000 MIDIs indentified with midisim music discovery pipeline with both pre-trained models\n\n#### Source MIDI dataset: [Discover MIDI Dataset](https://huggingface.co/datasets/projectlosangeles/Discover-MIDI-Dataset)\n\n***\n\n## Installation\n\n### midisim PyPI package (for general use)\n\n```sh\n!pip install -U midisim\n```\n\n### x-transformers 2.3.1 (for raw/custom tasks)\n\n```sh\n!pip install x-transformers==2.3.1\n```\n\n***\n\n## Basic use guide\n\n### General use example\n\n```python\n# ================================================================================================\n# Initalize midisim\n# ================================================================================================\n\n# Import main midisim module\nimport midisim\n\n# ================================================================================================\n# Prepare midisim embeddings\n# ================================================================================================\n\n# Option 1: Download sample pre-computed embeddings corpus from Hugging Face\nemb_path = midisim.download_embeddings()\n\n# Option 2: use custom pre-computed embeddings corpus\n# See custom embeddings generation section of this README for details\n# emb_path = './custom_midis_embeddings_corpus.npy'\n\n# Load downloaded embeddings corpus\ncorpus_midi_names, corpus_emb = midisim.load_embeddings(emb_path)\n\n# ================================================================================================\n# Prepare midisim model\n# ================================================================================================\n\n# Option 1: Download main pre-trained midisim model from Hugging Face\nmodel_path = midisim.download_model()\n\n# Option 2: Use main pre-trained midisim model included in midisim PyPI package\n# model_path = get_package_models()[0]['path']\n\n# Load midisim model\nmodel, ctx, dtype = midisim.load_model(model_path)\n\n# ================================================================================================\n# Prepare source MIDI\n# ================================================================================================\n\n# Load source MIDI\ninput_toks_seqs = midisim.midi_to_tokens('Come To My Window.mid')\n\n# ================================================================================================\n# Calculate and analyze embeddings\n# ================================================================================================\n\n# Compute source/query embeddings\nquery_emb = midisim.get_embeddings_bf16(model, input_toks_seqs)\n\n# Calculate cosine similarity between source/query MIDI embeddings and embeddings corpus\nidxs, sims = midisim.cosine_similarity_topk(query_emb, corpus_emb)\n\n# ================================================================================================\n# Processs, print and save results\n# ================================================================================================\n\n# Convert the results to sorted list with transpose values\nidxs_sims_tvs_list = midisim.idxs_sims_to_sorted_list(idxs, sims)\n\n# Print corpus matches (and optionally) convert the final result to a handy list for further processing\ncorpus_matches_list  midisim.print_sorted_idxs_sims_list(idxs_sims_tvs_list, corpus_midi_names, return_as_list=True)\n\n# ================================================================================================\n# Copy matched MIDIs from the MIDI corpus for listening and further evaluation and analysis\n# ================================================================================================\n\n# Copy matched corpus MIDI to a desired directory for easy evaluation and analysis\nout_dir_path = midisim.copy_corpus_files(corpus_matches_list)\n\n# ================================================================================================\n```\n\n### Raw/custom use example\n\n#### Small model (2 epochs)\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Encoder\n\n# Original model hyperparameters\nSEQ_LEN = 3072\n\nMASK_IDX     = 384 # Use this value for masked modelling\nPAD_IDX      = 385 # Model pad index\nVOCAB_SIZE   = 386 # Total vocab size\n\nMASK_PROB    = 0.15 # Original training mask probability value (use for masked modelling)\n\nDEVICE = 'cuda' # You can use any compatible device or CPU\nDTYPE  = torch.bfloat16 # Original training dtype\n\n# Official main midisim model checkpoint name\nMODEL_CKPT = 'midisim_small_pre_trained_model_2_epochs_43117_steps_0.3148_loss_0.9229_acc.pth'\n\n# Model architecture using x-transformers\nmodel = TransformerWrapper(\n    num_tokens = VOCAB_SIZE,\n    max_seq_len = SEQ_LEN,\n    attn_layers = Encoder(\n        dim   = 512,\n        depth = 8,\n        heads = 8,\n        rotary_pos_emb = True,\n        attn_flash = True,\n    ),\n)\n\nmodel.load_state_dict(torch.load(MODEL_CKPT, map_location=DEVICE))\n\nmodel.to(DEVICE)\nmodel.eval()\n\n# Original training autoxast setup\nautocast_ctx = torch.amp.autocast(device_type=DEVICE, dtype=DTYPE)\n```\n\n#### Large model (2 epochs)\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Encoder\n\n# Original model hyperparameters\nSEQ_LEN = 3072\n\nMASK_IDX     = 384 # Use this value for masked modelling\nPAD_IDX      = 385 # Model pad index\nVOCAB_SIZE   = 386 # Total vocab size\n\nMASK_PROB    = 0.15 # Original training mask probability value (use for masked modelling)\n\nDEVICE = 'cuda' # You can use any compatible device or CPU\nDTYPE  = torch.bfloat16 # Original training dtype\n\n# Official main midisim model checkpoint name\nMODEL_CKPT = 'midisim_large_pre_trained_model_2_epochs_86275_steps_0.2054_loss_0.9385_acc.pth'\n\n# Model architecture using x-transformers\nmodel = TransformerWrapper(\n    num_tokens = VOCAB_SIZE,\n    max_seq_len = SEQ_LEN,\n    attn_layers = Encoder(\n        dim   = 512,\n        depth = 16,\n        heads = 8,\n        rotary_pos_emb = True,\n        attn_flash = True,\n    ),\n)\n\nmodel.load_state_dict(torch.load(MODEL_CKPT, map_location=DEVICE))\n\nmodel.to(DEVICE)\nmodel.eval()\n\n# Original training autoxast setup\nautocast_ctx = torch.amp.autocast(device_type=DEVICE, dtype=DTYPE)\n```\n\n***\n\n## Creating custom MIDI corpus embeddings\n\n```python\n# ================================================================================================\n\n# Load main midisim module\nimport midisim\n\n# Import helper modules\nimport os\nimport tqdm\n\n# ================================================================================================\n\n# Call included TMIDIX module through midisim to create MIDI files list\ncustom_midi_corpus_file_names = midisim.TMIDIX.create_files_list(['./custom_midi_corpus_dir/'])\n\n# ================================================================================================\n\n# Create two lists: one with MIDI corpus file names \n# and another with MIDI corpus tokens representations suitable for embeddings generation\nmidi_corpus_file_names = []\nmidi_corpus_tokens = []\n\nfor midi_file in tqdm.tqdm(custom_midi_corpus_file_names):\n    midi_corpus_file_names.append(os.path.splitext(os.path.basename(midi_file))[0])\n    \n    midi_tokens = midisim.midi_to_tokens(midi_file, transpose_factor=0, verbose=False)[0]\n    midi_corpus_tokens.append(midi_tokens)\n\n# It is highly recommended to sort the resulting corpus by tokens sequence length\n# This greatly speeds up embeddings calculations\nsorted_midi_corpus = sorted(zip(midi_corpus_file_names, midi_corpus_tokens), key=lambda x: len(x[1]))\nmidi_corpus_file_names, midi_corpus_tokens = map(list, zip(*sorted_midi_corpus))\n\n# ================================================================================================\n# Now you are ready to generate embeddings as follows:\n# ================================================================================================\n\n# Load main midisim model\nmodel, ctx, dtype = midisim.load_model(verbose=False)\n\n# Generate MIDI corpus embeddings\nmidi_corpus_embeddings = midisim.get_embeddings_bf16(model, midi_corpus_tokens)\n\n# ================================================================================================\n\n# Save generated MIDI corpus embeddings and MIDI corpus file names in one handy NumPy file\nmidisim.save_embeddings(midi_corpus_file_names,\n                        midi_corpus_embeddings,\n                        verbose=False\n                       )\n\n# ================================================================================================\n\n# You now can use this saved custom MIDI corpus NumPy file with midisim.load_embeddings()\n# and the rest of the pipeline outlined in the general use section above\n```\n\n***\n\n## Music discovery pipeline\nHere is a complete MIDI music discovery pipeline example using midisim and [Discover MIDI Dataset](https://huggingface.co/datasets/projectlosangeles/Discover-MIDI-Dataset)\n\n### Install midisim and discovermidi PyPI packages\n\n```sh\n!pip install -U midisim\n```\n\n```sh\n!pip install -U discovermidi\n```\n\n### Download and unzip Discover MIDI Dataset\n\n```python\nimport discovermidi\nfrom discovermidi import fast_parallel_extract\n\ndiscovermidi.download_dataset()\n\nfast_parallel_extract.fast_parallel_extract()\n```\n\n### Choose and prepare one midisim model and corresponding embeddings set\n\n#### Small model\n\n```python\nmodel_ckpt = 'midisim_small_pre_trained_model_2_epochs_43117_steps_0.3148_loss_0.9229_acc.pth'\nmodel_depth = 8\n\nembeddings_file = 'discover_midi_dataset_3480123_clean_midis_embeddings_cc_by_nc_sa.npy'\n```\n\n#### Large model\n\n```python\nmodel_ckpt = 'midisim_large_pre_trained_model_2_epochs_86275_steps_0.2054_loss_0.9385_acc.pth'\nmodel_depth = 16\n\nembeddings_file = 'discover_midi_dataset_3480123_clean_midis_embeddings_large_cc_by_nc_sa.npy'\n```\n\n### Create Master MIDI dataset directory and upload your source/master MIDIs in it\n\n```python\nimport os\n\nos.makedirs('./Master-MIDI-Dataset/', exist_ok=True)\n```\n\n### Initialize midisim, download and load chosen midisim model and embeddings set\n\n```python\n# Import main midisim module\nimport midisim\n\n# Download embeddings from Hugging Face\nemb_path = midisim.download_embeddings(filename=embeddings_file)\n\n# Load downloaded embeddings corpus\ncorpus_midi_names, corpus_emb = midisim.load_embeddings(embeddings_path=emb_path)\n\n# Download midisim model from Hugging Face\nmodel_path = midisim.download_model(filename=model_ckpt)\n\n# Load midisim model\nmodel, ctx, dtype = midisim.load_model(model_path,\n                                       depth=model_depth\n                                      )\n```\n\n### Create Master MIDI dataset files list\n\n```python\nfilez = midisim.TMIDIX.create_files_list(['./Master-MIDI-Dataset/'])\n```\n\n### Launch the search\n\n```python\nimport os\nimport tqdm\n\nfor fa in tqdm.tqdm(filez):\n    \n    # Load source MIDI\n    input_toks_seqs = midisim.midi_to_tokens(fa, verbose=False)\n\n    if input_toks_seqs:\n    \n        # ================================================================================================\n        # Calculate and analyze embeddings\n        # ================================================================================================\n        \n        # Compute source/query embeddings\n        query_emb = midisim.get_embeddings_bf16(model, input_toks_seqs, verbose=False)\n    \n        # Calculate cosine similarity between source/query MIDI embeddings and embeddings corpus\n        idxs, sims = midisim.cosine_similarity_topk(query_emb, corpus_emb, verbose=False)\n       \n        # ================================================================================================\n        # Processs, print and save results\n        # ================================================================================================\n         \n        # Convert the results to sorted list with transpose values\n        idxs_sims_tvs_list = midisim.idxs_sims_to_sorted_list(idxs, sims)\n       \n        # Print corpus matches (and optionally) convert the final result to a handy list for further processing\n        corpus_matches_list = midisim.print_sorted_idxs_sims_list(idxs_sims_tvs_list,\n                                                                  corpus_midi_names,\n                                                                  return_as_list=True\n                                                                 )\n         \n        # ================================================================================================\n        # Copy matched MIDIs from the MIDI corpus for listening and further evaluation and analysis\n        # ================================================================================================\n        \n        # Copy matched corpus MIDI to a desired directory for easy evaluation and analysis\n        out_dir_path = midisim.copy_corpus_files(corpus_matches_list,\n                                                 corpus_midis_dirs=['./Discover-MIDI-Dataset/MIDIs/'],\n                                                 main_output_dir='Output-MIDI-Dataset',\n                                                 sub_output_dir=os.path.splitext(os.path.basename(fa))[0],\n                                                 verbose=False\n                                                )\n        # ================================================================================================\n```\n\n***\n\n## midisim functions reference lists\n\n### Main functions\n\n- ```midisim.copy_corpus_files``` — *Copy or synchronize MIDI corpus files from a source directory to a target corpus location.*  \n- ```midisim.cosine_similarity_topk``` — *Compute cosine similarities between a query embedding and a set of embeddings and return the top‑K matches.*  \n- ```midisim.download_all_embeddings``` — *Download an entire embeddings dataset snapshot from a Hugging Face dataset repository to a local directory.*  \n- ```midisim.download_embeddings``` — *Download a single precomputed embeddings `.npy` file from a Hugging Face dataset repository.*  \n- ```midisim.download_model``` — *Download a pre-trained model checkpoint file from a Hugging Face model repository to a local directory.*  \n- ```midisim.get_embeddings_bf16``` — *Load or convert embeddings into bfloat16 format for memory-efficient inference on supported hardware.*  \n- ```midisim.idxs_sims_to_sorted_list``` — *Convert parallel index and similarity arrays into a single sorted list of (index, similarity) pairs ordered by similarity.*  \n- ```midisim.load_embeddings``` — *Load a saved NumPy embeddings file and return the arrays of MIDI names and corresponding embedding vectors.*  \n- ```midisim.load_model``` — *Construct a Transformer model, load weights from a checkpoint, move it to the requested device, and return the model with an AMP autocast context and dtype.*  \n- ```midisim.masked_mean_pool``` — *Compute a masked mean pooling over sequence embeddings, ignoring padded positions via a boolean or numeric mask.*  \n- ```midisim.midi_to_tokens``` — *Convert a single-track MIDI file into one or more compact integer token sequences (with optional transpositions) suitable for model input.*  \n- ```midisim.pad_and_mask``` — *Pad a batch of variable-length token sequences to a common length and produce an attention/mask tensor indicating real tokens vs padding.*  \n- ```midisim.print_sorted_idxs_sims_list``` — *Pretty-print a sorted list of (index, similarity) pairs, optionally annotating entries with filenames or metadata.*  \n- ```midisim.save_embeddings``` — *Save a list of name strings and their corresponding embedding vectors into a structured NumPy array and optionally persist it to disk.*\n\n### Helper functions\n\n- ```midisim.helpers.get_package_models``` — *Return a sorted list of packaged model files and their paths.*\n- ```midisim.helpers.get_package_embeddings``` — *Return a sorted list of packaged embedding files and their paths.*\n- ```midisim.helpers.get_normalized_midi_md5_hash``` — *Compute original and normalized MD5 hashes for a MIDI file.*\n- ```midisim.helpers.normalize_midi_file``` — *Normalize a MIDI file and write the result to disk.*\n- ```midisim.helpers.install_apt_package``` — *Idempotently install an apt package with retries and optional python‑apt.*\n\n***\n\n## Limitations\n\n* Current code and models support only MIDI music elements similarity (start-times, durations and pitches)\n* MIDI channels, instruments, velocities and drums similarites are not currently supported due to complexity and practicality considerations\n* Current pre-trained models are limited by 3k sequence length (~1000 MIDI music notes) so long running MIDIs can only be analyzed in chunks\n* Solo drum track MIDIs are not currently supported and can't be analyzed\n\n***\n\n## Citations\n\n```bibtex\n@misc{project_los_angeles_2025,\n\tauthor       = { Project Los Angeles },\n\ttitle        = { midisim (Revision 707e311) },\n\tyear         = 2025,\n\turl          = { https://huggingface.co/projectlosangeles/midisim },\n\tdoi          = { 10.57967/hf/7383 },\n\tpublisher    = { Hugging Face }\n}\n```\n\n```bibtex\n@misc{project_los_angeles_2025,\n\tauthor       = { Project Los Angeles },\n\ttitle        = { midisim-embeddings (Revision 8ebb453) },\n\tyear         = 2025,\n\turl          = { https://huggingface.co/datasets/projectlosangeles/midisim-embeddings },\n\tdoi          = { 10.57967/hf/7382 },\n\tpublisher    = { Hugging Face }\n}\n```\n\n```bibtex\n@misc{project_los_angeles_2025,\n\tauthor       = { Project Los Angeles },\n\ttitle        = { midisim-samples (Revision 79afcc1) },\n\tyear         = 2025,\n\turl          = { https://huggingface.co/datasets/projectlosangeles/midisim-samples },\n\tdoi          = { 10.57967/hf/7388 },\n\tpublisher    = { Hugging Face }\n}\n```\n\n```bibtex\n@misc{project_los_angeles_2025,\n\tauthor       = { Project Los Angeles },\n\ttitle        = { Discover-MIDI-Dataset (Revision 0eaecb5) },\n\tyear         = 2025,\n\turl          = { https://huggingface.co/datasets/projectlosangeles/Discover-MIDI-Dataset },\n\tdoi          = { 10.57967/hf/7361 },\n\tpublisher    = { Hugging Face }\n}\n```\n\n***\n\n### Project Los Angeles\n### Tegridy Code 2025\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasigalov61%2Fmidisim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fasigalov61%2Fmidisim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasigalov61%2Fmidisim/lists"}