{"id":19090731,"url":"https://github.com/kodedninja/beatsbeats","last_synced_at":"2026-05-24T20:30:16.029Z","repository":{"id":143857845,"uuid":"506339672","full_name":"kodedninja/beatsbeats","owner":"kodedninja","description":"audio onset, tempo and beat detection for uni","archived":false,"fork":false,"pushed_at":"2022-06-22T17:18:24.000Z","size":216,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-01-02T22:30:12.899Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kodedninja.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-22T17:14:58.000Z","updated_at":"2024-01-12T19:47:23.000Z","dependencies_parsed_at":"2023-07-14T11:46:58.320Z","dependency_job_id":null,"html_url":"https://github.com/kodedninja/beatsbeats","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kodedninja%2Fbeatsbeats","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kodedninja%2Fbeatsbeats/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kodedninja%2Fbeatsbeats/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kodedninja%2Fbeatsbeats/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kodedninja","download_url":"https://codeload.github.com/kodedninja/beatsbeats/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240138111,"owners_count":19753866,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T03:08:50.136Z","updated_at":"2026-05-24T20:30:15.987Z","avatar_url":"https://github.com/kodedninja.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Audio Onset, Tempo and Beat Detection\r\n\r\nFor a course at uni.\r\n\r\n## Usage\r\n\r\nTraining data for the onset detection (`.wav` and `.gt` files) goes to `data/train_onsets`, then \r\nthe model can be trained by running the `train.ipynb` notebook from top to bottom.\r\n\r\nFor getting the submission file, `onsets.py`, `tempo.py` and `beats.py` can be run respectively.\r\n\r\n## Description\r\n\r\nAs our strategy for solving the three challenges, we followed the approach of using a neural-\r\nnetwork architecture for the onset prediction, and then using the onset envelope we get from this\r\nmodel as the input for the tempo and beats estimation, where we rely on classical, algorithmic\r\nmethods. Most of our efforts therefore were spent on pre-processing the data and trying out\r\ndifferent training solutions and network architectures for the onset detection challenge, while for\r\nthe latter two, we implemented solutions described on the slides, which didn’t require too much\r\nexperimentation.\r\n\r\n#### Onset Detection\r\n\r\nOur approach for training a neural-network for onset detection was based on [1], whose prominent\r\nperformance motivated us to use a ”deep” method for this challenge. We applied the same data\r\npre-processing as they did, namely transforming the audio signals into three logarithmically scaled\r\nMel-scaled spectograms with window sizes of 23ms, 46ms and 93ms and a hop size of 10ms. Further,\r\nwe normalized each frequency band to zero mean and unit variance. The spectograms were then\r\nchopped into pieces of 15 frames (± 70ms) and fed into the neural network.\r\nThe most prominent difference between our model and the model used in [^1] was that while\r\ntheir model classified only the frame at the center of the 15 frame piece as an onset or not, seeing\r\nit as a binary classification problem, our model doesn’t reduce the input in length and classifies all\r\nthe 15 frames at the same time. Therefore the model directly learns to output an onset envelope\r\nover time. To achieve this, we took the CNN architecture from [^1] and replaced the final fully-\r\nconnected layers with common solutions from modern vision models: global average / max pooling\r\nand 1x1 convolutions. We also adapted the convolutional layers with paddings so that the input\r\ndoesn’t reduce in length. This model already learned to identify some onsets, but as most of\r\nthe parameters in the model from the paper were in the final two layers, our model needed more\r\ncomplexity. Therefore, we scaled up and introduced residual blocks to achieve the final architecture\r\ndescribed below.\r\n\r\nAs targets, we used binary vectors of length 15, where most commonly a single element rep-\r\nresented an onset and the others were negative examples. Because of this heavy imbalance, we\r\nweighted positive frames 14x more than negative frames, which seemed to be essential not only to\r\nspeed up learning, but also to stop the model from classifying everything as negative. Additionally,\r\nto counter the issue described in [^1] (”Some onsets have a soft attack, though, or are not annotated\r\nwith 10 ms precision, resulting in actual onsets being presented to the network as negative training\r\nexamples. To counter this, we would like to train on less sharply defined ground truth.”) we applied\r\nthe same solution, namely using ±1 frame around a positive target also as positive, but with a total\r\nweight of 0.5.\r\n\r\nOur final architecture looks the following: a 7x3 convolutional layer with 16 feature maps, a 3x3\r\nresidual block with 16 kernels, 3x1 max-pooling, a 3x3 convolutional layer with 32 feature maps, a\r\n3x3 residual block with 32 kernels, 3x1 max-pooling, a final 3x3 convolutional layer with 64 feature\r\nmaps, global-max-pooling reducing the remaining 6 bands to 64 features with a single pixel and\r\na final 1x1 convolution reducing the 64 feature maps to a single dimension. After each pooling\r\noperation we apply a dropout with p = 0.4 and use spatial batch norm in the residual blocks. As\r\nan activation we use ReLU, except on the final outputs, where we use the sigmoid.\r\nFor classifying onsets we smooth the output we get by passing the whole spectogram through\r\nthe network by a Hamming window of 5 and then choose local maxima over a certain threshold.\r\nWe believe the model is far from its potential, as we didn’t have enough compute to perform a\r\nhyperparameter search (and it was out of the scope as well).\r\n\r\n#### Tempo Estimation\r\n\r\nFor estimating the tempo we apply the autocorrelation method described in the slides. We \r\nautocorrelate the smoothed output signal of the onset detector with τ between 60 and 200 BPM. We\r\ntake the two highest peaks of this autocorrelation signal and report them as the tempo.\r\n\r\n#### Beat Tracking\r\n\r\nFor beat tracking we put the two solutions from above together. Using the smoothed output of the\r\nonset detector we estimate the most likely tempo (taking only the highest peak in the autocorrelation\r\nsignal) and using these we run the dynamic programming algorithm from [^2]. First we convolve\r\nthe onset signal with the given period and then run the DP algorithm on this signal, getting a\r\ncumulative score for being a beat over every timepoint and their respective backlinks to the last\r\nmost prominent beat from there. Finally, we take the highest point of the cumulative score signal\r\nand backtrace from there to the first beat using the backlinks.\r\n\r\nWindowed beat estimation didn’t improve our score, and interestingly running the beat tracking\r\nDP algorithm on the onset envelope directly without convolving it with the given period didn’t\r\nworsen the performance.\r\n\r\n[^1]: Jan Schlüter and Sebastian Böck. Improved Musical Onset Detection with Convolutional Neural Networks.\r\nIn Proceedings of the IEEE International Conference on Acoustics, Speech and\r\nSignal Processing (ICASSP), pages 6979–6983, 2014.\r\n\r\n[^2]: Daniel P.W. Ellis. Beat Tracking by Dynamic Programming. 2017\r\nhttps://www.ee.columbia.edu/~dpwe/pubs/Ellis07-beattrack.pdf\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkodedninja%2Fbeatsbeats","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkodedninja%2Fbeatsbeats","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkodedninja%2Fbeatsbeats/lists"}