{"id":25975758,"url":"https://github.com/arnab-0053/song-identifier","last_synced_at":"2026-04-28T16:35:41.041Z","repository":{"id":280326055,"uuid":"941620717","full_name":"ArNAB-0053/Song-Identifier","owner":"ArNAB-0053","description":"It identifies songs and artists from lyric snippets using two distinct methods - simple NLP based approach and BM25(Best Match 25) approach.","archived":false,"fork":false,"pushed_at":"2025-03-02T18:26:07.000Z","size":20628,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-02T19:28:39.503Z","etag":null,"topics":["bm25","nlp","nltk","python","rank-bm25","scikit-learn","song-lyrics","spotify-dataset","text-preprocessing"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ArNAB-0053.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-02T18:03:57.000Z","updated_at":"2025-03-02T18:28:40.000Z","dependencies_parsed_at":"2025-03-02T19:38:57.126Z","dependency_job_id":null,"html_url":"https://github.com/ArNAB-0053/Song-Identifier","commit_stats":null,"previous_names":["arnab-0053/song-identifier"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArNAB-0053%2FSong-Identifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArNAB-0053%2FSong-Identifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArNAB-0053%2FSong-Identifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArNAB-0053%2FSong-Identifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ArNAB-0053","download_url":"https://codeload.github.com/ArNAB-0053/Song-Identifier/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241956617,"owners_count":20048668,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bm25","nlp","nltk","python","rank-bm25","scikit-learn","song-lyrics","spotify-dataset","text-preprocessing"],"created_at":"2025-03-05T03:23:59.002Z","updated_at":"2026-04-28T16:35:41.002Z","avatar_url":"https://github.com/ArNAB-0053.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Song Identifier\n\nA project that identifies songs and artists from lyric snippets using two distinct methods - simple NLP based approach and BM25(Best Match 25) approach.\n\n## 1. Introduction\n\nThis project identifies songs and artists from lyric snippets using two distinct methods, tested on the Spotify dataset (57,650 songs) from [Kaggle](https://www.kaggle.com/datasets/joebeachcapital/57651-spotify-songs/). It includes a Jupyter notebook (`spotify_song_pred.ipynb`) exploring both approaches, separated into standalone Python scripts (`normal_nlp.py` and `bm25_approach.py`) for ease of use.\n\nExample queries like \"sing us a song you're the piano man...\" and \"She's just my kind of girl...\" are matched accurately, with BM25 achieving perfect confidence (1.00).\n\n## 2. Approaches\n\n### Normal NLP (Hybrid TF-IDF + Cosine + Jaccard)\n- **Description**: A hybrid approach combining TF-IDF vectorization with both Cosine and Jaccard similarity measures to optimize lyric matching.\n- **Performance**: Correctly identifies songs (e.g., \"Piano Man by Billy Joel\" at 0.33 confidence), suitable for exact-match tasks.\n- **File**: `normal_nlp.py`\n- **Technical Details**:\n  1. **TF-IDF Vectorization**:\n     * `vectorizer = TfidfVectorizer(ngram_range=(1,3))`: Converts lyrics to TF-IDF vectors, capturing unigrams, bigrams, and trigrams (e.g., \"piano,\" \"piano man,\" \"sing us song\").\n     * `input_vector = vectorizer.transform([input_lyric])`: Turns the query into a TF-IDF vector.\n  2. **Cosine Similarity**:\n     * `cosine_scores = cosine_similarity(input_vector, tfidf_matrix).flatten()`: Measures similarity between the query's TF-IDF vector and all song vectors, based on angle (cosine) between them.\n     * Range: 0 to 1 (1 = identical).\n  3. **Jaccard Similarity**:\n     * `jaccard_scores = np.array([jaccard_similarity(input_lyric, song_lyric) for song_lyric in cleaned_lyrics])`: Calculates word overlap between the query and each song's cleaned text as a set intersection over union.\n     * Range: 0 to 1 (1 = all words match).\n  4. **Combined Scores**:\n     * `combined_scores = 0.7 * cosine_scores + 0.3 * jaccard_scores`: Blends Cosine (70%) and Jaccard (30%) into a single score.\n     * This combination leverages TF-IDF's weighted term importance through Cosine similarity, while Jaccard ensures exact word matches boost the score.\n\n### BM25 (via rank_bm25)\n- **Description**: Uses the BM25Okapi algorithm from the rank_bm25 library, tuned with k1=1.5 and b=0.75, optimizing for retrieval by balancing term frequency and lyric length.\n- **Performance**: Achieves perfect matches (e.g., \"Piano Man by Billy Joel\" and \"She's My Kind Of Girl by ABBA\" at 1.00 confidence), outperforming Normal NLP.\n- **File**: `bm25_approach.py`\n\n## 3. Setup\n\n### Cloning the Repository\n```bash\ngit clone https://github.com/ArNAB-0053/Song-Identifier.git\ncd Song-Identifier\n```\n\n### Installing Dependencies\nManually install the required Python libraries, as no requirements.txt is provided (version compatibility may vary—use latest versions unless issues arise):\n\n* For both approaches:\n```bash\npip install pandas numpy nltk\n```\n\n* For Normal NLP:\n```bash\npip install scikit-learn\n```\n\n* For BM25:\n```bash\npip install rank_bm25\n```\n\n### Dataset\n* You can use the `spotify_songs_dataset.csv` file included in this repository.\n* If you want the latest version (or if any updates occur), you can download it from [Kaggle](https://www.kaggle.com/datasets/joebeachcapital/57651-spotify-songs/).\n* Place the file in the project folder (alongside the scripts) before running the code.\n\n## 4. Running the Code\n\n### Explore Both Approaches\n* Open `spotify_song_pred.ipynb` in Jupyter Notebook to see the combined implementation and experimentation:\n```bash\njupyter notebook spotify_song_pred.ipynb\n```\n\n### Run Individual Scripts\n* Normal NLP:\n```bash\npython normal_nlp.py\n```\n\n* BM25:\n```bash\npython bm25_approach.py\n```\n\n* Edit `query`, `query1`, or `query2` in the scripts for custom snippets.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farnab-0053%2Fsong-identifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farnab-0053%2Fsong-identifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farnab-0053%2Fsong-identifier/lists"}