{"id":15912989,"url":"https://github.com/x-tabdeveloping/language-analytics-assignment3","last_synced_at":"2025-04-03T03:15:58.476Z","repository":{"id":227847541,"uuid":"772509121","full_name":"x-tabdeveloping/language-analytics-assignment3","owner":"x-tabdeveloping","description":"Third Assignment for Language Analytics in Cultural Data Science","archived":false,"fork":false,"pushed_at":"2024-05-10T12:30:05.000Z","size":16,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-08T17:14:07.994Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/x-tabdeveloping.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-15T10:31:03.000Z","updated_at":"2024-05-10T12:30:08.000Z","dependencies_parsed_at":"2024-10-28T14:27:48.322Z","dependency_job_id":"56a5abc7-47a8-4299-809b-b7fb047b2408","html_url":"https://github.com/x-tabdeveloping/language-analytics-assignment3","commit_stats":null,"previous_names":["x-tabdeveloping/language-analytics-assignment3"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Flanguage-analytics-assignment3","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Flanguage-analytics-assignment3/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Flanguage-analytics-assignment3/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Flanguage-analytics-assignment3/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/x-tabdeveloping","download_url":"https://codeload.github.com/x-tabdeveloping/language-analytics-assignment3/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246927843,"owners_count":20856198,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-06T16:22:14.885Z","updated_at":"2025-04-03T03:15:58.422Z","avatar_url":"https://github.com/x-tabdeveloping.png","language":"Python","readme":"# language-analytics-assignment3\nThird Assignment for Language Analytics in Cultural Data Science\n\nThe assignments is about using word embeddings for expanding search queries in a lyrics database.\nThe code finds how many, and what proportion of a given artists' songs contain a given keyword or words semantically closely related to that keyword.\n\nI chose a fault tolerant approach, this has the following implications:\n - If the given artist can't be exactly matched, you will get a warning, but the code will continue running with the closest fuzzy match. \n - If the given keyword can't be exactly matched in the embedding models' vocabulary, you will get a warning, but the code will continue running with the closest fuzzy match. \n\n## Setup\n\nYou will need to download the [Spotfy Million Song dataset](https://www.kaggle.com/datasets/joebeachcapital/57651-spotify-songs).\nthe file should be placed in a `dat` directory.\n\n```\n- dat/\n    - \"Spotify Million Song Dataset_exported.csv\"\n```\n\nInstall dependencies:\n\n```bash\npip install -r requirements.txt\n```\n\n## Usage\n\nYou can query songs by using the `src/query.py` command line interface for checking how many songs of a given artist contain a keyword or closesly related word.\n\n```bash\npython3 src/query.py -a \"Steely Dan\" -w \"cousin\"\n```\n```\nThe most similar terms are the following: cousin, nephew, brother, son, uncle, eldest, grandson, daughter, father, grandfather, niece\n-------------------------------------------------------------\n\nArtist Steely Dan has 15 songs (17.05%) that contain words related to cousin.\n-------------------------------------------------------------\n```\n\nBy passing the `--print_songs` flag you can also see the individual songs containing these terms.\n\n```\nThese are: \n - Cousin Dupree\n - Almost Gothic\n - Babylon Sisters\n - Chain Lightning\n - Deacon Blues\n - Don't Take Me Alive\n - Godwhacker\n - Green Flower Street\n - Kid Charlemagne\n - Pixeleen\n - Pretzel Logic\n - Sign In Stranger\n - Time Out Of Mind\n - Turn That Heartbeat Over Again\n - Two Against Nature\n```\n\n### Parameters\n\n| Parameter | Description | Default |\n| - | - | - |\n| `-a` or `--artist` | Name of the artist to query the songs of. | - |\n| `-w` or `--query_word` | The seed word to base the semantic query on. | - |\n| `-k` or `--k_expansion` | Number of termas most similar to the seed term to include in the query. | `10` |\n| `--print_songs` | Flag to indicate whether the names of the songs should be printed. | `False` |\n\n\u003e Additionally the script will produce csv files with the CO2 emissions of the substasks in the code (`emissions/`).\n\u003e This is necessary for Assignment 5, and is not directly relevant to this assignment.\n\n\u003e Note: The `emissions/emissions.csv` file should be ignored. This is due to the fact, that codecarbon can't track process and task emissions at the same time.\n\n## Potential Limitations\n\nIt would be good to know how the number of expansion terms affect results and to which extent the pipeline is sensitive to this parameter.\nSystematic evaluations might give us more information about the implications of setting higher or lower values.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-tabdeveloping%2Flanguage-analytics-assignment3","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fx-tabdeveloping%2Flanguage-analytics-assignment3","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-tabdeveloping%2Flanguage-analytics-assignment3/lists"}