{"id":18842815,"url":"https://github.com/aamend/lastfm-mapreduce","last_synced_at":"2025-10-30T15:07:28.170Z","repository":{"id":17892632,"uuid":"20842273","full_name":"aamend/lastfm-mapreduce","owner":"aamend","description":null,"archived":false,"fork":false,"pushed_at":"2014-06-14T21:45:14.000Z","size":148,"stargazers_count":1,"open_issues_count":0,"forks_count":5,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-14T07:47:09.041Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aamend.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-06-14T21:25:16.000Z","updated_at":"2022-09-01T05:45:04.000Z","dependencies_parsed_at":"2022-09-07T15:12:12.104Z","dependency_job_id":null,"html_url":"https://github.com/aamend/lastfm-mapreduce","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/aamend/lastfm-mapreduce","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aamend%2Flastfm-mapreduce","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aamend%2Flastfm-mapreduce/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aamend%2Flastfm-mapreduce/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aamend%2Flastfm-mapreduce/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aamend","download_url":"https://codeload.github.com/aamend/lastfm-mapreduce/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aamend%2Flastfm-mapreduce/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261386810,"owners_count":23150873,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T02:55:48.107Z","updated_at":"2025-10-30T15:07:28.091Z","avatar_url":"https://github.com/aamend.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"LastFM public data set\n======================\n\nMapReduce jobs against LastFM 1K users dataset (http://www.last.fm/)\nhttp://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-1K.html\n\n## Why Hadoop \u0026 MapReduce ?\n\nFor only 1000 users, data file is already about 600MB large. A sequential analysis would take a lot of time to complete, so let's distribute the work on an Hadoop environment. Furthermore, a Tab separated format will be easily analyzed on MapReduce job without having to preformat data.\n\n## Unique Users\n\n### Exercise: \n*Create a list of unique users, and the number of distinct songs played by each user*\n\n### MapReduce:\n\n* On Mapper phase, for each user as key, output the song as value \n* On Combiner phase, remove the duplicates. Output the user as key and the distinct (local) songs as value\n* On Reducer phase, remove the duplicates and output the user as key together with the total number of distinct songs as value\n\n###Execution\n\n`hadoop jar lastfm-1.0-SNAPSHOT.jar com.aamend.hadoop.lastfm.UniqueUsers -D mapred.reduce.tasks=\u003cnumber of reducers\u003e \u003cinput\u003e \u003coutput\u003e`\n\n* (Required) Input Directory\n* (Required) Output Directory\n* (Optional) The number of reducers (default 1). Suggested value is 1 for 1K user dataset\n\n## Top N songs\n\n### Exercise:\n*Create a list of the top 100 played songs (artist and title) in the dataset, with the number of times each song was played.*\n\n### MapReduce:\n\n2 MapReduce jobs will be required here. Job will be chained together from Driver code\n* Count the distinct songs based on traId\n* Sort data (topN)\n\n#### 1st job\n\n* On Mapper phase, for each song as key, output the value 1\n* On Combiner phase, for each song, sum up the '1' values. Output the song tuple as key and the sum as value.\n* On Reducer phase, for each song, sum up the '1' values. Output the song tuple as key and the sum as value.\n\nBe aware that key is a Tuple (including Id and Name). One need to use a custom partitioner based on tuple's traId to make sure all data belonging to a same traId will be sent to the same reducer.\n\n#### 2nd job\n\nUse 1st job output as input.\n* On Mapper phase, for each song (tuple including trackId, trackName and ArtistName) and counter, output the value as key and the key as value\n* On Combiner phase, output the TopN songs (local TopN), counter as key and song as value\n* On Reducer phase, output the TopN songs (global TopN), song as key and counter as value\n\nI decided to let the Hadoop framework deal with the sort phase. By outputting counter as a key **on a single reducer**, I let the reduce sort phase sort my records (i.e. by song popularity). I simply need to output only the first 100 records from the reducers to get my top100 songs. Be aware the default sort is Ascending. I had to create a Custom Comparator to sort data descending (top results).\n\n###Execution\n\n`hadoop jar lastfm-1.0-SNAPSHOT.jar com.aamend.hadoop.lastfm.TopSongs -D mapred.reduce.tasks=\u003cnumber of reducers\u003e -D top.n=\u003cN\u003e \u003cinput\u003e \u003coutput\u003e`\n\n* (Required) Input Directory\n* (Required) Output Directory\n* (Required) The N parameter for topN\n* (Optional) The number of reducers (default 1) for the 1st job. Second job uses only 1\n\n## Top N sessions\n\n### Exercise:\n*Say we define a user’s “session” of Last.fm usage to be comprised of one or more songs played by that user, where each song is started within 20 minutes of the previous song’s start time. Create a list of the top 100 longest sessions, with the following information about each session: userid, timestamp of first and last songs in the session, and the list of songs played in the session (in order of play).*\n\n### MapReduce:\n\n2 MapReduce jobs will be required here. Job will be chained together from Driver code\n* Build sessions\n* Sort data (topN)\n\n#### 1st job\n\n* On Mapper phase, for each user, output the song and timestamp\n* On Reducer phase, for each song belonging to each user, create a session according to above definition. Output each session as a custom Tuple\n\n#### 2nd job\n\nUse 1st job output as input.\n* On Mapper phase, for each session tuple, output the number of records this session has as a key and the session as value. Like for TopSongs job, Hadoop will deal with the sort phase.\n* On Combiner phase, output only the TopN sessions (local TopN). Session size as key and session as value.\n* On Reducer phase, output the TopN sessions (global TopN), Rank in the topN as key and session as value\n\n###Execution\n\n`hadoop jar lastfm-1.0-SNAPSHOT.jar com.aamend.hadoop.lastfm.TopSessions -D mapred.reduce.tasks=\u003cnumber of reducers\u003e -D top.n=\u003cN\u003e \u003cinput\u003e \u003coutput\u003e`\n\n* (Required) Input Directory\n* (Required) Output Directory\n* (Required) The N parameter for topN\n* (Optional) The number of reducers (default 1) for the 1st job. Second job uses only 1\n\n## High availability\n\n### Exercise\n\n*Load the results from task 3 into a highly available data-store that can be queried.*\n\nProvide:\n* a. The code you used to load data into the store.\n* b. A sample query that you used to retrieve the longest session.\n* c. The raw output from the data-store.\n\nData being on HDFS, the defacto data store that can be queried would be HBase. However, above query shows that such a KV store would not be really efficient. We do not want all data for a given rowID (userId), but we look at session values (such as number of songs). And then, what if one would like to get queries such as:\n\n* find all sessions for UserId1\n* find all topNSessions, worstNSessions\n* find all sessions started between Date1 and Date2\n* find all sessions including Song1 \n\nFor that purpose, we could use ElasticSearch, that index all fields of Session documents.\n\n\n### a. ETL\n\nLoading data from Hadoop to ElasticSearch cluster is quite easy using below dependency.\n\n```\n\u003cdependency\u003e\n    \u003cgroupId\u003eorg.elasticsearch\u003c/groupId\u003e\n    \u003cartifactId\u003eelasticsearch-hadoop-mr\u003c/artifactId\u003e\n    \u003cversion\u003e${elasticsearch.version}\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nWe create a new MapReduce job against TopN session data created earlier. We use `ESOutputFormat`, and `MapWritable` as value. We supply the ES conf from the Hadoop configuration.\n\n`hadoop jar lastfm-1.0-SNAPSHOT-jar-with-dependencies.jar com.aamend.hadoop.lastfm.TopSessionsETL -D es.nodes=localhost:9200 -D es.resource=radio/sessions input`\n\nDocuments will be indexed on ES under index `radio/sessions`\n\n### b. Query\n\nGet the longest session.. Find first the maximum size of any session\n```\ncurl -XGET 'http://localhost:9200/radio/sessions/_search?search_type=count' -d '\n{\n\"aggs\" : \n    {\n    \"max_session\" : \n        { \n        \"max\" : \n            { \n            \"field\" : \"count\" \n            }\n        }\n    }\n}'\n```\n... And retrieve the session with this session size\n```\ncurl -XGET 'http://localhost:9200/radio/sessions/_search?' -d '\n{\n\"fields\" : [\"user\", \"start\", \"stop\", \"count\"], \n\"query\" : \n    {\n    \"term\" : \n        { \n        \"count\" : \"4969\" \n        }\n    }\n}'\n```\nAlternatively, this should work as well. Sort the data and retrieve first row\n```\ncurl -XGET 'http://localhost:9200/radio/sessions/_search?size=1' -d '\n{\n\"fields\" : [\"user\", \"start\", \"stop\", \"count\"],\n\"query\" : \n    {\n    \"match_all\" : {}\n    },\n\"sort\" : [\n    {\n    \"count\" : \n        {\n        \"order\" : \"desc\", \"mode\" : \"avg\"\n        }\n    }\n    ]\n}' \n```\n\n### c. Output\n\n```\n{\n  \"_source\": {\n    \"stop\": 1133914596000,\n    \"user\": \"user_000949\",\n    \"tracks\": [\n      \"117f4438-64e5-4ef9-bebf-908f2d14a7f0\",\n      \"c14cc283-80da-40f8-a838-b880ccbcf50a\",\n      \"951a2bef-a129-4762-9e16-22c84c5d438e\",\n      \"9b63aa59-89bd-404d-b7bf-80a1e49d144a\",\n      ...\n      \"8038cce3-f643-4d74-af17-609dffd30d8e\",\n      \"8f9dec01-c479-4ad4-96b5-0950a97c1a91\",\n      \"2d7d6714-8b41-4894-af3a-1cf0a092a7ee\"\n    ],\n    \"id\": 181,\n    \"start\": 1133739654000,\n    \"count\": 1385\n  },\n  \"found\": true,\n  \"_version\": 1,\n  \"_id\": \"m2V39AKbQhi--lL_B5DNHw\",\n  \"_type\": \"sessions\",\n  \"_index\": \"radio\"\n}\n```\n\n\n## Build\n\n`mvn clean package`\n\nThis will create a fat JAR including all dependencies required for the project execution (ESOutputFormat)\n\n## Authors\n\nAntoine Amend \u003cantoine.amend@gmail.com\u003e\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faamend%2Flastfm-mapreduce","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faamend%2Flastfm-mapreduce","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faamend%2Flastfm-mapreduce/lists"}