{"id":39081748,"url":"https://github.com/lovit/kmrd","last_synced_at":"2026-01-17T18:31:01.115Z","repository":{"id":46557094,"uuid":"228449051","full_name":"lovit/kmrd","owner":"lovit","description":"Synthetic dataset for recommender system created from Naver Movie rating system","archived":false,"fork":false,"pushed_at":"2023-12-08T16:25:51.000Z","size":188845,"stargazers_count":24,"open_issues_count":1,"forks_count":9,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-11-13T10:49:11.835Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lovit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-12-16T18:24:13.000Z","updated_at":"2024-03-14T09:58:25.000Z","dependencies_parsed_at":"2022-09-24T16:13:18.358Z","dependency_job_id":null,"html_url":"https://github.com/lovit/kmrd","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lovit/kmrd","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fkmrd","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fkmrd/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fkmrd/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fkmrd/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lovit","download_url":"https://codeload.github.com/lovit/kmrd/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fkmrd/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28515740,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T18:28:00.501Z","status":"ssl_error","status_checked_at":"2026-01-17T18:28:00.150Z","response_time":85,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-17T18:31:00.359Z","updated_at":"2026-01-17T18:31:01.050Z","avatar_url":"https://github.com/lovit.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Korean Movie Recommender system Dataset\n\nMovieLens style synthetic dataset built from Naver Movie rating systems with [Naver Movie Scraper][scraper]\n\n[scraper]: https://github.com/lovit/naver_movie_scraper\n\n## Install\n\nClone this repository, and execute python script\n\n```\ngit clone https://github.com/lovit/kmrd\npython setup.py install\n```\n\n## Load data\n\n`load_rates` function returns sparse matrix formed user-item-rate matrix and numpy.ndarray formed timestamp. All identifier of users are masked. `timestamps` format is UNIX time (second). Choose the size from ['small', '2m', '5m']\n\n```python\nfrom kmr_dataset import load_rates\nfrom kmr_dataset import get_paths\n\npaths = get_paths(size='small')\n# paths = get_paths(size='2m')\nrates, timestamps = load_rates(size='small')\n# rates, timestamps = load_rates(size='5m')\n```\n\n`load_histories` function returns dict of list formed user histories\n\n```python\nfrom kmr_dataset import load_histories\n\nhistories = load_histories(size='small')\n```\n\nTo see the histories of user 0,\n\n```python\nhistoreis[0]\n```\n\nThe result follows the format of (item, rate, UNIX time).\n\n```\n[(10003, 7, 1494128040),\n (10004, 7, 1467529800),\n (10018, 9, 1513344120),\n (10021, 9, 1424497980),\n (10022, 7, 1427627340),\n (10023, 7, 1428738480),\n (10024, 4, 1429359420),\n ...\n```\n\n## Statistics\n\n### KMRD-small\n\nSome users in KMRD-small have rated only one item. The first comment, `73.3% rates, 16292 users (31.3%, # \u003e= 2)`, means that 73.3% (user, item) elements consists of 16292 users who did rating at least 2 items. There are also heavy users who have rated many movies. Therefore, after removing them, we listed the results of performing the same statistics.\n\n```\nDescription\n - num user : 52028\n - num item : 10999\n - num unique user : 52028 (100.0 %)\n - num unique item : 600 (5.455 %)\n - num of nonzero : 134331\n - sparsity : 0.9997652606410895\n - sparsity (compatified) : 0.9956968363189052\n```\n\n![](./figures/kmrd-small-dist.png)\n\n### KMRD-2m\n\nAll users in KMRD-2m and KMRD-5m have rated at least 20 times. However, some users have done to same items duplicatedly in KMRD dataset. \n\n```\nDescription\n - num user : 32151\n - num item : 191238\n - num unique user : 32151 (100.0 %)\n - num unique item : 41706 (21.81 %)\n - num of nonzero : 2569799\n - sparsity : 0.9995820440836619\n - sparsity (compatified) : 0.9980835118800974\n```\n\n![](./figures/kmrd-2m-dist.png)\n\n### KMRD-5m\n\n```\nDescription\n - num user : 86457\n - num item : 191238\n - num unique user : 86457 (100.0 %)\n - num unique item : 48840 (25.54 %)\n - num of nonzero : 4941301\n - sparsity : 0.9997011405760968\n - sparsity (compatified) : 0.9988297854523261\n```\n\n![](./figures/kmrd-5m-dist.png)\n\n### MovieLens-20m\n\nIn contrast to, MovieLens dataset does not include the duplicated (user, item) elements.\n\n```\nDescription\n - num user : 138494\n - num item : 131263\n - num unique user : 138493 (100.0 %)\n - num unique item : 26744 (20.37 %)\n - num of nonzero : 20000263\n - sparsity : 0.9988998233532408\n - sparsity (compatified) : 0.9946001521864456\n```\n\n![](./figures/movielens-20m-dist.png)\n\n## Files of dataset\n\nDataset consists of following files.\n\n### Movie Information File, `movies.txt`\n\nTap separated metadata table, (movie idx, Korean title, English title, first open year, grade)\n\n```\nmovie\ttitle\ttitle_eng\tyear\tgrade\n10107\t아웃 오브 아프리카\tOut Of Africa , 1985\t1986\tPG\n13252\t시계태엽 오렌지\tA Clockwork Orange , 1971\t\t청소년 관람불가\n24452\t매트릭스\tThe Matrix , 1999\t2016\t12세 관람가\n39516\t달콤한 인생\tA Bittersweet Life , 2005\t2005\t청소년 관람불가\n...\n```\n\n```python\nimport pandas as pd\nfrom kmr_dataset import get_paths\n\npath = get_paths(size='small')[3]\ndf = pd.read_csv(path)\ndf.head()\n```\n\n|  | movie | title | title_eng | year | grade |\n| --- | --- | --- | --- | --- | --- |\n| 0 | 10001 | 시네마 천국 | Cinema Paradiso , 1988 | 2013.0 | 전체 관람가 |\n| 1 | 10002 | 빽 투 더 퓨쳐 | Back To The Future , 1985 | 2015.0 | 12세 관람가 |\n| 2 | 10003 | 빽 투 더 퓨쳐 2 | Back To The Future Part 2 , 1989 | 2015.0 | 12세 관람가 |\n| 3 | 10004 | 빽 투 더 퓨쳐 3 | Back To The Future Part III , 1990 | 1990.0 | 전체 관람가 |\n| 4 | 10005 | 스타워즈 에피소드 4 - 새로운 희망 | Star Wars , 1977 | 1997.0 | PG |\n\n### People Information File, `peoples.txt`\n\nTap separated people name table, (people id, Korean name, English name)\n\n```\npeople\tkorean\toriginal\n73\t릴리 워쇼스키\tLilly Wachowski\n214\t캐리 앤 모스\tCarrie-Anne Moss\n554\t헬레나 본햄 카터\tHelena Bonham Carter\n581\t류승완\tRYOO Seung-wan\n688\t제프 다니엘스\tJeff Daniels\n1824\t송강호\tSong Kang-ho\n1897\t이범수\t\n1898\t이병헌\tByung-hun Lee\n1969\t전도연\t\n2009\t천호진\t\n...\n```\n\n\n### Casting Information File, `castings.csv`\n\nComma separated table, (movie id, people id, credit order, leading role)\n\n- `reading` 1 means the people acts as leading role\n```\nmovie,people,order,leading \n10107,1336,1,1\n10107,1061,2,1\n10107,892,3,0\n10107,4879,4,0\n10107,11143,5,0\n10107,7020,6,0\n...\n```\n\n### Rating matrix , `ratings.csv`\n\nComma separated table, (user index, movie id, rate, time)\n\n- `rate` is 1 - 10 integer score\n- `time` is UNIX time format\n\n```\nuser,movie,rate,time\n0,10107,10,1452358200\n1,10107,5,1406125440\n2,10107,8,1255014420\n3,10107,7,1169798460\n```\n\n### Countries, `countries.csv`\n\nComma separated table, (movie id, country)\n\n```\nmovie,country\n10001,이탈리아\n10001,프랑스\n10002,미국\n10003,미국\n10004,미국\n10005,미국\n...\n```\n\n### Genres, `genres.csv`\n\nComma separated table, (movie id, genre)\n\n```\nmovie,genre\n10001,드라마\n10001,멜로/로맨스\n10002,SF\n10002,코미디\n10003,SF\n10003,코미디\n...\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flovit%2Fkmrd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flovit%2Fkmrd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flovit%2Fkmrd/lists"}