https://github.com/lovit/kmrd

Synthetic dataset for recommender system created from Naver Movie rating system
https://github.com/lovit/kmrd

Last synced: 5 months ago
JSON representation

Synthetic dataset for recommender system created from Naver Movie rating system

Host: GitHub
URL: https://github.com/lovit/kmrd
Owner: lovit
Created: 2019-12-16T18:24:13.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2023-12-08T16:25:51.000Z (over 2 years ago)
Last Synced: 2024-11-13T10:49:11.835Z (over 1 year ago)
Language: Python
Homepage:
Size: 180 MB
Stars: 24
Watchers: 2
Forks: 9
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Korean Movie Recommender system Dataset

MovieLens style synthetic dataset built from Naver Movie rating systems with [Naver Movie Scraper][scraper]

[scraper]: https://github.com/lovit/naver_movie_scraper

## Install

Clone this repository, and execute python script

```

git clone https://github.com/lovit/kmrd

python setup.py install

```

## Load data

`load_rates` function returns sparse matrix formed user-item-rate matrix and numpy.ndarray formed timestamp. All identifier of users are masked. `timestamps` format is UNIX time (second). Choose the size from ['small', '2m', '5m']

```python

from kmr_dataset import load_rates

from kmr_dataset import get_paths

paths = get_paths(size='small')

# paths = get_paths(size='2m')

rates, timestamps = load_rates(size='small')

# rates, timestamps = load_rates(size='5m')

```

`load_histories` function returns dict of list formed user histories

```python

from kmr_dataset import load_histories

histories = load_histories(size='small')

```

To see the histories of user 0,

```python

historeis[0]

```

The result follows the format of (item, rate, UNIX time).

```

[(10003, 7, 1494128040),

 (10004, 7, 1467529800),

 (10018, 9, 1513344120),

 (10021, 9, 1424497980),

 (10022, 7, 1427627340),

 (10023, 7, 1428738480),

 (10024, 4, 1429359420),

 ...

```

## Statistics

### KMRD-small

Some users in KMRD-small have rated only one item. The first comment, `73.3% rates, 16292 users (31.3%, # >= 2)`, means that 73.3% (user, item) elements consists of 16292 users who did rating at least 2 items. There are also heavy users who have rated many movies. Therefore, after removing them, we listed the results of performing the same statistics.

```

Description

 - num user : 52028

 - num item : 10999

 - num unique user : 52028 (100.0 %)

 - num unique item : 600 (5.455 %)

 - num of nonzero : 134331

 - sparsity : 0.9997652606410895

 - sparsity (compatified) : 0.9956968363189052

```

![](./figures/kmrd-small-dist.png)

### KMRD-2m

All users in KMRD-2m and KMRD-5m have rated at least 20 times. However, some users have done to same items duplicatedly in KMRD dataset. 

```

Description

 - num user : 32151

 - num item : 191238

 - num unique user : 32151 (100.0 %)

 - num unique item : 41706 (21.81 %)

 - num of nonzero : 2569799

 - sparsity : 0.9995820440836619

 - sparsity (compatified) : 0.9980835118800974

```

![](./figures/kmrd-2m-dist.png)

### KMRD-5m

```

Description

 - num user : 86457

 - num item : 191238

 - num unique user : 86457 (100.0 %)

 - num unique item : 48840 (25.54 %)

 - num of nonzero : 4941301

 - sparsity : 0.9997011405760968

 - sparsity (compatified) : 0.9988297854523261

```

![](./figures/kmrd-5m-dist.png)

### MovieLens-20m

In contrast to, MovieLens dataset does not include the duplicated (user, item) elements.

```

Description

 - num user : 138494

 - num item : 131263

 - num unique user : 138493 (100.0 %)

 - num unique item : 26744 (20.37 %)

 - num of nonzero : 20000263

 - sparsity : 0.9988998233532408

 - sparsity (compatified) : 0.9946001521864456

```

![](./figures/movielens-20m-dist.png)

## Files of dataset

Dataset consists of following files.

### Movie Information File, `movies.txt`

Tap separated metadata table, (movie idx, Korean title, English title, first open year, grade)

```

movie	title	title_eng	year	grade

10107	아웃 오브 아프리카	Out Of Africa , 1985	1986	PG

13252	시계태엽 오렌지	A Clockwork Orange , 1971		청소년 관람불가

24452	매트릭스	The Matrix , 1999	2016	12세 관람가

39516	달콤한 인생	A Bittersweet Life , 2005	2005	청소년 관람불가

...

```

```python

import pandas as pd

from kmr_dataset import get_paths

path = get_paths(size='small')[3]

df = pd.read_csv(path)

df.head()

```

|  | movie | title | title_eng | year | grade |

| --- | --- | --- | --- | --- | --- |

| 0 | 10001 | 시네마 천국 | Cinema Paradiso , 1988 | 2013.0 | 전체 관람가 |

| 1 | 10002 | 빽 투 더 퓨쳐 | Back To The Future , 1985 | 2015.0 | 12세 관람가 |

| 2 | 10003 | 빽 투 더 퓨쳐 2 | Back To The Future Part 2 , 1989 | 2015.0 | 12세 관람가 |

| 3 | 10004 | 빽 투 더 퓨쳐 3 | Back To The Future Part III , 1990 | 1990.0 | 전체 관람가 |

| 4 | 10005 | 스타워즈 에피소드 4 - 새로운 희망 | Star Wars , 1977 | 1997.0 | PG |

### People Information File, `peoples.txt`

Tap separated people name table, (people id, Korean name, English name)

```

people	korean	original

73	릴리 워쇼스키	Lilly Wachowski

214	캐리 앤 모스	Carrie-Anne Moss

554	헬레나 본햄 카터	Helena Bonham Carter

581	류승완	RYOO Seung-wan

688	제프 다니엘스	Jeff Daniels

1824	송강호	Song Kang-ho

1897	이범수	

1898	이병헌	Byung-hun Lee

1969	전도연	

2009	천호진	

...

```

### Casting Information File, `castings.csv`

Comma separated table, (movie id, people id, credit order, leading role)

- `reading` 1 means the people acts as leading role

```

movie,people,order,leading 

10107,1336,1,1

10107,1061,2,1

10107,892,3,0

10107,4879,4,0

10107,11143,5,0

10107,7020,6,0

...

```

### Rating matrix , `ratings.csv`

Comma separated table, (user index, movie id, rate, time)

- `rate` is 1 - 10 integer score

- `time` is UNIX time format

```

user,movie,rate,time

0,10107,10,1452358200

1,10107,5,1406125440

2,10107,8,1255014420

3,10107,7,1169798460

```

### Countries, `countries.csv`

Comma separated table, (movie id, country)

```

movie,country

10001,이탈리아

10001,프랑스

10002,미국

10003,미국

10004,미국

10005,미국

...

```

### Genres, `genres.csv`

Comma separated table, (movie id, genre)

```

movie,genre

10001,드라마

10001,멜로/로맨스

10002,SF

10002,코미디

10003,SF

10003,코미디

...

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lovit/kmrd

Awesome Lists containing this project

README