https://github.com/talmago/pyspark-loglikelihood
PySpark Loglikelihood Similarity Examples
https://github.com/talmago/pyspark-loglikelihood
mahout pyspark recommendation-engine spark
Last synced: 6 months ago
JSON representation
PySpark Loglikelihood Similarity Examples
- Host: GitHub
- URL: https://github.com/talmago/pyspark-loglikelihood
- Owner: talmago
- License: mit
- Created: 2017-05-17T15:03:42.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-07-11T13:50:42.000Z (about 8 years ago)
- Last Synced: 2025-02-03T02:30:28.290Z (8 months ago)
- Topics: mahout, pyspark, recommendation-engine, spark
- Language: Python
- Homepage:
- Size: 7.81 KB
- Stars: 1
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pyspark-loglikelihood
PySpark Loglikelihood Examples.Inspired by [Mahout](http://mahout.apache.org/):
* [LogLikelihood](https://github.com/apache/mahout/blob/master/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java)
* [User / Item Similarity](https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/cf/taste/impl/similarity/LogLikelihoodSimilarity.java)
* [Nearest N-Neighborhood](https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java)### Installation
```sh
$ pip install https://github.com/talmago/pyspark-loglikelihood/archive/master.zip
```> **NOTICE:** PySpark Loglikelihood is adapted to [python2.7](https://www.python.org/download/releases/2.7/) to run. [pyenv](https://github.com/yyuu/pyenv) and [virtualenv](https://virtualenv.pypa.io/en/stable/) are recommended for setting an independent python environment.
### Usage
After the installation you can use `spark-submit` command line to execute the item-similarity and user-similarity jobs with your dataset respectively.
#### Item-Item Similarity (LogLikelihood)
```sh
$ spark-submit item_similarity.py \
input.csv \
output \
--maxPrefs=10000 \
--maxSimilaritiesPerItem 100
```
> **NOTICE:** Input file lines are expected to be comma seperated vectors of `USER_ID`,`ITEM_ID`. Output format will be consisted of `ITEM_ID1`,`ITEM_ID2`,`SCORE`.#### User-User Similarity (N-neighborhood + Loglikelihood)
```sh
$ spark-submit user_similarity.py \
input.csv \
output \
--numOfNeighbors=40 \
--numOfRecommendations 1000
```
> **NOTICE:** Input file lines are expected to be comma seperated vectors of `USER_ID`,`ITEM_ID`. Output format will be consisted of `USER_ID`,`ITEM_ID`,`SCORE`.### Example
##### Run [exmple](https://github.com/talmago/pyspark-loglikelihood/blob/master/examples/item-similarity-ml-100k-dataset) from command line
```sh
wget -O - https://raw.githubusercontent.com/talmago/pyspark-loglikelihood/master/examples/item-sim-ml-100l-dataset | bash -x
```##### Step by Step
Step 1: Download and re-format the [movielens 100k](https://grouplens.org/datasets/movielens/100k/) dataset.
```sh
$ wget -O - http://files.grouplens.org/datasets/movielens/ml-100k/u.data | cut -f1 -f2 | tr '\t' ',' > input.csv
```Step 2: Upload data set to local hfds
```sh
$ hadoop fs -rm -r /item-sim
$ hadoop fs -mkdir -p /item-sim
$ hadoop fs -moveFromLocal input.csv /item-sim/input.csv
```Step 3: Run item-silmilarity job on our hadoop data set
```sh
$ spark-submit item_similarity.py \
/item-sim/input.csv \
/item-sim/output \
--maxPrefs=10000 \
--maxSimilaritiesPerItem 100
```Step 4: Merge parquet files into a single csv file
```sh
$ hadoop fs -getmerge /item-sim/output result.csv
```Step 5: Analyze / process / visualize the result set
```sh
$ head result.csv
26,381,0.9889748
26,732,0.9876871
26,70,0.98738647
26,715,0.98685825
26,238,0.98625606
26,58,0.98580784
26,1,0.985786
26,83,0.9857064
26,88,0.9856318
26,367,0.9854448
```