Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hb-chen/spark-elasticsearch-recommender
Zeppelin-v0.8.0 Notebook演示使用Spark -v2.3.2+ Elasticsearch-v6.3.2构建推荐系统
https://github.com/hb-chen/spark-elasticsearch-recommender
elasticsearch recommender spark zeppelin
Last synced: about 1 month ago
JSON representation
Zeppelin-v0.8.0 Notebook演示使用Spark -v2.3.2+ Elasticsearch-v6.3.2构建推荐系统
- Host: GitHub
- URL: https://github.com/hb-chen/spark-elasticsearch-recommender
- Owner: hb-chen
- License: apache-2.0
- Created: 2018-10-24T12:35:17.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2019-07-05T02:44:52.000Z (over 5 years ago)
- Last Synced: 2024-11-11T02:38:46.159Z (3 months ago)
- Topics: elasticsearch, recommender, spark, zeppelin
- Homepage:
- Size: 19.5 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# spark-elasticsearch-recommender
Zeppelin Notebook演示使用Spark + Elasticsearch构建推荐系统### 组件
- Zeppelin `0.8.0`
- Spark `2.3.2`
- Elasticsearch `6.3.2`#### 1.环境准备
> Mac OSX
##### Zeppeline
```bash
# http://www.apache.org/dyn/closer.cgi/zeppelin/zeppelin-0.8.0/zeppelin-0.8.0-bin-netinst.tgz
$ wget http://mirrors.shu.edu.cn/apache/zeppelin/zeppelin-0.8.0/zeppelin-0.8.0-bin-netinst.tgz
$ tar -zxf zeppelin-0.8.0-bin-netinst.tgz
$ cd zeppelin-0.8.0-bin-netinst# 安装必要interpreter
$ ./bin/install-interpreter.sh --name md,elasticsearch
$ ./bin/zeppelin-daemon.sh start
```##### Spark
```bash
# http://spark.apache.org/downloads.html
$ wget https://www-us.apache.org/dist/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz
$ tar -zxf spark-2.3.2-bin-hadoop2.7.tgz
```##### Elasticsearch
```bash
# https://www.elastic.co/downloads/past-releases
# Elasticsearch + 6.3.2
$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.2.zip
$ unzip elasticsearch-6.3.2.zip# ES-Hadoop + 6.3.2
$ wget https://artifacts.elastic.co/downloads/elasticsearch-hadoop/elasticsearch-hadoop-6.3.2.zip
$ unzip elasticsearch-hadoop-6.3.2.zip
```
###### Elasticsearch 矢量评分插件
- [muhleder/elasticsearch-vector-scoring](https://github.com/muhleder/elasticsearch-vector-scoring)
```bash
# 修改build.gradle,这样不必Checkout Elasticsearch
# https://github.com/muhleder/elasticsearch-vector-scoring/issues/1#issuecomment-415267767
buildscript {
repositories {
jcenter()
mavenLocal()
}
dependencies {
classpath "org.elasticsearch.gradle:build-tools:6.3.2"
}
}apply plugin: 'idea'
apply plugin: 'java'
apply plugin: 'elasticsearch.esplugin'licenseFile = rootProject.file('LICENSE')
noticeFile = rootProject.file('NOTICE')esplugin {
name 'elasticsearch-vector-scoring'
description 'Provides a fast vector multiplication script.'
classname 'com.gosololaw.elasticsearch.VectorScoringPlugin'
}dependencies {
compile "org.elasticsearch:elasticsearch:6.3.2"
}
```
```bash
# 插件安装
$ ./bin/elasticsearch-plugin install {file:///path/to/plugin.zip}
```##### Python依赖库
```bash
$ pip install elasticsearch
$ pip install numpy
$ pip install tmdbsimple # 忽略,暂时未使用
```##### [Movielens数据集](https://grouplens.org/datasets/movielens/)下载
```bash
$ cd data # 与zeppelin-0.8.0-bin-netinst同Path,note中配置PATH_TO_DATA = "../data/ml-latest-small"
$ wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
$ unzip ml-latest-small.zip
```#### 2.启动服务
##### Elasticsearch启动
```bash
$ ./bin/elasticsearch
```##### Zeppelin配置及启动
```bash
$ cp conf/shiro.ini.template conf/shiro.ini
$ vim conf/shiro.ini
# 管理员账户密码
[users]
admin = 123456, admin$ cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh
$ vim conf/zeppelin-env.sh
# Spark配置
export SPARK_HOME=/{apache-spark-path}/spark-2.3.2-bin-hadoop2.7
export SPARK_SUBMIT_OPTIONS="--driver-memory 2G"$ cp conf/zeppelin-site.xml.template conf/zeppelin-site.xml
$ vim conf/zeppelin-site.xml
# 根据需要可以修改zeppelin.server.port等配置# 启动
$ ./bin/zeppelin-daemon.sh start
```#### 3.Notebook
http://localhost:8080
```bash
# Create new interpreter
# md# elasticsearch
elasticsearch.client.type http
elasticsearch.port 9200# spark
# 添加Dependencies
artifact /{elasticsearch-hadoop-path}/elasticsearch-hadoop-6.3.2/dist/elasticsearch-spark-20_2.11-6.3.2.jar
```#### 参考
- [使用 Apache Spark 和 Elasticsearch 构建一个推荐系统](https://github.com/IBM/elasticsearch-spark-recommender)