https://github.com/Major-333/BigDataWareHouse

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/Major-333/BigDataWareHouse
Owner: Major-333
Created: 2020-12-02T16:31:18.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2021-02-03T12:36:44.000Z (over 4 years ago)
Last Synced: 2025-04-14T11:55:37.238Z (2 months ago)
Language: Java
Size: 8.96 MB
Stars: 15
Watchers: 1
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-tjsse-project - Major-333/BigDataWareHouse

README

# BigDataWarehouse

> 2020年数据仓库大项目
### 数据集信息
**原始数据来源**：http://snap.stanford.edu/data/web-Movies.html

该数据集包含来自亚马逊的电影评论。数据的使用期超过10年，包括截至2012年10月的所有约800万条评论。评论包括产品和用户信息，评分以及纯文本评论。

**数据扩充**：爬取原始数据集中涉及到的所有产品ID对应的亚马逊网页

### 数据仓储

使用Neo4j, Mysql 和 Hive 三种仓储方式进行仓储，并进行查询性能比较。

## Folder Structure

```
.
├── ETL # ETL脚本项目
│ ├── processedData # 处理后的数据
│ │ └── neo4jcsv # neo4j入库需要的csv文件
│ ├── rawData # 存放原始数据集(movies.txt)
│ └── utils # ETL工具模块
│ ├── deduplicate # 原始数据去重的模块
│ └── neo4jPreprocess # 处理数据为neo4j所需的csv格式的模块
│ ├── csvData
│ └── model
├── mysql
│ ├── config
│ └── data
│ ├── #innodb_temp
│ ├── mysql
│ ├── performance_schema
│ └── sys
├── neo4j
│ ├── conf
│ ├── data
│ │ ├── databases
│ │ │ ├── neo4j
│ │ │ └── system
│ │ │ └── schema
│ │ └── transactions
│ │ ├── neo4j
│ │ └── system
│ ├── import
│ ├── logs
│ └── plugins
└── ...
```

## Requirements

### ETL Project

```shell
$ cd ETL/
$ pip install -r requirements.txt
```

### Database

Install docker on macOS

```shell
$ brew cask install docker
$ brew install docker-compose
```

Install docker on others
https://docs.docker.com/compose/install/

## How to Start ETL

### 🚚 Extract and Transform

```shell
$ cd ETL/
$ python run.py --review-extract --movie-extract --review-transform --movie-transform
```

- ⚠️ Extract should be done before Transform

### 🚀 Load data into database

```bash
$ cd ..
$ docker-compose up
```

- **Neo4j-db**

```bash
$ docker exec -it neo4j-db /bin/bash
$ cat /load/create_indexes.cql | cypher-shell -u neo4j -p {your password}
$ cat /load/load_data.cql | cypher-shell -u neo4j -p {your password}
$ exit
```

- ⏰ **you can check your neo4j-db via localhost:7474 in browser**

- **Mysql-db**

```bash
$ docker exec -it mysql-db /bin/bash
$ cat /script/createdatabase.sql | mysql -u root -p
$ exit
```

- ⚠️ **next, your need to use dw_backend to create table**

```bash
$ docker exec -it mysql-db /bin/bash
$ cat /script/dataloader.sql | mysql -u root -p
$ cat /script/procedure.sql.sql | mysql -u root -p
$ exit
```

## How to Start Database

### Database

```shell
$ docker-compose up
```

### test hive

```shell
# 进入bash
$ docker-compose exec hive-server /bin/bash

# 使用beeline客户端连接
$ /opt/hive/bin/beeline -u jdbc:hive2://localhost:10000

# 执行SQL。这两句是可以直接执行的，镜像带了example文件
> CREATE TABLE pokes (foo INT, bar STRING);
> LOAD DATA LOCAL INPATH '/opt/hive/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

# 查询
> select * from pokes;
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Major-333/BigDataWareHouse

Awesome Lists containing this project

README