https://github.com/nladuo/taobao_bra_crawler

a taobao web crawler just for fun.
https://github.com/nladuo/taobao_bra_crawler

Last synced: 7 months ago
JSON representation

a taobao web crawler just for fun.

Host: GitHub
URL: https://github.com/nladuo/taobao_bra_crawler
Owner: nladuo
License: mit
Created: 2016-02-21T13:32:13.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2018-11-30T06:28:08.000Z (about 7 years ago)
Last Synced: 2024-11-14T06:33:51.812Z (about 1 year ago)
Language: Python
Homepage: http://nladuo.github.io/bra
Size: 5.92 MB
Stars: 196
Watchers: 9
Forks: 61
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-hacking-lists - nladuo/taobao_bra_crawler - a taobao web crawler just for fun. (Python)

README

# taobao_bra_crawler
a taobao web crawler just for fun.

## 说明
淘宝文胸商品评论内容爬取与简单分析。

## 商品评论数据
### 下载地址
百度云链接: [https://pan.baidu.com/s/19S5ziVX8kXhk7LtgKn94qw](https://pan.baidu.com/s/19S5ziVX8kXhk7LtgKn94qw) 密码: fmgy

Google Drive链接：https://drive.google.com/file/d/1fJtXqDtuFVL7d61GkfL3pr2wDvMzKgJx/view?usp=sharing

### 导入数据
``` bash
mongoimport -d taobao -c rates --file ./rates.dat
```

## 爬虫部署
互联网时代的网站富于变化，爬虫今天可能正常明天可能就不能用了，如果爬虫无法使用请通过百度云盘链接导入数据。
### 部署环境
测试环境：腾讯云主机一台

操作系统：ubuntu-14.04

数据库： mongodb

### 安装依赖
``` bash
pip install -r requirements.txt
```
### 修改配置文件
``` python
config = {
'timeout' : 3,
'db_user': '',
'db_pass': '',
'db_host': 'localhost',
'db_port': 27017,
'db_name': 'taobao',
'use_tor_proxy': False,
'tor_proxy_port': 9050
}
```
说明：一般的爬取速度不会有禁IP的情况。
### 运行爬虫
``` bash
python crawler/item_crawler.py # 爬文胸的商品信息
python crawler/rate_crawler.py # 爬文胸的评论信息
```

## 数据处理
### 简单统计与可视化展示
#### 1. 运行脚本
``` sh
cd simple_analyzer
python simple_analyzer.py # 简单统计
cp bra.json data_visualization/static/ # 拷贝统计结果
```
#### 2. 运行网页显示
``` sh
cd data_visualization
npm install # 安装依赖
npm run dev # 进行调试
npm run build # 生成dist
```
#### 效果展示
见: [http://nladuo.github.io/bra](http://nladuo.github.io/bra)

### 关键词分析
#### 运行脚本
``` sh
cd keyword_analyzer
python create_corpus.py # 1.加载评论信息
python extract_tags.py # 2.提取关键词(20分钟左右, 可以直接用我的模型进行第三步)
python create_wordcloud.py # 3.生成词云图片
```
#### 效果
![word_cloud](./keyword_analyzer/assets/word_cloud1.png)

#### 参考
- [Python pytagcloud 中文分词生成标签云系列（一）](http://www.cnblogs.com/Yiutto/p/5998262.html)
- [利用pandas+python制作100G亚马逊用户评论数据词云](http://www.jianshu.com/p/c862130f322d)

## LICENSE
MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nladuo/taobao_bra_crawler

Awesome Lists containing this project

README