https://github.com/handexing/jdbee

整合使用selenium+phantomjs+WebCollector爬取京东数据，并做数据持久化。
https://github.com/handexing/jdbee

httpclient jsoup phantomjs selenium selenium-java webcollector

Last synced: 6 months ago
JSON representation

整合使用selenium+phantomjs+WebCollector爬取京东数据，并做数据持久化。

Host: GitHub
URL: https://github.com/handexing/jdbee
Owner: handexing
Created: 2017-05-24T05:42:06.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2017-06-10T08:10:54.000Z (over 8 years ago)
Last Synced: 2025-04-13T05:12:18.834Z (11 months ago)
Topics: httpclient, jsoup, phantomjs, selenium, selenium-java, webcollector
Language: Java
Homepage:
Size: 20.5 MB
Stars: 49
Watchers: 6
Forks: 25
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# JdBee
## 使用jsoup抓取京东数据

> **只用于学习交流，私自用于其他途径，后果自负！！！**

> 目前只抓取零食相关的数据,现在就只需要零食相关的数据,其他后续再议!

> 抓取零食相关的目的就是为了这个[vipsnacks](https://github.com/handexing/vipsnacks)项目的后续开发。

## 项目需要

- httpclient
- jsoup
- slf4j
- selenium
- phantomjs
- WebCollector

## 更新日志

- 初始化项目，完成一,二级类目的抓取 (*2017-05-24*)
- 采用selenium获取页面数据，获取三,四,五级类目(*2017-05-25*)
- 多线程并发爬取类目分页数据(*2017-05-26*)
- 多线程爬取商品skuid(*2017-05-28*)

**selenium这个爬取的速度太慢了，而且每次还要打开一个网页，抓取少量数据还可以用一用，多的话实在罩不住，近期在找别的方法爬取**

- 使用WebCollector+selenium+phantomjs爬取商品(*2017-06-01只爬取一个类目测试*)
- 数据入库测试(*2017-06-02*)
- 测试爬取一个小类目，爬取20万数据用时21分钟(*2017-06-03*)
- 数据正常入库,爬取数据**285330**条(*2017-06-04*)
- 优化获取商品代码，从获取一页要19664毫秒，优化到现在获取一页商品要7000毫秒左右,(*2017-06-07*)

> 觉得不错的朋友可以点下star,watch,fork也算是对我的鼓励了。

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/handexing/jdbee

Awesome Lists containing this project

README