Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lucasxlu/LagouJob
Data Analysis & Mining for lagou.com
https://github.com/lucasxlu/LagouJob
data-analysis data-mining lagou machine-learning nlp python3 web-crawler
Last synced: about 2 months ago
JSON representation
Data Analysis & Mining for lagou.com
- Host: GitHub
- URL: https://github.com/lucasxlu/LagouJob
- Owner: lucasxlu
- License: apache-2.0
- Created: 2016-03-15T01:51:57.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2019-04-19T01:59:03.000Z (almost 6 years ago)
- Last Synced: 2024-08-07T22:35:30.125Z (6 months ago)
- Topics: data-analysis, data-mining, lagou, machine-learning, nlp, python3, web-crawler
- Language: Python
- Homepage: https://www.zhihu.com/question/36132174/answer/94392659
- Size: 28.1 MB
- Stars: 259
- Watchers: 29
- Forks: 129
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Data Analysis of [Lagou Job](http://www.lagou.com/)
![Lagou](http://pstatic.lagou.com/www/static/common/widgets/header_c/modules/img/logo_d0915a9.png)## Introduction
This repository holds the code for job data analysis of [Lagou](http://www.lagou.com/).
The main functions included are listed as follows:1. Crawling job data from [Lagou](www.lagou.com), and get the latest information of jobs about Internet.
2. Proxies are collected from [XiCiDaiLi](https://www.xicidaili.com/nn/1).
3. Data analysis and visualization.
4. Crawling job details info and generate word cloud as __Job Impression__.
5. In order to train a [NLP](http://baike.baidu.com/item/nlp/25220#viewPageContent) task with machine learning, the data of interviewee's comments will be stored in [mongodb](https://docs.mongodb.com/)## Prerequisites
1. Install 3rd party libraries```sudo pip3 install -r requirements.txt```
2. Install [mongodb](https://docs.mongodb.com/) and start [mongodb](https://docs.mongodb.com/) service [optional]```sudo service mongod start```
## How to Use
1. clone this project from [github](https://github.com/lucasxlu/LagouJob.git).
2. Lagou's anti-spider strategy has been upgrade frequently recently. I suggest you run [proxy_crawler.py](./spider/proxy_crawler.py) to get IP proxies and execute the code with [PhantomJS](http://phantomjs.org/).
3. run [m_lagou_spider.py](spider/m_lagou_spider.py) to crawl job data, it will generate a collection of Excel files in ```./data``` directory.
4. run [hot_words_generator.py](analysis/hot_words_generator.py) to cut sentences, it will return __TOP-30__ hot words and wordcloud figure.## Analysis Results
> ![Image1](https://pic2.zhimg.com/a0c42bc6bd7c8743687ba50305c85821_b.jpg)
> ![Image2](https://pic3.zhimg.com/f89ca5a008f8ad84a1a2121888aa10c2_b.jpg)
> ![Image3](https://pic1.zhimg.com/85b930c6aff823a3b8ee73973d20f274_b.jpg)
> ![Image4](https://pic1.zhimg.com/v2-b5ef151109c8787a0a46efed111d3884_b.png)
> ![Image5](https://pic3.zhimg.com/v2-aae9b487a843b00298166b6335b061aa_b.png)
> ![Image6](https://pic3.zhimg.com/9c2e99674bcb59e0ff54ca0a3fbe4142_b.jpg)
> ![Image7](https://pic3.zhimg.com/6ea06ad7dd376f51e629635a69b09cba_b.jpg)## Report
* For technical details, please refer to my answer at [Zhihu](https://www.zhihu.com/question/36132174/answer/94392659).
* The PDF report can be downloaded from [here](https://lucasxlu.github.io/blog/projects/LagouJob.pdf).## Change Log
- [V2.0] - 2019.04. Upgraded to [PhantomJS](http://phantomjs.org/) and IP proxies.
- [V1.2] - 2017.05. Rewrite WordCloud visualization module.
- [V1.0] - 2017.04. Upgraded to mobile Lagou.
- [V0.8] - 2016.05. Finish Lagou PC web spider.## LICENSE
[Apache-2.0](./LICENSE)