{"id":13748578,"url":"https://github.com/wangjksjtu/Data-Mining-51Job","last_synced_at":"2025-05-09T11:31:04.667Z","repository":{"id":67121973,"uuid":"138133417","full_name":"wangjksjtu/Data-Mining-51Job","owner":"wangjksjtu","description":"Data-mining on 51Job website","archived":false,"fork":false,"pushed_at":"2018-06-25T23:55:05.000Z","size":8136,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-05-28T21:06:44.900Z","etag":null,"topics":["51job","data-mining","machine-learning","scikit-learn","seaborn","web-crawling"],"latest_commit_sha":null,"homepage":"http://wangjksjtu.github.io/Data-Mining-51Job/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wangjksjtu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-06-21T07:13:39.000Z","updated_at":"2021-08-16T03:29:06.000Z","dependencies_parsed_at":"2023-06-09T20:15:31.692Z","dependency_job_id":null,"html_url":"https://github.com/wangjksjtu/Data-Mining-51Job","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wangjksjtu%2FData-Mining-51Job","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wangjksjtu%2FData-Mining-51Job/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wangjksjtu%2FData-Mining-51Job/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wangjksjtu%2FData-Mining-51Job/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wangjksjtu","download_url":"https://codeload.github.com/wangjksjtu/Data-Mining-51Job/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224859699,"owners_count":17381676,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["51job","data-mining","machine-learning","scikit-learn","seaborn","web-crawling"],"created_at":"2024-08-03T07:00:44.586Z","updated_at":"2024-11-15T23:30:30.067Z","avatar_url":"https://github.com/wangjksjtu.png","language":"Jupyter Notebook","funding_links":[],"categories":["资源清单"],"sub_categories":["Data Mining"],"readme":"# Data-Mining-51Job\nThis repository is established to explore the data on [51Job website](https://www.51job.com/), where a number of companies post their wanted positions, and at the same time employees could share their own profiles to boost their career development. Overall, the work in this repo could be summarized in following aspects: \n\n- Collect the job information using python crawler. \n- Preprocess the data (clean, discretize, match, normalize, etc).\n- Conduct feature engineering to analyse the data.\n- Design two tasks for real scenarios (salary and job type prediction).\n- Apply various machine learning algorithms to our tasks.\n\nThe documents of our work are available here: [[report]](https://github.com/wangjksjtu/Data-Mining-51Job/blob/master/docs/report.pdf), [[notebook]](https://github.com/wangjksjtu/Data-Mining-51Job/blob/master/feature_analysis/feature_analysis.ipynb).\n\n## Requirements\n- [scrapy](https://scrapy.org/) (web crawling)\n- [numpy](http://www.numpy.org/) and [pandas](http://pandas.pydata.org/) (data preprocessing)\n- [scikit-learn](http://scikit-learn.org/stable/index.html) (ML-algorithms)\n- [matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/index.html) (data visualization)\n\n## QuickStart\n### Web Crawling\nWe use [scrapy](https://scrapy.org/) to crawl raw data from [51Job website](https://www.51job.com/). See the directory ```/job51spider``` for codes. XPath is used to parse the html and extract data information.\n\nAfter entering the directory, input the command in cmd.exe to run the spider.\n```\nscrapy crawl 51job\n```\n### Data Preprocessing\nWe use python libraries [pandas](http://pandas.pydata.org/) (using class dataframe) and re to preprocess the raw data. See /preprocess/preprocess.py for code.\nYou can find the preprocessed data in /data, where middleData.csv is the preprocessed data suitable for drawing pics,\n while quantityData.csv quantifies all data and fits further data analysis.\n\n### Feature Engineering\n\nSee directory ```/pics```.\n We analyzed feature coorelation and feature distribution respectively. We found two some main features which affect salary level: education level requirements, work experience requirements and area location.\n\n\u003cimg class =\"left\" src=\"docs/figures/map_num.png\" width=\"45%\"\u003e \u003cimg class =\"right\" src=\"docs/figures/map_salary.png\" width=\"45%\" height=\"459\"\u003e\n\n\n### Salary Prediction\n\n| Model | R2 value | Mean Error ￥/year | time / s |  \n| :---- |:------------:| :----: | :----: |\n| Ada-Boost | 0.2350 | 37483.9 | 0.79 |\n| Grad-Boost | __0.3237__ | __34031.4__ | 3.13 |\n| SVR (RBF) | 0.0092 | 43301.9 | 350.08 |\n| Bayesian Ridge | 0.2667 | 34031.4 | 0.05 |\n| Elastic Net | 0.0426 | 44784.4 | __0.03__ |\n| MLPs | 0.2682 | 36207.3 | 19.29 |\n\n\n### Job Area Prediction\n\n| Model | Accuracy / % | time / s | Model | Accuracy / % | time / s |  \n| :---- |:------------:| :----: | :---- |:------------:| :----: |\n| LP | 7.79% | 3.89 | MLP | 20.91% | 20.05 | \n| GNB | 7.32% | __0.23__ | SVM | __29.31%__ | 1032.75 |\n| KNN | 25.19% | 2.60 | XGBoost | 27.53%  | 303.21 |\n| RF | 28.44% | 1.80 | \n\n\nThe accuracy \u0026 time plot of the above models:\n\u003cfigure class=\"half\"\u003e\n    \u003cimg src=\"docs/figures/acc.png\" width=\"70%\"\u003e\n\u003c/figure\u003e\n\n\n## Team Members\n- [Jingkang Wang](https://github.com/wangjksjtu)\n- [Jilai Zheng](https://github.com/zhengjilai)\n- [Qingzhao Zhang](https://github.com/zqzqz)\n- [Lei Wang](https://github.com/Dulou)\n- [Jinrui Sha](https://github.com/sjrGCkym)\n- [Zhongwei Chen]()\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwangjksjtu%2FData-Mining-51Job","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwangjksjtu%2FData-Mining-51Job","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwangjksjtu%2FData-Mining-51Job/lists"}