{"id":18859786,"url":"https://github.com/sty945/news_spider","last_synced_at":"2025-09-04T07:40:14.464Z","repository":{"id":109500274,"uuid":"113196759","full_name":"sty945/news_spider","owner":"sty945","description":"以中国新闻网社会新闻板块为抓取对象,通过关键词来分析新闻热点事件","archived":false,"fork":false,"pushed_at":"2020-03-08T13:15:23.000Z","size":43887,"stargazers_count":19,"open_issues_count":0,"forks_count":9,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-14T12:21:22.990Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sty945.png","metadata":{"files":{"readme":"ReadMe.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-12-05T15:05:48.000Z","updated_at":"2025-01-27T14:02:46.000Z","dependencies_parsed_at":null,"dependency_job_id":"a22344b2-6e0a-4393-8a0a-641782d9d8b9","html_url":"https://github.com/sty945/news_spider","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sty945/news_spider","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sty945%2Fnews_spider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sty945%2Fnews_spider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sty945%2Fnews_spider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sty945%2Fnews_spider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sty945","download_url":"https://codeload.github.com/sty945/news_spider/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sty945%2Fnews_spider/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273574102,"owners_count":25129882,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-04T02:00:08.968Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T04:19:12.794Z","updated_at":"2025-09-04T07:40:14.443Z","avatar_url":"https://github.com/sty945.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"## 基于新闻媒体的热点新闻数据可视化分析\r\n欢迎对数据可视化、数据挖掘感兴趣的同学一起完成这个项目。\r\n**welcome to fork**\r\n\r\n## 代码解释\r\n关于该项目的设计思路、一些代码的解释以及学习如何利用NLP技术做简单数据可视化分析，可以微信扫描下面二维码，直达详细教程：\r\n\r\n![二维码](https://img-blog.csdnimg.cn/20200308211258100.png)\r\n\r\n\r\n## 当前功能：\r\n\r\n以中国新闻网社会新闻板块为抓取对象,通过关键词来分析新闻热点事件:\r\n[中国新闻网链接](http://www.chinanews.com/society.shtml)\r\n\r\n当前代码中设置的是抓取2017年11月份所有数据新闻数据，后期进行数据可视化分析，用户也可以自己在homework1.py设置要抓取的时间段\r\n\r\n[本项目开源地址](https://github.com/sty945/news_spider)\r\n\r\n[当前结果展示](https://github.com/sty945/news_spider/blob/master/result/news_spider_vision.ipynb)\r\n\r\n[可视化分析报告](https://mp.weixin.qq.com/s/LOEuUQe9rsv87S8KISGHJg)\r\n\r\n## 预期实现功能\r\n建立一整套从新闻信息挖掘到分析以及可视化展现的完整体系，\r\n使用户能够很好的关注整个当前的新闻热点以及这些热点的起始、 经过、 发展和消逝的整个过程。\r\n\r\n微信官方曾经关于新闻热点可视化的一篇推送，可做参考:\r\n\r\n[微信小秘密: 2016 年那些 10w+ 文章是怎么刷爆朋友圈的？](http://mp.weixin.qq.com/s/hlWAW8UybzF5jzhNyRx_Bg)\r\n\r\n## 国内常用热搜榜\r\n[微博热搜](http://s.weibo.com/top/summary?cate=homepage)\r\n\r\n[百度热搜风云榜](http://top.baidu.com/)\r\n\r\n[搜狗热搜榜](http://top.sogou.com/)\r\n\r\n[360实时热点](https://trends.so.com/hot)\r\n\r\n[360趋势](https://trends.so.com/)\r\n\r\n\r\n## 相关参考文档\r\n\r\n[pyLDA系列 考量时间因素的动态主题模型（Dynamic Topic Models)](https://blog.csdn.net/sinat_26917383/article/details/79377761)\r\n\r\n[LDA(Latent Dirichlet Allocation)主题模型](https://blog.csdn.net/aws3217150/article/details/53840029)\r\n\r\n\r\n## 运行环境：\r\n系统:windows\r\n\r\npython版本：python 3.6.3\r\n\r\n数据库:mongoDB 3.4.9\r\n\r\n分词系统：中科院ictclas分词系统 地址：https://github.com/sty945/NLPIR\r\n\r\n分词系统文件转json地址: http://tools.jb51.net/code/excel_col2json\r\n\r\n## 目录下文件功能解释\r\n```\r\nnews_spider\r\n│  readme.txt\r\n│  \r\n├─bin  程序文件\r\n│  │  countDatabase.py     在数据抓取过程中统计数据库中数据数量\r\n│  │  deal_network_failed.py    解决抓取过程中，网络掉线或者其他中断情况的断点续传功能\r\n│  │  writefile.py   将数据库中所有的新闻数据写入到txt文本中\r\n│  │  news_spider.py  爬虫主程序\r\n│          \r\n├─contents 文本资源\r\n│  │  03_content.txt   2017年11月份的结果数据文本\r\n│          \r\n└─result      结果存放\r\n        11month_view .html    数据可视化展示，基于jupyter notebook 书写保存后的html，建议firefox打开，chrome图表显示有问题\r\n        11result.json        处理后的用json保存的数据提取出来关键词结果\r\n        raw_result.json      处理前的用json保存的数据提取出来关键词结果\r\n        news_spider_vision.ipynb  jupyter note格式的结果展示\r\n        Locators_table_cheat_sheet.pdf  css selector资源\r\n        stop_words*   停用词典\r\n```     \r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsty945%2Fnews_spider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsty945%2Fnews_spider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsty945%2Fnews_spider/lists"}