{"id":31579145,"url":"https://github.com/blue-catblues/tieba-integratedanalysis","last_synced_at":"2025-10-05T20:47:39.620Z","repository":{"id":317740393,"uuid":"1068656392","full_name":"Blue-CatBlues/Tieba-IntegratedAnalysis","owner":"Blue-CatBlues","description":"Python期末大作业—对百度贴吧进行爬虫采集(scrapy)、统计分析(pandas)、可视化展示(matplotlib)，与机器学习分类(scikitLearn)的综合性数据分析","archived":false,"fork":false,"pushed_at":"2025-10-02T18:11:30.000Z","size":304,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-02T19:32:27.774Z","etag":null,"topics":["matplotlib","nlp-machine-learning","pandas","python","scikit-learn","scrapy","seaborn"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Blue-CatBlues.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-02T17:56:03.000Z","updated_at":"2025-10-02T18:19:18.000Z","dependencies_parsed_at":"2025-10-02T19:32:29.869Z","dependency_job_id":null,"html_url":"https://github.com/Blue-CatBlues/Tieba-IntegratedAnalysis","commit_stats":null,"previous_names":["blue-catblues/tieba-integratedaanalysis"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Blue-CatBlues/Tieba-IntegratedAnalysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blue-CatBlues%2FTieba-IntegratedAnalysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blue-CatBlues%2FTieba-IntegratedAnalysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blue-CatBlues%2FTieba-IntegratedAnalysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blue-CatBlues%2FTieba-IntegratedAnalysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Blue-CatBlues","download_url":"https://codeload.github.com/Blue-CatBlues/Tieba-IntegratedAnalysis/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blue-CatBlues%2FTieba-IntegratedAnalysis/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278518121,"owners_count":26000176,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-05T02:00:06.059Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["matplotlib","nlp-machine-learning","pandas","python","scikit-learn","scrapy","seaborn"],"created_at":"2025-10-05T20:47:34.546Z","updated_at":"2025-10-05T20:47:39.613Z","avatar_url":"https://github.com/Blue-CatBlues.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 百度贴吧综合分析项目 \n\n本项目是一个集爬虫、数据清洗、统计分析、可视化与机器学习于一体的综合系统，自动采集百度贴吧多个分区数据，并进行深入分析与分类预测，帮助用户理解贴吧生态与内容分布。\n\n---\n\n## 🚀 功能模块\n\n### 🕷️ 数据采集（Scrapy）\n\n- 支持 8 个贴吧分类（音乐、游戏、体育、地区、动漫、小说、明星、社会）\n- 自动爬取吧名、关注人数、帖子数、简介、所属类型\n- 输出为结构化 Excel 文件：`Tieba_output.xlsx`\n\n### 📊 数据分析（pandas）\n\n- 分区统计贴吧数量、关注人数、帖子数\n- 计算人均发帖数与比例分布\n- 输出分析结果为：`统计结果_比例.xlsx`\n\n### 📈 可视化展示（matplotlib + seaborn）\n\n- 箱线图：贴吧关注人数与帖子数区间分布\n- 柱状图：各类型人均发帖数\n- 饼图：各类型关注人数与帖子数比例\n- 热力图：分类准确率与混淆矩阵\n\n### 🤖 机器学习分类（scikit-learn）\n\n- 使用 TF-IDF 向量化贴吧简介文本\n- 使用朴素贝叶斯进行多分类预测\n- 输出准确率、混淆矩阵与分类报告\n- 支持类别两两组合分类准确率分析与热力图展示\n\n---\n\n## 🛠️ 使用方法\n\n### 1️⃣ 安装依赖\n\n```bash\nconda create -n tieba_insight python=3.10\nconda activate tieba_insight\npip install scrapy pandas matplotlib seaborn scikit-learn jieba openpyxl\n```\n### 2️⃣ 运行爬虫\n```bash\nscrapy crawl life\n```\n输出结果将保存在 data/Tieba_output.xlsx\n\n### 3️⃣ 分析数据\n```bash\npython analyse/proportion_analysis.py\n```\n输出分析结果为 data/统计结果_比例.xlsx\n\n### 4️⃣ 可视化展示\n```bash\npython analyse/visualization.py、\n```\n自动生成箱线图、柱状图、饼图等图表\n\n### 5️⃣ 机器学习分类\n```bash\npython ml/classifier.py\npython ml/pairwise_analysis.py\n```\n输出分类准确率、混淆矩阵与类别对比热力图\n\n示例图表\n\n📦 贴吧关注人数与帖子数箱线图\n\n🧱 各类型人均发帖数柱状图\n\n🥧 各类型比例饼图\n\n🔥 分类准确率热力图\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblue-catblues%2Ftieba-integratedanalysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fblue-catblues%2Ftieba-integratedanalysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblue-catblues%2Ftieba-integratedanalysis/lists"}