{"id":19285979,"url":"https://github.com/chenjiandongx/stackoverflow-spider","last_synced_at":"2025-04-10T01:14:19.084Z","repository":{"id":92058856,"uuid":"86238887","full_name":"chenjiandongx/stackoverflow-spider","owner":"chenjiandongx","description":"📖 爬取 Stackoverflow 100万 条问答并简单分析","archived":false,"fork":false,"pushed_at":"2023-03-12T07:25:54.000Z","size":449,"stargazers_count":214,"open_issues_count":6,"forks_count":79,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-04-10T01:14:14.181Z","etag":null,"topics":["python","spider","stackoverflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chenjiandongx.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-03-26T14:31:10.000Z","updated_at":"2025-03-26T02:44:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"6923c1a8-9bdf-47c6-9dbd-e3a6de42ac88","html_url":"https://github.com/chenjiandongx/stackoverflow-spider","commit_stats":{"total_commits":36,"total_committers":1,"mean_commits":36.0,"dds":0.0,"last_synced_commit":"6d60fc2d19d22cfeccdb8bca4aa144f3b633850c"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chenjiandongx%2Fstackoverflow-spider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chenjiandongx%2Fstackoverflow-spider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chenjiandongx%2Fstackoverflow-spider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chenjiandongx%2Fstackoverflow-spider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chenjiandongx","download_url":"https://codeload.github.com/chenjiandongx/stackoverflow-spider/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248137891,"owners_count":21053775,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python","spider","stackoverflow"],"created_at":"2024-11-09T21:47:34.902Z","updated_at":"2025-04-10T01:14:19.047Z","avatar_url":"https://github.com/chenjiandongx.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"#  爬取 Stackoverflow 1m 条问答\n\n作为一个热爱编程的大学生，怎么能不知道面向 stackoverflow 编程呢。\n\n打开 stackoverflow 主页，在 questions 页面下选择按 vote 排序，爬取前 20000 页，每页将问题数量设置为 50，共 1m 条，（实际上本来是想爬完 13m 条的，但 1m 条后面问题基本上都只有 1 个或 0 个回答，那就选取前 1m 就好吧）  \n \n实际上用数据库去重后只有 999654 条问答信息 \n\n# 对爬取数据进行简单分析  \n## votes 分析\n### 降序排列了 votes 数，生成折线图  \n\n![Votes折线图](https://github.com/chenjiandongx/stackoverflow/blob/master/images/votes_0.png)  \n2k 后的问题的 votes 数基本上就已经在 400 以下了，接着后面的就基本上是贴地飞行了  \nvotes 数最多 : [Why is it faster to process a sorted array than an unsorted array?](http://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array)\n\n### votes 数的连续分布情况  \n\n![votes甘特图](https://github.com/chenjiandongx/stackoverflow/blob/master/images/votes_1.png)  \n可见最多的还是集中在 1-2K 之间,从 6k 开始基本上就断层了  \n\n### 具体数据  \n\n| description  | count |\n| -----------  | ----- |\n| votes \u003e= 500 | 1630  |\n| votes \u003e= 400 | 2325  |\n| votes \u003e= 300 | 3782  |\n| votes \u003e= 200 | 7062  |\n| votes \u003e= 100 | 19781 |  \n\n如果以 100 为分界线的话，会得到这样的一个饼图  \n\n![pie_votes_1](https://github.com/chenjiandongx/stackoverflow/blob/master/images/pie_votes_1.png)  \n大于 100 的连 %2 都不到  \n\n再来看看底层的数据  \n\n| description | count   |\n| ----------- | -----   |\n| 1 \u003c= votes \u003c= 5    | 211804  |\n| 6 \u003c= votes \u003c= 10   | 430935  |\n| 11 \u003c= votes \u003c= 15  | 136647  |\n| 16 \u003c= votes \u003c= 20  | 64541   |\n| votes \u003c= 20        | 843927  |  \n\n可见 votes 小于 20 的，数量高达 84m  \n看看总体的比例吧  \n![pie_votes_2](https://github.com/chenjiandongx/stackoverflow/blob/master/images/pie_votes_2.png)  \n\n\n## answers 分析\n### 降序排列了 answers 数，生成折线图  \n  \n![answers折线图](https://github.com/chenjiandongx/stackoverflow/blob/master/images/answers_0.png)  \n很明显 3k 之后的 answers 数基本上就小于 20 了  \nanswers 数最多: [What is the best comment in source code you have ever encountered? [closed]](http://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered)  \n\n### answers 数的连续分布情况  \n\n![answers甘特图](https://github.com/chenjiandongx/stackoverflow/blob/master/images/answers_1.png)  \n150 后也就断层了，实际上能达到这样的回答数极少  \n\n### 具体数据  \n  \n| description   | count |\n| -----------   | ----- |\n| answers \u003e= 5   | 218059 |\n| answers \u003e= 10  | 34500  |\n| answers \u003e= 20  | 3808   |\n| answers \u003e= 30  | 968    |  \n\n大于 30 的确实少的可怜，看看总体情况  \n![pie_answer_1](https://github.com/chenjiandongx/stackoverflow/blob/master/images/pie_answer_1.png)  \n\n\n## views 分析\n### 降序排列了 views 数，生成折线图  \n\n![views折线图](https://github.com/chenjiandongx/stackoverflow/blob/master/images/views_0.png)  \n最高达到了 4.5m，100000 以后的基本上就不足 28000 了  \nviews 数最多: [How to undo last commit(s) in Git?](http://stackoverflow.com/questions/927358/how-to-undo-last-commits-in-git)\n\n\n### views 数的连续分布情况  \n\n![views甘特图](https://github.com/chenjiandongx/stackoverflow/blob/master/images/views_1.png)\n\n### 具体数据  \n\n| description   | count |\n| -----------   | ----- |\n| views \u003e= 5000    | 486466  |\n| views \u003e= 10000   | 315576  |\n| views \u003e= 20000   | 171873  |\n| views \u003e= 50000   | 59363   | \n| views \u003e= 100000  | 22224   | \n| views \u003e= 200000  | 7030    |  \n\n大部分问答的 views 数还是集中在 20000 以内  \n还是得看看总体分布  \n![bubble_views](https://github.com/chenjiandongx/stackoverflow/blob/master/images/bubble_views.png)\n\n## 再看看 votes，views，answers 三者的散点图对应情况  \n### votes - views  \n\n![votes-views散点图](https://github.com/chenjiandongx/stackoverflow/blob/master/images/views_votes.png)  \n### votes - answers  \n\n![votes-answers散点图](https://github.com/chenjiandongx/stackoverflow/blob/master/images/answers_votes.png)\n### views - answers  \n\n![views-answers散点图](https://github.com/chenjiandongx/stackoverflow/blob/master/images/view_answers.png)  \n\n\n总的来说，这三者对应关系类似于一个金字塔。三个图基本上都是左下角靠近原点的区域被填满，也就是说绝对大部分的问题的 votes，answers 和 views 都是属于最下层的。高质量活跃的问题是处于金字塔顶端的。三者的最高数好像也没特别明显的对应关系，且三者的最高数都不是同一个问题。\n\n\n根据所有问题的 tags 提取出总量前 200 的关键词（前 50 条如下），第 1 名是 c#，python 排在第 5\n\n```python\n('c#', 94614),\n('java', 93244),\n('javascript', 76722),\n('android', 69321),\n('python', 62502),\n('c++', 58173),\n('php', 42596),\n('ios', 37773),\n('jquery', 37405),\n('.net', 36180),\n('html', 28536),\n('css', 26174),\n('c', 24699),\n('objective-c', 23253),\n('iphone', 22171),\n('ruby-on-rails', 20143),\n('sql', 19171),\n('asp.net', 18060),\n('mysql', 17559),\n('ruby', 16397),\n('r', 15670),\n('git', 13139),\n('linux', 13080),\n('asp.net-mvc', 12857),\n('angularjs', 12606),\n('sql-server', 12473),\n('node.js', 12212),\n('django', 11576),\n('arrays', 11006),\n('algorithm', 10959),\n('wpf', 10631),\n('performance', 10619),\n('xcode', 10613),\n('string', 10426),\n('windows', 10132),\n('eclipse', 10117),\n('scala', 9942),\n('regex', 9685),\n('multithreading', 9601),\n('json', 9266),\n('swift', 8950),\n('c++11', 8939),\n('haskell', 8823),\n('osx', 8159),\n('visual-studio', 8140),\n('html5', 7627),\n('database', 7567),\n('xml', 7478),\n('spring', 7464),\n('unit-testing', 7253),\n('bash', 6825)\n```\n\n### 这样看好像不太直观，所以就把它根据词频生成了词云  \n\n![词云](https://github.com/chenjiandongx/stackoverflow/blob/master/images/word_cloud.jpg)\n\n\n## 因为是用 Python 写的爬虫，所以重点来分析下 Python 类的问答\n### votes 数前 10\n* 6162 : [What does the “yield” keyword do in Python?](http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python)\n* 3529 : [What is a metaclass in Python?](http://stackoverflow.com/questions/100003/what-is-a-metaclass-in-python)\n* 3098 : [How do I check whether a file exists using Python?](http://stackoverflow.com/questions/82831/how-do-i-check-whether-a-file-exists-using-python)\n* 3035 : [Does Python have a ternary conditional operator?](http://stackoverflow.com/questions/394809/does-python-have-a-ternary-conditional-operator)\n* 2620 : [Calling an external command in Python](http://stackoverflow.com/questions/89228/calling-an-external-command-in-python)\n* 2605 : [What does if __name__ == “__main__”: do?](http://stackoverflow.com/questions/419163/what-does-if-name-main-do)\n* 2194 : [How to merge two Python dictionaries in a single expression?](http://stackoverflow.com/questions/38987/how-to-merge-two-python-dictionaries-in-a-single-expression)\n* 2123 : [Sort a Python dictionary by value](http://stackoverflow.com/questions/613183/sort-a-python-dictionary-by-value)\n* 2058 : [How to make a chain of function decorators?](http://stackoverflow.com/questions/739654/how-to-make-a-chain-of-function-decorators)\n* 1984 : [How to check if a directory exists and create it if necessary?](http://stackoverflow.com/questions/273192/how-to-check-if-a-directory-exists-and-create-it-if-necessary)\n\n\n### answers 数前 10\n* 191 : [Hidden features of Python [closed]](http://stackoverflow.com/questions/101268/hidden-features-of-python)\n* 87 : [Best ways to teach a beginner to program? [closed]](http://stackoverflow.com/questions/3088/best-ways-to-teach-a-beginner-to-program)\n* 55 : [Favorite Django Tips \u0026 Features?](http://stackoverflow.com/questions/550632/favorite-django-tips-features)\n* 50 : [How do you split a list into evenly sized chunks?](http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks)\n* 44 : [Calling an external command in Python](http://stackoverflow.com/questions/89228/calling-an-external-command-in-python)\n* 43 : [How can I represent an 'Enum' in Python?](http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python)\n* 38 : [How to merge two Python dictionaries in a single expressions](http://stackoverflow.com/questions/38987/how-to-merge-two-python-dictionaries-in-a-single-expression)\n* 38 : [Finding local IP addresses using Python's stdlib](http://stackoverflow.com/questions/166506/finding-local-ip-addresses-using-pythons-stdlib)\n* 37 : [Reverse a string in python without using reversed or [::-1]](http://stackoverflow.com/questions/18686860/reverse-a-string-in-python-without-using-reversed-or-1)\n* 37 : [How do I check whether a file exists using Python?](http://stackoverflow.com/questions/82831/how-do-i-check-whether-a-file-exists-using-python)\n\n\n### views 数前 10\n* 2121621 : [Parse String to Float or Int](http://stackoverflow.com/questions/379906/parse-string-to-float-or-int)\n* 1905938 : [Using global variables in a function other than the one that created them](http://stackoverflow.com/questions/423379/using-global-variables-in-a-function-other-than-the-one-that-created-them)\n* 1888666 : [How do I check whether a file exists using Python?](http://stackoverflow.com/questions/82831/how-do-i-check-whether-a-file-exists-using-python)\n* 1827126 : [Calling an external command in Python](http://stackoverflow.com/questions/89228/calling-an-external-command-in-python)\n* 1699574 : [Converting integer to string in Python?](http://stackoverflow.com/questions/961632/converting-integer-to-string-in-python)\n* 1686230 : [How do I read a file line-by-line into a list?](http://stackoverflow.com/questions/3277503/how-do-i-read-a-file-line-by-line-into-a-list)\n* 1682307 : [Iterating over dictionaries using 'for' loops in Python](http://stackoverflow.com/questions/3294889/iterating-over-dictionaries-using-for-loops-in-python)\n* 1569205 : [How to get the size of a list](http://stackoverflow.com/questions/1712227/how-to-get-the-size-of-a-list)\n* 1554755 : [How do I install pip on Windows?](http://stackoverflow.com/questions/4750806/how-do-i-install-pip-on-windows)\n* 1515505 : [Finding the index of an item given a list containing it in Python](http://stackoverflow.com/questions/176918/finding-the-index-of-an-item-given-a-list-containing-it-in-python)  \n\n### 三者的前十中有两个问题是完全重叠的，分别是\n* [How do I check whether a file exists using Python?](http://stackoverflow.com/questions/82831/how-do-i-check-whether-a-file-exists-using-python)\n* [Calling an external command in Python](http://stackoverflow.com/questions/89228/calling-an-external-command-in-python)\n\n\n### 欢迎 Fork 和 Star","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchenjiandongx%2Fstackoverflow-spider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchenjiandongx%2Fstackoverflow-spider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchenjiandongx%2Fstackoverflow-spider/lists"}