{"id":16525642,"url":"https://github.com/camdavidsonpilon/pyconcanada2015","last_synced_at":"2025-10-28T08:31:36.851Z","repository":{"id":66073640,"uuid":"41968983","full_name":"CamDavidsonPilon/PyconCanada2015","owner":"CamDavidsonPilon","description":"My scrapers, data and analysis for PyCon Canada 2015 Keynote","archived":false,"fork":false,"pushed_at":"2015-11-07T19:51:43.000Z","size":15604,"stargazers_count":26,"open_issues_count":1,"forks_count":14,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-02-01T12:51:12.959Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CamDavidsonPilon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-09-05T17:05:39.000Z","updated_at":"2023-09-08T17:01:22.000Z","dependencies_parsed_at":"2023-02-19T22:15:40.238Z","dependency_job_id":null,"html_url":"https://github.com/CamDavidsonPilon/PyconCanada2015","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamDavidsonPilon%2FPyconCanada2015","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamDavidsonPilon%2FPyconCanada2015/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamDavidsonPilon%2FPyconCanada2015/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamDavidsonPilon%2FPyconCanada2015/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CamDavidsonPilon","download_url":"https://codeload.github.com/CamDavidsonPilon/PyconCanada2015/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238623011,"owners_count":19502964,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-11T17:04:27.496Z","updated_at":"2025-10-28T08:31:33.558Z","avatar_url":"https://github.com/CamDavidsonPilon.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PyConCanada2015\n\n\n\n#### My scrapers + data + analysis for PyConCanada2015 Keynote\n\n(sorry github)\n\n\n\n\n## Frequency of libraries in `requirements.txt` files in Github Python repositories\n\n\nThis was done by scraping 10k+ Python repositories on Github that contain a `requirements.txt` file. This file is commonly used to store dependencies of the repository. \n\n![freq_libs](http://i.imgur.com/Kft8vUl.png)\n\nIt's clear that the majority of repositories on Python are web development related, or web developers are most likely to include a proper requirements.txt file in their repositories. \n\n\n## Relationships between libraries\n\nUsing the data in `requirements.txt' files, we can find common co-occurences of libraries. For example, it's not hard to imagine that whenever django is a requirement, so is psycopg2. In fact, in the dataset I had, 41% of all django apps also included psycopg2. These relationships can be mined using a simple algorithm called the apriori algorithm. It's history goes back to large department stores that were interested in what products were commonly bought together. The naive solution, compare all possible pairs, results in a quadratic algorithm - and if you have thousands of products, this becomes inefficient quickly. The apriori algorithm intelligently cuts through this massive space. \n\nHere are the other common libraries paired with django:\n\n(confidence is defined as \n```\nconfidence = P(ending_with | starting_with)\n           = P(starting_with and ending_with) / P(starting_with)\n           = #{requirement.txts with both} / #{requirement.txts with starting_with}\n\n```\n\n\n|starting_with | ending_with      |  starting_with_occurrences |confidence     | occurrences |ending_with_occurrences |\n|------------|-----------------|-------------------------|---------------|-------------|-----------------------|\n|django      | requests        |  2714                   |0.243920412675 | 662         |2463                   |\n|django      | wheel           |  2714                   |0.22402358143  | 608         |1649                   |\n|django      | six             |  2714                   |0.245394252027 | 666         |1985                   |\n|django      | psycopg2        |  2714                   |0.411569638909 | 1117        |1573                   |\n|django      | gunicorn        |  2714                   |0.320191599116 | 869         |1531                   |\n|django      | dj-database-url |  2714                   |0.263448784083 | 715         |728                    |\n\n\n[Here are the results for other libraries](https://github.com/CamDavidsonPilon/PyconCanada2015/blob/master/analysis/library_association_rules.csv)\n including some metrics to sort on. To read more about these metrics, see [this link](http://michael.hahsler.net/research/association_rules/measures.html).\n\n## Now, let's recommend libaries based on these relationships\n\nSo, if we know a user installed django, we can perhaps recommend that they also install psycopg2 (according to above, we would be right 41% of the time). We can turn these co-occurences into a very simple recommendation algorithm for Python Libaries! So I've gone ahead and done that. \n\n### `pipp`: one of the `p`s stands for personalized!\n\nYes, that's right - we can bring you library recommendations right to the command line. Try it out!\n\n`pip install pipp`\n\n```\n$ pipp install jsmin\nRequirement already satisfied (use --upgrade to upgrade): jsmin in /Users/camerondavidson-pilon/.virtualenvs/data/lib/python2.7/site-packages\npipp: Other users who installed jsmin also installed cssmin\n```\n\nCommand line too nerdy for you? How about recommendations on PyPI?\n\n![pypi](http://i.imgur.com/BCyumQV.png)\n\n## Network force-layout of libraries in `requirements.txt` files in Github Python repositories\n\nThis is biased, as some libaries have their own requirements. For example, Pandas depends on Numpy, so it would be less common to have both Pandas and Numpy in a requirements.txt file. \n\n![network_graph](http://i.imgur.com/MesvpSa.png)\n\n## The Plural of Ancedote is Data!\n\nI've often heard techies say the plural of ancedote is not data. I see where they are coming from, however I think they are being shortsighted. Whereas one or two occurences of something is not enough evidence to prove a fact, it is *evidence* of something interesting. And if you have a tool to quickly confirm or deny further occurences of this anecdote, then yes you have data. For example how often do you see links to stackoverflow questions in code? I have seen it before, and I wondered, how common is this? \n\nUsing one of the greatest anecdote validation tools, Search, we can validate this idea:\n\n![search](http://i.imgur.com/kdqLVTz.png)\n\nGreat, now let's start scraping. Here are the most common questions linked in Python code:\n\n```\nstackoverflow.com/questions/19622133    1173\nstackoverflow.com/questions/279237       887\nstackoverflow.com/questions/5658622      320\nstackoverflow.com/questions/22019341     134\nstackoverflow.com/questions/35817        117\nstackoverflow.com/questions/1769332       89\nstackoverflow.com/questions/377017        86\nstackoverflow.com/questions/1189781       73\nstackoverflow.com/questions/4124220       70\nstackoverflow.com/questions/701802        66\n```\n\nLet's investigate the first one. It's a very specific question about windows and ctypes - not a common problem in the first place. If we [search for just that url on Github](https://github.com/search?q=stackoverflow.com%2Fquestions%2F19622133\u0026type=Code\u0026utf8=%E2%9C%93), we see it's all from the same file, `windows_support.py`. Investigating those repos with the url, we see that 1. not only is this is from code inside Python 2.7, but 2. people are including all of Python 2.7 in the Github repos! \n\n\n## Most controversial Python StackOverflow answer\n\nStackOverflow has become the most popular forum for developers to ask, answer and importantly *promote* or *demote* content. StackOverflow does something even more incredible: they expose all their interaction data (questions, answers, views, votes) through a [public query interface](http://data.stackexchange.com/). Using this, we can compute, what is the most controversial Python answer?\n\n\nTo do this, we will use the following algorithm: find the answer that has an upvote/downvote ratio close to 0.5, and also has lots of votes. The former requirement is a good definition of \"controversial\", and the latter requirement protects use against answers with trivial counts (ex: 1 upvote and 1 downvote). Think of it as a balancing act between \"how confident are we that this question is indeed the most controversial?\" The following query accomplishes this (based on a similar equation in [this post](http://camdp.com/blogs/how-sort-comments-intelligently-reddit-and-hacker-))\n\n```SQL\ndeclare @VoteStats table (parentid int, id int, U float, D float) \n\ninsert @VoteStats\nSELECT \n  a.parentid,\n  a.id,\n  CAST(SUM(case when (VoteTypeID = 2) then 1. else 0. end) + 1. as float) as U,\n  CAST(SUM(case when (VoteTypeID = 3) then 1. else 0. end) + 1. as float) as D\nFROM Posts q\nJOIN PostTags qt \n  ON qt.postid = q.ID\nJOIN Tags T \n  ON T.Id = qt.TagId\nJOIN Posts a \n  ON q.id = a.parentid\nJOIN Votes \n  ON Votes.PostId = a.Id\nWHERE TagName  = 'python'\n   and a.PostTypeID = 2 -- these are answers\nGroup BY a.id, a.parentid\n\nset nocount off\n\nSELECT \n TOP 100\n parentid,\n id,\n U, D,\n ABS(0.5 - U/(U+D) - 3.5*SQRT(U*D / ((U+D) * (U+D) * (U+D+1)))) + \n   ABS(0.5 - U/(U+D) + 3.5*SQRT(U*D / ((U+D) * (U+D) * (U+D+1)))) as Score\nFROM @VoteStats \nORDER BY Score \n```\n\nRunning this produces the following table (as of Oct. 24, 2015):\n\n| parentid | url      | U   | D  | Score             |\n|----------|---------|-----|----|-------------------|\n| 1641219  | http://stackoverflow.com/questions/1641305 | 100 | 58 | 0.267581687129904 |\n| 366980   | http://stackoverflow.com/questions/367082  | 55  | 29 | 0.360985397926758 |\n| 904928   | http://stackoverflow.com/questions/904941  | 44  | 40 | 0.379197639329681 |\n| 1641219  | http://stackoverflow.com/questions/1945699 | 49  | 23 | 0.382002382488145 |\n| 734368   | http://stackoverflow.com/questions/734910  | 48  | 30 | 0.38315203605798  |\n| 7479442  | http://stackoverflow.com/questions/7479473 | 46  | 23 | 0.394405318873308 |\n| 620367   | http://stackoverflow.com/questions/620397  | 42  | 24 | 0.411383595098925 |\n| 969285   | http://stackoverflow.com/questions/969324  | 49  | 20 | 0.420289855072464 |\n| 1566266  | http://stackoverflow.com/questions/1566285 | 39  | 24 | 0.424918292799399 |\n\n\nThe closer the score is to 0, the more controversial it is. Take a look at the answers comment's to see debates about why the answer is controversial. \n\n\n## 2-Spaces vs 4-Spaces\n\nLet's not argue: let's look at the empirical data. I looked at over 23 thousand Python repos and [computed what the most common](https://github.com/CamDavidsonPilon/PyconCanada2015/blob/master/analysis/indent_analysis.py) indenting practice was in each repo. The results were quite infavor of 4-spaces: **88% of repos used 4-spaces, and only 7% of repos use 2-spaces**. What about the remaining 5%? Well, some repos use 8-spaces, and some used 1-spaces! Examples: https://github.com/aqt01/UnderWaterWorld uses 8-spaces, and https://github.com/sanglech/CSC326 uses 1-space. \n\n![indent](http://i.imgur.com/6jSwYrO.png)\n\n## What is the most popular testing framework?\n\nPassing through the tens of thousands of repos, [I looked for imports](https://github.com/CamDavidsonPilon/PyconCanada2015/blob/master/analysis/test_frameworks.py) of the most popular testing libaries: pytest, unittest, nose and testify. Here where the results:\n\n| package  | count | percent of total |\n|----------|-------|------------------|\n| None     | 22162 |       86%        |\n| unittest |  3032 |       12%        |\n| nose     |   379 |      1.5%        |\n| pytest   |   293 |        1%        |\n| testify  |     4 |       ~0%        |\n\n\n![testing](http://i.imgur.com/XfgpH4O.png)\n\n## What about using Python for functional programming?\n\nIf you are going to use Python for functional programming, or semi-functional programming, you're probably going to be using libraries like `functools`, 'itertools', 'toolz' and others. How many Python repos use this style of programming? Data shows about 15% of repos do this. \n\n\n## How often do we disobey *flat is better than nested*?\n\n```\nfrom com.sun.org.apache.xerces.internal.impl.io import \\\n            MalformedByteSequenceException\n```\n(from [here](https://github.com/kurtmckee/listparser/blob/c280d6619241cb6e46ec1f708063f0c05b28933c/listparser/__init__.py#L66-L67))\n \nIs this ugly or beautiful? Python says it's ugly - after all, *flat is better than nested*. How often we break this? For this, I looked at the *maximum* import nest in each repo. Here's the breakdown: \n\n\n![nest_dist](http://i.imgur.com/ShGQFlF.png)\n\n\n## Topic modelling Python source code using LDA\n\nWhat happens when we apply a topic modelling algorithm, like Latent Dirichlet Allocation, to hundreds of thousands of Python source code files? To be clear: this is not something you usually do! Topic modelling is meant to articles and reviews: human-readable text. Python code, on the other hand, is full of keywords in illogical order, repeated words over and over again, and developers use odd acronymns and abbreviations for all their variables! But, let's try it anyways. \n\nAfter training LDA on the repos and library I downloaded, I came out with [these topics](https://github.com/CamDavidsonPilon/PyconCanada2015/blob/master/analysis/python_topics.txt). For example, we can see the topic:\n\n\u003e python, version, package, author, setup, description, language, copyright, packages, license\n\nobviously this is the setup.py topic\n\n\u003e test, equal, case, tests, foo, unittest, equals, suite, result, expected\n\nthis is the testing topic,\n\n\u003e grid, color, plot, plt, label, step, data, width, ax, size\n\nthe matplotlib plotting topic.\n\n\nSee if you can find others in the output above. \n\n\n## Conclusion\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamdavidsonpilon%2Fpyconcanada2015","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcamdavidsonpilon%2Fpyconcanada2015","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamdavidsonpilon%2Fpyconcanada2015/lists"}