{"id":13418491,"url":"https://github.com/anvaka/common-words","last_synced_at":"2025-05-07T21:10:39.502Z","repository":{"id":141680365,"uuid":"76376974","full_name":"anvaka/common-words","owner":"anvaka","description":"visualization of common words in different programming languages","archived":false,"fork":false,"pushed_at":"2024-07-25T06:17:37.000Z","size":12020,"stargazers_count":505,"open_issues_count":3,"forks_count":27,"subscribers_count":21,"default_branch":"master","last_synced_at":"2024-07-31T22:43:12.485Z","etag":null,"topics":["clouds","language","visualization"],"latest_commit_sha":null,"homepage":"https://anvaka.github.io/common-words","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/anvaka.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-12-13T16:22:38.000Z","updated_at":"2024-07-25T06:17:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"4447400c-3318-4a1d-9241-2741ad1b4b40","html_url":"https://github.com/anvaka/common-words","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anvaka%2Fcommon-words","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anvaka%2Fcommon-words/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anvaka%2Fcommon-words/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anvaka%2Fcommon-words/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/anvaka","download_url":"https://codeload.github.com/anvaka/common-words/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252954410,"owners_count":21830905,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clouds","language","visualization"],"created_at":"2024-07-30T22:01:02.892Z","updated_at":"2025-05-07T21:10:39.485Z","avatar_url":"https://github.com/anvaka.png","language":"JavaScript","funding_links":[],"categories":["TODO scan for Android support in followings"],"sub_categories":[],"readme":"# [Common words](https://anvaka.github.io/common-words/#?lang=js)\n\nThis visualization shows which words are used most often in different programming\nlanguages.\n\nThe index was built between mid/end of 2016 from `~3 million` public open source\nGitHub repositories. Results are presented as word clouds and text:\n\n![demo](https://raw.githubusercontent.com/anvaka/common-words/master/docs/main_screen.png)\n\nBelow is description of hows and whys. If you want to explore visualizations -\nplease click here: [common words](https://anvaka.github.io/common-words/#?lang=js).\n\n# Tidbits\n\n* I store the most common words from many different programming languages as part of this\nrepository. GitHub's language recognition treats this repository as mostly C++. It makes sense\nbecause many of those languages were inspired by C/C++:\n![github thinks it C++](https://raw.githubusercontent.com/anvaka/common-words/master/docs/languages.png)\n\n* License text is commonly put into comments in every programming language. Of all languages\nJava code was the winner with `127` words out of `966` coming from license text:\n![lots of license in Java](https://raw.githubusercontent.com/anvaka/common-words/master/docs/java-license.png)\n  * In fact it was so overwhelming that I decided to filter out license text.\n\n* `Lua` is the only programming language that has a swear word in top 1,000. [Can you find it?](https://anvaka.github.io/common-words/#?lang=lua)\n* In `Go` [`err` is as popular as `return`](https://anvaka.github.io/common-words/#?lang=go).\nHere is [why](https://twitter.com/anvaka/status/813505093458767873).\n\nIf you find more interesting discoveries - please let me know. I'd be happy to include them here.\n\n# How?\n\nI extracted individual words from the [github_repos](https://bigquery.cloud.google.com/dataset/bigquery-public-data:github_repos)\ndata set using BigQuery. A word is extracted along with the top 10 lines of code where\nthis word appeared.\n\nI apply several constraints before saving individual words:\n\n* The line where this word appears should be shorter than 120 characters. This helps\nme filter out code not written by a human, like minified JavaScript.\n* I ignore punctuation (`, ; : .`), operators (`+ - * ...`) and `numbers`. So if the line is\n`a+b + 42`, then only two words are extracted: `a` and `b`.\n* I ignore lines with \"license markers\" - words that predominantly appear inside license text\n(e.g. `license`, `noninfringement`, [etc.](https://github.com/anvaka/common-words/blob/master/data-extract/ignore/index.js)). License text is very common in code.\n It was interesting to see at the beginning, but overwhelming at the end, so I filtered it out.\n* Words are case sensitive: `This` and `this` will be counted as two separate words.\n\n## How was the data collected?\n\n\u003eIn this section we take deeper look into words extraction. If you are not interested [jump to word clouds algorithm](#how-word-clouds-are-rendered).\n\nData comes from the GitHub's public data set, indexed by BigQuery: [github_repos](https://bigquery.cloud.google.com/dataset/bigquery-public-data:github_repos)\n\nBigQuery stores the contents of each indexed file in a table as plain text:\n\n| File Id   | Content                                       |\n| ----------|:---------------------------------------------:|\n| File 1.h  | // File 1 content\\n#ifndef FOO\\n#define FOO...|\n| File 2.h  | // File 2 content\\n#ifndef BAR\\n#define BAR...|\n\nTo build a word cloud we need a `weight` to scale each word accordingly.\n\nTo get the weight we could split text into individual words, and then group table by each word:\n\n| Word    | Count|\n|---------|:----:|\n| File    | 2    |\n| content | 2    |\n| ...     | ...  |\n\nUnfortunately, this naive approach does exactly what people don't like about word\nclouds - each word will be taken out of context.\n\nI wanted to avoid this problem, and allow people to explore each word along with\ntheir contexts:\n\n![context demo](https://raw.githubusercontent.com/anvaka/common-words/master/docs/context_demo.gif)\n\nTo achieve this, I created a temporary table ([code](https://github.com/anvaka/common-words/blob/master/data-extract/sql/get_all_top_lines.sql)),\nthat instead of counting individual words counts lines:\n\n| Line              | Count |\n|-------------------|:-----:|\n| // File 1 content |  1    |\n| #ifndef FOO       |  1    |\n| #define FOO       |  1    |\n| ...               | ...   |\n\nThis gave me \"contexts\" for each word and reduced overall data size from couple terabytes\nto `~12GB`.\n\nTo get top words from this table we can employ the previously mentioned technique of splitting line content\ninto individual words, and then group the table by each word. We can also get a word's\ncontext if we keep the original line in an intermediate table:\n\n\n| Line              | Word     |\n|-------------------|:--------:|\n| // File 1 content | File     |\n| // File 1 content | content  |\n| #ifndef FOO       | ifndef   |\n| #ifndef FOO       | FOO      |\n| ...               | ...      |\n\nFrom this intermediate representation we can use SQL window function to group by word\nand get top 10 lines for each word (more info here: [Select top 10 records for each category](http://stackoverflow.com/questions/176964/select-top-10-records-for-each-category))\n\nCurrent extraction code can be found here: [extract_words.sql](https://github.com/anvaka/common-words/blob/master/data-extract/sql/extract_words.sql)\n\n**Note 1:** My SQL-fu is in kindergarten, so please let me know if you find an error or\nmaybe more appropriate way to get the data. While the current script is working, I think\nthere may be cases where results are slightly skewed.\n\n**Note 2:** [BigQuery](https://bigquery.cloud.google.com/) is amazing. It is powerful, flexible, and fast. Huge kudos\nto the amazing people who work on it.\n\n## How are word clouds rendered?\n\nAt the heart of word clouds lies very simple algorithm:\n\n```\nfor each word `w`:\n  repeat:\n    place word `w` at random point (x, y)\n  until `w` does not intersect any other word\n```\n\nTo prevent the inner loop from running indefinitely we can try only limited number of\ntimes and/or reduce word's font size if it doesn't fit.\n\nIf we step back a little bit from the words, we can formulate this problem in terms\nof rectangles: For each rectangle try to place it onto a canvas, until it doesn't\nintersect any other pixel.\n\nObviously, when canvas is heavily occupied finding a spot for a new rectangle can\nbecome challenging or not even possible.\n\nVarious implementations tried to speed up this algorithm by indexing occupied space:\n\n* Use [summed area table](https://en.wikipedia.org/wiki/Summed_area_table) to quickly,\nin O(1) time, tell if a new candidate rectangle intersects anything\nunder it. The downside of this method is that each canvas update requires updating the\nentire table, which gives bad performance;\n* Maintain some sort of [`R-tree`](https://en.wikipedia.org/wiki/R-tree) to quickly\ntell if a new candidate rectangle intersects anything under it. Intersection lookup\nin this approach is slower than in summed area tables, but index maintenance is faster.\n\nI think the main downside of both of these methods is that we still can get wrong\ninitial point many number of times before we find a spot that fits new rectangle.\n\nI wanted to try something different. I wanted to build an index that would let me\nquickly pick a rectangle large enough to fit my new incoming rectangles.\nMake index of the free space, not occupied one.\n\nI choose a [quadtree](https://en.wikipedia.org/wiki/Quadtree) to be my index.\nEach non-leaf node in the tree contains information about how many free pixels\nare available underneath. At the very basic level this can immediately answer\nquestion: \"Is there enough space to fit `M` pixels?\". If a quad has less available\npixels than `M`, then there is no need to look inside.\n\nTake a look at this quad tree for JavaScript logo:\n\n![javascript quadtree](https://raw.githubusercontent.com/anvaka/common-words/master/docs/js-quad-tree.png)\n\nEmpty white rectangles are quads with available space. If our candidate rectangle\nis smaller than any of these empty quads we could immediately place it inside such quad.\n\nA simple approach with quadtree index gives decent results, however, it is\nalso susceptible to visual artifacts. You can see quadrants borders - no text can\nbe placed on the intersection of quads:\n\n![quad tree artifacts](https://raw.githubusercontent.com/anvaka/common-words/master/docs/quad-tree-split.gif)\n\nThe `largest quad` approach can also miss opportunities. What if there is no single\nquad large enough to fit a new rectangle, but, if united with neighboring quads\na fit can be found?\n\nIndeed, uniting quads helps to find spots for new words, as well as removes visual\nartifacts. Many quads are united, and the text is likely to appear on intersection\nof two quads:\n\n![quad tree no artifacts](https://raw.githubusercontent.com/anvaka/common-words/master/docs/quad-tree-no-artifact.gif)\n\n\u003e My final code for quadtree word cloud generation is not released. I don't think\n\u003e it is ready to be reused anywhere else.\n\n## How was the website created?\n\n### Rendering text\n\nOverall I was [happy](https://twitter.com/anvaka/status/801869174502879232) with achieved\nspeed of word cloud generation. Yet, it was still too slow for `common-words` website.\n\nI'm using SVG to render each word on a screen. Rendering alone so many text elements\ncan halt the UI thread for a couple seconds. There is just not enough\nCPU time to squeeze in text layout computation. The good news - we don't have to.\n\nInstead of computing layout of words over and over again every time when you open\na page, I decided to compute layout once, and store results into a JSON file.\nThis helped me to focus on UI thread optimization.\n\nTo prevent UI blocking for long periods of time, we need to add words asynchronously.\nWithin one event loop cycle we add N words, and let browser handle user commands\nand updates. On the second loop cycle we add more, and so on. For these purposes\nI made [anvaka/rafor](https://github.com/anvaka/rafor), which is an asynchronous `for` loop\niterator that adapts and distributes CPU load across multiple event loop cycles.\n\n### Pan and zoom\n\nThe website supports Google-maps like navigation on SVG scene. It is also mobile and keyboard friendly.\nAll these feature are implemented by [panzoom](https://github.com/anvaka/panzoom) library.\n\n### Application structure\n\nI'm using [vue.js](https://vuejs.org/) as my rendering framework. Mostly because it's very simple and fast.\nSingle file components and hot reload make it fast to develop in.\n\nThe entire application state is stored in a [single object](https://github.com/anvaka/common-words/blob/master/web/src/state/appState.js)\nand individual language files are loaded when user selects corresponding element from a drop down.\n\nAs my message dispatcher I'm using [ngraph.events](https://github.com/anvaka/ngraph.events), a\nvery small message passing library with focus on speed.\n\nI use [anvaka/query-state](https://github.com/anvaka/query-state) to store currently\nselected language in the query string.\n\n![query state](https://raw.githubusercontent.com/anvaka/common-words/master/docs/query-state.gif)\n\n# Tools summary\n\n* https://github.com/anvaka/query-state - allow storing application state in\nthe query string. Supports bidirectional updates: `query string \u003c-\u003e application state`\n* https://github.com/anvaka/rafor - asynchronous iteration over array, without\nblocking the UI thread. This module adapts to amount of work per cycle, so that\nthere is enough CPU time to keep UI responsive.\n* https://github.com/anvaka/simplesvg - very simple wrapper on top of SVG DOM\nelements, providing easy manipulation.\n* https://github.com/anvaka/panzoom - a library that allows Google-maps-like panning\nand zooming of an SVG scene.\n\n# Why word clouds?\n\nWord clouds in general are considered bad for several reasons:\n\n* They take words out of their context. So `good` does not necessary mean something is good (e.g.\nwhen word `not` was dropped from visualization)\n* They scale words to fit inside a picture. So the size of a word cannot be trusted\n* They drop some common words (like `a`, `the`, `not`, etc.)\n\nHowever, I was always fascinated by algorithms that fit words inside a given shape to\nproduce word cloud.\n\nI spent last couple months of my spare time developing my own word cloud algorithm.\nAnd this website was born. It was fun :).\n\n# Thank you!\n\nThank you, dear reader, for being curious. I hope you enjoyed this small exploration.\nAlso special thanks to my co-worker, Ryan, who showed me word clouds in the first\nplace. And to my lovely wife who inspires me and encourages me in all my pursuits.\n\n## PS\n\nI also tried to bring word clouds into \"real life\" and created several printed\nproducts (T-Shirts, hoodies and mugs). However I didn't like T-Shirts very much,\nso I'm not going to show them here.\n\n[The javascript mug](http://www.zazzle.com/javascript_word_cloud_mug-168756031080597723) -\nI think is my best real world word cloud:\n\n![js mug](http://i.imgur.com/2dBcvXU.gif)\n\nFeel free to [buy it](http://www.zazzle.com/javascript_word_cloud_mug-168756031080597723)\nif you love javascript. I hope you enjoy it!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanvaka%2Fcommon-words","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fanvaka%2Fcommon-words","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanvaka%2Fcommon-words/lists"}