{"id":16114725,"url":"https://github.com/solarliner/twemoji-zipf-test","last_synced_at":"2025-04-06T08:17:02.720Z","repository":{"id":80989329,"uuid":"101471516","full_name":"SolarLiner/twemoji-zipf-test","owner":"SolarLiner","description":"An experiment in Node to test Twemoji usage and correlation to Zipf's law.","archived":false,"fork":false,"pushed_at":"2017-08-26T19:20:23.000Z","size":50,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"develop","last_synced_at":"2025-02-12T13:48:16.540Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SolarLiner.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-08-26T07:51:44.000Z","updated_at":"2017-08-26T19:20:24.000Z","dependencies_parsed_at":null,"dependency_job_id":"d5030372-3250-486e-b9c6-1dbf14fd8d4b","html_url":"https://github.com/SolarLiner/twemoji-zipf-test","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SolarLiner%2Ftwemoji-zipf-test","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SolarLiner%2Ftwemoji-zipf-test/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SolarLiner%2Ftwemoji-zipf-test/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SolarLiner%2Ftwemoji-zipf-test/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SolarLiner","download_url":"https://codeload.github.com/SolarLiner/twemoji-zipf-test/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247451665,"owners_count":20940944,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-09T20:15:34.220Z","updated_at":"2025-04-06T08:17:02.700Z","avatar_url":"https://github.com/SolarLiner.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"  \n  \n# Twemoji and the Zipf law\n  \n  \n* [Twemoji and the Zipf law](#twemoji-and-the-zipf-law )\n\t* [Preface](#preface )\n\t\t* [The Zipf law?](#the-zipf-law )\n\t\t\t* [It's \"everywhere\"](#its-everywhere )\n\t\t\t* [Emoji and Zipf's law](#emoji-and-zipfs-law )\n\t* [The experiment](#the-experiment )\n\t\t* [Getting data from Twitter](#getting-data-from-twitter )\n\t\t* [Processing data from Twitter](#processing-data-from-twitter )\n* [Project Log](#project-log )\n\t* [Day 1: Starting the project... by doing something else.](#day-1-starting-the-project-by-doing-something-else )\n\t* [Day 2: Nodejs x Twitter themed allnighter](#day-2-nodejs-x-twitter-themed-allnighter )\n\t\t* [Actually coding...](#actually-coding )\n  \n## Preface\n  \n  \nI've recently came across the Twemoji tracker from a [Vsauce DONG](https://www.youtube.com/watch?v=d1RPFzZN3Ro ) and marvelled at all the real-time, flashing lights of emojis being used all over twitter. But then as I looked at the pattern my subconscious started screaming \"ZIPF Law!\". That is, indeed, a weird thing to scream; but this is what got me started on this journey. I started wondering about the frequency at which they were used.\n  \nAt the time of writing, the *face with tears or joy emoji* 😂 is the most used emoji on [Emoji tracker](http://emojitracker.com ), and also is the most frequently sent.\n  \n### The Zipf law?\n  \n  \n\u003e Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.\n  \nThis quote from the [Wikipedia entry on Zipf's law](http://en.wikipedia.org/wiki/Zipf%27s_law ) describes in more mathematical terms that the most used word in a language will be used twice as often as the second-most used word, and three times as often as the third-most used word, and etc.\n  \n#### It's \"everywhere\"\n  \n  \nLet's try and see where Zipf's law can be found, starting with the alphabet.  \nI've copied the Wikipedia article on [Frequency Analysis](https://en.wikipedia.org/wiki/Frequency_analysis ) (for a bit of sweet irony) and after some processing to reduce the content to only lowercase letters, this is the resulting bar graph:\n  \n  \nSuch a high correlation does point a finger right into the relation between numbers.\n  \nI know, I know, correlation doesn't imply causation. What if this was the result of any large enough set of letters?\n  \nHere's the same graph but for random data generated from a [random letter sequence generator](http://www.dave-reed.com/Nifty/randSeq.html ):\n  \n  \nYou can see there difference without having me to overlap them. In the first graph, the letter `e` is the most used letter of the whole page, with `t` as a second one, then `a`, then `s`, etc. This actually follows English's letter distribution pretty accurately.  \nThis second set shows `u` to be the most used of the set, which is the 14th most used letter in the plain english set.\n  \nDoes it happen with other languages? Sure.\n  \n*(Characters taken from the Japanese version of the Frequency Analysis page)*\n  \nZipf is everywhere where you can count the frequency of some natural occurence. City populations? Check. Earth quakes numbers and magnetudes? Check and check.\n  \n#### Emoji and Zipf's law\n  \n  \n*So, do emoji occurences follow Zipf's law?*\n  \nWell, that's the whole point of this experiment. My hypothesis, when I started this, is that because emoji are a natural occurence (much like the words we write down), their use should follow Zipf's law. \n  \n## The experiment\n  \n  \nI believe that Twitter was the service of choice when [Matthew Rothenberg](https://github.com/mroth ) chose to display realtime emoji usage. For one, lots of people are using it. And *a lot of data* is coming in: according to the [Emoji tracker source code](https://github.com/mroth/emojitrack-feeder#development-setup ), Receiving *all the tweets* will take about 1 MB/s of your bandwidth. This is 8 Mbps, or about a decently compressed 720p video streaming from the Internet.\n  \nI wish I had a server in the cloud to make the computation available without downtime, but for now I don't - so if anyone wants to reproduce this at home, you'll need to have a decent internet connection.\n  \n### Getting data from Twitter\n  \n  \nThis is the easy part. The Twitter Streaming API allows us to recieve all the tweets, in real time. This is the same API that power the emoji tracker, and is the perfect fit for the job.\n  \n* Note: API keys are not provided in the repository for security purposes. Generate your own and add them to 'api/streaming.json' like so:\n  \n```json\n{\n    consumer_key: \"your_consumer_key\",\n    consumer_secret: \"your_consumer_secret\",\n    token: \"your_access_token_key\",\n    token_secret: \"your_access_token_secret\"\n}\n```\n  \n*JSON format is to respect the [`npm twiter-stream-api`](https://www.npmjs.com/package/twitter-stream-api ) `var keys` format.*\n  \n### Processing data from Twitter\n  \n  \nNow comes the hard part. We need to get the mangled, minified data from the Streaming API and process it to get the message content, and from there, extract the emoji(s).\n  \n# Project Log\n  \n  \nHere lies my thoughts and notes while building the project.  \nAs this will surely become longer than the article itself, it will be removed from here. It is already available [here](Project Log.md ).\n  \n  \n  \n  \n  \n  \n  \n  \n## Day 1: Starting the project... by doing something else.\n  \n  \nI started the project after seeing the Emoji tracker website displaying real time data about emoji usage on Twitter. This gave way to this project, made to test the hypothesis that emoji follow Zipf's law.\n  \nBut I didn't really do anything *directly* aimed at the project. I instead started documenting my intentions first, padded with a bit of an explanation for what's what and what's happening. This in turn gave way to an overly long showcase of Zipf's law through character freqyency analysis and usage of Matplotlib - which finally I had a use of. I wanted to work with it quite badly after being first introduced to it. But learning *yet another Python library* takes time and took me away from actually starting building the project. Oh well...\n  \n## Day 2: Nodejs x Twitter themed allnighter\n  \n  \nI am currently in Vietnam for unrelated matters and my plane back to France is at 8:30am, which, taking a 2 hour buffer just in case + driving there; I need to wake up at 4am tomorrow.  \nScrew this \"sleeping\" thing, the flight will be a long 13 hours cramped in an Economy class seat and I will have time to sleep there.\n  \nLet's start writing some actual JS code!\n  \nOr should I say *TS code*, because I'm so fond of scrtict typing that I'm happy to add another compilation step just for it. Plus VS Code is awesome with TS.\n  \nI'm building the project with Node v8 and TS v2.5. Nodejs stuff is still new to me so I'm bound to making mistakes.\n  \nWish me luck...\n  \n### Actually coding...\n  \n  \n... But only after forcing Git to get his act together. Wasted time for nothing trying to sync the project to GitHub. Oh well, I have (half of) my night for myself.\n  \nLet's start by defining the processing classes first, this way I can save the headaches of OAuth2 for later.\n  \nThis is how I want to do things: basically, you need to process the input stream as fast as possible in order to cop with the amount of information and not queue items on Twitter side (and eventually end up being shut out). I want to implement a multi-process setup but first we'll do it by separating the reading stream from the processing stream.\n  \n```mermaid\ngraph TD\n    subgraph Stream parser\n    in[Stream Input] --\u003e tsh[Twitter Stream Handler]\n    tsh --Creates Tweet instances for each delimited tweet in chunk--\u003e tweetraw[Create tweet from data]\n    tweetraw --\u003e tweetqueue[\"Push to Queue (FIFO)\"]\n    tweetqueue--\u003ein\n    end\n    subgraph Data handler\n    tweetqueue.pop[Get from Queue]--\u003eprocess[Parse Emoji]\n    process--\u003eaddtodb[Add row to CSV]\n    addtodb--\u003etweetqueue.pop\n    end\n    tweetqueue-.Shared data array.-tweetqueue.pop\n```\n  \nThis system decouples the parsing and the processing - and the shared data array could easily be atomic. \n  \nBut I also want snapshots from different running lengths (30 minutes, 1 hour and 6 hours) from which I could derive data from. This can be done like this:\n```mermaid\ngraph LR\n    start[Start]--\u003etimetest{Check timing}\n    longesttest{Was it the longest interval?}-- Yes --\u003eerasecurrent[Recreate working database]\n    erasecurrent--\u003etimetest\n    longesttest-- No --\u003etimetest\n    timetest-- It's been 30 minutes --\u003ecopy30[\"Copy file to 30min.csv\"]\n    copy30--\u003elongesttest\n    timetest-- It's been 1 hour --\u003ecopy60[\"Copy file to hour.csv\"]\n    copy60--\u003elongesttest\n    timetest-- It's been 6 hours --\u003ecopy3600[\"Copy file to 6hour.csv\"]\n    copy3600--\u003elongesttest\n```\n  \nThis allows the snapshots to remain available until overwritten, and for the data to be able to continuously be written.\n  \nDeleting and recreating the file might take some time, and several tweets might have been uploaded in the meantime. I can see two solutions for this:\n  \n1. Explicitely hold from consuming the queue while waiting for the file to be available again.\n1. Be careless about the lost tweets.\n  \nThe first solution might appear good to a perfectionist - but here the goal is to collect data about emoji usage, not about Twitter itself; therefore I can accept losing Tweets for a few seconds. This is the reason I won't add the REST API for times when the tweets don't come through completely - loss of information is okay, only the tally matters, in the end.\n  \n  ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsolarliner%2Ftwemoji-zipf-test","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsolarliner%2Ftwemoji-zipf-test","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsolarliner%2Ftwemoji-zipf-test/lists"}