Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/keimeno/word-growth-rate-analyzer
Analyzing the growth rate of all words that are actively written by Reddit users.
https://github.com/keimeno/word-growth-rate-analyzer
data-science nodejs statistics typescript
Last synced: about 2 hours ago
JSON representation
Analyzing the growth rate of all words that are actively written by Reddit users.
- Host: GitHub
- URL: https://github.com/keimeno/word-growth-rate-analyzer
- Owner: Keimeno
- Created: 2020-09-11T16:50:57.000Z (about 4 years ago)
- Default Branch: develop
- Last Pushed: 2023-01-09T14:40:59.000Z (almost 2 years ago)
- Last Synced: 2023-03-05T13:50:59.475Z (over 1 year ago)
- Topics: data-science, nodejs, statistics, typescript
- Language: TypeScript
- Homepage:
- Size: 344 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Word Growth Rate Analyzer
## The Goal of this Project
The objective of this project is, to automatically analyse, store and notify about long and short-term trends. This will be done by counting and comparing the occurrences of any given word within a given timeline.
## Project Architecture
![WGRA Architecture](WGRA-Architecture.png)
## Technical Risks
### Fluctuation between peak hours
Since users are more active at different times of the day, and at different days of the week, we will get large fluctuations in the growth rate of each word. While this doesn't result in a direct problem, since in relation to other words, the resulting growth rate will stay meaningful, it will cause our dataset to become less readable.
To resolve this problem, we need to calculate the growth rate of each word, based on the occurrence of the most common word. In our case "the". As an example, if we want to calculate the growth rate of the word "hello", we need the following inputs to calculate the growth rate.
| Word | Time Frame | Occurrences |
| :---- | :------------ | :---------- |
| hello | 15:00 - 16:00 | 4000 |
| hello | 16:00 - 17:00 | 6000 |
| the | 15:00 - 16:00 | 120.000 |
| the | 16:00 - 17:00 | 150.000 |Given our example input, it may seem as if we have a growth rate of 50% for the word "hello" between our two time frames. However, in reality we first need to take the word "the" - our most occurring word - as a baseline. Then we need to divide the occurrences of "the", with the occurrences of "the" from the time frame before, meaning `150.000 / 120.000`. This will give us `1.25`, which is the growth rate in user activity between those time frames. Next we have to multiply the occurrences of the word "hello" in our first time frame with our user growth rate, which results in `4000 * 1.25 = 5000`. We can now calculate the true growth rate by dividing the occurrences of one time frame with the time frame before using our adjusted occurrences, `6000 / 5000 = 1.2`. We then know, that the true growth rate is 20%, and not 50%.
### Spam messages
If a user decides to write the word "foobar" hundreds of times in one comment, all occurrences will be added to our database, and the growth rate would be enormous. To resolve this problem, we can implement two potential solutions.
1. We require a specific threshold, if a word is under this threshold, say 50 occurrences, we won't calculate the growth rate, since its value is not significant enough.
2. Spam messages usually only occur once. If the occurrence for the word is back to normal after one time frame, we know that it was spam and can ignore it.