Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jeremy-rifkin/tccpp-ngrams
Trends of words and phrases over time on the Together C & C++ discord server
https://github.com/jeremy-rifkin/tccpp-ngrams
Last synced: 14 days ago
JSON representation
Trends of words and phrases over time on the Together C & C++ discord server
- Host: GitHub
- URL: https://github.com/jeremy-rifkin/tccpp-ngrams
- Owner: jeremy-rifkin
- License: mit
- Created: 2024-09-11T03:23:26.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-10-29T03:25:10.000Z (3 months ago)
- Last Synced: 2024-11-07T22:13:46.437Z (2 months ago)
- Language: C++
- Homepage: https://projects.rifkin.dev/tccpp-ngrams
- Size: 428 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Together C & C++ Ngrams
This is a project exploring trends of words and phrases (ngrams) over time on the Together C & C++ discord server. It
was inspired by the [google books ngrams project](https://books.google.com/ngrams/).![demo](./screenshots/demo.png)
## Table of Contents
- [About](#about)
- [Privacy](#privacy)
- [Future work](#future-work)# About
On the Together C & C++ Discord server we have built up a database of messages sent on the server since it was created
in 2017. This is something we did out of moderation necessity - we've had to inspect edit and deletion logs countless
times. Having this data, though, provides a cool opportunity for analysis of trends within the data.This repository contains two parts: The aggregation in `src/` and the application code in `ui/` and `server/`.
The messages are stored in a MongoDB database used by the server's discord bot, [Wheatley][wheatley]. The aggregator
code reads documents from MongoDB, excluding private channels, bot ids, and deleted messages, and then tokenizes the
messages. Words are tokenized based on being any alphanumeric string of characters, allowing for `_` as well as `'` and
`-` as long as not at the edges of words. Two passes are performed over the database, the first computes totals for
given ngram sequences. Any ngram sequences that occur fewer than 40 times are excluded. This first pass uses a lot of
memory (the hash maps built up keep count of hundreds of millions of unique ngram sequences) and could be optimized
later if needed, but, for now it's fine. After this the actual aggregation pass is done which performs aggregation for
each counted ngram sequence for every month. Ngram frequency is computed simply as `count / total_tokens_for_month`.
While this simple aggregation of frequency data of short phrases from a fully public message set should not pose privacy
concerns, as a safety measure a small amount of artificial noise is added in, +/-1% uniformly. This is done with RNGs
that are seeded based on a hash of the ngrams and a secret nonce, which is very over-engineered but whatever. Monthly
frequency numbers are written to a DuckDB database which I found was far faster than Sqlite for the types of queries the
application ends up doing, glob queries that don't lend themselves to indexing. DuckDB ends up being really good at
these while Sqlite is more optimized for taking advantage of indices.# Privacy
The data are frequencies of short phrases from messages sent in public channels in the Together C & C++ Discord server.
Messages in private channels, messages sent by bots, and deleted messages are excluded. The frequency data are monthly
aggregates for the whole server and thus is not tied to any particular user. Additionally, the following steps are taken
to further ensure privacy:- [Discord snowflakes][snowflakes] (17 to 19 digit sequences used by discord to uniquely identify everything from users
to channels to emojis) are filtered out. These most commonly appear in user mentions, which appear textually as
`<@331718482485837825>`.
- Noise is added to results (+/-1%, uniformly). This is a measure inspired by [differential privacy][diff] but not using
the same level of mathematical rigor as differential privacy is more tailored to other types of data sets.
- Any words or phrases used less than 40 times throughout the server's history are excluded.Neither the raw nor aggregate data have been made available for download and I have no plans to change that.
# Future work
Future work on this, if I do more, will likely center around performance exploration.
[snowflakes]: https://discord.com/developers/docs/reference#snowflakes
[diff]: https://en.wikipedia.org/wiki/Differential_privacy
[wheatley]: https://github.com/TCCPP/wheatley