{"id":13436679,"url":"https://github.com/Codecademy/EventHub","last_synced_at":"2025-03-18T21:30:50.710Z","repository":{"id":16709787,"uuid":"19466626","full_name":"Codecademy/EventHub","owner":"Codecademy","description":"An open source event analytics platform","archived":false,"fork":false,"pushed_at":"2022-04-05T11:51:24.000Z","size":838,"stargazers_count":1330,"open_issues_count":1,"forks_count":140,"subscribers_count":125,"default_branch":"master","last_synced_at":"2025-03-11T12:01:45.591Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://tinyurl.com/eventhub","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"flexd/slackinviter","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Codecademy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-05-05T18:37:02.000Z","updated_at":"2025-02-14T15:51:55.000Z","dependencies_parsed_at":"2022-07-26T08:48:08.266Z","dependency_job_id":null,"html_url":"https://github.com/Codecademy/EventHub","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Codecademy%2FEventHub","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Codecademy%2FEventHub/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Codecademy%2FEventHub/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Codecademy%2FEventHub/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Codecademy","download_url":"https://codeload.github.com/Codecademy/EventHub/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244217929,"owners_count":20417677,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T03:00:51.214Z","updated_at":"2025-03-18T21:30:50.691Z","avatar_url":"https://github.com/Codecademy.png","language":"Java","readme":"# EventHub\nEventHub enables companies to do cross device event tracking. Events are joined by their associated user on EventHub and can be visualized by the built-in dashboard to answer the following common business questions\n* what is my funnel conversion rate?\n* what is my cohorted KPI retention?\n* which variant in my A/B test has a higher conversion rate?\n\nMost important of all, EventHub is free and open source.\n\n**Table of Contents**\n- [Quick Start](#quick-start)\n- [Server](#server)\n- [Dashboard](#dashboard)\n- [Javascript Library](#javascript-library)\n- [Ruby Library](#ruby-library)\n\n## Quick Start\n### Playground\nA [demo server](http://codecademy:codecademy@floating-mesa-9408.herokuapp.com/) is available on Heroku and the username/password to access the dashboard is `codecademy/codecademy`.\n\n- [Example funnel query](http://codecademy:codecademy@54.193.159.140/?start_date=20130101\u0026end_date=20130107\u0026num_days_to_complete_funnel=7\u0026funnel_steps%5B%5D=receive_email\u0026funnel_steps%5B%5D=view_track_page\u0026funnel_steps%5B%5D=finish_course\u0026type=funnel)\n- [Example cohort query](http://codecademy:codecademy@54.193.159.140/?start_date=20130101\u0026end_date=20130107\u0026row_event_type=receive_email\u0026column_event_type=start_track\u0026num_days_per_row=1\u0026num_columns=11\u0026type=cohort)\n\n### Screenshots\n![Funnel screenshot](https://raw.githubusercontent.com/Codecademy/EventHub/master/funnel-screenshot.png)\n![Cohort screenshot](https://raw.githubusercontent.com/Codecademy/EventHub/master/cohort-screenshot.png)\n\n### Deploy with Heroku\nDevelopers who want to try EventHub can quickly set the server up on Heroku with the following commands. However, please be aware that Heroku's file system is ephemeral and your data will be wiped after the instance is closed.\n```bash\ngit clone https://github.com/Codecademy/EventHub.git\n\ncd EventHub\nheroku create\ngit push heroku master\n\nheroku open\n```\n\n### Required dependencies\n* [java sdk7](http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html)\n* [maven](http://maven.apache.org)\n\n### Compile and run\n```bash\n# set up proper JAVA_HOME for mac\nexport JAVA_HOME=$(/usr/libexec/java_home)\n\ngit clone https://github.com/Codecademy/EventHub.git\ncd EventHub\nexport EVENT_HUB_DIR=`pwd`\nmvn -am -pl web clean package\njava -jar web/target/web-1.0-SNAPSHOT.jar\n```\n\n### How to run all the tests\n#### Unit/Integration/Functional testing\n```bash\nmvn -am -pl web clean test\n```\n\n#### Manual testing with curl\nComprehensive examples can be found in `script.sh`.\n```bash\ncd ${EVENT_HUB_DIR}; ./script.sh\n```\n\nTest all event related endpoints\n* Add new event\n    ```bash\n    curl -X POST http://localhost:8080/events/track --data \"event_type=signup\u0026external_user_id=foobar\u0026event_property_1=1\"\n    ```\n\n* Batch add new event\n    ```bash\n    curl -X POST http://localhost:8080/events/batch_track --data \"events=[{event_type: signup, external_user_id: foobar, date: 20130101, event_property_1: 1}]\"\n    ```\n\n* Show all event types\n    ```bash\n    curl http://localhost:8080/events/types\n    ```\n\n* Show events for a given user\n    ```bash\n    curl http://localhost:8080/users/timeline\\?external_user_id\\=chengtao@codecademy.com\\\u0026offset\\=0\\\u0026num_records\\=1\n    ```\n\n* Show all property keys for the given event type\n    ```bash\n    curl 'http://localhost:8080/events/keys?event_type=signup'\n    ```\n\n* Show all property values for the given event type and property key\n    ```bash\n    curl 'http://localhost:8080/events/values?event_type=signup\u0026event_key=treatment'\n    ```\n\n* Show all property values for the given event type, property key and value prefix\n    ```bash\n    curl 'http://localhost:8080/events/values?event_type=signup\u0026event_key=treatment\u0026prefix=fa'\n    ```\n\n* Show server stats\n    ```bash\n    curl http://localhost:8080/varz\n    ```\n\n* Funnel query\n    ```bash\n    today=`date +'%Y%m%d'`\n    end_date=`(date -d '+7day' +'%Y%m%d' || date -v '+7d' +'%Y%m%d') 2\u003e /dev/null`\n\n    curl -X POST \"http://localhost:8080/events/funnel\" --data \"start_date=${today}\u0026end_date=${end_date}\u0026funnel_steps[]=signup\u0026funnel_steps[]=view_shopping_cart\u0026funnel_steps[]=checkout\u0026num_days_to_complete_funnel=7\u0026eck=event_property_1\u0026ecv=1\"\n    ```\n\n* Retention query\n    ```bash\n    today=`date +'%Y%m%d'`\n    end_date=`(date -d '+7day' +'%Y%m%d' || date -v '+7d' +'%Y%m%d') 2\u003e /dev/null`\n\n    curl -X POST \"http://localhost:8080/events/cohort\" --data \"start_date=${today}\u0026end_date=${end_date}\u0026row_event_type=signup\u0026column_event_type=view_shopping_cart\u0026num_days_per_row=1\u0026num_columns=2\"\n    ```\n\nTest all user related endpoints\n* show paginated events for a given user\n    ```bash\n    curl http://localhost:8080/users/timeline\\?external_user_id\\=chengtao@codecademy.com\\\u0026offset\\=0\\\u0026num_records\\=5\n    ```\n\n* show information of users who have matched property keys \u0026 values\n    ```bash\n    curl -X POST http://localhost:8080/users/find --data \"ufk[]=external_user_id\u0026ufv[]=chengtao1@codecademy.com\"\n    ```\n\n* add or update user information\n    ```bash\n    curl -X POST http://localhost:8080/users/add_or_update --data \"external_user_id=chengtao@codecademy.com\u0026foo=bar\u0026hello=world\"\n    ```\n\n* Show all property keys for users\n    ```bash\n    curl 'http://localhost:8080/users/keys\n    ```\n\n* Show all property values for users given property key and (optional) value prefix\n    ```bash\n    curl 'http://localhost:8080/users/values?user_key=hello\u0026prefix=w'\n    ```\n\n#### Load testing with Jmeter\nWe use [Apache Jmeter](http://jmeter.apache.org) for load testing, and the load testing script can be found in `${EVENT_HUB_DIR}/jmeter.jmx`.\n```bash\nexport JMETER_DIR=~/Downloads/apache-jmeter-2.11/\njava -jar ${JMETER_DIR}/bin/ApacheJMeter.jar -JnumThreads=1 -n -t jmeter.jmx -p jmeter.properties\njava -jar ${JMETER_DIR}/bin/ApacheJMeter.jar -JnumThreads=5 -n -t jmeter.jmx -p jmeter.properties\njava -jar ${JMETER_DIR}/bin/ApacheJMeter.jar -JnumThreads=10 -n -t jmeter.jmx -p jmeter.properties\n\n# generate graph (require matplotlib)\n./plot_jmeter_performance.py 1-jmeter-performance.csv 5-jmeter-performance.csv 10-jmeter-performance.csv\n\n# open \"Track Event.png\"\n```\n\n## Server\n\n### Key observations \u0026 design decisions\nOur goal is to build something usable on a single machine with a reasonably large SSD drive. Let's say, hypothetically, the server receives 100M events monthly (might cost you few thousand dollars per month to use SAAS provider), and each event is 500 bytes without compression. In this situation, storing all the events likely only takes you few hundreds GB with compression, and chances are, only the data in recent months are of interest.\n\nAlso, to efficiently run basic funnel and cohort queries without filtering, only two forward indices are needed, event index sharded by event types and event index sharded by users. Therefore, our strategy is to make those two indices as small as possible to fit in memory, and if the client wants to do filtering for events, we build a bloomfilter to reject most of the non exact-match. Imagine we are running another hypothetical query while assuming both indices and the bloomfilters can be fit in memory. Say there are 1M events that cannot be rejected and need to hit the disk, assuming each SSD disk read is 16 microseconds, we are talking about sub-minute query time, while assuming none of the data are in memory. In practice, this situation is likely much better as we cache all the recently hit records, and most of the queries likely only care the most recent data.\n\nTo simplify the design of the server and store indices compactly so that they fit in memory, we made the following two assumptions.\n\n1. Times are associated to events when the server receives the an event\n2. Date is the finest level of granularity\n\nWith the above two assumptions, we can rely on the server generated monotonically increasing id to maintain the total order for the events. In addition, as long as we track the id of the first event in any given date, we do not need to store the time information in the indices (which greatly reduces the size of the indices). The direct implication for those assumptions are, first, if the client chose to cache some events locally and sent them later, the timing for those events will be recorded as the server receives them, not when the user made those actions; second, though the server maintains the total ordering of all events, it cannot answer questions like what is the conversion rate for the given funnel between 2pm and 3pm on a given date.\n\nLastly, for both indices, since they are sharded by event types or users, we can expect the size of the indices to reduce significantly with proper compression.\n\n### Architecture\nAt the highest level, `com.codecademy.evenhub.web.EventHubHandler` is the main entry point. It runs a [Jetty](http://www.eclipse.org/jetty) server, reflectively collects supported commands under `com.codecademy.evenhub.web.commands`, handles JSONP request transparently, handles requests to static resources like the dashboard, and most importantly, act as a proxy which translates http request and respones to and from method calls to `com.codecademy.evenhub.EventHub`.\n\n`com.codecademy.evenhub.EventHub` can be thought of as a facade to the key components of `UserStorage`, `EventStorage`, `ShardedEventIndex`, `DatedEventIndex`, `UserEventIndex` and `PropertiesIndex`.\n\nFor `UserStorage` and `EventStorage`, at the lowest level, we implemented `Journal{User,Event}Storage` backed by [HawtJournal](https://github.com/fusesource/hawtjournal/) to store underlying records reliably. In addition, when clients are quering records which cannot be filtered by the supported indices, the server will loop through all the potential hits, look up the properties from the `Journal` and then filter accordingly. For better performance, there are also decorators for each storage like `Cached{User,Event}Storage` to support caching and `BloomFiltered{User,Event}Storage` to support fast rejection for filters like `ExactMatch`. Please also beware that each `Storage` maintains a monotonically increasing counter as the internal id generator for each event and user received.\n\nTo make the funnel and cohort queries fast, `EventHub` also maintains three indices, `ShardedEventIndex`, `UserEventIndex`, and `DatedEventIndex` behind the scene. `DatedEventIndex` simply tracks the mapping from a given date, the id of the first event received in that day. `ShardedEventIndex` can be thought of as sorted event ids sharded by event type. `UserEventIndex` can be thought of as sorted event ids sharded by users.\n\nLastly, `EventHub` maintains a `PropertiesIndex` backed by [LevelDB Jni](https://github.com/fusesource/leveldbjni) to track what properties keys are available for a given event type and what properties values are available for a given event type and a property key.\n\n### Horizontal scalabiltiy\nWhile EventHub does not need any information from different users, with a broker in front of EventHub servers, EventHub can be easily sharded by users and scale horizontally.\n\n### Performance\nIn the following three experiments, the spec of the computer used can be found in the following table\n\n| Component      | Spec                                    |\n|----------------|-----------------------------------------|\n| Computer Model | Mac Book Pro, Retina 15-inch, Late 2013 |\n| Processor      | 2GHz Intel Core i7                      |\n| Memory         | 8GB 1600 MHz DDR3                       |\n| Software       | OS X 10.9.2                             |\n| Jvm            | Oracle JDK 1.7                          |\n\n#### Write performance\nThe following graph is generated as described in [Load testing with Jmeter](#load-testing-with-jmeter). The graph shows both the throughput and latency of adding the first one million events (without batching) with different number of threads (1, 5, 10, 15).\n![Throughput and latency by threads](http://i60.tinypic.com/16ad66b.png)\n\n#### Query performance\nWhile it is difficult to come up with a generic benchmark, we would rather show something rather than show nothing. After generating about one million events with the load testing script as described in [Load testing with Jmeter](#load-testing-with-jmeter), we ran the four types of queries twice, once after the server starts cleanly and another time while the cache is still warm.\n\n| Query                   | 1st execution | 2nd execution | command |\n|-------------------------|---------------|---------------|---------|\n| Funnel without filters  | 1.15s         | 0.19s         | curl -X POST \"http://localhost:8080/events/funnel\" --data \"start_date=20130101\u0026end_date=20130130\u0026funnel_steps[]=receive_email\u0026funnel_steps[]=view_track_page\u0026funnel_steps[]=start_track\u0026num_days_to_complete_funnel=30\" |\n| Funnel with filters     | 1.31s         | 0.43s         | curl -X POST \"http://localhost:8080/events/funnel\" --data \"start_date=20130101\u0026end_date=20130130\u0026funnel_steps[]=receive_email\u0026funnel_steps[]=view_track_page\u0026funnel_steps[]=start_track\u0026num_days_to_complete_funnel=30\u0026efk0[]=event_property_1\u0026efv0[]=1\" |\n| Cohort without filters  | 0.63s         | 0.13s         | curl -X POST \"http://localhost:8080/events/cohort\" --data \"start_date=20130101\u0026end_date=20130130\u0026row_event_type=receive_email\u0026column_event_type=start_track\u0026num_days_per_row=1\u0026num_columns=7\" |\n| Cohort with filters     | 1.20s         | 0.32s         | curl -X POST \"http://localhost:8080/events/cohort\" --data \"start_date=20130101\u0026end_date=20130130\u0026row_event_type=receive_email\u0026column_event_type=start_track\u0026num_days_per_row=1\u0026num_columns=7\u0026refk[]=event_property_1\u0026refv[]=1\" |\n\n#### Memory footprint\nIn the experiment, the server was bootstrapped differently. Instead of using the load testing script, we used subset of data from Codecademy, which has around 53M events and 2.4M users. Please be aware that the current storage format on disk is fairly inefficient and has serious internal fragmentation. However, when the data are loaded to memory, it will be much more efficient as we would never load those \"hole\" pages into memory.\n\n| Key Component             | Size in memory  | Note |\n|---------------------------|-----------------|------|\n| ShardedEventIndex         | 424Mb           | (data size) + (index size) \u003cbr\u003e= (event id size * number of events) + negligible\u003cbr\u003e= (8 * 53M) |\n| UserEventIndex            | 722Mb           | (data size) + (index size) \u003cbr\u003e= (event id size * number of events) + (index entry size * number of users)\u003cbr\u003e= (8 * 53M) + ((numPointersPerIndexEntry * 2 + 1) * 8 + 4) * 2.4M)\u003cbr\u003e= (8 * 53M) + (124 * 2.4M) |\n| BloomFilteredEventStorage | 848Mb           | (bloomfilter size) * (number of events) \u003cbr\u003e= 16 * 53M |\n\n## Dashboard\nThe server comes with a built-in dashboard which is simply some static resources stored in `/web/src/main/resources/frontend` and gets compiled into the server jar file. After running the server, the dashboard can be accessed at [http://localhost:8080](http://localhost:8080). Through the dashboard, you can access the server for your funnel and cohort analysis.\n\n#### Password protection\nThe dashboard comes with insecure basic authentication which send unencrypted information without SSL. Please use it at your own discretion. The default username/password is codecademy/codecademy and you can change it by modifying your web.properties file or use the following command to start your server\n```bash\nUSERNAME=foo\nPASSWORD=bar\njava -Deventhubhandler.username=${USERNAME} -Deventhubhandler.password=${PASSWORD} -jar web/target/web-1.0-SNAPSHOT.jar\n```\n\n## Javascript Library\nThe project comes with a javascript library which can be integrated with your website as a way to send events to your EventHub server. \n\n### How to run JS tests\n#### install [karma](http://karma-runner.github.io/0.12/index.html)\n```bash\ncd ${EVENT_HUB_DIR}\n\nnpm install -g karma\nnpm install -g karma-jasmine@2_0\nnpm install -g karma-chrome-launcher\n\nkarma start karma.conf.js\n```\n\n### API\nThe javascript library is extremely simple and heavily inspired by mixpanel. There are only five methods that a developer needs to understand. Beware that behind the scenes, the library maintains a queue backed by localStorage, buffers the events in the queue, and has a timer reguarly clear the queue. If the browser doesn't support localStorage, a in-memory queue will be created as EventHub is created. Also, our implementation relies on the server to track the timestamp of each event. Therefore, in the case of a browser session disconnected before all the events are sent, the remaining events will be sent in the next browser session and thus have the timestamp recorded as the next session starts.\n\n#### window.newEventHub()\nThe method will create an EventHub and start the timer which clears out the event queue in every second (default)\n```javascript\nvar name = \"EventHub\";\nvar options = {\n  url: 'http://example.com',\n  flushInterval: 10 /* in seconds */\n};\nvar eventHub = window.newEventHub(name, options);\n```\n\n#### eventHub.track()\nThis method enqueues the given event which will be cleared in batch at every flushInterval. Beware that if there is no identify method called before the track method is called, the library will automatically generate an user id which remain the same for the entire session (clears after the browser tab is closed), and send the generated user id along with the queued event. On the other hand, if `eventhub.identify()` is called before the track method is called, the user information passed along with the identify method call will be merged to the queued event.\n```javascript\neventHub.track(\"signup\", {\n  property_1: 'value1',\n  property_2: 'value2'\n});\n```\n\n#### eventHub.alias()\nThis method links the given user to the automatically generated user. Typically, you only want to call this method once -- right after the user successfully signs up.\n```javascript\neventHub.alias('chengtao@codecademy.com');\n```\n\n#### eventHub.identify()\nThis method tells the library instead of using the automatically generated user information, use the given information instead.\n```javascript\neventHub.identify('chengtao@codecademy.com', {\n  user_property_1: 'value1',\n  user_property_2: 'value2'\n});\n```\n\n#### eventHub.register()\nThis method allows the developer to add additional information to the generated user.\n```javascript\neventHub.register({\n  user_property_1: 'value1',\n  user_property_2: 'value2'\n});\n```\n\n### Scenario and Receipes\n#### Link the events sent before and after an user sign up\nThe following code\n```javascript\nvar eventHub = window.newEventHub('EventHub', { url: 'http://example.com' });\neventHub.track('pageview', { page: 'home' });\neventHub.register({\n  ip: '10.0.0.1'\n});\n\n// after user signup\neventHub.alias('chengtao@codecademy.com');\neventHub.identify('chengtao@codecademy.com', {\n  gender: 'male'\n});\neventHub.track('pageview', { page: 'learn' });\n```\n will result in a funnel like\n```javascript\n{\n  user: 'something generated',\n  event: 'pageview',\n  page: 'home',\n  ip: '10.0.0.1'\n}\nlink 'chengtao@codecademy.com' to 'something generated'\n{\n  user: 'chengtao@codecademy.com',\n  event: 'pageview',\n  page: 'learn',\n  gender: 'male'\n}\n```\n\n#### A/B testing\nThe following code\n```javascript\nvar eventHub = window.newEventHub('EventHub', { url: 'http://example.com' });\neventHub.identify('chengtao@codecademy.com', {});\neventHub.track('pageview', {\n  page: 'javascript exercise 1',\n  experiment: 'fancy feature',\n  treatment: 'new'\n});\neventHub.track('submit', {\n  page: 'javascript exercise 1'\n});\n```\nand\n```javascript\nvar eventHub = window.newEventHub('EventHub', { url: 'http://example.com' });\neventHub.identify('bob@codecademy.com', {});\neventHub.track('pageview', {\n  page: 'javascript exercise 1',\n  experiment: 'fancy feature',\n  treatment: 'control'\n});\neventHub.track('skip', {\n  page: 'javascript exercise 1'\n});\n```\nwill result in two funnels like\n```javascript\n{\n  user: 'chengtao@codecademy.com',\n  event: 'pageview',\n  page: 'javascript exercise 1',\n  experiment: 'fancy feature',\n  treatment: 'new'\n}\n{\n  user: 'chengtao@codecademy.com',\n  event: 'submit',\n  page: 'javascript exercise 1'\n}\n```\nand\n```javascript\n{\n  user: 'bob@codecademy.com',\n  event: 'pageview',\n  page: 'javascript exercise 1',\n  experiment: 'fancy feature',\n  treatment: 'control'\n}\n{\n  user: 'bob@codecademy.com',\n  event: 'skip',\n  page: 'javascript exercise 1'\n}\n```\n\n## Ruby Library\nSeparate ruby gem is also available at [https://github.com/Codecademy/EventHubClient](https://github.com/Codecademy/EventHubClient)\n\n## License\nMIT License.  \nCopyright (c) 2022 Codecademy LLC\n\n","funding_links":[],"categories":["Java","Applications","I. Development"],"sub_categories":["4. Business"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCodecademy%2FEventHub","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCodecademy%2FEventHub","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCodecademy%2FEventHub/lists"}