{"id":16283372,"url":"https://github.com/zhudotexe/fireball-data-processing","last_synced_at":"2025-08-26T06:36:44.132Z","repository":{"id":93897783,"uuid":"525070859","full_name":"zhudotexe/FIREBALL-data-processing","owner":"zhudotexe","description":"The data preprocessing code for FIREBALL","archived":false,"fork":false,"pushed_at":"2023-05-08T18:07:28.000Z","size":1750,"stargazers_count":9,"open_issues_count":0,"forks_count":4,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-03T09:21:23.734Z","etag":null,"topics":["aws-kinesis-firehose","dataset"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zhudotexe.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-08-15T17:06:49.000Z","updated_at":"2024-09-20T18:54:51.000Z","dependencies_parsed_at":null,"dependency_job_id":"990cce0d-ca3d-4505-bf23-2bcb49b43062","html_url":"https://github.com/zhudotexe/FIREBALL-data-processing","commit_stats":null,"previous_names":["zhudotexe/fireball-data-processing"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zhudotexe/FIREBALL-data-processing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhudotexe%2FFIREBALL-data-processing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhudotexe%2FFIREBALL-data-processing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhudotexe%2FFIREBALL-data-processing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhudotexe%2FFIREBALL-data-processing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zhudotexe","download_url":"https://codeload.github.com/zhudotexe/FIREBALL-data-processing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhudotexe%2FFIREBALL-data-processing/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272186211,"owners_count":24888333,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-26T02:00:07.904Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws-kinesis-firehose","dataset"],"created_at":"2024-10-10T19:13:18.951Z","updated_at":"2025-08-26T06:36:44.111Z","avatar_url":"https://github.com/zhudotexe.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FIREBALL Data Processing\n\n## Downloading raw data\n\nI recommend downloading the raw dataset into a directory at `data/` (relative to this repo's root). Usually this is done\nwith `aws s3 sync`.\n\nFor the FIREBALL project, you can extract the dataset release here.\n\n## Requirements\n\nThe pipeline requires Python 3.10+.\n\nI recommend creating a virtual environment to install the Python requirements:\n\n```bash\n# installing Python requirements\n$ python --version\nPython 3.10.2\n$ python -m venv venv\n$ source venv/bin/activate\n# If the venv is already set up, you can skip to this step\n(venv) $ pip install -r requirements.txt\n```\n\n# FIREBALL Processing Pipeline\n\nTo reproduce the FIREBALL data processing, run these python scripts in this order:\n\n- `distill1_time_group.py`\n- `distill2_authors.py`\n- `distill3a_ic_regex.py`\n- `distill3b_ic_classifier_gpt.py`\n- `distill4_normalize.py` (the results of this step are included in the FIREBALL release for all instances)\n- `finetune_prep.py`\n\nThe resulting data will be in the `extract/` directory.\n\n# Data Explorer\n*originally AWS Kinesis Dataset Exploration Tool*\n\n\u003e **NOTE**: This code is not part of the FIREBALL preprocessing pipeline; it was used to explore the dataset and iterate\n\u003e on heuristics early in the project. You can use this code to explore the dataset as well.\n\nThis repo contains a set of tools in order to explore datasets collected via AWS Kinesis Firehose quickly and\nintuitively, while providing the framework to quickly iterate on heuristics and visualize raw data.\n\n- Operates directly on gzipped JSONL files output by AWS Kinesis Firehose, no extraction needed\n- Memory efficient (streaming heuristic applicator)\n- High-throughput and horizontally scalable (multiprocessing out of the box)\n- Low-latency (streaming API \u0026 client)\n\nI built this tool for the FIREBALL project (https://www.cis.upenn.edu/~ccb/language-to-avrae.html) and this is the\ndataset this repo is implemented for, but the tool is designed with some degree of dataset-agnosticism in mind.\n\nFor help using this tool in your own data science project, contact me at andrz@seas.upenn.edu or view \"Customizing the\nExplorer\" below.\n\n## Step 1: Heuristics\n\nThe first step of exploring the dataset is to define and apply heuristics to the dataset.\n\n### Defining Heuristics\n\nTo define a heuristic, add it to the ``heuristics`` module - a function that takes an iterator of event dicts (a combat\nsession) and returns a single float (we'll use this later - the scale and meaning can be fairly arbitrary). Make sure to\nimport any added heuristics in ``heuristics/__init__.py``.\n\n### Applying Heuristics\n\nNext, you should compute each heuristic over the dataset - to do this efficiently, run `python heuristic_worker.py`.\nThis will compute each defined heuristic for each combat instance in parallel and save the results\nto `heuristic_results/`.\n\nIf a heuristic has been computed for the dataset previously (based on heuristic name and dataset checksum), it will not\nbe recomputed. **Make sure to delete any prior result from your output directory or run with `--force-recompute` after\nmodifying heuristic code.**\n\nThe heuristic worker includes some additional arguments for more fine-grained control. You can view these arguments\nwith `python heuristic_worker.py --help`:\n\n```text\nusage: heuristic_worker.py [-d DATA_DIR] [-o OUTPUT_DIR] [-h HEURISTIC] [--force-recompute] [--help]\n\nApplies defined heuristics to a dataset.\n\noptions:\n  -d DATA_DIR, --data-dir DATA_DIR\n                        the directory containing the raw data (default: data/)\n  -o OUTPUT_DIR, --output-dir OUTPUT_DIR\n                        the directory to save the heuristic results to (default: heuristic_results/)\n  -h HEURISTIC, --heuristic HEURISTIC\n                        the heuristic(s) to run (defaults to all)\n  --force-recompute     forces the worker to recompute regardless of prior computation\n  --help                displays CLI help\n```\n\n## Step 2: Exploration\n\nAfter defining and computing some heuristics, the next step is to open up the dataset in the *dataset explorer* and\nview each recording instance empirically alongside the computed heuristics.\n\n### Requirements\n\nBuilding the explorer app locally is optional - the prebuilt files can be downloaded from TODO.\n\nTo build the explorer web app locally, Node.js 16+ is required.\n\nThe explorer app uses some [modern web technologies](https://caniuse.com/mdn-api_textdecoderstream) that are not yet\nsupported by all browsers; Chrome 71+, Firefox 105+, Edge 79+, Safari 14.1+, or Opera 58+ is required (IE is not\nsupported).\n\n```bash\n# installing Node requirements (optional)\n$ node --version\nv16.17.0\n$ npm --version\n8.17.0\n$ cd explorer\n$ npm install\n```\n\n### Build Explorer App\n\nThe explorer app is a Vue app that lives in `explorer/`. To build it, run `npm run build` from the explorer directory.\n\nAlternatively, you can download an automatically built prebuilt distribution from \n[this repo's CI pipeline](https://github.com/zhudotexe/aws-kinesis-dataset-exploration-tool/actions?query=branch%3Amain+is%3Asuccess)\n(click on the latest run and download the `explorer-dist` artifact).\nTo use the prebuilt distribution, create the `explorer/dist` directory and extract it to that directory. The project \nfile structure should look like this:\n\n```text\naws-kinesis-dataset-exploration-tool/\n    explorer/\n        dist/\n            assets/\n                index.***.css\n                index.***.js\n            index.html\n```\n\n### Run Explorer App\n\nThis project provides a simple local web app to accomplish this. Run `python explorer_server.py` and the explorer will\nbe served at `http://127.0.0.1:31415/explorer`.\n\nSimilarly to the heuristic worker, you can point the explorer to an alternate dataset directory and heuristic results\ndirectory by setting the `DATA_DIR` and `HEURISTIC_DIR` environment variables, respectively.\n\n### Customizing the Explorer\n\nThe implementation of the explorer in this repo is built for the Avrae NLP project. To use this tool for your own\nproject, you will need to change your event visualizer and models.\n\n#### Define Custom Event Models (Optional)\n\nThis step provides typed interfaces in order to make building the custom event visualizer easier. You can skip this step\nby setting `type AnyEvent = any` in `explorer/src/events.ts` if you do not need static typing.\n\nOtherwise, create a directory in `explorer/src` for your own dataset, and define your event type(s) as a TypeScript\ninterface. Once you've done that, set the `AnyType` type to your newly defined type(s) in `explorer/src/events.ts`.\n\n#### Define Event Key\n\nIn order to use this tool's event annotation features, each event should have a unique ID. To define how to extract this\nID from an event, implement `getEventId(event: AnyEvent): string | null` in `explorer/src/events.ts`.\n\nIf the function returns `null`, the event for which it did will not support annotations.\n\n#### Custom Event Visualizer\n\nTo visualize an event, define a Vue component in your dataset-specific directory that takes your event as a prop. Use\nthis template:\n\n```vue\n\u003c!-- explorer/src/(dataset)/EventComponent.vue --\u003e\n\u003cscript setup lang=\"ts\"\u003e\n// don't change this!\nimport type {AnyEvent} from \"@/events\";\ndefineProps\u003c{event: AnyEvent}\u003e();\n// any custom JS logic goes here\n\u003c/script\u003e\n\n\u003ctemplate\u003e\n\u003c!-- by default, this displays the event as JSON - update the template to your liking --\u003e\n\u003cpre\u003e\n    {{ event }}\n\u003c/pre\u003e\n\u003c/template\u003e\n\n\u003cstyle scoped\u003e\n/* css goes here */\n\u003c/style\u003e\n```\n\nThen, update the import in `explorer/src/views/InstanceViewer.vue` to use your dataset-specific component:\n\n```vue\n\u003c!-- explorer/src/views/InstanceViewer.vue --\u003e\n\u003cscript setup lang=\"ts\"\u003e\n[...]\nimport EventComponent from \"@/(dataset)/EventComponent.vue\";  // change this line!\n[...]\n\u003c/script\u003e\n```\n\n### Gotchas\n\nAs JavaScript's `number` type loses precision for integers greater than 2^53, the explorer will automatically parse\nany integer that would otherwise cause rounding as a [`BigNumber`](https://github.com/MikeMcl/bignumber.js).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhudotexe%2Ffireball-data-processing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzhudotexe%2Ffireball-data-processing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhudotexe%2Ffireball-data-processing/lists"}