{"id":18074981,"url":"https://github.com/lnsp/trace-explorer","last_synced_at":"2025-04-12T07:10:53.873Z","repository":{"id":43231544,"uuid":"464544518","full_name":"lnsp/trace-explorer","owner":"lnsp","description":"Toolset to explain and visualize database workload traces and benchmark data points.","archived":false,"fork":false,"pushed_at":"2024-01-15T23:14:51.000Z","size":143,"stargazers_count":8,"open_issues_count":2,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-26T02:22:48.261Z","etag":null,"topics":["database","duckdb","parquet","python","traces"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lnsp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-02-28T15:48:57.000Z","updated_at":"2025-01-05T12:32:00.000Z","dependencies_parsed_at":"2024-10-31T10:44:08.004Z","dependency_job_id":"0e797260-e0b8-412e-808b-877cb2016349","html_url":"https://github.com/lnsp/trace-explorer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lnsp%2Ftrace-explorer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lnsp%2Ftrace-explorer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lnsp%2Ftrace-explorer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lnsp%2Ftrace-explorer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lnsp","download_url":"https://codeload.github.com/lnsp/trace-explorer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248530575,"owners_count":21119600,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["database","duckdb","parquet","python","traces"],"created_at":"2024-10-31T10:44:02.148Z","updated_at":"2025-04-12T07:10:53.831Z","avatar_url":"https://github.com/lnsp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Trace Explorer\n\nThis repository contains the source code of Trace Explorer, a toolset to explain and visualize database workload traces and benchmark datapoints.\n\n## Installation\n\n```bash\n# Download the latest trace-explorer.zip from nightly.link\nwget -O trace-explorer.zip https://nightly.link/lnsp/trace-explorer/workflows/lint/main/trace-explorer.zip\n# Unzip the archive\nunzip trace-explorer.zip\n# Install onto your machine\npip install trace_explorer-1.1-py3-none-any.whl\n# Remove zip folder and wheel\nrm trace-explorer.zip trace_explorer-1.1-py3-none-any.whl\n```\n\n**Congrats, you can now use the `trace_explorer` command.**\n\nYou can also go to [the latest Build \u0026 Lint workflow](../../actions/workflows/lint.yml) and download the latest `trace-explorer.zip` under the *Artifacts* section manually.\n\n## Web interface\n\nTrace Explorer comes with an integrated web interface. To start it on your local machine, enter\n\n```\ntrace_explorer web\n```\n\nand a web server will run on port 5000.\n\n\u003e Note: The web UI may not expose all features of the command-line interface.\n\n## Command-line interface\n\nThe first step in exploring your measurements is data preparation. Trace Explorer assist you in a multitude of ways, by automatically exploring different strategies to maximize dataset variance.\n\n```mermaid\nflowchart TD\nA[Original dataset] --\u003e |User-specified transformer handling JSON, XML, ...| B[Common parquet format]\nB --\u003e |Specify new columns via SQL| B{Parquet file}\nB --\u003e |Join with other dataset| B\nB --\u003e |Filter outliers or other conditions| B\nB --\u003e |Visualize samples| C[Plot as PDF]\nB --\u003e |Compare with other dataset| C\n```\n\n### Turning the raw data into a parquet table\n\n\u003e In case you want to go along and test out the commands but do not have the required data at hand, you can generate sample data using the `generator.py` script in the *sample-data* folder. Go down to the section on *Generating sample data*, follow the examples there or return to this step later on.\n\nFirst, you have to convert your data into a common format. We use parquet for storing datasets because of its widespread compatibility and integrated compression. Each measurement must be converted into one row in the common dataset format. We provide [the `Transformer` interface](trace_explorer/transformer.py) to allow users to provide their own format converter. You can find examples for custom transformers in the `transformers/` directory.\n\n```bash\ntrace_explorer convert --using myconverter.py --source 'mydataset/*.merged' --output mydatasetcommon.parquet\n```\n\nThe provided transformer has to export a single class called `Transformer` implementing the `trace_explorer.Transformer` abstract class.\n\n```python3\nimport json\nimport trace_explorer.transformer\n\nclass Transformer(trace_explorer.transformer.Transformer):\n    def columns(self):\n        return ['scan', 'join', 'filter']\n\n    def transform(self, content: str):\n        obj = json.loads(content)\n        return [obj['scan'], obj['join'], obj['filter']]\n```\n\n### Find a good preprocessing pipeline\n\nTo maximize the possibility of being able to derive conclusions from the data, a good preprocessing pipeline is very necessary. We provide a set of common preprocessing primitives, and allow for automatic tuning by optimizing for global variance.\n\nTo speed up processing, we use [DuckDB](https://duckdb.com) and Parquet for storing intermediate data.\n\n```bash\n# Clean up the dataset by dropping entries with ANY column abs zscore \u003e 5\ntrace_explorer clean --zscore 5 --source mydataset.parquet --output mydataset_cleaned.parquet\n\n# Add a new generated column from existing data\ntrace_explorer generate --source mydataset.parquet --query 'select log(1 + execTime) as execTimeLog from dataset'\n\n# Only keep read-only queries\ntrace_explorer generate --source mydataset.parquet --no_copy --query 'select * from dataset where writtenBytes = 0'\n\n# Print out useful dataset stats\ntrace_explorer stats --source mydataset.parquet\n```\n\n### Visualize your dataset\n\nTo make the most sense of your trace, you probably want to visualize your dataset. Trace Explorer supports clustering, auto-labeling and visualizing dataset clusters by\n\n- performing clustering on a transformed subset of your original dataset\n- auto-labeling the discovered clusters\n- plotting them in a 2D scatter plot using a TSNE embedding\n- (optional) training a tree classifier to cluster entire dataset\n\n```bash\n# Generate a plot with auto-labeled clusters\ntrace_explorer visualize --source mydataset.parquet --threshold 5\n```\n\n### Compare different traces\n\nFinding a good way to compare cluster traces is difficult. A good approach when operating on a common or subset/superset feature space is to\n\n- either limit the feature superset to the feature subset OR use a good imputation strategy to generate the missing columns\n- concatenate both datasets\n- take a subset of data to cluster via agglomerative clustering\n- use a random forest classifier to 'learn' the classification\n- apply classification to a larger set of data, compute visualization for that as well\n- visualize large set of data with trained classification\n\n```bash\n# Compare both datasets in a single visualization\ntrace_explorer compare --superset dataset1.parquet --subset dataset2.parquet --exclude badcolumns\n```\n## Generating sample data\n\nThe repository contains a simple `generator.py` script in the `sample-data` directory. It generates multi-dimensional clustered data sampled from Laplace distributions.\n\n```bash\n# Generate a 2-column dataset with a join, scan and filter field with 5 clusters and 100 samples per cluster\nsample-data/generator.py -c join -c scan -c filter -n 300 -k 5 -d sample-data/raw/\n\n# Convert the dataset into parquet\ntrace_explorer convert --using sample-data/transformer.py --destination sample-data/raw.parquet --source 'sample-data/raw/*.json'\n\n# Visualize the dataset, spills out a file named plot.pdf\ntrace_explorer visualize --source sample-data/raw.parquet --threshold 20\n\n# Compare two sampled datasets\ntrace_explorer compare --superset sample-data/raw.parquet --subset sample-data/raw2.parquet\n```\n\n## Run in debug mode\n\nIf you want to extend Trace Explorer or fix a bug, you might want to run the web frontend\nin debug mode (enabling hot reloading of all relevant components).\n\n```bash\n# NOBROWSER will disable opening a browser up for the application.\nNOBROWSER=yes flask --app trace_explorer.web --debug run\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flnsp%2Ftrace-explorer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flnsp%2Ftrace-explorer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flnsp%2Ftrace-explorer/lists"}