{"id":25247062,"url":"https://github.com/gem5/stats-schema","last_synced_at":"2025-10-26T22:30:56.504Z","repository":{"id":55035091,"uuid":"304470196","full_name":"gem5/stats-schema","owner":"gem5","description":"A proposal for a shared statistics schema","archived":false,"fork":false,"pushed_at":"2021-02-05T17:48:27.000Z","size":64,"stargazers_count":5,"open_issues_count":6,"forks_count":3,"subscribers_count":5,"default_branch":"main","last_synced_at":"2023-02-27T05:03:03.399Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gem5.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-10-15T23:25:08.000Z","updated_at":"2022-04-15T18:09:45.000Z","dependencies_parsed_at":"2022-08-14T09:40:56.637Z","dependency_job_id":null,"html_url":"https://github.com/gem5/stats-schema","commit_stats":null,"previous_names":[],"tags_count":null,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gem5%2Fstats-schema","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gem5%2Fstats-schema/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gem5%2Fstats-schema/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gem5%2Fstats-schema/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gem5","download_url":"https://codeload.github.com/gem5/stats-schema/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238404863,"owners_count":19466393,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-12T02:56:59.819Z","updated_at":"2025-10-26T22:30:51.227Z","avatar_url":"https://github.com/gem5.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# simstats schema\n\nA proposal for a shared statistics schema for computer architecture simulators.\n\nInitially, we will be targeting the JSON output format, but we plan to support many other output formats in the future (e.g., CSV, HDF5, pandas, etc.).\n\nThis repository is currently maintained by Jason Lowe-Power \u003cjason@lowepower.com\u003e.\nAll questions/comments can be directed toward Jason via email or creating an issue on this repository.\n\nThe current working group members are\n\n- Jonathan Beard (Arm)\n- Bobby Bruce (gem5, UC Davis)\n- Ahmed Gheith (Arm)\n- Jason Lowe-Power (gem5, UC Davis)\n- Andreas Sandberg (gem5, Arm)\n- Arun Rodrigues (SST, Sandia)\n- Gwen Voskuilen (SST, Sandia)\n\n## Background\n\nThere are many computer architecture simulators (e.g., [gem5](http://www.gem5.org/), [SST](http://sst-simulator.org/), [DRAMSim](https://github.com/umd-memsys/DRAMsim3), and [GPGPU-Sim](http://www.gpgpu-sim.org/)), and each of them have their own output formats, which are often poorly defined.\nThis causes pain for researchers and students using these simulators.\n\nSome pain points include:\n\n- Writing custom text parsing code for each simulator (or multiple time for the same simulator!)\n- Confusion on the meaning of statistics\n- Incompatibility between simulators, especially when used together (e.g., gem5+DRAMSim)\n\n**The goal of this working group is to define a common schema for computer architecture simulator statistics.**\nWith this common schema, we hope to enable better compatibility between simulators and to ease the burden on simulator users.\n\n## This repository\n\nThis repository contains a proposal for a statistic schema using [JSON Schema](https://json-schema.org).\n\n### JSON Schema\n\nA good starting guide to JSON Schema is [Understanding JSON Schema](https://json-schema.org/understanding-json-schema/index.html).\nJSON Schema is most related to [database schemas](https://en.wikipedia.org/wiki/Database_schema) and simply defines the *format* of statistics.\nSimulators must implement statistic output that follows this schema.\n\nJSON Schema also has the ability to validate an output against the schema.\nHowever, we expect that this schema will be used by the simulator developers to design their statistic outputs and by visualization developers to visualize and represent those output.\nGeneral simulator users shouldn't have to worry about this schema and can simply use the output from the simulators.\n\nThe file [simstats.schema.json](./simstats.schema.json) contains the current draft of the schema.\n\n### Testing the schema\n\nThe \u003ctests\u003e directory contains a simple python script to test the schema.\n\nTo run the tests, you can use the following code:\n\n```sh\npip3 install -r requirements.txt\ncd tests\npython3 test.py\n```\n\nThis test will validate the schema.\nThen, it will validate all of the files in [tests/examples](tests/examples).\nDetails of these files can be found in the [README](tests/examples/README.md) in that directory.\nIt contains a set of valid and invalid examples of statistics files in json format.\n\n## Understanding the schema file\n\nThe schema file begins with a title and description of the overall schema as well as some JSON Schema specific information.\n\nThen, there is a section of \"definitions.\"\nThese are \"types\" that can be used throughout the rest of the schema.\nYou can think of these as specializations of the built-in [JSON types](https://json-schema.org/understanding-json-schema/reference/type.html).\n\n\u003e Note: We may want to break this schema into multiple documents, which is possible to do in JSON Schema\n\nEach of these types also has a title and a description.\nThis is the documentation for the user (simulator developer, in this case) to understand what this definition is supposed to represent.\n\nFor objects, we specify the *properties* that we expect these objects to have.\nFor the most part, properties are *optional* to support simpler/smaller/compressed files.\nHowever, all statistics must have a `value`, and there are a few other required properties.\nSee comments in the schema for details.\n\n### Style\n\n\u003e Note: We will almost certainly want to revisit this\n\n#### Types\n\nThe current style for defining \"types\" is camel case with a lowercase first letter.\n\n#### Property names\n\nThe current style for property names is camel case with a lowercase first letter.\n\n------------\n\nBelow here is some older information\n\n## Related works\n\n### Some generally related works\n\n- Generally, data serialization\n- HPC serialization\n  - Comprehensive Resource Use Monitoring for HPCSystems with TACC Stats\n    - \u003chttps://www.tacc.utexas.edu/documents/1084364/1191938/tacc_stats_hust.pdf\u003e (HPC users workshop)\n    - No schema\n  - Serialization and deserialization of complex data structures,and applications in high performance computing\n    - \u003chttps://mosaic.mpi-cbg.de/docs/Zaluzhnyi2016.pdf\u003e (masters thesis)\n    - Avro\n- System monitoring\n  - Prometheus\n    - \u003chttps://prometheus.io/\u003e\n    - Focused on time series\n    - Doesn't have a set schema except for simple data types\n\n### Projects\n\n- Advanced Scientific Data format (pronounced AZ-diff)\n  - https://asdf-standard.readthedocs.io/en/1.5.0/index.html\n  - For Astronomical data\n- Apache Avro\n  - Support for code auto generation\n  - More about messages than data storage\n- Json Schema\n  - https://json-schema.org/\n- Google Protocol Buffers and Apache Thrift\n  - Note really a schema, but can specify a schema\n- CDR (Common data representation)\n  - Not human readable\n  - Quite \"industrial\"\n- ASN.1\n  - For telecommunications (mostly)\n- Data Documentation Initiative\n  - https://ddialliance.org/\n- USGS data dictionaries\n  - https://www.usgs.gov/products/data-and-tools/data-management/data-dictionaries\n\n## Other links\n\n- \u003chttps://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats\u003e\n\n## Requirements\n\nThe main purpose of this schema is documentation.\nMore people will look at this schema to define stats than machines will read it.\n\n- Easy to understand for users who will be creating new stats\n- Compatible with standard APIs for Python and other languages\n\n### Requirements of our stats output\n\n- Possible to make human readable (concise, clear, etc.)\n- Possible to parse/write easily\n  - pandas\n  - json\n  - hdf5\n  - csv\n- Compact\n\n### Nice to haves\n\n- Standardized, but compatibility with our tools (python, C++, etc) is the real requirement.\n\n## The general schema:\n\n- Base file\n  - Contains global stats and other models\n- Model\n  - Has a type (e.g., \"Cache\", \"CPU\", etc.)\n    - We could specialize this and have different types of models that match across simulators\n  - Can contain models and statistics\n- Statistic\n  - Name\n  - Type\n  - Value\n  - Unit\n  - Description\n\n## An example Json Schema approach\n\nFirst, some data that we want to put in the schema:\n\nHere's the system that generated this \"data\"\n\n```python\nmy_system = System()\nmy_system.cpus = [TimingSimpleCPU() for i in range(2)]\nmy_system.l2_cache = L2Cache()\nfor cpu in my_system.cpus:\n  cpu.tlb = X86TLB()\n  cpu.l1_cache = L1Cache()\n  cpu.l1_cache.connectMemSide(my_system.l2_cache)\n```\n\n\n```json\n{\n  \"my_system\": {\n    \"type\": \"System\",\n    \"cpus\": [\n      {\n        \"type\": \"CPU\",\n        \"committed_instructions\": {\n          \"value\": 0,\n          \"type\": \"Scalar\",\n          \"unit\": \"Count\",\n          \"description\": \"The number of instructions committed\"\n        },\n        \"tlb\": {\n          \"type\": \"TLB\",\n          \"data_hits\": {\n            \"value\": 0,\n            \"type\": \"Scalar\",\n            \"unit\": \"Count\",\n            \"description\": \"The number of hits from data accesses\"\n          },\n          \"inst_hits\": {\n            \"value\": 0,\n            \"type\": \"Scalar\",\n            \"unit\": \"Count\",\n            \"description\": \"The number of hits from instruction accesses\"\n          }\n        },\n        \"l1_cache\": {\n          \"type\": \"Cache\",\n          \"miss_latency\": {\n            \"type\": \"Distribution\",\n            \"bins\": [0, 1.0e-8, 2.0e-8, 3.0e-8],\n            \"value\":[0, 0, 0, 0],\n            \"unit\": \"Time\",\n            \"description\": \"Latency of cache misses (includes both reads \u0026 writes)\"\n          }\n        }\n      },\n      {\n        \"tlb\": {\n          \"type\": \"TLB\",\n          \"data_hits\": {\n            \"value\": 0,\n            \"type\": \"Scalar\",\n            \"unit\": \"Count\",\n            \"description\": \"The number of hits from data accesses\"\n          },\n          \"inst_hits\": {\n            \"value\": 0,\n            \"type\": \"Scalar\",\n            \"unit\": \"Count\",\n            \"description\": \"The number of hits from instruction accesses\"\n          }\n        },\n        \"l1_cache\": {\n          \"type\": \"Cache\",\n          \"miss_latency\": {\n            \"type\": \"Distribution\",\n            \"bins\": [0, 1.0e-8, 2.0e-8, 3.0e-8],\n            \"value\":[0, 0, 0, 0],\n            \"unit\": \"Time\",\n            \"description\": \"Latency of cache misses (includes both reads \u0026 writes)\"\n          }\n        }\n      }\n    ],\n    \"l2_cache\": {\n      \"type\": \"Cache\",\n      \"miss_latency\": {\n        \"type\": \"Distribution\",\n        \"bins\": [0, 1.0e-8, 2.0e-8, 3.0e-8],\n        \"value\":[0, 0, 0, 0],\n        \"unit\": \"Time\",\n        \"description\": \"Latency of cache misses (includes both reads \u0026 writes)\"\n      }\n    }\n  }\n}\n```\n\nNot all output formats of the stats have to have all of the data from the schema.\nI think this is a key idea to enable this.\n\n### Example of the base file\n\n```json\n{\n  \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n  \"$id\": \"http://gem5.org/simstats.schema.json\",\n  \"title\": \"Architecture Simulator Statistics\",\n  \"description\": \"A set of statistcs or results output from a computer architecture simulation\",\n  \"type\": \"object\",\n  \"properties\": {\n    \"creationTime\": {\n      \"description\": \"The time this output was generated (wall clock time) in Date format\",\n      \"type\": \"string\",\n      \"format\": \"date-time\"\n    },\n    \"globalStatistics\": {\n      \"description\": \"Statistics not associated with a particular model (e.g., total ticks, total instructions, etc.)\",\n      \"type\": \"array\",\n      \"items\": { \"$ref\": \"http://gem5.org/statistic.schema.json\" }\n    }\n  },\n  \"additionalProperties\": { \"$ref\": \"http://gem5.org/model.schema.json\" }\n}\n```\n\nCould also do the follwing for additional properties and drop \"globalStatistics\".\nThis would allow us to say \"The file contains a set of named stats and/or models with stats or other sub-models\"\n\n```json\n{\n  \"additionalProperties\": {\n    \"anyOf\": [\n      { \"$ref\": \"http://gem5.org/model.schema.json\" },\n      { \"$ref\": \"http://gem5.org/statistic.schema.json\" },\n      { \"$ref\": \"http://gem5.org/scalar-statistic.schema.json\" },\n      { \"$ref\": \"http://gem5.org/distribution-statistic.schema.json\" }\n    ]\n  }\n}\n```\n\n### Example for a general statistic\n\n```json\n{\n  \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n  \"$id\": \"http://gem5.org/statistic.schema.json\",\n  \"title\": \"Statistic\",\n  \"description\": \"A single statistic output\",\n  \"properties\": {\n    \"type\": {\"type\": \"string\" },\n    \"value\": {},\n    \"unit\": {\"type\": \"string\" },\n    \"description\": {\"type\": \"string\" }\n  },\n  \"required\": [\"value\"]\n}\n```\n\n### Example for a specific statistic (Scalar)\n\n```json\n{\n  \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n  \"$id\": \"http://gem5.org/scalar-statistic.schema.json\",\n  \"title\": \"Scalar\",\n  \"description\": \"A scalar statistic value (e.g., a count, latency, etc.)\",\n  \"properties\": {\n    \"type\": { \"const\": \"Scalar\" },\n    \"value\": { \"type\": \"number\" },\n    \"unit\": {\"type\": \"string\" },\n    \"description\": {\"type\": \"string\" }\n  },\n  \"required\": [\"value\"]\n}\n```\n\n### Example for a specific statistic (Distribution)\n\n```json\n{\n  \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n  \"$id\": \"http://gem5.org/distribution-statistic.schema.json\",\n  \"title\": \"Distribution\",\n  \"description\": \"A distribution of statistic values\",\n  \"properties\": {\n    \"type\": { \"const\": \"Distribution\" },\n    \"value\": {\n      \"type\": \"array\",\n      \"items\": { \"type\": \"integer\", \"minimum\": 0 }\n    },\n    \"bins\": {\n      \"type\": \"array\",\n      \"items\": { \"type\": \"number\" }\n    },\n    \"binSize\": { \"type\": \"number\" },\n    \"numBins\": { \"type\": \"integer\", \"minimum\": 1 },\n    \"unit\": { \"type\": \"string\" },\n    \"description\": { \"type\": \"string\" }\n  },\n  \"required\": [\"value\"]\n}\n```\n\n### Example for a model\n\n```json\n{\n  \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n  \"$id\": \"http://gem5.org/model.schema.json\",\n  \"title\": \"Model\",\n  \"description\": \"A simulated model which has statistics and possibly other sub models\",\n  \"properties\": {\n    \"type\": { \"type\": \"string\" }\n  },\n  \"additionalProperties\": {\n    \"anyOf\": [\n      { \"$ref\": \"http://gem5.org/model.schema.json\" },\n      { \"$ref\": \"http://gem5.org/statistic.schema.json\" },\n      {\n        \"type\": \"array\",\n        \"items\": {\n          \"type\": { \"$ref\": \"http://gem5.org/model.schema.json\" }\n        }\n      }\n    ]\n  }\n}\n```\n\n## Ideas\n\n- We can have generic model which have \"shared\" stats between simulators (and other things)\n  - E.g., Caches\n    - Just hits/misses\n    - Keep it simple\n- We allow simulators to have more specific stats\n- Allow for the simulator to output once the metadata and then simply output the values in all other outputs.\n  - Want to have dynamic components\n  - Need to support stats appearing in the middle of simulation\n- Still need to define \"alternative\" schemas\n  - CSV\n  - ???\n\n### Other potential info to include\n\n- Data type of the result (e.g., u64, s32, f32)\n  - The other option is to just use \"integer\" and \"number\" from json schema\n- Other metadata with each dump/file\n\n## Other questions to answer or potential features\n\n- How to represent time-series data\n  - How to \"tag\" each dump\n- Dump different stats at different frequencies\n  - How to represent this in the above schema\n  - gem5 doesn't currently support this (easily), but could be extended\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgem5%2Fstats-schema","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgem5%2Fstats-schema","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgem5%2Fstats-schema/lists"}