{"id":16329124,"url":"https://github.com/SOM-Research/DescribeML","last_synced_at":"2025-10-25T21:31:11.470Z","repository":{"id":43666668,"uuid":"447183317","full_name":"SOM-Research/DescribeML","owner":"SOM-Research","description":"DescribeML is a Visual Studio Code language plug-in to describe machine-learning datasets in a structured format. Build better data describing the composition, provenance and social concerns of your dataset.","archived":false,"fork":false,"pushed_at":"2023-09-15T08:13:21.000Z","size":101198,"stargazers_count":27,"open_issues_count":2,"forks_count":3,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-10-11T23:16:55.754Z","etag":null,"topics":["data-science","dataset-generation","datasets","describeml","langium","machine-learning","model-driven","modeling","open-data","open-datasets","visual-studio-code","vscode"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SOM-Research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"MIT-LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-01-12T11:01:14.000Z","updated_at":"2024-04-28T16:41:03.000Z","dependencies_parsed_at":"2023-02-02T12:31:12.388Z","dependency_job_id":null,"html_url":"https://github.com/SOM-Research/DescribeML","commit_stats":{"total_commits":149,"total_committers":6,"mean_commits":"24.833333333333332","dds":0.4429530201342282,"last_synced_commit":"994e105f626e295ddfb0c28c2c380ecea945d790"},"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SOM-Research%2FDescribeML","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SOM-Research%2FDescribeML/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SOM-Research%2FDescribeML/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SOM-Research%2FDescribeML/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SOM-Research","download_url":"https://codeload.github.com/SOM-Research/DescribeML/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238212377,"owners_count":19434954,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","dataset-generation","datasets","describeml","langium","machine-learning","model-driven","modeling","open-data","open-datasets","visual-studio-code","vscode"],"created_at":"2024-10-10T23:14:41.841Z","updated_at":"2025-10-25T21:31:11.465Z","avatar_url":"https://github.com/SOM-Research.png","language":"TypeScript","funding_links":[],"categories":["TypeScript"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# DescribeML ![GitHub tag (latest by date)](https://img.shields.io/github/v/tag/SOM-Research/DescribeML?label=Version\u0026style=for-the-badge)\n\nDescribeML is a VSCode language plugin to describe machine-learning datasets. \u003cbr\u003e\n\nPrecisely describe your data's provenance, composition, and social concerns in a structured format.\n\n\nMake it easy to **reproduce your experiments** to others when you cannot share your data. \u003cbr\u003e\n\u003cbr\u003e\nCheck out the quick video [presentating](https://www.youtube.com/watch?v=Bf3bhWB-UJY) of the tool, and the [tutorial](https://www.youtube.com/watch?v=1Of1qfuJKvY) presented in the MODELS '22 Conference\n\n\u003c/div\u003e\n\n## Installation \n\n### Via marketplace\n\nThe easiest way to install the plugin is by using the **Visual Studio Code Market**. Just type \"describeML\" in the extension tab, and that's it!\n\n### Manually\n\nInstead, you can install it manually using the packaged release of the plugin in this [repository](https://github.com/SOM-Research/DescribeML) that can be found at the root of the project. \n\nThe file is **DescribeML-1.2.1.vsix**\n\nOpen your terminal (or the terminal inside the VSCode) and write this:\n\n```\n\ngit clone https://github.com/SOM-Research/DescribeML.git datasets\ncd datasets \ncode --install-extension DescribeML-1.2.1.vsix\n```\n\n\u003cspan style=\"font-size:0.7em;\"\u003e*Troubles: If you cannot see the syntax highlight in the examples files (p.e. *Melanoma.descml*) as the image below. Please, reload the VSCode editor and write the code --install command again* \u003c/span\u003e\n\nGreat! That's it.\n\n\n\n## Getting Started\n\n1) The first step is to create a *.descml* file\n\n2) The easy way to start using our tool is to use the *preloader data service*,  located at the top left of your editor, clicking at: \u003cimg\n  src=\"https://github.com/SOM-Research/DescribeML/blob/main/fileicons/cloud-computing.png?raw=true\"\n  alt=\"preloader service\"\n  title=\"Optional title\"\n  style=\"display: inline-block; margin: 0 auto; width: 40px\"\u003e\n\n3) Select your dataset file (*.csv*), and the tool will generate a draft of your description file.\n\n4) To help you, look to the [Language Reference Guide](https://github.com/SOM-Research/DescribeML/blob/main/documentation/language-reference-guide.md) and follow the examples in the **examples/evaluation** [folders](https://github.com/SOM-Research/DescribeML/tree/main/examples/evaluation) to get a sense of the tool's possibilities. Take a look at the *Melanoma.descml* file, for example.\n5) During the documentation process, hitting CTRL + Space (equivalent in other OS) gives you auto-completion help. In addition, the part marked with the points below gives you hints to complete the documentation, and the outline in the right part shows you the document structure.\n\n\u003cdiv align=\"center\"\u003e\n\n![Autocompletion feature](https://github.com/SOM-Research/DescribeML/blob/main/fileicons/Autcomplete.gif?raw=true)\n\n\u003c/div\u003e\n\n6) Once you are happy with your documentation, you can generate HTML documentation by clicking the generator button next to the prealoder service: \u003cimg\n  src=\"https://github.com/SOM-Research/DescribeML/blob/main/fileicons/html.png?raw=true\"\n  alt=\"HTML generator\"\n  title=\"Optional title\"\n  style=\"display: inline-block; margin: 0 auto; width: 40px\"\u003e\n\n\n\n\n\n\n\nFor more information, check out the **quick [presentation](https://www.youtube.com/watch?v=Bf3bhWB-UJY) video** and the [**tutorial**](https://www.youtube.com/watch?v=1Of1qfuJKvY) presented in the MODELS '22 Conference\n\n\n\n\n## Contributing\n\nThis project is being development as part of a research line of the [SOM Research Lab](https://som-research.github.io/), but we are open to contributions from the community. If you are interested in contributing to this project, please first read the [CONTRIBUTING.md](CONTRIBUTING.md) guidelines file.\n\n### Repository structure\n\nThe following tree shows the list of the repository's relevant sections:\n\n- The *documentation* and *examples* folders contains the mentioend examples and the language reference guide.\n- The *out* folder contains the executable plugin in JS. You may not want to dive in as it is generated by the TypeScrpit compiler\n- The *src* folder contains the project's source code\n  - The *cli* folder is the generated grammar and AST from Langium. You may not want to dive in it as it is a generated asset\n  - The *generator-service* folder contains all the code of the generation service. Could be a good place to start if you want to improve the generation of the tool.\n  - The *uploader-service* folder contains all the code of the uploader service. Could be a good place to contribute new statistical metrics, or ML techniques to do dataset reverse engineering\n  - The *language-server* folder contains all the language features, and the grammar declaration. If you want to improve the grammar, or some of the features the plugin offers here is the place you may want to start\n    - The *dataset-description.langium* file contains the main grammar declaration. This grammar is developed using the [Langium Grammar Language](https://langium.org/docs/grammar-language/). Please refer to the linked documentation to more insights on how to develop the grammar.\n\n\n\n\n```\n├── documentation\n│   └── language-reference-guide.md         // The language reference guide\n├── examples\n│     ├── evaluation\n│       ├── Gender.descml                   // Gender dataset example\n|       ├── Melanoma.descml                 // Melanoma dataset example\n|       └── Polarity.descml                 // Polarity dataset example\n├── out                                     // The generated JS from the src folder\n└── src                                     // The source code of the project\n  ├── cli                                     // Langium framework utils\n  ├── generator-service                       // The tool's HTML generator service\n  ├── uploader-service                        // The tool's HTML uploader service\n  └── language-server                         // The tool's language features\n        ├── generated                           // Generated grammar and AST from Langium\n        ├── dataset-description-index.ts        // Custom index feature\n        ├── dataset-description-module.ts       // Declaration of the custom language features\n        ├── dataset-description-validator.ts    // Custom language features \n        └── dataset-description.langium         // The main grammar file of the tool\n  \n```\n\n\n\n\n#### Debugging the extensions\n\nThis repo comes with an already built-in config to debug. Just go to Debug in VSCode, and launch the Extension config. Please check your port 6009 is free.\n  \nFor more information about how the framework works and how the language can be extended, please refer to https://github.com/langium/langium or the VSCode extension API documentation https://code.visualstudio.com/api\n\n## Research background and citation\n\nDescribeML is part of an ongoing research project to improve dataset documentation for machine learning. The core of our proposal is a domain-specific language published in the [Journal of Computer Languages](https://www.sciencedirect.com/science/article/pii/S2590118423000199) that allows data creators to describe relevant aspects of their data for the machine learning field and beyond. The [Critical Dataset Studios](https://knowingmachines.org/reading-list#dataset_documentation_practices) of the [Knowing Machines](https://knowingmachines.org) project have compiled an excellent list of current documentation practices.\n\nTo cite the domain-specific language:\n```\nGiner-Miguelez, J., Gómez, A., \u0026 Cabot, J. (2023). A domain-specific language for describing machine learning datasets. Journal of Computer Languages, 76, 101209.\n```\n\nThe tool has been presented at the [ACM/IEEE 25th International Conference on Model Driven Engineering Languages and Systems](https://conf.researchr.org/details/models-2022/models-2022-tools---demonstrations/5/DescribeML-a-tool-for-describing-machine-learning-datasets) and published as an Original Software Publication in the [Science of Computer Programming](https://www.sciencedirect.com/science/article/pii/S0167642323001120) journal. \n\nTo cite the tool:\n```\nGiner-Miguelez, J., Gómez, A., \u0026 Cabot, J. (2023). DescribeML: A dataset description tool for machine learning. Science of Computer Programming, 2023, 103030, ISSN 0167-6423, https://doi.org/10.1016/j.scico.2023.103030.\n```\n\n\n\n# Code of Conduct\n\nAt SOM Research Lab we are dedicated to creating and maintaining welcoming, inclusive, safe, and harassment-free development spaces. Anyone participating will be subject to and agrees to sign on to our [Code of Conduct](CODE_OF_CONDUCT.md).\n\n## License\n\nShield: [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n\nThe source code for the site is licensed under the MIT license, which you can find in the MIT-LICENSE file.\n\nAll graphical assets are licensed under the\n[Creative Commons Attribution 3.0 Unported License](https://creativecommons.org/licenses/by/3.0/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSOM-Research%2FDescribeML","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSOM-Research%2FDescribeML","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSOM-Research%2FDescribeML/lists"}