{"id":15660403,"url":"https://github.com/smola/language-dataset","last_synced_at":"2025-10-22T10:55:40.915Z","repository":{"id":36037380,"uuid":"143737125","full_name":"smola/language-dataset","owner":"smola","description":"Dataset for programming language identification.","archived":false,"fork":false,"pushed_at":"2023-03-06T05:01:18.000Z","size":12235,"stargazers_count":22,"open_issues_count":14,"forks_count":5,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-30T23:41:11.335Z","etag":null,"topics":["dataset","language-detection","language-identification","programming-language-identification"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/smola.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-08-06T14:05:52.000Z","updated_at":"2025-03-21T16:49:13.000Z","dependencies_parsed_at":"2024-10-03T13:21:51.623Z","dependency_job_id":"e42435fb-7973-4320-b951-bb42a370bc63","html_url":"https://github.com/smola/language-dataset","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smola%2Flanguage-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smola%2Flanguage-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smola%2Flanguage-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smola%2Flanguage-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/smola","download_url":"https://codeload.github.com/smola/language-dataset/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252577015,"owners_count":21770721,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","language-detection","language-identification","programming-language-identification"],"created_at":"2024-10-03T13:21:31.014Z","updated_at":"2025-10-22T10:55:35.886Z","avatar_url":"https://github.com/smola.png","language":"Python","readme":"# language-dataset\n\nA dataset for programming language identification.\n\n## Methodology\n\n* Available languages are fetched from [github/linguist](https://github.com/github/linguist/)'s [languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml) and [acmeism/RosettaCodeData](https://github.com/acmeism/RosettaCodeData)'s [Lang.yaml](https://github.com/acmeism/RosettaCodeData/blob/master/Meta/Lang.yaml).\n* For each language, initial samples are fetched from GitHub as follows:\n  * [GitHub Search API](https://developer.github.com/v4/query/#search) is used to get a list of repositories.\n  * Each repository is cloned and languages are detected with [github/linguist](https://github.com/github/linguist/).\n  * One sample is added from each repository.\n* Samples are later reviewed by humans.\n\nRules for sample inclusion are:\n\n* No more than one sample from each repository.\n* Sample is at least 500b and at most 100kb.\n\n## Dataset\n\nThe dataset is stored in the `data` directory. It contains:\n\n* `meta.yml`: metadata about the dataset and available languages.\n* `dataset.yml`: collection of all samples, with pointers sample paths relative to `data`.\n\nCheck a summary of the dataset at [REPORT.md](REPORT.md).\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md).\n\n## Tooling\n\nThe `tools` directory contains various Python utilities to maintain the dataset:\n* `tools/gen_meta.py`: Generates `data/meta.yml`. This is only needed when upgrading to a new github/linguist or acmeism/RosettaCodeData version.\n* `tools/harvest.py`: Fetches samples from GitHub.\n* `tools/vote.py`: Updates the `vote` annotation.\n* `tools/lint.py`: Checks the dataset for potential problems.\n* `tools/prepare_commit.py`: Updates generated files, required before any commit.\n* `tools/classify_linguist.py`: Updates linguist labels.\n* `tools/classify_pygments.py`: Updates pygments labels.\n\nTo run tools first create the virtual environment:\n\n```\npip install poetry\npoetry install\n```\n\nThen run the tool with `python -m`:\n\n```\npoetry run python -m tools.gen_meta\n```\n\n## License\n\nEach sample in `data` has its own license. Check the origin repository for details.\n\nEverything else is licensed under the [MIT License](LICENSE).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmola%2Flanguage-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsmola%2Flanguage-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmola%2Flanguage-dataset/lists"}