{"id":20718511,"url":"https://github.com/knowm/datasets","last_synced_at":"2025-04-23T14:13:00.807Z","repository":{"id":13841821,"uuid":"16538895","full_name":"knowm/Datasets","owner":"knowm","description":"Datasets is a Java library for conveniently working with machine learning datasets. ","archived":false,"fork":false,"pushed_at":"2018-06-19T05:31:14.000Z","size":78923,"stargazers_count":21,"open_issues_count":2,"forks_count":10,"subscribers_count":6,"default_branch":"develop","last_synced_at":"2025-04-23T14:12:52.251Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/knowm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-02-05T08:23:23.000Z","updated_at":"2024-04-25T16:34:06.000Z","dependencies_parsed_at":"2022-08-23T16:00:25.384Z","dependency_job_id":null,"html_url":"https://github.com/knowm/Datasets","commit_stats":null,"previous_names":["timmolter/datasets"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knowm%2FDatasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knowm%2FDatasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knowm%2FDatasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knowm%2FDatasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/knowm","download_url":"https://codeload.github.com/knowm/Datasets/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250447990,"owners_count":21432165,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-17T03:13:52.074Z","updated_at":"2025-04-23T14:13:00.750Z","avatar_url":"https://github.com/knowm.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Knowm Datasets\n\nKnowm Datasets is a Java library for conveniently working with machine learning datasets.  \n\n## Description \n\nThe philosophy of this open source project is simple - take several diverse datasets, which all have their own custom formats, and convert them all into a unified \nformat with a unified API for accessing the data. Each module has a `RawData2DB` class, which parses the raw data and puts each data object into a file-based HSQLDB database. \nNo separate database installation is necessary. The generated database files have been uploaded to Knowm's Google Drive account [here](https://drive.google.com/folderview?id=0ByP7_A9vXm17VXhuZzBrcnNubEE\u0026usp=sharing#list).\nThe data is accessed for client apps through a DAO class, with methods so easy, even a child could understand:\n\nSample code:\n\n    LSHTC4DAO.init(\"/Users/timmolter/Documents/Datasets\"); // setup data\n\n    // print number of objects\n    long count = LSHTC4DAO.selectCount();\n    System.out.println(\"count= \" + count);\n    \n    // loop through first 10 LSHTC4 objects\n    for (int i = 1; i \u003c= 10; i++) {\n\n      LSHTC4 lSHTC4 = LSHTC4DAO.selectSingle(i);\n      System.out.println(lSHTC4.toString());\n    }\n    \n    LSHTC4DAO.release(); // release data resources\n    \n    \nOutput:\n\n    count= 2817603\n    \n    LSHTC4 [id=1, labels=, features=139:1,153:4,199:1,212:1,232:1,282:1,307:3,310:1,428:1,510:1,528:1,609:1,700:2,709:1,727:1,765:1,791:1,798:2,838:1,872:1,1007:1,1170:2,1374:1,1388:1,1409:1,1435:1,1892:1,2190:1,2197:1,2253:1,2348:2,2570:1,2628:1,2713:1,3066:1,3406:1,3619:2,3628:2,3636:1,3649:2,5068:1,8385:1,9371:1,11248:1,11806:1,]\n    LSHTC4 [id=2, labels=, features=41:3,131:2,218:1,254:1,289:1,501:4,511:1,519:3,526:1,527:1,539:1,542:1,543:2,551:2,558:3,605:2,977:2,2748:1,2867:1,3849:1,4032:1,5030:1,19156:1,]\n    LSHTC4 [id=3, labels=, features=41:1,519:2,532:1,574:1,576:1,1032:1,1413:1,4285:1,8865:1,11071:1,24481:1,83715:1,]\n    LSHTC4 [id=4, labels=, features=8:1,26:1,29:1,44:1,48:1,107:1,118:1,137:1,145:1,196:1,197:1,211:1,354:1,400:1,403:1,409:1,415:1,432:1,439:1,442:1,459:1,536:2,551:1,558:1,605:1,612:1,661:1,689:1,695:1,805:3,816:1,834:1,854:5,867:1,883:1,889:1,891:1,902:1,944:2,980:1,1139:1,1273:1,1287:1,1345:1,1415:1,1614:2,1664:1,1713:1,1776:2,1817:1,1861:1,1956:1,2100:1,2105:1,2121:1,2558:2,2564:1,2619:1,3018:1,3045:1,3055:1,3061:2,3217:2,3233:1,3301:1,3755:1,5504:1,6555:1,6942:1,7102:1,7901:1,10298:1,11317:1,12780:1,14305:1,16756:1,27769:1,28416:1,29278:3,32759:1,181529:1,1003324:1,]\n    LSHTC4 [id=5, labels=, features=11:1,26:1,40:1,49:1,139:1,146:1,153:3,175:1,197:1,198:1,199:2,215:2,226:1,228:1,237:2,238:1,239:2,240:1,242:1,253:1,262:1,274:1,286:1,297:1,307:1,316:2,317:1,318:4,326:1,354:1,364:1,375:1,430:1,439:2,463:1,474:1,490:1,491:1,583:3,596:1,597:1,605:1,614:1,615:2,647:1,730:2,752:1,765:1,769:1,777:3,791:1,793:1,798:6,867:2,874:1,891:1,1006:1,1018:1,1092:1,1099:2,1106:2,1116:1,1138:1,1155:1,1159:3,1167:1,1169:1,1171:1,1180:1,1184:2,1317:1,1330:1,1394:1,1398:1,1414:3,1449:1,1467:1,1469:1,1515:1,1547:1,1575:1,1771:1,1797:1,1842:2,1918:1,1932:1,2009:1,2066:1,2103:1,2115:1,2135:1,2143:1,2180:1,2184:1,2192:1,2196:1,2197:1,2220:2,2275:1,2306:1,2334:1,2342:1,2344:1,2419:1,2557:2,2610:1,2652:1,2934:1,2969:1,3023:1,3026:1,3032:1,3048:3,3053:2,3380:2,3403:2,3507:1,3664:1,3849:1,3964:16,3970:1,3984:1,4016:1,4017:4,4205:1,4302:1,4336:1,4353:1,4524:1,4548:1,4571:1,4665:1,4667:1,4672:1,5083:2,5134:1,5930:1,6229:1,6738:1,6977:1,7404:1,8540:1,9532:2,11399:1,12822:1,15406:1,16929:1,17726:1,19875:1,20093:1,20597:1,20641:1,20655:1,26618:1,27756:1,36028:1,63893:1,70093:1,121950:1,171358:1,191665:1,866061:1,]\n    LSHTC4 [id=6, labels=, features=18:1,19:1,64:1,89:1,123:1,147:1,198:1,264:1,356:1,387:1,491:2,511:2,521:1,527:1,529:2,561:4,632:1,712:1,761:1,903:1,991:1,1002:1,1105:1,1299:1,1565:1,1620:1,1651:1,1697:1,1832:1,3591:1,4607:1,4718:1,6248:1,7963:1,23274:2,]\n    LSHTC4 [id=7, labels=, features=11:2,26:2,36:1,62:2,67:1,70:1,81:1,99:1,155:1,185:1,197:3,204:3,211:5,229:1,230:1,231:1,246:1,344:2,347:1,375:1,397:1,401:2,413:1,415:1,458:2,491:1,497:1,539:1,558:1,587:1,692:2,745:1,752:1,761:1,812:2,815:1,827:1,829:1,854:12,944:1,978:2,991:1,1001:2,1109:1,1159:1,1193:1,1247:1,1300:1,1380:1,1414:3,1518:1,1544:1,1634:1,1661:16,1670:1,1788:2,1813:2,1834:1,1846:1,1879:1,2062:1,2128:1,2220:1,2236:2,2562:2,2578:2,2586:7,2683:1,2962:1,3014:1,3019:1,3734:2,3826:1,3999:1,4052:1,4267:1,4471:1,4752:1,4756:1,4811:1,4850:2,4963:1,5071:1,5317:2,5459:1,5497:1,5509:3,5698:2,6899:1,7045:1,7217:1,7641:1,7924:1,7985:1,8010:1,8176:1,8482:1,8942:1,10605:1,10682:1,10706:1,12306:1,12307:1,12425:2,12555:1,12681:1,12961:1,13995:1,13998:1,14000:1,14214:1,14826:1,15493:1,16852:1,21690:3,26455:1,26503:1,34393:1,35307:1,42172:1,43814:1,47525:1,50601:1,65466:1,74704:1,93306:1,93846:1,98361:1,143927:1,512967:1,581083:1,892311:1,922750:1,]\n    LSHTC4 [id=8, labels=, features=20:1,30:1,32:1,44:1,81:1,104:1,114:1,122:1,133:1,135:2,140:1,178:1,202:1,211:1,215:1,219:2,228:2,229:1,312:2,367:1,475:1,587:1,740:1,750:1,769:1,777:1,778:3,829:1,830:1,834:1,856:1,1024:1,1083:5,1099:1,1100:2,1102:5,1106:12,1118:1,1129:1,1156:1,1176:1,1377:1,1681:1,1786:1,1804:2,2088:1,2126:1,2295:1,3018:2,3044:2,3127:1,4175:1,4440:1,5115:1,5568:1,5774:1,5913:2,5923:1,7958:1,8112:1,9324:3,10808:1,12594:2,12692:1,12715:1,16618:1,18828:1,18829:1,19913:1,19920:4,20093:5,20193:1,21208:1,21213:1,25433:1,36336:1,55404:1,69755:1,113192:1,]\n    LSHTC4 [id=9, labels=, features=24:1,41:1,81:1,122:2,131:2,196:1,197:1,199:2,219:1,230:3,310:1,318:2,328:1,346:2,354:2,375:1,378:1,395:1,400:1,415:1,430:1,464:1,501:1,559:3,561:3,567:2,570:4,576:1,589:1,601:1,605:1,633:1,692:3,717:1,721:3,765:1,773:1,791:3,818:1,841:1,903:1,916:1,977:1,1000:1,1019:1,1046:1,1078:1,1106:1,1109:1,1163:1,1249:2,1266:1,1413:1,1556:1,1563:1,1664:1,1716:1,1742:2,1756:1,1782:1,1793:1,1915:1,1966:1,2032:1,2369:1,2687:2,2695:1,2957:1,3365:1,3519:1,3581:1,3698:1,4548:1,4570:1,5126:3,5526:3,5954:2,6014:1,7104:1,7124:1,7652:1,8532:1,10305:1,10637:1,10774:1,11256:2,11892:1,12116:1,14386:1,14732:1,17880:5,19492:4,23460:1,23618:1,30520:2,33822:1,42461:1,57833:1,386140:1,691708:1,1558913:1,]\n    LSHTC4 [id=10, labels=, features=40:1,41:1,44:1,48:2,49:1,68:1,95:1,111:1,153:4,162:1,196:1,219:1,228:1,229:1,232:1,238:1,239:2,242:2,247:2,276:1,297:2,306:1,307:1,316:1,317:1,375:1,430:1,510:1,516:1,582:1,612:1,717:1,728:2,761:1,764:1,776:1,783:1,797:1,815:1,915:1,1116:1,1337:1,1441:1,1680:1,2116:2,2118:1,2119:1,2192:1,2194:1,2322:1,2347:1,2354:1,2613:1,2636:1,2748:1,2930:1,3048:1,3057:1,3140:1,3229:1,3893:1,4030:1,4252:1,4984:1,5068:1,6599:1,7108:1,8540:1,10639:1,10666:1,10670:2,10676:1,14070:5,14321:1,14364:2,24700:1,26766:1,27895:1,63406:1,166985:1,601892:1,]\n\nThe first time the DAO class is used, it attempts to download the database files from Google Drive. If there are problems, like when the file is too big, a message is printed \ndirecting you to download the files manually.\n\nIf you prefer to build the project yourself, note that the actual data is not hosted in the repo with the code, but must be downloaded separately first. Each module in this \nprojects has its own README file with instructions on where to get the data and how to build the modules. \n\n## License\n\n[MIT](http://opensource.org/licenses/MIT)\n\nSource code from other open source projects has been bundled with this project either directly or in modified form. \nThe original copyright and license notices have been preserved in their original forms in the following source code files:\n\n[musicg](https://code.google.com/p/musicg/) datasets-common/com/musicg (apache-2.0)\n[snowball](http://snowball.tartarus.org/license.php) datasets-common/org/taratrus/snowball (BSD)\n[mnist-tools](https://code.google.com/p/mnist-tools/) (Artistic License/GPL)\n\n## Included Datasets\n\n* [Breast Cancer Wisconsin (Original)](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29)\n* [Census Income](http://archive.ics.uci.edu/ml/datasets/Census+Income)\n* [CIFAR-10](http://www.cs.toronto.edu/~kriz/cifar.html)\n* [Higgs-Boson](https://www.kaggle.com/c/higgs-boson)\n* [HJA Birdsong](http://web.engr.oregonstate.edu/~briggsf/kdd2012datasets/hja_birdsong/)\n* [LSHTC4](http://www.kaggle.com/c/lshtc/data)\n* [MNIST](http://yann.lecun.com/exdb/mnist/)\n* [NSL-KDD](http://nsl.cs.unb.ca/NSL-KDD/)\n* [Reuters-21578](http://archive.ics.uci.edu/ml/support/Reuters-21578+Text+Categorization+Collection)\n* [UCSD Anomaly](http://www.svcl.ucsd.edu/projects/anomaly/dataset.html)\n* [Numenta Anomaly](https://github.com/numenta/NAB) \n* [PCB](https://www.caa.tuwien.ac.at/cvl/research/cvl-databases/pcb-dslr-dataset/) \n\n\n## Example Usage\n\n### Include Jar in Your Project\n\nDownload Datasets Release Jars: http://search.maven.org/#search%7Cga%7C1%7Cknowm%20datasets\n\nDownload Datasets Snapshot Jars: https://oss.sonatype.org/content/groups/public/org/knowm/datasets\n\n### Via Maven\n\nThe Datasets release artifacts are hosted on Maven Central.\n\nAdd the Datasets library as a dependency to your pom.xml file:\n\n    \u003cdependency\u003e\n        \u003cgroupId\u003eorg.knowm.datasets\u003c/groupId\u003e\n        \u003cartifactId\u003edatasets-breast-cancer-wisconsin-orginal\u003c/artifactId\u003e\n        \u003cversion\u003e2.1.0\u003c/version\u003e\n    \u003c/dependency\u003e\n    \n, adjusting the particular dataset you want, in this case `datasets-breast-cancer-wisconsin-orginal`.\n\nFor snapshots, add the following to your pom.xml file:\n\n    \u003crepository\u003e\n      \u003cid\u003esonatype-oss-snapshot\u003c/id\u003e\n      \u003csnapshots/\u003e\n      \u003curl\u003ehttps://oss.sonatype.org/content/repositories/snapshots\u003c/url\u003e\n    \u003c/repository\u003e\n\nThe current snapshot version is:\n\n    2.2.0-SNAPSHOT\n    \n\n## Building\n\nKnowm Datasets is built with Maven.\n\n    cd path/to/datasets-parent\n    \n#### Install to local repo\n\n    mvn clean install\n    \n#### maven-license-plugin\n\n    mvn license:check\n    mvn license:format\n    mvn license:remove\n    \n#### JavaDocs\n\n    mvn javadoc:aggregate \n\n## Continuous Integration\n\n[![Build Status](https://travis-ci.org/timmolter/Datasets.png?branch=develop)](https://travis-ci.org/timmolter/Datasets.png)  \n[Build History](https://travis-ci.org/timmolter/Datasets/builds)  \n \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fknowm%2Fdatasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fknowm%2Fdatasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fknowm%2Fdatasets/lists"}