{"id":48081592,"url":"https://github.com/rom1mouret/ml-essentials","last_synced_at":"2026-04-04T14:53:58.878Z","repository":{"id":57564043,"uuid":"334572127","full_name":"rom1mouret/ml-essentials","owner":"rom1mouret","description":"dataframe library for machine learning","archived":false,"fork":false,"pushed_at":"2021-02-02T06:30:46.000Z","size":113,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-06-20T11:58:35.201Z","etag":null,"topics":["dataframe","dataframe-library","machine-learning","ml","one-hot-encoding","preprocessing"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rom1mouret.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-01-31T04:29:21.000Z","updated_at":"2021-02-03T05:42:03.000Z","dependencies_parsed_at":"2022-09-16T13:35:19.028Z","dependency_job_id":null,"html_url":"https://github.com/rom1mouret/ml-essentials","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/rom1mouret/ml-essentials","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rom1mouret%2Fml-essentials","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rom1mouret%2Fml-essentials/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rom1mouret%2Fml-essentials/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rom1mouret%2Fml-essentials/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rom1mouret","download_url":"https://codeload.github.com/rom1mouret/ml-essentials/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rom1mouret%2Fml-essentials/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31403781,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T10:20:44.708Z","status":"ssl_error","status_checked_at":"2026-04-04T10:20:06.846Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataframe","dataframe-library","machine-learning","ml","one-hot-encoding","preprocessing"],"created_at":"2026-04-04T14:53:58.215Z","updated_at":"2026-04-04T14:53:58.871Z","avatar_url":"https://github.com/rom1mouret.png","language":"Go","readme":"ml-essentials is a data frame library for Go in the same vein as [qota](https://github.com/go-gota/gota) and [qframe](https://github.com/tobgu/qframe).\n\nIt draws inspiration from [pandas](https://pandas.pydata.org/) and [numpy](https://numpy.org/).\n\nUnlike [qota](https://github.com/go-gota/gota) and [qframe](https://github.com/tobgu/qframe),\nml-essentials doesn't cater for data scientists, e.g. with functions to load Excel files, SQL databases or functions to help with [EDA](https://en.wikipedia.org/wiki/Exploratory_data_analysis).\nIt is best suited for machine learning engineers who want to serve their models in a safe and predictable manner.\nIt is also smaller, with a focus on simplicity, stability and clarity.\n\nI hope that ml-essentials is transparent enough for users to glance at their code and get a sense of what ml-essentials does under the hood and how much it is going to cost in CPU and RAM usage.\nTo illustrate my point, I am enumerating below all the view-returning functions.\nThose features are only available through views, so the user has no choice but to spell out what his/her code should do.\n\n```\n(df *DataFrame) IndexView(indices []int) *DataFrame\n(df *DataFrame) SliceView(from int, to int) *DataFrame\n(df *DataFrame) MaskView(mask []bool) *DataFrame\n(df *DataFrame) ColumnView(columns ...string) *DataFrame\n(df *DataFrame) ShuffleView() *DataFrame\n(df *DataFrame) SampleView(n int, replacement bool) *DataFrame\n(df *DataFrame) SplitNView(n int) []*DataFrame\n(df *DataFrame) SplitView(batchSize int) []*DataFrame\n(df *DataFrame) SplitTrainTestViews(testingRatio float64) (*DataFrame, *DataFrame)\n(df *DataFrame) SortedView(byColumn string) *DataFrame\n(df *DataFrame) TopView(byColumn string, n int, ascending bool, sorted bool) *DataFrame\n(df *DataFrame) ReverseView() *DataFrame\n(df *DataFrame) HashStringsView(columns ...string) *DataFrame\n(df *DataFrame) DetachedView(columns ...string) *DataFrame\n(df *DataFrame) ResetIndexView() *DataFrame\n(df *DataFrame) ShallowCopy() *DataFrame\n(df *DataFrame) ColumnConcatView(dfs ...*DataFrame) (*DataFrame, error)\n```\n\nView-returning functions are guaranteed not to copy any large chunk of data.\n\n### Documentation and examples\n\n- [dataframe package](dataframe/)\n- [preprocessing package](preprocessing/)\n- [algorithms package](algorithms/)\n- [A-to-Z example](examples/linreg.go)\n\n### Benchmarks\n\ndataset: [kddcup98](https://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html)\n\ntask: linear regression\n\n|                             | ml-essentials CPU=1 | ml-essentials CPU=16 | python (pandas + pytorch) |\n|-----------------------------|---------------------|----------------------|---------------------------|\n| reading CSV                 | 18.3                | 3.3                  | 4.3                       |\n| shuffling and splitting     | 0.003               | 0.003                | 0.4                       |\n| preprocessing fit_transform | 2.4                 | 0.8                  | 2.2                       |\n| linreg training (1 epoch)   | 6.9                 | 6.9                  | 3.4                       |\n| preprocessor on test data   | 1                   | 0.5                  | 0.77                      |\n| writing predictions         | 33                  | 4.7                  | 426                       |\n| reading written rows        | 170                 | 71                   | 410                       |\n\nThe reason it takes so long to read/write predictions is because one-hot encoding creates over 20,000 columns.\n\nReproduction\n\n```bash\ncd examples\ngo run linreg.go -momentum=0.2 -epochs=1 -testratio=0.33 -batchsize 256 cup98LRN.txt TARGET_B CONTROLN\npython3 linreg.py -momentum=0.2 -epochs=1 -testratio=0.33 -batchsize 256 cup98LRN.txt TARGET_B CONTROLN\n```\n\n### Design choices\n\n##### Native types\n\nHere are the benchmarks that have motivated my decision to use 3 native types alongside `interface{}`.\nThose benchmarks measure the time to copy a slice at specific indices (from a slice of indices).\n\n| type           | speed      | storage choice |  missing value |\n|----------------|------------|----------------|----------------|\n|`[]interface{}` | 4.51 ns/op | `[]interface{}`| nil            |\n|`[]string`      | 4.26 ns/op | `[]interface{}`| nil            |\n|`[]float64`     | 1.97 ns/op | `[]float64`    | NaN            |\n|`[]int`         | 1.80 ns/op | `[]int`        | -1             |\n|`[]bool`        | 1.38 ns/op | `[]bool`       | not applicable |\n\n\nFloat64 were chosen over float32 for the sake of compatibility with [gonum](https://github.com/gonum).\n\n\n##### `interface{}` type for all the columns\n\nStoring all the data slices as `interface{}` is sound.\nFor one thing, this requires only one `map[string]interface{}`.\nBy contrast, ml-essentials allocates 5 `map[string]T`, even when empty.\nAlso, some functions get to be very succinct, for instance\n`rename` can move the data from one column to another without ever knowing what\ntype the data is of.\n\nUltimately, it was decided not to use `interface{}` for everything. Most functions\ndo rely on knowing the precise type and casting the values anyway. The first version\nused `interface{}` everywhere and lots of type assertion errors popped up. Although\nthey were easy to fix, the new implementation brings more peace of mind.\n\n### Roadmap\n\n\n- functions to store/retrieve gonum's blas vectors in the df.objects map\n- functions to store/retrieve/sort datetime objects in the df.objects map\n- functions to create masks, e.g. mask := df.Test(\"age\").Lower(15).Mask()\n- smarter ColumnSmartConcat function\n- ordinal encoder as an alternative to Hash Encoder\n- more methods to RawData, like some sort of concat\n- optimization of TopView\n- more options to CSV reader and writer, such as BOM parsing\n- inverse transform for OneHot\n- `RepeatView(n int, bool interleaved)`\n- more evaluation metrics, such as cross entropy\n- reading/writing data in JSON\n- release as a Go module\n\n### External Contributions\n\nml-essentials is not affiliated with any organization.\nContributions are welcome.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2From1mouret%2Fml-essentials","html_url":"https://awesome.ecosyste.ms/projects/github.com%2From1mouret%2Fml-essentials","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2From1mouret%2Fml-essentials/lists"}