{"id":47540641,"url":"https://github.com/rbmuller/datatrax","last_synced_at":"2026-04-01T18:01:11.495Z","repository":{"id":208884267,"uuid":"722690145","full_name":"rbmuller/datatrax","owner":"rbmuller","description":"Data engineering and machine learning toolkit for Go.","archived":false,"fork":false,"pushed_at":"2026-03-26T19:58:43.000Z","size":66,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-27T07:52:31.605Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rbmuller.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-11-23T17:52:41.000Z","updated_at":"2026-03-27T05:40:24.000Z","dependencies_parsed_at":"2023-12-08T22:38:38.769Z","dependency_job_id":null,"html_url":"https://github.com/rbmuller/datatrax","commit_stats":null,"previous_names":["rbmuller/devtools","rbmuller/datatrax"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/rbmuller/datatrax","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbmuller%2Fdatatrax","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbmuller%2Fdatatrax/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbmuller%2Fdatatrax/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbmuller%2Fdatatrax/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rbmuller","download_url":"https://codeload.github.com/rbmuller/datatrax/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbmuller%2Fdatatrax/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31290740,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-01T13:12:26.723Z","status":"ssl_error","status_checked_at":"2026-04-01T13:12:25.102Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-03-28T14:00:25.163Z","updated_at":"2026-04-01T18:01:11.479Z","avatar_url":"https://github.com/rbmuller.png","language":"Go","readme":"\u003cdiv align=\"center\"\u003e\n\n# Datatrax\n\n**Data engineering and machine learning toolkit for Go.**\n\nBatch processing, type coercion, deduplication, date utilities, and classic ML algorithms — all in pure Go with zero external dependencies.\n\n[![Go Reference](https://pkg.go.dev/badge/github.com/rbmuller/datatrax.svg)](https://pkg.go.dev/github.com/rbmuller/datatrax)\n[![CI](https://github.com/rbmuller/datatrax/actions/workflows/ci.yml/badge.svg)](https://github.com/rbmuller/datatrax/actions/workflows/ci.yml)\n[![Go Report Card](https://goreportcard.com/badge/github.com/rbmuller/datatrax)](https://goreportcard.com/report/github.com/rbmuller/datatrax)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\n[![Go Version](https://img.shields.io/github/go-mod/go-version/rbmuller/datatrax)](go.mod)\n\n\u003c/div\u003e\n\n---\n\n## Why Datatrax?\n\nMost data engineers use **Go for pipelines** and **Python for everything else**. Datatrax eliminates the context switch — type coercion, batch processing, deduplication, and classic ML all in one Go module.\n\n- **Zero dependencies** — pure Go stdlib, nothing to audit\n- **Generics-first** — built for Go 1.21+, type-safe by default\n- **Battle-tested utilities** — born from real-world ETL pipelines processing 500k+ records/day\n- **ML without Python** — classic algorithms with a scikit-learn-simple API (coming soon)\n\n## Install\n\n```bash\ngo get github.com/rbmuller/datatrax\n```\n\n## Packages\n\n| Package | Description | Key Functions |\n|---------|-------------|---------------|\n| [`batch`](batch/) | Split slices into chunks for parallel processing | `ChunkArray[T]` |\n| [`coerce`](coerce/) | Convert `interface{}` to typed values safely | `Floatify`, `Integerify`, `Boolify`, `Stringify` |\n| [`dateutil`](dateutil/) | Date/time parsing, conversion, and math | `EpochToTimestamp`, `DaysDifference`, `StringToDate` |\n| [`dedup`](dedup/) | Remove duplicates from any comparable slice | `Deduplicate[T]` |\n| [`errutil`](errutil/) | Errors with automatic file:line location | `NewError` |\n| [`maputil`](maputil/) | Map operations — copy, generate from JSON | `CopyMap[K,V]`, `GenerateMap` |\n| [`mathutil`](mathutil/) | Safe math operations | `Divide` (zero-safe) |\n| [`strutil`](strutil/) | String utilities and generic search | `Contains[T]`, `TrimQuotes`, `SplitByRegexp` |\n| [`ml`](ml/) | ML algorithms — 8 models, metrics, preprocessing | `LinearRegression`, `KNN`, `KMeans`, `RandomForest`, ... |\n\n## Quick Start\n\n### Batch Processing\n\nSplit large datasets into manageable chunks for parallel processing:\n\n```go\nimport \"github.com/rbmuller/datatrax/batch\"\n\nrecords := []int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}\nchunks := batch.ChunkArray(records, 3)\n// [[1 2 3] [4 5 6] [7 8 9] [10]]\n\n// Process chunks in parallel\nfor _, chunk := range chunks {\n    go processChunk(chunk)\n}\n```\n\n### Type Coercion\n\nSafely convert untyped data from JSON, CSV, or database results:\n\n```go\nimport \"github.com/rbmuller/datatrax/coerce\"\n\nval, err := coerce.Floatify(\"3.14\")    // 3.14, nil\nval, err := coerce.Floatify(42)        // 42.0, nil\nval, err := coerce.Integerify(\"100\")   // 100, nil\nval, err := coerce.Boolify(1)          // true, nil\nval, err := coerce.Stringify(3.14)     // \"3.14\", nil\n```\n\n### Deduplication\n\nRemove duplicates from any comparable slice — strings, ints, structs:\n\n```go\nimport \"github.com/rbmuller/datatrax/dedup\"\n\nnames := []string{\"Alice\", \"Bob\", \"Alice\", \"Charlie\", \"Bob\"}\nunique := dedup.Deduplicate(names)\n// [\"Alice\", \"Bob\", \"Charlie\"]\n\nids := []int{1, 2, 3, 2, 1, 4}\nunique := dedup.Deduplicate(ids)\n// [1, 2, 3, 4]\n```\n\n### Date Utilities\n\nParse, convert, and calculate date differences:\n\n```go\nimport \"github.com/rbmuller/datatrax/dateutil\"\n\n// Convert epoch milliseconds to readable timestamp\nts, ok := dateutil.EpochToTimestamp(1684624830053)\n// \"2023-05-21 02:00:30\"\n\n// Calculate days between dates\ndays, err := dateutil.DaysDifference(\"2024-01-01\", \"2024-03-15\", \"2006-01-02\")\n// 74\n\n// Parse date strings\nt, err := dateutil.StringToDate(\"2024-03-15\", \"2006-01-02\")\n```\n\n### Error Utilities\n\nWrap errors with automatic source file and line number:\n\n```go\nimport \"github.com/rbmuller/datatrax/errutil\"\n\nerr := errutil.NewError(errors.New(\"connection timeout\"))\nfmt.Println(err)\n// \"main.go:42 - connection timeout\"\n\n// Supports errors.Is / errors.As via Unwrap()\nerrors.Is(err, originalErr) // true\n```\n\n### String Utilities\n\nGeneric search, trimming, and formatting:\n\n```go\nimport \"github.com/rbmuller/datatrax/strutil\"\n\n// Generic contains — works with any comparable type\nstrutil.Contains([]string{\"a\", \"b\", \"c\"}, \"b\")  // true\nstrutil.Contains([]int{1, 2, 3}, 5)              // false\n\n// Trim surrounding quotes\nstrutil.TrimQuotes(`\"hello world\"`)  // \"hello world\"\n\n// Join with quotes for SQL\nstrutil.StringifyWithQuotes([]string{\"a\", \"b\"})  // \"'a','b'\"\n\n// Safe index access — no panics\nstrutil.SafeIndex([]string{\"a\", \"b\"}, 5)  // \"\", false\n```\n\n### Map Utilities\n\nCopy maps and parse JSON:\n\n```go\nimport \"github.com/rbmuller/datatrax/maputil\"\n\n// Generic shallow copy\noriginal := map[string]int{\"a\": 1, \"b\": 2}\ncopied := maputil.CopyMap(original)\n\n// Parse JSON bytes to map\ndata := []byte(`{\"name\": \"datatrax\", \"version\": 1}`)\nm, err := maputil.GenerateMap(data)\n```\n\n### Safe Math\n\nDivision without panics:\n\n```go\nimport \"github.com/rbmuller/datatrax/mathutil\"\n\nmathutil.Divide(10, 3)  // 3.333...\nmathutil.Divide(10, 0)  // 0 (no panic)\n```\n\n## Machine Learning\n\n8 ML algorithms with a consistent `Fit` / `Predict` API — pure Go, zero dependencies.\n\n| Algorithm | Type | Key Config |\n|-----------|------|------------|\n| `LinearRegression` | Regression | LearningRate, Epochs (+ Normal Equation) |\n| `LogisticRegression` | Classification | LearningRate, Epochs, Threshold |\n| `KNN` | Classification | K, Distance (euclidean/manhattan), Weighted |\n| `KMeans` | Clustering | K, MaxIter (K-Means++ init) |\n| `DecisionTree` | Classification | MaxDepth, MinSamples, Criterion (gini/entropy) |\n| `RandomForest` | Classification | NTrees, MaxDepth, MaxFeatures, OOB Score |\n| `GaussianNB` | Classification | — (parameter-free) |\n| `MultinomialNB` | Classification | Alpha (Laplace smoothing) |\n\n**Infrastructure:** Dataset (CSV loading, train/test split), Preprocessing (MinMaxScale, StandardScale), Encoding (OneHot, Label), Metrics (Accuracy, Precision, Recall, F1, MSE, RMSE, MAE, R², ConfusionMatrix), K-Fold Cross Validation.\n\n### Benchmarks\n\nAll benchmarks on Apple M4, 1000 samples, 10 features:\n\n| Algorithm | Fit | Predict (100 samples) | Allocs |\n|-----------|-----|----------------------|--------|\n| LinearRegression | 828µs | 0.4µs | 1 |\n| LogisticRegression | 2.5ms | 1.3µs | 2 |\n| KNN | — (stores data) | 10.1ms | 601 |\n| KMeans | 1.9ms | — | 223 |\n| DecisionTree | 849ms | 1.4µs | 1 |\n| GaussianNB | 41µs | 36µs | 102 |\n\n| Utility | Operation | Speed | Allocs |\n|---------|-----------|-------|--------|\n| ChunkArray | 10k items, chunks of 100 | 377ns | 1 |\n| Deduplicate | 10k strings, 50% dupes | 314µs | 3 |\n| Floatify | Single conversion | 27ns | 0 |\n| Contains | 10k elements, worst case | 20µs | 0 |\n\n### Linear Regression\n\n```go\nimport \"github.com/rbmuller/datatrax/ml\"\n\nmodel := ml.NewLinearRegression()\nmodel.Fit(xTrain, yTrain)\npredictions := model.Predict(xTest)\nfmt.Println(\"R²:\", ml.R2Score(yTest, predictions))\n```\n\n### Classification (KNN)\n\n```go\nclf := ml.NewKNN(ml.KNNConfig{K: 5, Distance: \"euclidean\"})\nclf.Fit(xTrain, yTrain)\npredictions := clf.Predict(xTest)\nfmt.Println(\"Accuracy:\", ml.Accuracy(yTest, predictions))\nfmt.Println(\"F1:\", ml.F1Score(yTest, predictions, 1.0))\n```\n\n### Clustering (K-Means)\n\n```go\nkm := ml.NewKMeans(ml.KMeansConfig{K: 3, MaxIter: 100})\nkm.Fit(data)\nlabels := km.Predict(data)\nfmt.Println(\"Inertia:\", km.Inertia())\n```\n\n### Decision Tree\n\n```go\ndt := ml.NewDecisionTree(ml.DecisionTreeConfig{\n    MaxDepth:   5,\n    MinSamples: 2,\n    Criterion:  \"gini\",\n})\ndt.Fit(xTrain, yTrain)\npredictions := dt.Predict(xTest)\nfmt.Println(\"Importance:\", dt.FeatureImportance())\n```\n\n### Random Forest\n\n```go\nrf := ml.NewRandomForest(ml.RandomForestConfig{\n    NTrees:    100,\n    MaxDepth:  10,\n    Criterion: \"gini\",\n})\nrf.Fit(xTrain, yTrain)\npredictions := rf.Predict(xTest)\nfmt.Println(\"Accuracy:\", ml.Accuracy(yTest, predictions))\nfmt.Println(\"OOB Score:\", rf.OOBScore(xTrain, yTrain))\nfmt.Println(\"Importance:\", rf.FeatureImportance())\n```\n\n### Preprocessing \u0026 Evaluation\n\n```go\n// Scale features\nxScaled := ml.MinMaxScale(xTrain)\n\n// Cross validation\nfolds := ml.KFoldSplit(x, y, 5)\nfor _, fold := range folds {\n    model.Fit(fold.XTrain, fold.YTrain)\n    pred := model.Predict(fold.XTest)\n    fmt.Println(\"Fold R²:\", ml.R2Score(fold.YTest, pred))\n}\n\n// Full metrics\nfmt.Println(\"Accuracy:\", ml.Accuracy(yTrue, yPred))\nfmt.Println(\"Precision:\", ml.Precision(yTrue, yPred, 1.0))\nfmt.Println(\"Recall:\", ml.Recall(yTrue, yPred, 1.0))\nfmt.Println(\"Confusion:\", ml.ConfusionMatrix(yTrue, yPred))\n```\n\n### Load Dataset from CSV\n\n```go\ndataset, err := ml.LoadCSV(\"data.csv\", 4) // target is column 4\nxTrain, xTest, yTrain, yTest := dataset.Split(0.8)\n```\n\n## Roadmap\n\n| Version | What | Status |\n|---------|------|--------|\n| **v0.1.0** | Core utilities — 8 packages, 47 tests, zero deps | **Done** |\n| **v0.5.0** | Classic ML — 6 algorithms, preprocessing, metrics, cross-validation | **Done** |\n| **v1.1.0** | Full ML — 7 algorithms, benchmarks, encoding, tree viz, examples | **Done** |\n| **v2.0.0** | Random Forest, SVM, PCA, ensemble methods | Planned |\n\n## Design Principles\n\n1. **Zero dependencies** — If it can be done with stdlib, it will be\n2. **Generics everywhere** — Type safety is not optional\n3. **No silent failures** — Functions return `(value, error)`, not zero values\n4. **Pipeline-ready** — Every function works with slices and streams\n5. **Documentation-driven** — If it's not documented, it doesn't exist\n\n## Why Datatrax over existing Go ML libs?\n\n| Library | Status | How Datatrax compares |\n|---------|--------|----------------------|\n| goml | Abandoned (2019) | Active, modern Go 1.21+ with generics |\n| golearn | Abandoned (2020) | Simpler API, batteries included |\n| gorgonia | Active but complex | scikit-learn-simple, not TensorFlow-complex |\n| sajari/regression | Regression only | Full toolkit: utilities + ML + preprocessing |\n\nDatatrax is NOT competing with deep learning frameworks. It's the **scikit-learn of Go** — classic ML with a clean API, plus data engineering utilities that no other Go ML lib offers.\n\n## Contributing\n\nContributions are welcome! Please:\n\n1. Fork the repo\n2. Create a feature branch (`git checkout -b feat/amazing-feature`)\n3. Write tests for your changes\n4. Ensure `go test -race ./...` passes\n5. Open a PR\n\n## License\n\n[MIT](LICENSE) — Robson Bayer Müller, 2026\n","funding_links":[],"categories":["Machine Learning"],"sub_categories":["Search and Analytic Databases"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frbmuller%2Fdatatrax","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frbmuller%2Fdatatrax","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frbmuller%2Fdatatrax/lists"}