{"id":13413351,"url":"https://github.com/nikolaydubina/go-featureprocessing","last_synced_at":"2025-10-10T03:30:42.105Z","repository":{"id":38327381,"uuid":"322598448","full_name":"nikolaydubina/go-featureprocessing","owner":"nikolaydubina","description":"🔥 Fast, simple sklearn-like feature processing for Go","archived":false,"fork":false,"pushed_at":"2024-11-25T03:11:32.000Z","size":1420,"stargazers_count":121,"open_issues_count":4,"forks_count":8,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-01-20T07:07:32.820Z","etag":null,"topics":["feature-engineering","go","machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nikolaydubina.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"nikolaydubina"}},"created_at":"2020-12-18T13:09:18.000Z","updated_at":"2024-11-25T03:11:35.000Z","dependencies_parsed_at":"2024-12-15T19:11:33.703Z","dependency_job_id":"8e36410f-aafc-4e95-b9f2-d5c3286ff82b","html_url":"https://github.com/nikolaydubina/go-featureprocessing","commit_stats":{"total_commits":87,"total_committers":2,"mean_commits":43.5,"dds":"0.11494252873563215","last_synced_commit":"65da480e931adefbfbe6581eb0046df8f53f4c32"},"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nikolaydubina%2Fgo-featureprocessing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nikolaydubina%2Fgo-featureprocessing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nikolaydubina%2Fgo-featureprocessing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nikolaydubina%2Fgo-featureprocessing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nikolaydubina","download_url":"https://codeload.github.com/nikolaydubina/go-featureprocessing/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235907288,"owners_count":19064177,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["feature-engineering","go","machine-learning"],"created_at":"2024-07-30T20:01:38.399Z","updated_at":"2025-10-10T03:30:36.664Z","avatar_url":"https://github.com/nikolaydubina.png","language":"Go","readme":"# go-featureprocessing\n\n[![Tests](https://github.com/nikolaydubina/go-featureprocessing/workflows/Tests/badge.svg)](https://github.com/nikolaydubina/go-featureprocessing/workflows/Tests/badge.svg)\n[![Go Report Card](https://goreportcard.com/badge/github.com/nikolaydubina/go-featureprocessing)](https://goreportcard.com/report/github.com/nikolaydubina/go-featureprocessing)\n[![codecov](https://codecov.io/gh/nikolaydubina/go-featureprocessing/branch/main/graph/badge.svg?token=02QNME4TNT)](https://codecov.io/gh/nikolaydubina/go-featureprocessing)\n[![Go Reference](https://pkg.go.dev/badge/github.com/nikolaydubina/go-featureprocessing.svg)](https://pkg.go.dev/github.com/nikolaydubina/go-featureprocessing)\n[![Mentioned in Awesome Go](https://awesome.re/mentioned-badge.svg)](https://github.com/avelino/awesome-go)\n[![OpenSSF Scorecard](https://api.securityscorecards.dev/projects/github.com/nikolaydubina/go-featureprocessing/badge)](https://securityscorecards.dev/viewer/?uri=github.com/nikolaydubina/go-featureprocessing)\n\n[Fast](https://github.com/nikolaydubina/go-ml-benchmarks), simple [sklearn](https://scikit-learn.org/stable/modules/preprocessing.html)-like feature processing for Go\n\n- [x] Does not cross `cgo` boundary\n- [x] No memory allocation\n- [x] No reflection\n- [x] Convenient serialization\n- [x] Generated code has 100% test coverage and benchmarks\n- [x] Fitting\n- [x] UTF-8\n- [x] Parallel batch transform\n- [x] Faster than sklearn in batch mode\n\n```go\n//go:generate go run github.com/nikolaydubina/go-featureprocessing/cmd/generate -struct=Employee\n\ntype Employee struct {\n\tAge         int     `feature:\"identity\"`\n\tSalary      float64 `feature:\"minmax\"`\n\tKids        int     `feature:\"maxabs\"`\n\tWeight      float64 `feature:\"standard\"`\n\tHeight      float64 `feature:\"quantile\"`\n\tCity        string  `feature:\"onehot\"`\n\tCar         string  `feature:\"ordinal\"`\n\tIncome      float64 `feature:\"kbins\"`\n\tDescription string  `feature:\"tfidf\"`\n\tSecretValue float64\n}\n``` \n\nCode above will generate a new struct as well _benchmarks_ and _tests_ using [google/gofuzz](https://github.com/google/gofuzz).\n```go\nemployee := Employee{\n   Age:         22,\n   Salary:      1000.0,\n   Kids:        2,\n   Weight:      85.1,\n   Height:      160.0,\n   City:        \"Pangyo\",\n   Car:         \"Tesla\",\n   Income:      9000.1,\n   SecretValue: 42,\n   Description: \"large text fields is not a problem neither, tf-idf can help here too! more advanced NLP will be added later!\",\n}\n\nvar fp EmployeeFeatureTransformer\n\nconfig, _ := ioutil.ReadAll(\"employee_feature_processor.json\")\njson.Unmarshal(config, \u0026fp)\n\nfeatures := fp.Transform(\u0026employee)\n// []float64{22, 1, 0.5, 1.0039999999999998, 1, 1, 0, 0, 0, 1, 5, 0.7674945674619879, 0.4532946552278861, 0.4532946552278861}\n\nnames := fp.FeatureNames()\n// []string{\"Age\", \"Salary\", \"Kids\", \"Weight\", \"Height\", \"City_Pangyo\", \"City_Seoul\", \"City_Daejeon\", \"City_Busan\", \"Car\", \"Income\", \"Description_text\", \"Description_problem\", \"Description_help\"}\n```\n\nYou can also fit transformer based on data\n```go\nfp := EmployeeFeatureTransformer{}\nfp.Fit([]Employee{...})\n\nconfig, _ := json.Marshal(data)\n_ = ioutil.WriteFile(\"employee_feature_processor.json\", config, 0644)\n```\n\nThis transformer can be serialized and de-serialized by standard Go routines.\nSerialized transformer is easy to read, update, and integrate with other tools.\n```json\n{\n   \"Age_identity\": {},\n   \"Salary_minmax\": {\"Min\": 500, \"Max\": 900},\n   \"Kids_maxabs\": {\"Max\": 4},\n   \"Weight_standard\": {\"Mean\": 60, \"STD\": 25},\n   \"Height_quantile\": {\"Quantiles\": [20, 100, 110, 120, 150]},\n   \"City_onehot\": {\"Mapping\": {\"Pangyo\": 0, \"Seoul\": 1, \"Daejeon\": 2, \"Busan\": 3},\n   \"Car_ordinal\": {\"Mapping\": {\"BMW\": 90000, \"Tesla\": 1}},\n   \"Income_kbins\": {\"Quantiles\": [1000, 1100, 2000, 3000, 10000]},\n   \"Description_tfidf\": {\n      \"Mapping\": {\"help\": 2, \"problem\": 1, \"text\": 0},\n      \"Separator\": \" \",\n      \"DocCount\": [1, 2, 2],\n      \"NumDocuments\": 2,\n      \"Normalizer\": {}\n   }\n}\n```\n\nOr you can manually initialize it.\n```go\nfp := EmployeeFeatureTransformer{\n   Salary: MinMaxScaler{Min: 500, Max: 900},\n   Kids:   MaxAbsScaler{Max: 4},\n   Weight: StandardScaler{Mean: 60, STD: 25},\n   Height: QuantileScaler{Quantiles: []float64{20, 100, 110, 120, 150}},\n   City:   OneHotEncoder{Mapping: map[string]uint{\"Pangyo\": 0, \"Seoul\": 1, \"Daejeon\": 2, \"Busan\": 3}},\n   Car:    OrdinalEncoder{Mapping: map[string]uint{\"Tesla\": 1, \"BMW\": 90000}},\n   Income: KBinsDiscretizer{QuantileScaler: QuantileScaler{Quantiles: []float64{1000, 1100, 2000, 3000, 10000}}},\n   Description: TFIDFVectorizer{\n      NumDocuments:    2,\n      DocCount:        []uint{1, 2, 2},\n      CountVectorizer: CountVectorizer{Mapping: map[string]uint{\"text\": 0, \"problem\": 1, \"help\": 2}, Separator: \" \"},\n   },\n}\n```\n\n### Benchmarks\n\nFor typical use, with this struct encoder you can get ~100ns processing time for a single sample. How fast you need to get? Here are some numbers:\n\n```\n                       0 - C++ FlatBuffers decode\n                     ...\n                   200ps - 4.6GHz single cycle time\n                1ns      - L1 cache latency\n               10ns      - L2/L3 cache SRAM latency\n               20ns      - DDR4 CAS, first byte from memory latency\n               20ns      - C++ raw hardcoded structs access\n               80ns      - C++ FlatBuffers decode/traverse/dealloc\n ----------\u003e  100ns      - go-featureprocessing typical processing\n              150ns      - PCIe bus latency\n              171ns      - Go cgo call boundary, 2015\n              200ns      - some High Frequency Trading FPGA claims\n              800ns      - Go Protocol Buffers Marshal\n              837ns      - Go json-iterator/go json decode\n           1µs           - Go Protocol Buffers Unmarshal\n           1µs           - High Frequency Trading FPGA\n           3µs           - Go JSON Marshal\n           7µs           - Go JSON Unmarshal\n           9µs           - Go XML Marshal\n          10µs           - PCIe/NVLink startup time\n          17µs           - Python JSON encode or decode times\n          30µs           - UNIX domain socket, eventfd, fifo pipes latency\n          30µs           - Go XML Unmarshal\n         100µs           - Redis intrinsic latency\n         100µs           - AWS DynamoDB + DAX\n         100µs           - KDB+ queries\n         100µs           - High Frequency Trading direct market access range\n         200µs           - 1GB/s network air latency\n         200µs           - Go garbage collector latency 2018\n         500µs           - NGINX/Kong added latency\n     10ms                - AWS DynamoDB\n     10ms                - WIFI6 \"air\" latency\n     15ms                - AWS Sagemaker latency\n     30ms                - 5G \"air\" latency\n    100ms                - typical roundtrip from mobile to backend\n    200ms                - AWS RDS MySQL/PostgreSQL or AWS Aurora\n 10s                     - AWS Cloudfront 1MB transfer time\n```\n\nThis is significantly faster than sklearn, or calling sklearn from Go, for few samples.\nAnd it performs similarly or faster than sklearn for large number of samples.\n![bench_log](docs/bench_log.png)\n![bench_lin](docs/bench_lin.png)\n\nFor full benchmarks go to `/docs/benchmarks`, some extract for typical struct: \n```\ngoos: darwin\ngoarch: amd64\npkg: github.com/nikolaydubina/go-featureprocessing/cmd/generate/tests\nBenchmarkEmployeeFeatureTransformer_Transform-8                                  \t62135674\t        206 ns/op\t       208 B/op\t       1 allocs/op\nBenchmarkEmployeeFeatureTransformer_Transform_Inplace-8                          \t89993084\t        123 ns/op\t         0 B/op\t       0 allocs/op\nBenchmarkEmployeeFeatureTransformer_TransformAll_10elems-8                       \t 5921253\t       1881 ns/op\t      2048 B/op\t       1 allocs/op\nBenchmarkEmployeeFeatureTransformer_TransformAll_100elems-8                      \t  528890\t      20532 ns/op\t     21760 B/op\t       1 allocs/op\nBenchmarkEmployeeFeatureTransformer_TransformAll_1000elems-8                     \t   53524\t     238542 ns/op\t    221185 B/op\t       1 allocs/op\nBenchmarkEmployeeFeatureTransformer_TransformAll_10000elems-8                    \t    4879\t    2267683 ns/op\t   2007048 B/op\t       1 allocs/op\nBenchmarkEmployeeFeatureTransformer_TransformAll_100000elems-8                   \t     475\t   23257147 ns/op\t  20004876 B/op\t       1 allocs/op\nBenchmarkEmployeeFeatureTransformer_TransformAll_1000000elems-8                  \t      46\t  284763749 ns/op\t 192004098 B/op\t       1 allocs/op\nBenchmarkEmployeeFeatureTransformer_TransformAll_10elems_8workers-8              \t 1552704\t       7362 ns/op\t      2064 B/op\t       2 allocs/op\nBenchmarkEmployeeFeatureTransformer_TransformAll_100elems_8workers-8             \t  412455\t      29814 ns/op\t     21776 B/op\t       2 allocs/op\nBenchmarkEmployeeFeatureTransformer_TransformAll_1000elems_8workers-8            \t   63822\t     177183 ns/op\t    213008 B/op\t       2 allocs/op\nBenchmarkEmployeeFeatureTransformer_TransformAll_10000elems_8workers-8           \t    8704\t    1505994 ns/op\t   2162707 B/op\t       2 allocs/op\nBenchmarkEmployeeFeatureTransformer_TransformAll_100000elems_8workers-8          \t     800\t   15840396 ns/op\t  21602323 B/op\t       2 allocs/op\nBenchmarkEmployeeFeatureTransformer_TransformAll_1000000elems_8workers-8         \t      72\t  139700740 ns/op\t 192004112 B/op\t       2 allocs/op\nBenchmarkEmployeeFeatureTransformer_TransformAll_5000000elems_8workers-8         \t       9\t 1720488586 ns/op       1040007184 B/op\t       2 allocs/op\nBenchmarkEmployeeFeatureTransformer_TransformAll_15000000elems_8workers-8        \t       1\t14009776007 ns/op       3240001552 B/op\t       2 allocs/op\n```\n\n### [beta] Reflection based version\n\nIf you can't use `go:gencode` version, you can try relfection based version.\nNote, that reflection version intrudes overhead that is particularly noticeable if your struct has a lot of fields.\nYou would get ~2x time increase for struct with large composite transformers. \nAnd you would get ~20x time increase for struct with 32 fields.\nNote, some features like serialization and de-serialization are not supported yet.\n\nBenchmarks:\n```bash\ngo test -timeout=1h -bench=. -benchtime=10s -benchmem ./...\n```\n\n```\ngoos: darwin\ngoarch: amd64\n\n// reflection\npkg: github.com/nikolaydubina/go-featureprocessing/structtransformer\nBenchmarkStructTransformerTransform_32fields-4                           1732573              2079 ns/op             512 B/op          2 allocs/op\n\n// non-reflection\npkg: github.com/nikolaydubina/go-featureprocessing/cmd/generate/tests\nBenchmarkWith32FieldsFeatureTransformer_Transform-8                     31678317\t       116 ns/op\t     256 B/op\t       1 allocs/op\nBenchmarkWith32FieldsFeatureTransformer_Transform_Inplace-8           \t80729049\t        43 ns/op\t       0 B/op\t       0 allocs/op\n```\n\n### Profiling\n\nFrom profiling benchmarks for struct with 32 fields, we see that reflect version takes much longer and spends time on what looks like reflection related code.\nMeanwhile `go:generate` version is fast enough to compar to testing routines themselves and spends 50% of the time on allocating single output slice, which is good since means memory access is a bottleneck.\nFlamegraphs were produced from pprof output by https://www.speedscope.app/.\n\nMake profiles with \n```bash\nmkdir -p docs/benchmark_profiles\ngo test -bench=BenchmarkWith32FieldsFeatureTransformer_Transform -benchtime=3s -benchmem -memprofile docs/benchmark_profiles/codegen_transform_mem.profile -cpuprofile docs/benchmark_profiles/codegen_transform_cpu.profile ./cmd/generate/tests\ngo test -bench=BenchmarkStructTransformer_Transform_32fields -benchtime=3s -benchmem -memprofile docs/benchmark_profiles/reflect_transform_mem.profile -cpuprofile docs/benchmark_profiles/reflect_transform_cpu.profile ./structtransformer\n```\n\ngencode:\n![gencode](docs/codegen_transform_cpu_profile.png)\n![gencode_selected](docs/codegen_transform_cpu_profile_selected.png)\n\nreflect:\n![reflect](docs/reflect_transform_cpu_profile.png)\n      \n### Reference\n\n- https://dave.cheney.net/2016/01/18/cgo-is-not-go\n- https://github.com/json-iterator/go\n- https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/go.html\n- https://github.com/shmuelamar/python-serialization-benchmarks\n- https://shijuvar.medium.com/benchmarking-protocol-buffers-json-and-xml-in-go-57fa89b8525\n- https://gist.github.com/shijuvar/25ad7de9505232c87034b8359543404a#file-order_test-go\n- https://google.github.io/flatbuffers/flatbuffers_benchmarks.html\n- https://www.cockroachlabs.com/blog/the-cost-and-complexity-of-cgo/\n- https://en.wikipedia.org/wiki/CAS_latency\n","funding_links":["https://github.com/sponsors/nikolaydubina"],"categories":["Machine Learning","AutoML","Go","机器学习","Relational Databases"],"sub_categories":["Search and Analytic Databases","[Tools](#tools-1)","Advanced Console UIs","检索及分析资料库"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnikolaydubina%2Fgo-featureprocessing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnikolaydubina%2Fgo-featureprocessing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnikolaydubina%2Fgo-featureprocessing/lists"}