{"id":20285448,"url":"https://github.com/mramshaw/pickles","last_synced_at":"2026-05-05T00:31:53.003Z","repository":{"id":92905439,"uuid":"155597487","full_name":"mramshaw/Pickles","owner":"mramshaw","description":"Experiments in pickling (data serialization) in Golang and Python","archived":false,"fork":false,"pushed_at":"2019-01-20T02:27:14.000Z","size":156,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-14T08:12:10.720Z","etag":null,"topics":["go","golang","machine-learning","pickle","picklers","pickles","protobuf","protobuf3","python","python3"],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mramshaw.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-31T17:26:12.000Z","updated_at":"2020-12-11T03:31:31.000Z","dependencies_parsed_at":"2023-04-13T05:01:15.328Z","dependency_job_id":null,"html_url":"https://github.com/mramshaw/Pickles","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FPickles","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FPickles/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FPickles/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FPickles/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mramshaw","download_url":"https://codeload.github.com/mramshaw/Pickles/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241780465,"owners_count":20019058,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["go","golang","machine-learning","pickle","picklers","pickles","protobuf","protobuf3","python","python3"],"created_at":"2024-11-14T14:26:44.944Z","updated_at":"2026-05-05T00:31:52.958Z","avatar_url":"https://github.com/mramshaw.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pickles\n\nExperiments in pickling (data serialization) in Golang and Python.\n\n## Motivation\n\nParsing input data for machine learning is a time-intensive process.\n\nIt's a recommended practice to serialize parsed data (pickle it) so as\nto reduce the overhead of continually processing input data - especially\nfor static datasets.\n\nFor a number of reasons, machine learning is normally carried out\nin Python. But as I was trying out\n[Golang for machine learning](http://github.com/mramshaw/gophernet) \nit seemed like a good idea to look into pickling and whether or not\nit could be done in a language-agnostic way.\n\n## Python options\n\nBasically, for Python there's `pickle`:\n\n    http://docs.python.org/3/library/pickle.html\n\nThere are other options but the use of `pickle` is pretty widespread.\n\nIf the intention is a solution that is language-agnostic, it\nis probably a good idea (for compatibility reasons) to avoid\nspecifying `protocol=pickle.HIGHEST_PROTOCOL` as this may result\nin the use of an unsupported format. That said, it is probably\na good idea to use the highest language-agnostic format (2 or 3)\nas they seem to support more datatypes than earlier versions.\n\nNote that the __default__ protocol is __3__ (which is a Python 3\nformat).\n\nIf performance is a concern (when is it not?) then there is\n`cPickle`:\n\n    http://pymotw.com/2/pickle/\n\n\u003e The `cPickle` module implements the same algorithm, in C instead of Python. It is many times faster than the Python implementation, but does not allow the user to subclass from Pickle. If subclassing is not important for your use, you probably want to use cPickle.\n\n#### Other options\n\nOther options are as follows:\n\n* [msgpack](http://pypi.org/project/msgpack-python/)\n* [HDF5](http://docs.h5py.org/en/latest/quick.html)\n* [dill](http://pypi.org/project/dill/)\n* [cloudpickle](http://pypi.org/project/cloudpickle/)\n* [anycache](http://pypi.org/project/anycache/)\n\n## Golang options\n\nThere are options for storing data in Golang, listed below.\n\n#### Golang-only options\n\nFor storing binary data with Golang:\n\n    http://golang.org/pkg/encoding/gob/\n\nI would be very surprised if this format supported compression well.\n\nIt is probably possible to compress the binary data, but binary data\ngenerally does not compress well.\n\n#### Pickles options\n\nFor reading \u0026 writing pickled data with Golang there is `ogórek`:\n\n    http://godoc.org/github.com/kisielk/og-rek\n\nAccording to the docs, it is safer than reading pickled data with Python:\n\n\u003e In particular on Go side it is thus by default safe to decode pickles from untrusted sources(^).\n\nAs `ogórek` supports Protocol 3 (the Python 3 variety), as well as being able to ___write___ pickles, it is probably the option of choice:\n\n```Golang\ne := ogórek.NewEncoderWithConfig(w, \u0026ogórek.EncoderConfig{\n\tProtocol: 3,\n})\nerr := e.Encode(obj)\n```\n\nOf course, for reading pickled data with Golang there is also `stalecucumber`:\n\n    http://godoc.org/github.com/hydrogen18/stalecucumber\n\nNote that `stalecucumber` only supports Python 2 pickle formats:\n\n\u003e Protocols 0,1,2 are implemented. These are the versions written by the Python 2.x series. Python 3 defines newer protocol versions, but can write the older protocol versions so they are readable by this package.\n\nAs far as I can tell, the higher the version number, the\nmore compression is applied. For the best compression, it\nis probably necessary to use Python 3. Likewise, higher\nversions generally feature more binary-encoded data.\n\nThere are examples on the GitHub repo:\n\n    http://github.com/hydrogen18/stalecucumber\n\nThere is a good writeup on this package here:\n\n    http://www.hydrogen18.com/blog/reading-pickled-data-in-go.html\n\nIf you are planning on using `pickle` this article is well worth a read\nas it gives useful information on the internals of the `pickle` format.\nNote however the following:\n\n\u003e In the future, I plan on adding to the library the writing of pickled objects from Go.\n\n## Language-agnostic options\n\nGoogle developed `protobuf` which supports a number of languages (among them, Python and Go):\n\n    http://developers.google.com/protocol-buffers/\n\nThey describe protobufs as:\n\n\u003e a language-neutral, platform-neutral, extensible way of serializing structured data for use\n\u003e in communications protocols, data storage, and more\n\nCheck out my [protobuf experiments with Python and Go](http://github.com/mramshaw/protobufs) repo.\n\nProbably the best approach is to use the `pickle` format.\n\nOf course, for a truly language-agnostic option, there is [Apache Arrow](http://arrow.apache.org/):\n\n![Apache Arrow](images/Apache_Arrow.png)\n\n[Graphic stolen from the Apache Arrow website]\n\nFor a summary of its advantages, there is this blog post from Wes McKinney:\n\n    http://wesmckinney.com/blog/pandas-and-apache-arrow/\n\n[Wes McKinney describes himself as the creator of Pandas and Ibis, and an Apache Arrow committer]\n\nThe general consensus on Apache Arrow seems to be that it obviates\nthe expensive serialization/de-serialization overhead found with\nother options.\n\nIt also enables interoperability between Python and R.\n\nFrom the launch release:\n\n\u003eA high-performance cross-system data layer for columnar in-memory analytics, Apache Arrow provides the following benefits for Big Data workloads:\n\u003e\n\u003e-  Accelerates the performance of analytical workloads by more than 100x in some cases\n\u003e-  Enables multi-system workloads by eliminating cross-system communication overhead\n\n    http://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces87\n\nAs contrasted with other systems (from a [blog post by Cloudera](http://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-standard/)):\n\n\u003e Efficient and fast data interchange between systems without the serialization costs associated with other systems like Thrift, Avro, and Protocol Buffers.\n\nFor a deep dive into Apache Arrow, this post is worth a read:\n\n    http://www.dremio.com/origin-history-of-apache-arrow/\n\n## To Do\n\n- [ ] Investigate Golang serialization formats (gob)\n- [ ] Investigate whether or not compression is a good idea (probably)\n- [x] Investigate `protobuf` support in Python (see my [protobufs](http://github.com/mramshaw/protobufs) repo)\n- [ ] Investigate [Apache Arrow](http://github.com/apache/arrow)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmramshaw%2Fpickles","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmramshaw%2Fpickles","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmramshaw%2Fpickles/lists"}