https://github.com/ptiger10/pd
A fast, tested, and predictable way to clean, aggregate, and transform data
https://github.com/ptiger10/pd
analytics data go spreadsheet
Last synced: 23 days ago
JSON representation
A fast, tested, and predictable way to clean, aggregate, and transform data
- Host: GitHub
- URL: https://github.com/ptiger10/pd
- Owner: ptiger10
- License: mit
- Created: 2019-04-15T20:39:15.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2019-07-20T03:00:20.000Z (over 6 years ago)
- Last Synced: 2025-08-15T16:16:27.009Z (6 months ago)
- Topics: analytics, data, go, spreadsheet
- Language: Go
- Homepage:
- Size: 6.79 MB
- Stars: 35
- Watchers: 4
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pd
[](https://goreportcard.com/report/github.com/ptiger10/pd)
[](https://godoc.org/github.com/ptiger10/pd)
[](https://travis-ci.org/ptiger10/pd)
[](https://codecov.io/gh/ptiger10/pd)
[](https://opensource.org/licenses/MIT)
pd (informally known as "GoPandas") is a library for cleaning, aggregating, and transforming data using Series and DataFrames. GoPandas combines a flexible API familiar to Python pandas users with the qualities of Go, including type safety, predictable error handling, and fast concurrent processing.
The API is still version 0 and subject to major revisions. Use in production code at your own risk.
Some notable features of GoPandas:
* flexible constructor that supports float, int, string, bool, time.Time, and interface Series
* seamlessly handles null data and type conversions
* well-suited to either the Jupyter notebook style of data exploration or conventional programming
* advanced filtering, grouping, and pivoting
* hierarchical indexing (i.e., multi-level indexes and columns)
* reads from either CSV or any spreadsheet or tabular data structured as [][]interface (e.g., Google Sheets)
* complete test coverage
* minimal dependencies (total package size is <10MB, compared to Pandas at >200MB)
* uses concurrent processing to achieve faster speeds than Pandas on many fundamental operations, and the performance differential becomes more pronounced with scale (6x+ superior performance summing two columns in a 500k row spreadsheet - see the most recent [benchmarking table](benchmarking/profiler/comparison_summary.txt)
## Getting Started
Check out the Jupyter notebook examples in the [guides](https://github.com/ptiger10/pd/tree/master/guides). Github sometimes has trouble rendering .ipynb, backup views are here: [Series](https://nbviewer.jupyter.org/github/ptiger10/pd/blob/master/guides/Series.ipynb?flush_cache=true), [DataFrame](https://nbviewer.jupyter.org/github/ptiger10/pd/blob/master/guides/DataFrame.ipynb?flush_cache=true), [Options](https://nbviewer.jupyter.org/github/ptiger10/pd/blob/master/guides/Options.ipynb?flush_cache=true).
To run the Jupyter notebooks yourself, I recommend lgo (Docker required)
* `cd guides/docker`
* start: `./up.sh`
* stop: `./down.sh`
* rebuild package to newest version: `./up.sh -r`
## Replicating Benchmark Tests
* Requires Python 3.x and pandas
* Download data from [here](https://github.com/ptiger10/pdTestData) and save in benchmarking/profiler
* `go run -tags=benchmarks benchmarking/profiler/main.go`