{"id":18267713,"url":"https://github.com/maebert/knyfe","last_synced_at":"2026-03-01T16:04:38.435Z","repository":{"id":3019527,"uuid":"4039045","full_name":"maebert/knyfe","owner":"maebert","description":"knyfe is a python utility for rapid exploration of datasets.","archived":false,"fork":false,"pushed_at":"2015-04-03T01:41:29.000Z","size":405,"stargazers_count":54,"open_issues_count":1,"forks_count":2,"subscribers_count":1,"default_branch":"gh-pages","last_synced_at":"2026-02-05T11:27:44.051Z","etag":null,"topics":["datascience","dataset","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maebert.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-04-16T08:49:24.000Z","updated_at":"2023-10-17T04:58:47.000Z","dependencies_parsed_at":"2022-08-26T03:02:04.381Z","dependency_job_id":null,"html_url":"https://github.com/maebert/knyfe","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/maebert/knyfe","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maebert%2Fknyfe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maebert%2Fknyfe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maebert%2Fknyfe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maebert%2Fknyfe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maebert","download_url":"https://codeload.github.com/maebert/knyfe/tar.gz/refs/heads/gh-pages","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maebert%2Fknyfe/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29974336,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-01T15:41:30.362Z","status":"ssl_error","status_checked_at":"2026-03-01T15:37:07.343Z","response_time":124,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datascience","dataset","python"],"created_at":"2024-11-05T11:28:36.313Z","updated_at":"2026-03-01T16:04:38.415Z","avatar_url":"https://github.com/maebert.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"What is knyfe?\n==============\n\nknyfe is a python utility for rapid exploration of datasets. Use it when you have some kind of dataset and you want to get a feel for how it is composed, run some simple tests on it, or prepare it for further processing. The great thing about knyfe is that you don't have to know much about how your dataset is designed. You shouldn't have to remember in which variable resides in which column of your data matrix or how your `structs` are nested. Just get shit done.\n\n![knyfe in an iPython shell](http://maebert.github.com/knyfe/img/interactive.png)\n\nQuickstart\n----------\n\nknyfe is awesome on it's own, but it's really good friends with the [iPython](http://ipython.org/) console. Just fire it up with `ipython qtconsole --pylab=inline` and get rockin':\n\n    \u003e\u003e\u003e cereals = knyfe.Data(\"examples/cereals.json\")\n    \u003e\u003e\u003e print cereals.summary\n\n    Unnamed Dataset (75 samples)\n    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''\n    rating       : 18.04 - 93.70         Mean: 42.59 +- 14.05   \n    potass       : 15.00 - 330.00        Mean: 99.25 +- 70.74   (missing in 2 samples)\n    fiber        : 0.00 - 320.00         Mean: 161.27 +- 82.20  \n    vitamins     : 0.00 - 100.00         Mean: 28.33 +- 22.48   \n    name         : [Mueslix Crispy ...]                         \n    weight       : 0.50 - 1.50           Mean: 1.03 +- 0.15     \n    sodium       : 0.00 - 5.00           Mean: 1.01 +- 1.01     \n    shelf        : 1 - 3                                        \n    sugars       : 5.00 - 23.00          Mean: 14.77 +- 3.93    (missing in 1 samples)\n    calories     : 50 - 160                                     \n    fat          : 1.00 - 6.00           Mean: 2.53 +- 1.09     \n    protein      : 1.00 - 6.00           Mean: 2.53 +- 1.09     \n    cups         : 0.25 - 1.50           Mean: 0.82 +- 0.23     \n    type         : [cold, hot]                                  \n    carbo        : 0.00 - 14.00          Mean: 2.20 +- 2.38     \n    manufacturer : [Kelloggs, Nabis...]                         \n    ==================================================================================\n\n    \u003e\u003e\u003e print set(cereals.manufacturer)\n    set(['Kelloggs', 'Nabisco', 'Ralston Purina', 'Quaker Oats', 'Post', 'General Mills'])\n    \u003e\u003e\u003e kellogs_products = cereals.filter(manufacturer=\"Kellogs\")\n    \u003e\u003e\u003e hist(kellogs_products.sugars)\n\n![Histogram of Kellogg's Cereals sugar](http://maebert.github.com/knyfe/img/kellogs-sugar-hist.png)\n\n    \u003e\u003e\u003e kellogs_products.export(\"kellogs.xls\")\n\nLoading Data\n------------\n\nData objects can be created using\n\n* Strings, interpreted as paths to JSON files\n* dictionaries, interpreted as single samples\n* lists of dictionaries\n* other Data instances\n\nSo any of these will work:\n\n    cereals = knyfe.Data(\"examples/cereals.json\")\n    all_examples = knyfe.Data(\"examples/*.json\")\n    bruce = knyfe.Data({\"name\": \"Bruce Schneier\", \"awesomeness\": 8.7})\n    people = knyfe.Data([\n      {\"name\": \"Justin Bieber\", \"awesomeness\": 1.3}, \n      {\"name\": \"Nikola Tesla\", \"awesomeness\": 9.8}\n    ])\n    copy_of_singleton = knyfe.Data(singleton)\n\nExploring Data\n--------------\n\nAt any time, you can print the `summary` of a data set to get a quick peek into what's inside:\n\n    \u003e\u003e\u003e print people.summary\n    Unnamed Dataset (2 samples)\n    ''''''''''''''''''''''''''''''''''''''''''''''''''''''\n    awesomeness : 1.30 - 9.80          Mean: 5.55 +- 4.25     \n    name        : [Nikola Tesla, ...]                        \n    ======================================================\n\n`attributes` will give you all attributes in a dataset:\n\n    \u003e\u003e\u003e print people.attributes\n    set(['awesomeness', 'name'])\n\nYou can access the values of an attribute using the `get` method, or the shorthand `.`-notation:\n\n    \u003e\u003e\u003e print people.get(\"awesomeness\")\n    [ 1.3,  9.8]\n    \u003e\u003e\u003e print people.awesomeness\n    [ 1.3,  9.8]\n\nNote that while `get` works on any attribute, the dot-notation requires attributes to look like valid python variables. In any case, the values returned will be a `numpy`-array. Note that if there are samples with missing values, the returned array will be shorter than the data set itself. You can tell `get` to replace missing values, though:\n\n    \u003e\u003e\u003e people += {\"name\": \"The Yeti\"}\n    \u003e\u003e\u003e print people.get(\"awesomeness\")\n    [ 1.3,  9.8]\n    \u003e\u003e\u003e  people.get(\"awesomeness\", missing=NaN)\n    [ 1.3,  9.8, nan]\n\nManipulating Data\n-----------------\n\n### Adding Data, Unions and Differences\n\nThe `+` and `-` operators work as expected:\n\n    \u003e\u003e\u003e yeti = {\"name\": \"The Yeti\"}\n    \u003e\u003e\u003e people += yeti                   # Adds 1 sample to people (now 3)\n    \u003e\u003e\u003e more_people = people + bruce     # Creates new Dataset with 4 samples\n    \u003e\u003e\u003e real_people = more_people - yeti # Creates new Dataset with Bruce, Nikoalai and Justin\n\n### Filtering\n\nBut the real awesomeness happens in `filter`. Back to our cereals:\n\n    \u003e\u003e\u003e cereals.filter(manufacturer=\"Kellogs\")\n\nWill return a data set with only those samples from `cereals` where `manufacturer` is `Kellogs`. \n\n    \u003e\u003e\u003e cereals.filter(shelf=(2,3))\n\nwill get all cereals with `shelf` being _either_ `2` or `3`, and \n\n    \u003e\u003e\u003e cereals.filter(\"sugars\")\n\nwill get all samples where the `sugars` attribute is present and does not evaluate to `False` (ie. is not `NaN` or `0`). You can also filter by an array of booleans, which is very handy for situations like this:\n\n    \u003e\u003e\u003e cereals.filter(cereals.calories \u003e 60)\n\nNote that in this case `cereals.calories` must not have any missing values, because then `cereals.calories \u003e 60` would be shorter than data itself. In such a case, you can use `cereals.get(\"calories\", missing=NaN) \u003e 60` (samples with `calories` missing  will not be part of the filtered dataset this way.) But you can also use any arbitrary filter like this:\n\n    \u003e\u003e\u003e cereals.filter(lambda c: 12.0 \u003c= c['sugars'] \u003c 15.0)\n\ngets all the cereals that have between 12 and 15 grams of sugar.\n\n### Daisy-chaining\n\nSince `filter` returns a new data set, you can also chain methods:\n\n    \u003e\u003e\u003e cereals.filter(manufacturer=\"Kellogs\").filter(shelf=(2,3))\n\nOf course, you can also write \n\n    \u003e\u003e\u003e cereals.filter(manufacturer=\"Kellogs\", shelf=(2,3))\n\nand get the same effect - but chaining methods allows you to do a few other operations in a single line.\n\n\n### Other functions:\n\n- `map`\n- `median_split`\n- `toggle_verbose`\n- `remove_outliers`\n- `label`\n- `dependent_vars`\n\nSaving and Exporting\n--------------------\n\nSaving to json is as easy as\n\n    cereals.save(\"new_dataset.json\")\n\nBut exporting is just as swift:\n\n    cereals.save(\"excel_worksheet.xlsx\")\n\nknyfe will guess the format by the extension. \n\n### Formats\n\nCurrently following formats are supported.\n\n- `csv` for comma separated value\n- `xlsx` for Excel 07 or newer\n- `xls` for legacy Excel\n- `ods` for open document spreadsheet\n- `html` for an html file\n\nNative Datasets: JSON\n---------------------\n\nNatively, knyfe treats data like JSON objects, or, key value pairs. If you know what JSON is, skip this section.\n\n### Why JSON?\n\nAny data format should be constructed after three principles:\n\n1. Human readable\n2. Explict (ie. self-contained)\n3. Flexible\n\nIn other words, a dataset shouldn't look like this: `PK\\x03\\x04\\x14\\x00\\x00\\x00\\x00\\x00\\xce\\xad` and it also shouldn't look like `5.1,3.5,1.4,0.2;4.6,3.1,1.5,0.2`. Why? For two reasons:\n\n1. If other people want to use your data, the should know what they're dealing with.\n2. Human readable means anybody will be able to open the data set, now and in 50 years.\n\n### What does JSON look like?\n\nIf you know Python, JSON will look very familiar: it translates to Python `dict` and `list` types almost directly. The only difference is that `None` in Python is `null` in JSON, and keys don't have to be strings. So a Dataset in JSON may look like this:\n\n    [\n      {\n        species: 'Elephant',\n        weight: 8014.2,\n        age: 31,\n        name: 'Dumbo'\n      },\n      {\n        species: 'Squirrel',\n        weight: 0.021,\n        age: .7,\n        name: null\n      }\n    ]\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaebert%2Fknyfe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaebert%2Fknyfe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaebert%2Fknyfe/lists"}