{"id":22052706,"url":"https://github.com/digitalnz/extraction_point","last_synced_at":"2026-04-13T12:01:16.051Z","repository":{"id":138874030,"uuid":"191655582","full_name":"DigitalNZ/extraction_point","owner":"DigitalNZ","description":"Tools for processing Kete data after it has been exported in PostgreSQL importable form","archived":false,"fork":false,"pushed_at":"2021-12-15T15:02:15.000Z","size":108,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-23T15:38:56.324Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DigitalNZ.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-12T22:59:03.000Z","updated_at":"2021-06-16T21:46:20.000Z","dependencies_parsed_at":null,"dependency_job_id":"4ee8b772-cf8c-4778-b39e-f1cf4f86df9e","html_url":"https://github.com/DigitalNZ/extraction_point","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DigitalNZ/extraction_point","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DigitalNZ%2Fextraction_point","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DigitalNZ%2Fextraction_point/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DigitalNZ%2Fextraction_point/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DigitalNZ%2Fextraction_point/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DigitalNZ","download_url":"https://codeload.github.com/DigitalNZ/extraction_point/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DigitalNZ%2Fextraction_point/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31751705,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-13T09:16:15.125Z","status":"ssl_error","status_checked_at":"2026-04-13T09:16:05.023Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-30T15:14:00.377Z","updated_at":"2026-04-13T12:01:16.010Z","avatar_url":"https://github.com/DigitalNZ.png","language":"Elixir","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ExtractionPoint\n\nTools for processing Kete data after it has been exported in\nPostgreSQL importable form\n\n## Requirements\n\nExpects [Docker](https://www.docker.com) to be up and running and\n[`docker-compose`](https://docs.docker.com/compose/) to be available\non machine.\n\n## Overview\n\nExtraction Point migrates Kete's data to modern PostgreSQL based\ntables and columns using standard datatypes rather than Kete specific\nExtended Fields, etc.\n\nThen it makes this new version of the data available via a temporary\nAPI that can return JSON or TSV files (like CSV files, but tab\ndelimited) so that you can import the data into other systems.\n\nA typical workflow will look like this:\n\n* grab your Kete site's original data via [Kete  extraction](https://github.com/walter/kete_extraction)\n* use this project to standardize and modernize the data via initial data migration\n* use this project to download JSON or TSV files for the new version of the data\n* import and map the data into new system including steps to map\n  audio, document, image, and video files to their corresponding data\n  (this may vary a lot depending on the new system)\n\nExtraction Point or the workflow you use should be straight forward to\nmodify or extend for a given projects needs.\n\nSee sections below for detailed instructions or guidance.\n\n_Note: This is a non-destructive set of tools and the original tables and data (along with new versions of the data) are available for further querying or use via SQL, etc. or even using common tools such as `pg_dump` to create full or partial back up of the database to be imported elsewhere._\n\n## Current limitations\n\nDue to time constraints and the desire to cover the most common needs\nof Kete sites first, some data from Kete has not yet been modernized\nand therefore not made available via the included API:\n\n* comments\n* private items\n* full version history meta data\n* basket membership and roles\n* fully resolving values of fields that included a \"base URL\" as\n  prefix for field value via Kete's Extended Field and Choice system\n* replacing URL values of fields that refer to old site URL as\n  prefix for field value via Kete's Extended Field and Choice system\n  -- this is best handled as bulk search and replace operation\n  separately once new system URL patterns are determined\n\nA further limitation is that no constraints or indexes are in place\nfor either legacy tables or modernized tables as the database is\nintended as intermediate tool rather than a final application\ndatabase. If you want to base an application's database off the\nExtraction Point one, we recommend you evaluate what constraints and\nother changes are needed.\n\nAgain, keep in mind that all of the original tables and data are\navailable to work around these limitations. The project also welcomes\nopen source contributions or further funding to improve it.\n\n## Installation\n\n* set up Docker and docker-compose locally\n* clone this repository and `cd` into its directory\n\n## Usage\n\n### Initial data migration\n\nYou'll need a sql file that is an export of a Kete site's data via\n[Kete extraction](https://github.com/walter/kete_extraction).\n\nThe first step is to import that data. This will trigger `docker` and\n`docker-compose` to set up the necessary images and containers on the\nmachine. So the first time you run any of the commands may be slow.\n\n```sh\ndocker-compose run app mix extraction_point.load_sql _the_kete_export_file.sql_\n```\n\nThe next step is to trigger the migration of data from Kete's specific\nformat to modern PostgreSQL tables and columns.\n\nThis includes _per row_ updates, so it can take a long time depending\non how much data your Kete site had.\n\n```sh\ndocker-compose run app mix ecto.migrate\n```\n\nIf you have a lot of data, you may want to surpress standard output\nlike this instead:\n\n```sh\ndocker-compose run app mix ecto.migrate \u003e /dev/null\n```\n\n_Note: If you need to undo last migration (you can do this repeatedly\nuntil at right spot), you can rollback._\n\n```sh\ndocker-compose run app mix ecto.rollback\n```\n\nNow you are ready to download the migrated data via the included\nscripts that use the provided API.\n\n### Things to know about the standardized and modernized data\n\nSome background. Kete's fundamental organizing container is the basket\nwith content items (audio recordings, documents, images, videos, and\nweb links) and topics (which in turn are organized into site defined\ntopic types) grouped by basket. Comments (also referred to as\n\"discussion\" in the UI) were then associated with a content item or\ntopic and also \"owned\" by the same basket.\n\nThese content types and topic types are the primary data that\nExtraction Point exports.\n\nHere are the common fields for the content type and topic type based\nrecords with explanatory comments:\n\nColumn | Notes\n------ | -----\n`id` | id field has same value as original data table\n`title` | name of item\n`description` | description of item in HTML, can be long\n`tags` | names of tags associated with item, multiple value field\n`version` | number of last published version of item\n`inserted_at` | when the item was created in the database\n`updated_at` | when the item was last modified in the database\n`basket_id` | id of basket in legacy baskets table\n`basket_key` | name of basket as it was reflected in urls, for convenience\n`license_id` | id of license in legacy licents table\n`license` | text title of license for convenience\n`previous_oai_identifier` | unique id in Kete's included OAI-PMH repository - useful for tracking where the record was previously in any aggregation system, E.g. Digital New Zealand, that harvested the repository\n`previous_url_patterns` | url patterns (using glob wildcards) for where the item was previous found in Kete for various actions, useful for setting up redirects to new system\n`creator_id` | id of user who created item in legacy users table\n`creator_login` | unique login of user who created item in legacy users table, for convenience\n`creator_name` | user chosen display name of user who created item in legacy users table, for convenience\n`contributor_ids` | list of ids of users who modified item in legacy users table\n`contributor_logins` | unique logins of users who modified item in legacy users table, for convenience\n`contributor_names` | user chosen display names of users who modified item in legacy users table, for convenience\n\n_Note: contributors (and creator) are only listed once even though they may have\ncontributed multiple versions of the item. So order in fields does not\ncorrespond to order or number of contributions._\n\nThen each may or may not have _additional fields_ depending on two\nthings.\n\n1. fields that were specific to the type, E.g. `url` for `web_links`\n   or `content_type` for `documents` to store whether the document is\n   a `pdf` or `doc` file\n2. what Extended Fields were set up for the type, E.g. `dob` could\n   have been set up for the `person` type\n\nExtraction Point includes a report that will describe each extracted\ntype and all of its columns along with how many records it has for\neach type. We'll cover that in the next section.\n\n### Downloading data in TSV or JSON format\n\nFirst we need to spin up the API server with the following command in\nits own shell:\n\n```sh\ndocker-compose up # wait for the output to report \"[info] Access ExtractionPointWeb.Endpoint at http://localhost:4000\"\n```\n\nNow in another shell within the same directory, you can start\ndownloading the data. First up is the meta report which will guide our\nfurther downloads.\n\n```sh\n./bin/extract_as_json.sh meta meta.json # this says to request the meta report and output it to a file named meta.json\n```\n\n`json` is recommeneded for the meta report. By default the API\noutputs json without extra white space which is hard to read. Either\nopen the file in your favorite editor and use the editor's capability\nto pretty print the json or if you have\n[jq](https://stedolan.github.io/jq/) handy use `jq . meta.json` to\nexamine it. This goes for any of the json output.\n\nYou can see that the meta report lists the extracted types with their\ncolumns with datatypes, the number of rows they have under `count`, as\nwell as which baskets the rows are in.\n\nUse this information to determine your download plan. Obviously you\ncan skip types that have a `count` of 0.\n\nYou may also want to skip types that are only `within_baskets` of\n\"about\" or \"help\" as these are generic Kete baskets that may not\ncontain anything specific to your site or are not relevant to your new\nsystem. There is also the option to skip particular baskets which\nwe'll talk about shortly and these baskets are good candidates to for\nthis option.\n\nNow determine if you want to extract `json` or `tsv` to download your\ndata. This is going to depend on what you plan to do next with the\ndata and what requirements the system you plan on importing to has.\n\n`tsv` is functionally equivalent to `csv`, but is handled better\nby Excel when the data contains unicode characters from what we have\nread. Outside of Excel, most comma separated values parser libraries\nshould allow for specifying tab as a delimiter.\n\n#### Downloading the standard content types\n\nKete standard content types are `audio_recordings`, `documents`,\n`still_images`, and `web_links`. There are also two special case\ncontent types, `comments` which we are skipping for now, and `users`\nwhich we'll cover later.\n\nStart by downloading these types. Check the meta report for those that\nhave a `count` of more than zero for their content.\n\nHere's the most basic way to download `still_images`, the same pattern\ncan be used for the other content types.\n\n```sh\n./bin/extract_as_json.sh still-images\n```\n\nFirst use the appropriate export script, in this case\n`bin/extract_as_json.sh`, but if you want `tsv`, use\n`bin/extract_as_tsv.sh`. Then specify the type in url style dash\nseparated plural form (i.e. kebab case).\n\nThere are two additional arguments that you can specify for the\nscript. Output file and options.\n\nOutput file is relative path of the file you would like created with\nthe data.\n\n```sh\n./bin/extract_as_json.sh still-images path-to/file.json\n```\n\nOptions are in the form of a URL query string and they should be\nseparated by an escaped appersand, `\\\u0026`. They specify any\noptions you want for manipulating the data.\n\nHere are the options currently supported:\n\nOption | Notes\n----- | -----\n`except_baskets` | comma separated (no spaces) list of basket keys that should be excluded from results\n`only_baskets` | comma separated (no spaces) list of basket keys of those baskets that data should be limited to\n`limit` | number of results to limit to, used in combination with `offset` to paginate results\n`offset` | number of record in results to start after, E.g. when used in combination with `limit`, `limit=10\\\u0026offset=10` would say to only return results 11 - 20 or \"page 2\"\n`include_related` | call as `included_related=source` to use this option with content types or topics. Lists ids and titles columns for all topics (by type) that item is in\n\nArguments for the scripts are positional, so options require that the\noutput file parameter is also specified!\n\nHere's how to request the first and second 100 results for still images in two\nsuccessive files:\n\n```sh\n./bin/extract_as_json.sh still-images still-images-page-1.json limit=100\\\u0026offset=0\n./bin/extract_as_json.sh still-images still-images-page-2.json limit=100\\\u0026offset=100\n```\n\nUsing limit and offset for pagination is probably most useful for\nbreaking up `json` into more managable chunks.\n\nHere's how to request results in `tsv` for all documents only in a\nspecific basket:\n\n```sh\n./bin/extract_as_json.sh documents community-group-a-documents.json only_baskets=community_group_a\n\n```\n\nHere's how to request results in `tsv` for all web links as long as\nthey are not in the generic \"about\" and \"help\" baskets:\n\n```sh\n./bin/extract_as_json.sh web-links web-links.json except_baskets=about,help\n\n```\n\n#### Downloading topic types\n\nThe other types listed in the meta report, except for the special\ncontent type`users` and the `relations` type for linking records which\nwe'll cover last, are topic types.\n\nSome of these come standard with Kete, such as `topic` while others\nare dynamicall added by site admins and therefore their names are not\nknown ahead of time.\n\nYou can derive the type name for a topic type by looking at the\n`table_name` in the meta report and dropping the `extracted_` prefix.\n\nHow you download the data for a type is the same as for content types\n_except you use singular form for the type argument_! Here's how to\nget the person type's data in `json` with limit and offset:\n\n```sh\n./bin/extract_as_json.sh person people-page-1.json limit=100\\\u0026offset=0\n```\n\nHere are the options currently supported:\n\nOption | Notes\n----- | -----\n`except_baskets` | comma separated (no spaces) list of basket keys that should be excluded from results\n`only_baskets` | comma separated (no spaces) list of basket keys of those baskets that data should be limited to\n`limit` | number of results to limit to, used in combination with `offset` to paginate results\n`offset` | number of record in results to start after, E.g. when used in combination with `limit`, `limit=10\\\u0026offset=10` would say to only return results 11 - 20 or \"page 2\"\n`include_related` | call as `included_related=source` to lists ids and titles columns for all topics (by type) that item is in OR call as  `included_related=target` to lists ids and titles columns for all  topics and content items (by type) that a topic contains\n\n##### Known issues\n\nTopic type names are expected to be in singular form, e.g. \"Town\",\nrather than plural like \"Towns\". Because of this convention, the\nextraction process will have some unexpected naming for types and\ntables.\n\nWe recommend renaming the topic type name to singular form before\nrunning the data export step or directly in the resulting sql export file.\n\n#### Downloading users and relations\n\nYou may also want to bring across `users` for your new system as well\nas the mapping of which pieces of content are related to which topics\nvia the `relations` data.\n\nThe same scripts will work for them, however the `except_baskets`\nand `only_baskets` options are not relevant.\n\nE.g.\n\n```sh\n./bin/extract_as_json.sh users # or relations as type\n```\n\nwill give you the data for each.\n\n_Note: extracted `users` data doesn't include the hashed passwords,\nalthough they are available in the legacy table. Email addresses are\nincluded, so be careful not expose extracted data publicly.\n\nThat's everything at this point.\n\n#### Shutting down temporary API and cleaning out docker images\n\nOnce you are done with downlding data from the API, you can shut it\ndown in the shell it was running in with `control-c control-c`.\n\nThen you can clean out any stuff `docker-compose up` left around with\n`docker-compose down`.\n\n### Recommended order of import into new systems\n\n* import users first\n* then content types and topics types\n* last import relations\n\n### Mapping corresponding audio, document, image, and video files to data\n\n* content types with associated files that were uploaded to Kete have\n  a `relative_file_path` column (`still_images` have different column\n  names, but same idea, should be self explanatory)\n* you have to prefix this path with appropriate path that corresponds\n  to whether it is public or private and its type for where the file\n  will be found in the exported files from Kete Extraction. E.g. if\n  `relative_file_path` for a document is `0000/0000/001/file.pdf`\n  then it will be found at `public/documents/0000/0000/001/file.pdf`.\n* in the future we will also handle private items and they will be\n  found under the `private` directory\n* still images are a special case as they have multiple image files\n  associated with them, the original and also resized versions such as\n  thumbnails. It has columns for `relative_original_file_path` and\n  `relative_resized_file_paths` accordingly - the rezsized files will\n  only be present if you opted to export them\n\n## Appendix\n\n### Wiping and starting again\n\n_Note: this tool spins up a container for PostgreSQL to run. It\npersists data under `docker/data/postgres` between command runs._\n\n_If you want to start from scratch or clean things out, do\n`rm -rf docker/data/postgres/*`._\n\n### Lower level data access tools\n\nYou can also examine the data via a `psql` session or through the\nElixir application via `iex`.\n\nFor `psql` for direct sql access:\n\n```sh\ndocker-compose run app psql -h db -U 'postgres' extraction_point_dev\n```\n\nExtraction Point is built using [Elixir](https://elixir-lang.org ) and\n[Phoenix](https://phoenixframework.org) and has the\n[`iex`](https://hexdocs.pm/iex/IEx.html) interactive shell available\nfor interacting with the application via Elixir:\n\n```sh\ndocker-compose run app iex -S mix\n```\n\n### Tips and Tricks\n\nIf you run the `extraction_point.load_sql` command and it fails with\nsomething like this:\n\n``` sh\nERROR:  invalid byte sequence for encoding \"UTF8\": ...\n```\n\nthen you may have data that has not been properly encoded when it was\nadded to the database. The most likely scenario is that someone has\ncopied and pasted from a Microsoft Word that is encoded in\n`WINDOWS-1252`.\n\nYou'll need to re-encode the data before importing it. This was\nsuccessful for me with the caveat that there is a danger of double encoding:\n\n``` sh\niconv -f WINDOWS-1252 -t UTF-8 source_data_file.sql \u003e source_data_file.utf8.sql\n```\n\nthen use this new sql file to re-run the `extraction_point.load_sql` command.\n\n## Credits\n\nThis project was developed by [Walter McGinnis](waltermcginnis.com) for\nmigrating data from the  [Kete](old.kete.net.nz) open source\napplication and was funded by [Digital New Zealand](digitalnz.org).\n\n## COPYRIGHT AND LICENSING\u2028\u2028\n\nGNU GENERAL PUBLIC LICENCE, VERSION 3\u2028\u2028\n\nExcept as indicated in code, this project is Crown copyright (C) 2019,\nNew Zealand Government.\n\nThis program is free software: you can redistribute it and/or modify\nit under the terms of the GNU General Public License as published by\nthe Free Software Foundation, either version 3 of the License, or (at\nyour option) any later version.\n\nThis program is distributed in the hope that it will be useful, but\nWITHOUT ANY WARRANTY; without even the implied warranty of\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU\nGeneral Public License for more details.\n\nYou should have received a copy of the GNU General Public License\nalong with this program. If not, see http://www.gnu.org/licenses /\nhttp://www.gnu.org/licenses/gpl-3.0.txt\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdigitalnz%2Fextraction_point","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdigitalnz%2Fextraction_point","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdigitalnz%2Fextraction_point/lists"}