{"id":13558000,"url":"https://github.com/ckan/datapusher","last_synced_at":"2025-04-05T02:12:22.518Z","repository":{"id":5450945,"uuid":"6644549","full_name":"ckan/datapusher","owner":"ckan","description":"A standalone web service that pushes data files from a CKAN site resources into its DataStore","archived":false,"fork":false,"pushed_at":"2024-03-13T10:35:02.000Z","size":743,"stargazers_count":80,"open_issues_count":115,"forks_count":157,"subscribers_count":22,"default_branch":"master","last_synced_at":"2025-03-29T01:14:49.101Z","etag":null,"topics":["ckan","datastore"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ckan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2012-11-11T21:08:59.000Z","updated_at":"2025-02-22T13:12:47.000Z","dependencies_parsed_at":"2024-03-13T11:54:21.416Z","dependency_job_id":null,"html_url":"https://github.com/ckan/datapusher","commit_stats":{"total_commits":320,"total_committers":40,"mean_commits":8.0,"dds":0.64375,"last_synced_commit":"73ed6cc6f613484999f9e5c25e9fdf645d3425e6"},"previous_names":[],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ckan%2Fdatapusher","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ckan%2Fdatapusher/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ckan%2Fdatapusher/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ckan%2Fdatapusher/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ckan","download_url":"https://codeload.github.com/ckan/datapusher/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247276189,"owners_count":20912288,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ckan","datastore"],"created_at":"2024-08-01T12:04:40.642Z","updated_at":"2025-04-05T02:12:22.494Z","avatar_url":"https://github.com/ckan.png","language":"Python","funding_links":[],"categories":["Python","others"],"sub_categories":[],"readme":"[![Tests](https://github.com/ckan/datapusher/actions/workflows/test.yml/badge.svg)](https://github.com/ckan/datapusher/actions/workflows/test.yml)\n[![Latest Version](https://img.shields.io/pypi/v/datapusher.svg)](https://pypi.python.org/pypi/datapusher/)\n[![Downloads](https://img.shields.io/pypi/dm/datapusher.svg)](https://pypi.python.org/pypi/datapusher/)\n[![Supported Python versions](https://img.shields.io/pypi/pyversions/datapusher.svg)](https://pypi.python.org/pypi/datapusher/)\n[![License](https://img.shields.io/badge/license-GPL-blue.svg)](https://pypi.python.org/pypi/datapusher/)\n\n[CKAN Service Provider]: https://github.com/ckan/ckan-service-provider\n[Messytables]: https://github.com/okfn/messytables\n\n\n# DataPusher\n\nDataPusher is a standalone web service that automatically downloads any tabular\ndata files like CSV or Excel from a CKAN site's resources when they are added to the\nCKAN site, parses them to pull out the actual data, then uses the DataStore API\nto push the data into the CKAN site's DataStore.\n\nThis makes the data from the resource files available via CKAN's DataStore API.\nIn particular, many of CKAN's data preview and visualization plugins will only\nwork (or will work much better) with files whose contents are in the DataStore.\n\nTo get it working you have to:\n\n1. Deploy a DataPusher instance to a server (or use an existing DataPusher\n   instance)\n2. Enable and configure the `datastore` plugin on your CKAN site.\n3. Enable and configure the `datapusher` plugin on your CKAN site.\n\nNote that if you installed CKAN using the _package install_ option then a\nDataPusher instance should be automatically installed and configured to work\nwith your CKAN site.\n\nDataPusher is built using [CKAN Service Provider][] and [Messytables][].\n\nThe original author of DataPusher was\nDominik Moritz \u003cdominik.moritz@okfn.org\u003e. For the current list of contributors\nsee [github.com/ckan/datapusher/contributors](https://github.com/ckan/datapusher/contributors)\n\n## Development installation\n\nInstall the required packages::\n\n    sudo apt-get install python-dev python-virtualenv build-essential libxslt1-dev libxml2-dev zlib1g-dev git libffi-dev\n\nGet the code::\n\n    git clone https://github.com/ckan/datapusher\n    cd datapusher\n\nInstall the dependencies::\n\n    pip install -r requirements.txt\n    pip install -r requirements-dev.txt\n    pip install -e .\n\nRun the DataPusher::\n\n    python datapusher/main.py deployment/datapusher_settings.py\n\nBy default DataPusher should be running at the following port:\n\n    http://localhost:8800/\n\nIf you need to change the host or port, copy `deployment/datapusher_settings.py` to\n`deployment/datapusher_local_settings.py` and modify the file to suit your needs. Also if running a production setup, make sure that the host and port matcht the `http` settings in the uWSGI configuration.\n\nTo run the tests:\n\n    pytest\n\n## Production deployment\n\n*Note*: If you installed CKAN via a [package install](http://docs.ckan.org/en/latest/install-from-package.html), the DataPusher has already been installed and deployed for you. You can skip directly to the [Configuring](#configuring) section.\n\n\nThes instructions assume you already have CKAN installed on this server in the default\nlocation described in the CKAN install documentation\n(`/usr/lib/ckan/default`).  If this is correct you should be able to run the\nfollowing commands directly, if not you will need to adapt the previous path to\nyour needs.\n\nThese instructions set up the DataPusher web service on [uWSGI](https://uwsgi-docs.readthedocs.io/en/latest/) running on port 8800, but can be easily adapted to other WSGI servers like Gunicorn. You'll\nprobably need to set up Nginx as a reverse proxy in front of it and something like\nSupervisor to keep the process up.\n\n\n     # Install requirements for the DataPusher\n     sudo apt install python3-venv python3-dev build-essential\n     sudo apt-get install python-dev python-virtualenv build-essential libxslt1-dev libxml2-dev git libffi-dev\n\n     # Create a virtualenv for datapusher\n     sudo python3 -m venv /usr/lib/ckan/datapusher\n\n     # Create a source directory and switch to it\n     sudo mkdir /usr/lib/ckan/datapusher/src\n     cd /usr/lib/ckan/datapusher/src\n\n     # Clone the source (you should target the latest tagged version)\n     sudo git clone -b 0.0.17 https://github.com/ckan/datapusher.git\n\n     # Install the DataPusher and its requirements\n     cd datapusher\n     sudo /usr/lib/ckan/datapusher/bin/pip install -r requirements.txt\n     sudo /usr/lib/ckan/datapusher/bin/python setup.py develop\n\n     # Create a user to run the web service (if necessary)\n     sudo addgroup www-data\n     sudo adduser -G www-data www-data\n\n     # Install uWSGI\n     sudo /usr/lib/ckan/datapusher/bin/pip install uwsgi\n\nAt this point you can run DataPusher with the following command:\n\n    /usr/lib/ckan/datapusher/bin/uwsgi -i /usr/lib/ckan/datapusher/src/datapusher/deployment/datapusher-uwsgi.ini\n\n\n*Note*: If you are installing the DataPusher on a different location than the default\none you need to adapt the relevant paths in the `datapusher-uwsgi.ini` to the ones you are using. Also you might need to change the `uid` and `guid` settings when using a different user.\n\n\n### High Availability Setup\n\nThe default DataPusher configuration uses SQLite as the backend for the jobs database and a single uWSGI thread. To increase performance and concurrency you can configure DataPusher in the following way:\n\n1. Use Postgres as database backend, which will allow concurrent writes (and provide a more reliable backend anyway). To use Postgres, create a user and a database and update the `SQLALCHEMY_DATABASE_URI` settting accordingly:\n\n    ```\n    # This assumes DataPusher is already installed\n    sudo apt-get install postgresql libpq-dev\n    sudo -u postgres createuser -S -D -R -P datapusher_jobs\n    sudo -u postgres createdb -O datapusher_jobs datapusher_jobs -E utf-8\n\n    # Run this in the virtualenv where DataPusher is installed\n    pip install psycopg2\n\n    # Edit SQLALCHEMY_DATABASE_URI in datapusher_settings.py accordingly\n    # eg SQLALCHEMY_DATABASE_URI=postgresql://datapusher_jobs:YOURPASSWORD@localhost/datapusher_jobs\n    ```\n\n2. Start more uWSGI threads. On the `deployment/datapusher-uwsgi.ini` file, set `workers` and `threads` to a value that suits your needs, and add the `lazy-apps=true` setting to avoid concurrency issues with SQLAlchemy, eg:\n\n    ```\n    # ... rest of datapusher-uwsgi.ini\n    workers         =  3\n    threads         =  3\n    lazy-apps       =  true\n    ```\n\n## Configuring\n\n\n### CKAN Configuration\n\nAdd `datapusher` to the plugins in your CKAN configuration file\n(generally located at `/etc/ckan/default/production.ini` or `/etc/ckan/default/ckan.ini`):\n\n    ckan.plugins = \u003cother plugins\u003e datapusher\n\nIn order to tell CKAN where this webservice is located, the following must be\nadded to the `[app:main]` section of your CKAN configuration file :\n\n    ckan.datapusher.url = http://127.0.0.1:8800/\n   \nStarting from CKAN 2.10, DataPusher requires a valid API token to operate (see [the documentation on API tokens](https://docs.ckan.org/en/latest/api/index.html#authentication-and-api-tokens)), and will fail to start if the following option is not set:\n\n    ckan.datapusher.api_token = \u003capi_token\u003e\n\nThere are other CKAN configuration options that allow to customize the CKAN - DataPusher\nintegation. Please refer to the [DataPusher Settings](https://docs.ckan.org/en/latest/maintaining/configuration.html#datapusher-settings) section in the CKAN documentation for more details.\n\n\n### DataPusher Configuration\n\nThe DataPusher instance is configured in the `deployment/datapusher_settings.py` file.\nHere's a summary of the options available.\n\n| Name | Default | Description |\n| -- | -- | -- |\n| HOST | '0.0.0.0' | Web server host |\n| PORT | 8800 | Web server port |\n| SQLALCHEMY_DATABASE_URI | 'sqlite:////tmp/job_store.db' | SQLAlchemy Database URL. See note about database backend below. |\n| MAX_CONTENT_LENGTH | '1024000' | Max size of files to process in bytes |\n| CHUNK_SIZE | '16384' | Chunk size when processing the data file |\n| CHUNK_INSERT_ROWS | '250' | Number of records to send a request to datastore |\n| DOWNLOAD_TIMEOUT | '30' | Download timeout for requesting the file |\n| SSL_VERIFY | False | Do not validate SSL certificates when requesting the data file (*Warning*: Do not use this setting in production) |\n| TYPES | [messytables.StringType, messytables.DecimalType, messytables.IntegerType, messytables.DateUtilType] | [Messytables][] types used internally, can be modified to customize the type guessing |\n| TYPE_MAPPING | {'String': 'text', 'Integer': 'numeric', 'Decimal': 'numeric', 'DateUtil': 'timestamp'} | Internal Messytables type mapping |\n| LOG_FILE | `/tmp/ckan_service.log` | Where to write the logs. Use an empty string to disable |\n| STDERR | `True` | Log to stderr? |\n\n\nMost of the configuration options above can be also provided as environment variables prepending the name with `DATAPUSHER_`, eg `DATAPUSHER_SQLALCHEMY_DATABASE_URI`, `DATAPUSHER_PORT`, etc. In the specific case of `DATAPUSHER_STDERR` the possible values are `1` and `0`.\n\n\nBy default, DataPusher uses SQLite as the database backend for jobs information. This is fine for local development and sites with low activity, but for sites that need more performance, Postgres should be used as the backend for the jobs database (eg `SQLALCHEMY_DATABASE_URI=postgresql://datapusher_jobs:YOURPASSWORD@localhost/datapusher_jobs`. See also [High Availability Setup](#high-availability-setup). If SQLite is used, its probably a good idea to store the database in a location other than `/tmp`. This will prevent the database being dropped, causing out of sync errors in the CKAN side. A good place to store it is the CKAN storage folder (if DataPusher is installed in the same server), generally in `/var/lib/ckan/`.\n\n\n## Usage\n\nAny file that has one of the supported formats (defined in [`ckan.datapusher.formats`](https://docs.ckan.org/en/latest/maintaining/configuration.html#ckan-datapusher-formats)) will be attempted to be loaded\ninto the DataStore.\n\nYou can also manually trigger resources to be resubmitted. When editing a resource in CKAN (clicking the \"Manage\" button on a resource page), a new tab named \"DataStore\" will appear. This will contain a log of the last attempted upload and a button to retry the upload.\n\n![DataPusher UI](images/ui.png)\n\n### Command line\n\nRun the following command to submit all resources to datapusher, although it will skip files whose hash of the data file has not changed:\n\n    ckan -c /etc/ckan/default/ckan.ini datapusher resubmit\n\nOn CKAN\u003c=2.8:\n\n    paster --plugin=ckan datapusher resubmit -c /etc/ckan/default/ckan.ini\n\nTo Resubmit a specific resource, whether or not the hash of the data file has changed::\n\n    ckan -c /etc/ckan/default/ckan.ini datapusher submit {dataset_id}\n\nOn CKAN\u003c=2.8:\n\n    paster --plugin=ckan datapusher submit \u003cpkgname\u003e -c /etc/ckan/default/ckan.ini\n\n\n## License\n\nThis material is copyright (c) 2020 Open Knowledge Foundation and other contributors\n\nIt is open and licensed under the GNU Affero General Public License (AGPL) v3.0\nwhose full text may be found at:\n\n[http://www.fsf.org/licensing/licenses/agpl-3.0.html]()\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fckan%2Fdatapusher","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fckan%2Fdatapusher","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fckan%2Fdatapusher/lists"}