{"id":31673250,"url":"https://github.com/reelyactive/paretl-postgres","last_synced_at":"2025-10-08T03:37:22.772Z","repository":{"id":312318987,"uuid":"1047104338","full_name":"reelyactive/paretl-postgres","owner":"reelyactive","description":"Pareto Anywhere ETL for PostgreSQL.  We believe in an open Internet of Things.","archived":false,"fork":false,"pushed_at":"2025-10-02T16:33:10.000Z","size":1744,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-10-02T18:28:33.998Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/reelyactive.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-29T18:42:53.000Z","updated_at":"2025-10-02T16:33:14.000Z","dependencies_parsed_at":null,"dependency_job_id":"b5337190-cd92-4709-94c9-6dc6eb03f8b8","html_url":"https://github.com/reelyactive/paretl-postgres","commit_stats":null,"previous_names":["reelyactive/paretl-postgres"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/reelyactive/paretl-postgres","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/reelyactive%2Fparetl-postgres","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/reelyactive%2Fparetl-postgres/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/reelyactive%2Fparetl-postgres/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/reelyactive%2Fparetl-postgres/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/reelyactive","download_url":"https://codeload.github.com/reelyactive/paretl-postgres/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/reelyactive%2Fparetl-postgres/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278886174,"owners_count":26062972,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-08T02:00:06.501Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-08T03:37:21.606Z","updated_at":"2025-10-08T03:37:22.757Z","avatar_url":"https://github.com/reelyactive.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"paretl-postgres\n===============\n\nPareto Anywhere ETL (\"paretl\") for PostgreSQL.\nThis program applies transformation and filter operations on the raw data, and load it into a new table.\nThe resulting processed data is hence expected to be cleaner and easier to retrieve.\nA watchdog process enables the overview of the full operation.\n\n# Quick HOW-TO\n\nYou can use the ETL process either with the scripts directly or using a docker image.\n\n## Conditions\n\nYou need a local postgresql database with a `raddec` table whose columns are:\n* **transmitterid** (alphanumeric variable): MAC address of the transmitting device  \n* **receiverid** (alphanumeric variable): MAC address of the receiving device  \n* **numberofdecodings** (integer variable): number of signals from the transmitter observed by the receiver during one minute  \n* **rssi** (integer variable): detection power  \n* **timestamp** (time variable YYYY-MM-DD HH:MM:SS): time of detection  \n\n\n## Using Docker\n\nThe ETL is avalaible as a docker image stored in Docker Hub.\n\nMake sure you have docker installed:\n\n`docker --version`\n\nif you don't have docker, install it:\n\n`sudo snap install docker`\n\nPull the docker image:\n\n`sudo docker pull reelyactive/paretl-postgres:latest`\n\nCheck you have the image:\n\n`sudo docker images`\n\nDownload the configuration file\n\n`mkdir config`\n\n`wget https://github.com/reelyactive/paretl-postgres/blob/ba47af6cf082b0998bd76e4b162a28f9adafa697/config/config.json` \n\n`sudo docker run \\\n  --add-host=host.docker.internal:host-gateway \\\n  -v $(pwd)/config:/app/config \\\n  reelyactive/paretl-postgres:latest python -m src.main -c config/config.json`\n\nIf it fails, you may need to make the postgresql listen to the docker by adding the following lines:\n\n`sudo nano /etc/postgresql/16/main/postgresql.conf \nlisten_addresses = '*'\nsudo nano /etc/postgresql/16/main/pg_hba.conf \nhost all all 172.17.0.0/16 md5\nsudo systemctl restart postgresql`\n\n\n## Using the plain code\n\nMake sure you have the following configuration (consider using a [dedicated environment](https://www.youtube.com/watch?v=IAvAlS0CuxI) like [anaconda](https://youtu.be/hVcEv7rEN24?si=xHN6zLnYidVYLEej)):\n\n* Python 3.13\n\nPython libraries:\n* pandas\n* psycopg2-binary\n* sqlalchemy\n* psutil\n* tabulate\n* logging\n* argparse\n\nIf missing you can install them using:\n\n`pip install \u003cLIBRARY NAME\u003e`\n\nRetrieve the code:\n\n`git clone https://github.com/reelyactive/paretl-postgres.git`\n\nGo to the ETL repository:\n\n`cd paretl-postgres`\n\nIn the configuration file `config/config.json`, make sure that the DB host link be set to:\n\n`\"db_host\": \"localhost\"`\n\nRun the ETL:\n\n`python -m src.main -c config/config.json`\n\n## Result\n\nYour database contains now two additional tables:\n* etl_raddec: filtered data\n* etl_watchdog: performances of the ETL process\n\nThe `etl_raddec` table contains the rows of the `raddec` table that passed the filters defined in the configuration file. Its columns are the same as the raddec's, plus various metrics:\n\n* **time_window** (numeric variable): duration in seconds between the first and the last observation of a transmitter over all the receivers  \n* **max_rssi** (numeric variable): maximum observed detection power rssi of a transmitter over all the receivers  \n* **nb_counts** (integer variable): total number of observations of a transmitter over all the receivers  \n* **digit_2** (alphanumeric variable): second character of the MAC address of the transmitter  \n* **isPrivate** (boolean): does the MAC address of the transmitter correspond to a private device  \n* **date** (date variable): simple conversion of timestamp to date  \n* **watchdog_id** (table key): processing index, to be crossed with the primary key of the `etl_watchdog` table  \n\n\nThe `etl_watchdog` table contains one row per ETL processing with the following columns:\n\n* **id** (table primary key): processing index  \n* **event_name** (alphanumeric variable): name of the event during which the receivers have been deployed, defined in the configuration file  \n* **ts** (time variable): timestamp of the processing  \n* **rows** (integer variable): number of rows  \n* **duration_sec** (integer variable): duration in seconds of the ETL processing  \n* **cpu_percent** (numeric variable): fraction of the local CPU used for the ETL processing  \n* **memory_mb** (numeric variable): RAM in Mb used for the ETL processing  \n* **n_transmitters** (integer variable): number of transmitters observed in the filtered raddec table  \n* **n_transmitters_per_day** (alphanumeric variable YYYY-MM-DD: N): number of transmitters per day observed in the filtered raddec table  \n* **median_time_window** (numeric variable): [median](https://en.wikipedia.org/wiki/Median) time window of the transmitters in the filtered raddec table  \n* **mean_time_window** (numeric variable): [mean](https://en.wikipedia.org/wiki/Arithmetic_mean) time window of the transmitters in the filtered raddec table  \n* **std_time_window** (numeric variable): [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) (spread) time window of the transmitters in the filtered raddec table  \n\n\n\n# Configuring the ETL\n\nThe ETL process requires as unique imput a json configuration file. An example of such file can be found in:\n\n`config/config.json`\n\n## Content of the json configuration file\n\nThe configuration file requires the following fields:\n\n* **event_name** : name of the event (for instance \"F1 Montreal 2022\")\n* **start_ts** : start timestamp in the raddec table for ETL in format YYYY-MM-DD HH:MM:SS\n* **end_ts** : end timestamp in the raddec table for ETL in format YYYY-MM-DD HH:MM:SS\n* **receivers_id** : list of receiver IDs (array of strings)\n* **db_type** : database type (currently supported \"postgresql\")\n* **db_host** : database host (\"host.docker.internal\" if running with docker, \"localhost\" if running with plain code)\n* **db_port** : database port (typically 5432)\n* **db_user** : database username\n* **db_pass** : database password\n* **db_name** : database name\n* **source_table** : source table name (typically \"raddec\")\n* **target_table** : target table name (typically \"etl_raddec\")\n* **watchdog_table** : watchdog table name (typically \"etl_watchdog\")\n* **log_level** : logging level (\"INFO\", \"DEBUG\", \"ERROR\")\n* **dry_run** : boolean flag (true/false) to enable dry-run mode (no DB writes)\n* **filtering** : list of filtering rules, each object containing:\n  * **name** : user defined filter name (for instance \"Trying a filter on time window\")\n  * **col** : column name to filter on, must be a column of the `etl_raddec` table (see above) (for instance \"time_window\")\n  * **op** : operator (==, !=, \u003e=, \u003c=, \u003c, \u003e)\n  * **val** : filter value (string, number, or boolean)\n\n\n## Creating the json configuration file\n\nEven though such a json file can easily be created using any text editor, you are welcome to use the following local webpage:\n\n`tools/create_config.html`\n\n1. click on the html document, which should open in a browser\n2. fill the configuration fileds\n3. click generate JSON\n\nYou will need a csv file with all the receivers, from which you can then pick the ones used at that event, for instance:\n\n|                 |\n|-----------------|\n| 02a3416dc4f7    |\n| 02a3bd59e2dc    |\n| 02a3e5351a16    |\n| 02a37341e3ff    |\n| 02a384aafe9f    |\n| 02a38b484e43    |\n\n\n\nYou can then add an arbitrary number of user defined filters.\n\n# For the developers\n\n## Building the ETL docker image\n\nBuild the image and push it to the Docker hub\n\n`sudo docker build -t etl_app .`\n\n`sudo docker images`\n\n`sudo docker tag etl_app reelyactive/paretl-postgres:latest`\n\n`sudo docker login -u \u003cYOUR DOCKER USER NAME\u003e`\n\n`sudo docker push reelyactive/paretl-postgres:latest`\n\nClean up your local docker from all images\n\n`sudo docker container prune -f`\n\n`sudo docker rmi $(sudo docker images | awk '/\u003cnone\u003e/ {print $3}')`\n\n`sudo docker stop $(sudo docker ps -aq)`\n\n`sudo docker rm $(sudo docker ps -aq)`\n\n`sudo docker rmi -f $(sudo docker images -aq)`\n\n## Testing the ETL\n\nTesting implies:\n1. the creation of a postgresql DB\n2. the upload of the test dataset\n3. the test\n\n### (1.1) Install and start postgresql\n\n`sudo apt update`\n\n`sudo apt upgrade -y`\n\n`sudo apt install postgresql postgresql-contrib -y`\n\n`sudo systemctl enable postgresql`\n\n`sudo systemctl start postgresql`\n\n`sudo systemctl status postgresql`\n\n### (1.2) Create the user\n\n`sudo -i -u postgres`\n\n`psql -c \"CREATE USER reelyactive WITH PASSWORD 'paretoanywhere';\"`\n\n### (1.3) Create database owned by the user and grant privileges\n\n`psql -c \"CREATE DATABASE pareto_anywhere OWNER reelyactive;\"`\n\n`psql -c \"GRANT ALL PRIVILEGES ON DATABASE pareto_anywhere TO reelyactive;\"`\n\n### (1.4) Checks the users and tables\n\n`psql`\n\n`\\l+`\n\n`\\du`\n\n`exit`\n\n### (2.1) Create the table\n\n`psql -U reelyactive -d pareto_anywhere -h localhost`\n\n`CREATE TABLE raddec (\n    transmitterId TEXT,\n    numberOfDecodings INT,\n    receiverId TEXT,\n    rssi INT,\n    timestamp TIMESTAMP\n);`\n\n### (2.2) Upload the test dataset\n\n`\\copy raddec(transmitterId, numberOfDecodings, receiverId, rssi, timestamp)\nFROM '/home/full/path/to/data.csv'\nDELIMITER ','\nCSV HEADER`\n\n### Empty table if needed (truncate keep the structure, drop wipes it)\n\n`sudo -i -u postgres`\n\n`psql`\n\n`\\c pareto_anywhere`\n\n`\\dt+`\n\n`TRUNCATE TABLE etl_raddec;`\n\n`TRUNCATE TABLE etl_watchdog;`\n\n`DROP TABLE etl_raddec;`\n\n`DROP TABLE etl_watchdog;`\n \n### (2.3) Check that the data has been uploaded to the database\n\n`\\dt+`\n\n`SELECT COUNT(*) FROM raddec;`\n\n`SELECT * FROM raddec LIMIT 5;`\n\n`exit`\n\n\n### (3.1) Run the ETL\n\n`cd ..`\n\nFor a local test (no docker) replace in the config.json\n\"db_host\": \"host.docker.internal\",\nby\n\"db_host\": \"localhost\",\nThen run the ETL locally:\n\n`python -m src.main -c config/config.json`\n\n\n## Structure of the ETL\n\nThe ETL has a standard structure in three steps:\n\n1. Configuration: once all libraries uploaded, the input json configuration file is read and checked.\n2. Extraction: given the information in the configuration file, the data are extracted from the postgresql database.\n3. Transformation: metrics are built and the filters defined the configuration file are applied to the extracted data.\n4. Loading: filtered data are loaded in the output table specified in the configuration file\n5. Logging: process information (CPU, RAM, duration) and general metrics of the filtered data are loaded as a single row in a watchdog table.\n\n\nContributing\n------------\n\nDiscover [how to contribute](CONTRIBUTING.md) to this open source project which upholds a standard [code of conduct](CODE_OF_CONDUCT.md).\n\n\nSecurity\n--------\n\nConsult our [security policy](SECURITY.md) for best practices using this open source software and to report vulnerabilities.\n\n\nLicense\n-------\n\nMIT License\n\nCopyright (c) 2025 [reelyActive](https://www.reelyactive.com)\n\nPermission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR \nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, \nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE \nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER \nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, \nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN \nTHE SOFTWARE.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Freelyactive%2Fparetl-postgres","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Freelyactive%2Fparetl-postgres","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Freelyactive%2Fparetl-postgres/lists"}