{"id":14067747,"url":"https://github.com/adjust/rport","last_synced_at":"2025-05-10T08:31:09.382Z","repository":{"id":10920530,"uuid":"13221418","full_name":"adjust/rport","owner":"adjust","description":"Connection management and SQL parallelisation for R analytics on big database clusters","archived":true,"fork":false,"pushed_at":"2020-09-16T10:38:30.000Z","size":110,"stargazers_count":22,"open_issues_count":4,"forks_count":7,"subscribers_count":76,"default_branch":"master","last_synced_at":"2025-03-22T07:51:26.192Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adjust.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-09-30T17:06:28.000Z","updated_at":"2024-09-10T12:51:48.000Z","dependencies_parsed_at":"2022-09-03T00:11:48.109Z","dependency_job_id":null,"html_url":"https://github.com/adjust/rport","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adjust%2Frport","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adjust%2Frport/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adjust%2Frport/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adjust%2Frport/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adjust","download_url":"https://codeload.github.com/adjust/rport/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253389736,"owners_count":21900805,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-13T07:05:45.459Z","updated_at":"2025-05-10T08:31:08.675Z","avatar_url":"https://github.com/adjust.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"# rport - Parallel Querying on Sharded PostgreSQL Clusters for Analytics in R\n\nQuerying PostgreSQL from R is typically done using the [RPostgreSQL][rpostgresql] driver (or the newer\n[RPostgres][rpostgres]). However in both cases it's the responsibility of the analyst to maintain the connection objects\nand pass them on every query. In analytical contexts, where data resides in multiple databases (e.g. sharded setup,\nmicroservice setup, etc.) the task of maitaining all connection objects, quickly becomes very tedious.\n\nFurthermore, in many partitioning and sharding data architectures, queries could be parallized and run simultaneously to\nget the necessary data more efficiently. However parallelizing the querying could mean even more complexity for the analyst.\n\n`rport` solves both of these issues by:\n\n* allowing data scientists to maintain DB connection details outside of their analytics\ncodebase (e.g. in a `database.yml` config)\n* providing parallelisation facility for easy SQL query-distribution\n\n## Installation\n\nRport is distributed as a lightweight R package and you can get the most up-to-date version\nfrom GitHub, directly from within an R session:\n\n    \u003e library(devtools); install_github('adjust/rport')\n\nNext you'll have to define some PostgreSQL connection settings in YML format\nby default here `config/database.yml`. See [the example\ndatabase.yml](https://github.com/adjust/rport/blob/master/tests/database.yml)\nfor an example.\n\nGiven that you have a connection name `db1` (and a running PostgreSQL database),\nyou can test by:\n\n```r\nlibrary(rport)\n\ndb('db1', 'select 1')\n```\n\nIf successful, you should see the following output:\n\n```\n\u003e db('db1', 'select 1')\n2018-04-10 17:09:04 -- 1468 Executing: select 1 on db1\n2018-04-10 17:09:05 -- 1468 Done: db1\n   ?column?\n1:        1\n```\n\n## Usecases\n\nManaging the PostgreSQL connectivity in an analytics environment with even a\nsingle database can already be very beneficial. This usecase is popular and\nencouraged. The full benefit of `rport` is however unlocked in contexts where\ndata is partitioned/sharded.\n\nBelow are some of the usecases for `rport` which emphasize the benefits it\noffers in handling DB connection objects and distributing SQL queries.\n\n### rport on Sharded Database Cluster\n\nSuppose we have 16 database nodes (shards), where data is distributed by some\nkey. Below is a sample `config/database.yml`, which we might have on our\nworkspace defining all connection settings.\n\n```YML\nshard1:\n  database: db1\n  username: postgres\n  port: 5432\n  application_name: rport\nshard2:\n  database: db2\n  username: postgres\n  port: 5432\n  application_name: rport\n\n...\n\nshard16:\n  database: db16\n  username: postgres\n  port: 5432\n  application_name: rport\n```\n\nLet's say we want to run the following SQL query on every shard and combine the\nresults for analysis in R.\n\n```SQL\nSELECT id, name, city, sum(events) as events\nFROM events\nWHERE country IN ('de', 'fr', 'bg')\n```\n\nTo distribute this SQL on all 16 database servers (shards), we can use the\nfollowing R code.\n\n```r\nlibrary(rport)\n\nsql \u003c- \"\n  SELECT id, name, city, sum(events) as events\n  FROM events\n  WHERE country IN ('de', 'fr', 'bg')\n\"\n# Perform intermediate (per-shard) aggregation, parallel on 4 cores by default.\nevents \u003c- db(paste0('shard', 1:16), sql)\n\n# Perform final (in-memory) aggregation on the resulting `data.table`\nevents \u003c- events[, .(events=sum(events)), by='country']\n```\n\n### Multiple Queries on Single Database\n\nOne of our product's database model has data partitioned over several thousand\nof PostgreSQL tables. All tables have the same schema so often we want to do\nanalytics on data from many of these tables. See the example data model below\nwhere each app's data is stored on its own table.\n\n```SQL\ncreate table app_1 (id int, title text, created_at date, installs int,...);\ncreate table app_2 (id int, title text, created_at date, installs int,...);\n...\ncreate table app_100 (id int, title text, created_at date, installs int,...);\n```\n\nTo distribute a query on all apps, using `rport` in R you can do:\n\n```R\nsql \u003c- sprintf(\"\n  SELECT\n    id AS app_id,\n    created_at AS date,\n    sum(installs) installs\n  FROM app_%s\n  WHERE created_at \u003e '2018-01-01'\n  GROUP BY 1, 2\n\", 1:100)\n\ndat \u003c- db('apps-db', sql)\n```\n\nNote that the `sql` variable above is a vector of queries, each being different\nfrom the others by the table name it reads from. `db('apps-db', sql)` will\ndistribute those queries in parallel on the single PostgreSQL instance.\n\n### Multiple Queries on Multiple Databases\n\nScaling the usecase above, let's model the raw data on the usage of apps. Each\nuser's interaction with an app will be producing raws into our tables:\n\n```SQL\ncreate table app_1_20180101 (device_id uuid, created_at timestamp, os_name text, os_version ...);\ncreate table app_1_20180102 (device_id uuid, created_at timestamp, os_name text, os_version ...);\ncreate table app_1_20180103 (device_id uuid, created_at timestamp, os_name text, os_version ...);\n...\ncreate table app_2_20180101 (device_id uuid, created_at timestamp, os_name text, os_version ...);\n...\n```\n\nWe'll also put these tables into multiple PostgreSQL instances.\n\n```SQL\ncreate database db1;\ncreate database db2;\ncreate database db3;\n...\ncreate database db50;\n```\n\nAt adjust we actually query hundreds of PostgreSQL databases, where petabytes of\ndata live according to a similar partitioning scheme. We have a master\nPostgreSQL node, which contains the meta-data determining, on which database\ndata is stored.  Suppose the master instance manages the metadata in a table\nlike that:\n\n```SQL\nCREATE TABLE metadata (\n  connection_name text,\n  app_id          int,\n  created_at      date\n)\n```\n\nLet's look at how we can run analytical queries using `rport` in R on such\nsetup. We are interested in estimating the adoption rates of iOS versions and\nthe activity we see on each version over the last 6 months from our distributed\nraw data.\n\n```R\nlibrary(rport)\n\n# Get all DB connections containing data for the last 180 days.\nmetadata \u003c- db('master', '\n  SELECT connection_name, app_id, created_at\n  FROM metadata\n  WHERE created_at \u003e current_date - 180\n')\n\n# SQL query that we want to run on every relevant node.\nsql.template \u003c- \"\n  SELECT os_veresion, created_at::date, count(*) AS events\n  FROM app_%d_%d\n  WHERE os_name = 'ios'\n  GROUP BY os_version\n\"\n\n# data.table syntax to connection names and the relevant SQL\nmetadata[, sql:=sprintf(sql.template, app_id, created_at)]\n\ndat \u003c- db(metadata$connection_name, metadata$sql)\n```\n\nWe expect that the database connections are defined at runtime in\n`database.yml`. This doesn't have to be the case and at adjust we define these\nconnections dynamically using `register.connections()` after reading\nfrom the master node.\n\n### rport on PostgreSQL and Shiny\n\n[Shiny][shiny] is a popular framework for interactive data visualisations in R.\nUsing the [Pool][pool] project and `rport` you can connect Shiny to either your\ndistributed cluster or simply to all different database you might have. Managing\nDB configurations in a centralized file makes it much easier to deploy multiple\nShiny apps.\n\n## Other Features\n\nThe main function that `rport` provides is `db()` and it's exemplified in the\nUsecases section below. Here's an overview of the rest of `rport`'s functions.\n\n```R\ndb.connection        # retrieve a connection object from a connection name\ndb.disconnect        # disconnect either all open connections or by connection name\nlist.connections     # get a list of all open database connections\nregister.connections # register a list of new connections (other than those defined in `database.yml`)\nreload.db.config     # reload the `database.yml` connection config file\n```\n\nFor more details on each of those functions, check their help from R - for\nexample `?db.connection`.\n\n## Flexible data inserts using PostgreSQL COPY\n\nThe R DBI interface already supports a function called `dbWriteTable`, which the\nPostgres driver implements using SQL `COPY`. However the implementation isn't\nflexible enough and among other shortages it:\n\n* doesn't allow custom columns for the `COPY` and thus you can't benefit from\n  default values on the underlying PostgreSQL table.\n* can't benefit from transactions that `COPY` into temp tables (e.g. `CREATE\n  TABLE my_temp_table (...) ON COMMIT DROP`)\n\n`rport` has a stripped down implementation which gives you the flexibility to\naddress these. Example usage:\n\n```R\n# Suppose we have a table called my_pg_table\ndat \u003c- data.table(ts=as.POSIXct('2013-01-01 00:00:10'), id=1:10, bl=TRUE)\ncon \u003c- db.connection('db1')\ntbl.name \u003c- 'my_pg_table'\n\npg.copy(con, tbl.name, dat)\n```\n\n## Configuration\n\n`rport` allows some configuration through the R's `options()` functionality.\n\n### Custom Database Config\n\nBy default `rport` looks for a `config/database.yml` file. Custom `database.yml`\nlocation could be given in two ways:\n\n* by calling `options('rport-database-yml-file'='~/my-dir/my-config.yml')`\n* by setting an evironment variable `RPORT_DB_CONFIG=~/my-dir/my-config.yml`\n\n### Length of the SQL log\n\n`rport` logs SQL statements on `db()` call. By default only the first 100\ncharacters are logged. This length could be changed by:\n\n* `options('rport-max-sql-query-log-length'=111)`\n\n### FAQ\n\n* Why did you choose only PostgreSQL as supported backend\n\nThe development of Rport has been driven by the internal needs at Adjust, which is a PostgreSQL company. However\nabstracting the RDBMS backend is easily achievable and could be done at future iterations on the project. Contributions\nare also welcome.\n\n* Why not make the project even more lightweight by dropping the YML dependency\n\nThe concept of `database.yml` connection definitions have been borrowed from the `Ruby on Rails` world. For the time\nbeing this will stay part of `rport`, but we might in the future offer support for other configuration formats and even\nmake the YML dependency obsolete.\n\n* Why did you switch the goal of the project away from a generic framework for analytics apps\n\nThe idea of a framework for analytics apps is not dead for us. In a possible future development of such framework,\n`rport` would definitely be a part of it. However, we chose to focus the project on addressing our growing analytics\nneeds and we realized we were mainly using the DB connectivity feature of `rport`, so we further developed that. Caching\nwas one example where we found that the `memoise` package was exactly what we needed for the purpose and so there was no\nuse of us duplicating the functionality in `rport`.\n\n* Why don't you consider the newer `RPostgres` driver for PostgreSQL.\n\nWe follow the development of the [RPostgres][rpostgres] project closely and we might switch to it as a supported\nPostgreSQL driver in the future.\n\n## Contributing\n\nTo run the test suite of `rport` you'll need [Docker][docker]. Check the\nproject's Makefile to find your way in the test suite. Build your feature and\nsend a Pull Request on GitHub. Or just write an issue first.\n\n## Author\n\nNikola Chochkov nikola@adjust.com, Berlin, adjust GmbH, Germany\n\n## License\n\nThis Software is licensed under the MIT License.\n\nCopyright (c) 2018 adjust GmbH, http://www.adjust.com\n\nPermission is hereby granted, free of charge, to any person obtaining a copy of\nthis software and associated documentation files (the \"Software\"), to deal in\nthe Software without restriction, including without limitation the rights to\nuse, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of\nthe Software, and to permit persons to whom the Software is furnished to do so,\nsubject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS\nFOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR\nCOPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER\nIN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN\nCONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n\n[shiny]: https://shiny.rstudio.com \"Shiny\"\n[data_table]: https://github.com/Rdatatable/data.table \"The Data Table R Package\"\n[adjust]: http://adjust.com \"Adjust\"\n[rpostgres]: https://github.com/r-dbi/RPostgres\n[rpostgresql]: https://cran.r-project.org/web/packages/RPostgreSQL/index.html\n[pool]: https://github.com/rstudio/pool\n[docker]: https://www.docker.com/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadjust%2Frport","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadjust%2Frport","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadjust%2Frport/lists"}