{"id":13400880,"url":"https://github.com/hrbrmstr/sergeant","last_synced_at":"2025-03-16T20:31:12.568Z","repository":{"id":45887545,"uuid":"60310735","full_name":"hrbrmstr/sergeant","owner":"hrbrmstr","description":":guardsman: Tools to Transform and Query Data with 'Apache' 'Drill'","archived":false,"fork":false,"pushed_at":"2022-04-18T13:42:13.000Z","size":18621,"stargazers_count":126,"open_issues_count":7,"forks_count":13,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-02-27T13:19:08.292Z","etag":null,"topics":["apache-drill","dplyr","drill","parquet-files","r","r-cyber","rstats","sql"],"latest_commit_sha":null,"homepage":"https://hrbrmstr.github.io/sergeant/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hrbrmstr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-06-03T02:03:16.000Z","updated_at":"2024-09-05T05:47:56.000Z","dependencies_parsed_at":"2022-09-23T09:50:16.460Z","dependency_job_id":null,"html_url":"https://github.com/hrbrmstr/sergeant","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fsergeant","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fsergeant/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fsergeant/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fsergeant/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hrbrmstr","download_url":"https://codeload.github.com/hrbrmstr/sergeant/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243830912,"owners_count":20354848,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-drill","dplyr","drill","parquet-files","r","r-cyber","rstats","sql"],"created_at":"2024-07-30T19:00:56.633Z","updated_at":"2025-03-16T20:31:12.147Z","avatar_url":"https://github.com/hrbrmstr.png","language":"R","funding_links":[],"categories":["R","Backend"],"sub_categories":["Database"],"readme":"\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1248912.svg)](https://doi.org/10.5281/zenodo.1248912)\n[![CRAN\\_Status\\_Badge](https://www.r-pkg.org/badges/version/sergeant)](https://cran.r-project.org/package=sergeant)\n\n# 💂 sergeant\n\nTools to Transform and Query Data with ‘Apache’ ‘Drill’\n\n## \\*\\* IMPORTANT \\*\\*\n\nVersion 0.7.0+ (a.k.a. the main branch) splits off the JDBC interface\ninto a separate package `sergeant.caffeinated`\n([GitHub](https://github.com/hrbrmstr/sergeant-caffeinated)).\n\nI\\# Description\n\nDrill + `sergeant` is (IMO) a streamlined alternative to Spark +\n`sparklyr` if you don’t need the ML components of Spark (i.e. just need\nto query “big data” sources, need to interface with parquet, need to\ncombine disparate data source types — json, csv, parquet, rdbms - for\naggregation, etc). Drill also has support for spatial queries.\n\nUsing Drill SQL queries that reference parquet files on a local linux or\nmacOS workstation can often be more performant than doing the same data\ningestion \u0026 wrangling work with R (especially for large or disperate\ndata sets). Drill can often help further streamline workflows that\ninvolve wrangling many tiny JSON files on a daily basis.\n\nDrill can be obtained from \u003chttps://drill.apache.org/download/\u003e (use\n“Direct File Download”). Drill can also be installed via\n[Docker](https://drill.apache.org/docs/running-drill-on-docker/). For\nlocal installs on Unix-like systems, a common/suggestion location for\nthe Drill directory is `/usr/local/drill` as the install directory.\n\nDrill embedded (started using the `$DRILL_BASE_DIR/bin/drill-embedded`\nscript) is a super-easy way to get started playing with Drill on a\nsingle workstation and most of many workflows can “get by” using Drill\nthis way.\n\nThere are a few convenience wrappers for various informational SQL\nqueries (like `drill_version()`). Please file an PR if you add more.\n\nSome of the more “controlling vs data ops” REST API functions aren’t\nimplemented. Please file a PR if you need those.\n\nThe following functions are implemented:\n\n**`DBI`** (REST)\n\n  - A “just enough” feature complete R `DBI` driver has been implemented\n    using the Drill REST API, mostly to facilitate the `dplyr`\n    interface. Use the `RJDBC` driver interface if you need more `DBI`\n    functionality.\n  - This also means that SQL functions unique to Drill have also been\n    “implemented” (i.e. made accessible to the `dplyr` interface). If\n    you have custom Drill SQL functions that need to be implemented\n    please file an issue on GitHub. Many should work without it, but\n    some may require a custom interface.\n\n**`dplyr`**: (REST)\n\n  - `src_drill`: Connect to Drill (using `dplyr`) + supporting functions\n\nNote that a number of Drill SQL functions have been mapped to R\nfunctions (e.g. `grepl`) to make it easier to transition from\nnon-database-backed SQL ops to Drill. See the help on\n`drill_custom_functions` for more info on these helper Drill custom\nfunction mappings.\n\n**Drill APIs**:\n\n  - `drill_connection`: Setup parameters for a Drill server/cluster\n    connection\n  - `drill_active`: Test whether Drill HTTP REST API server is up\n  - `drill_cancel`: Cancel the query that has the given queryid\n  - `drill_functions`: Show all the available Drill built-in functions \u0026\n    UDFs (Apache Drill 1.15.0+ required)\n  - `drill_jdbc`: Connect to Drill using JDBC\n  - `drill_metrics`: Get the current memory metrics\n  - `drill_options`: List the name, default, and data type of the system\n    and session options\n  - `drill_popts`: Show all the available Drill options (1.15.0+)\n  - `drill_profile`: Get the profile of the query that has the given\n    query id\n  - `drill_profiles`: Get the profiles of running and completed queries\n  - `drill_query`: Submit a query and return results\n  - `drill_set`: Set Drill SYSTEM or SESSION options\n  - `drill_settings_reset`: Changes (optionally, all) session settings\n    back to system defaults\n  - `drill_show_files`: Show files in a file system schema.\n  - `drill_show_schemas`: Returns a list of available schemas.\n  - `drill_stats`: Get Drillbit information, such as ports numbers\n  - `drill_status`: Get the status of Drill\n  - `drill_storage`: Get the list of storage plugin names and\n    configurations\n  - `drill_system_reset`: Changes (optionally, all) system settings back\n    to system defaults\n  - `drill_threads`: Get information about threads\n  - `drill_uplift`: Turn a columnar query results into a type-converted\n    tbl\n  - `drill_use`: Change to a particular schema.\n  - `drill_version`: Identify the version of Drill running\n\n**Helpers**\n\n  - `ctas_profile`: Generate a Drill CTAS Statement from a Query\n  - `drill_up`: sart a Dockerized Drill Instance \\# `sdrill_down`: stop\n    a Dockerized Drill Instance by container id\n  - `howall_drill`: Show all dead and running Drill Docker containers\n  - `stopall_drill`: Prune all dead and running Drill Docker containers\n\n# Installation\n\n``` r\ninstall.packages(\"sergeant\", repos = \"https://cinc.rud.is\")\n# or\ndevtools::install_git(\"https://git.rud.is/hrbrmstr/sergeant.git\")\n# or\ndevtools::install_git(\"https://git.sr.ht/~hrbrmstr/sergeant\")\n# or\ndevtools::install_gitlab(\"hrbrmstr/sergeant\")\n# or\ndevtools::install_bitbucket(\"hrbrmstr/sergeant\")\n# or\ndevtools::install_github(\"hrbrmstr/sergeant\")\n```\n\n# Usage\n\n### `dplyr` interface\n\n``` r\nlibrary(sergeant)\nlibrary(tidyverse)\n\n# use localhost if running standalone on same system otherwise the host or IP of your Drill server\nds \u003c- src_drill(\"localhost\")  #ds\ndb \u003c- tbl(ds, \"cp.`employee.json`\") \n\n# without `collect()`:\ncount(db, gender, marital_status)\n##  # Source:   lazy query [?? x 3]\n##  # Database: DrillConnection\n##  # Groups:   gender\n##    gender marital_status     n\n##    \u003cchr\u003e  \u003cchr\u003e          \u003cdbl\u003e\n##  1 F      S                297\n##  2 M      M                278\n##  3 M      S                276\n##  4 F      M                304\n\ncount(db, gender, marital_status) %\u003e% collect()\n##  # A tibble: 4 x 3\n##  # Groups:   gender [2]\n##    gender marital_status     n\n##    \u003cchr\u003e  \u003cchr\u003e          \u003cdbl\u003e\n##  1 F      S                297\n##  2 M      M                278\n##  3 M      S                276\n##  4 F      M                304\n\ngroup_by(db, position_title) %\u003e%\n  count(gender) -\u003e tmp2\n\ngroup_by(db, position_title) %\u003e%\n  count(gender) %\u003e%\n  ungroup() %\u003e%\n  mutate(full_desc = ifelse(gender == \"F\", \"Female\", \"Male\")) %\u003e%\n  collect() %\u003e%\n  select(Title = position_title, Gender = full_desc, Count = n)\n##  # A tibble: 30 x 3\n##     Title                  Gender Count\n##     \u003cchr\u003e                  \u003cchr\u003e  \u003cdbl\u003e\n##   1 President              Female     1\n##   2 VP Country Manager     Male       3\n##   3 VP Country Manager     Female     3\n##   4 VP Information Systems Female     1\n##   5 VP Human Resources     Female     1\n##   6 Store Manager          Female    13\n##   7 VP Finance             Male       1\n##   8 Store Manager          Male      11\n##   9 HQ Marketing           Female     2\n##  10 HQ Information Systems Female     4\n##  # … with 20 more rows\n\narrange(db, desc(employee_id)) %\u003e% print(n = 20)\n##  # Source:     table\u003ccp.`employee.json`\u003e [?? x 20]\n##  # Database:   DrillConnection\n##  # Ordered by: desc(employee_id)\n##     employee_id full_name first_name last_name position_id position_title store_id department_id birth_date hire_date\n##     \u003cchr\u003e       \u003cchr\u003e     \u003cchr\u003e      \u003cchr\u003e     \u003cchr\u003e       \u003cchr\u003e          \u003cchr\u003e    \u003cchr\u003e         \u003cchr\u003e      \u003cchr\u003e    \n##   1 999         Beverly … Beverly    Dittmar   17          Store Permane… 8        17            1914-02-02 1998-01-…\n##   2 998         Elizabet… Elizabeth  Jantzer   17          Store Permane… 8        17            1914-02-02 1998-01-…\n##   3 997         John Swe… John       Sweet     17          Store Permane… 8        17            1914-02-02 1998-01-…\n##   4 996         William … William    Murphy    17          Store Permane… 8        17            1914-02-02 1998-01-…\n##   5 995         Carol Li… Carol      Lindsay   17          Store Permane… 8        17            1914-02-02 1998-01-…\n##   6 994         Richard … Richard    Burke     17          Store Permane… 8        17            1914-02-02 1998-01-…\n##   7 993         Ethan Bu… Ethan      Bunosky   17          Store Permane… 8        17            1914-02-02 1998-01-…\n##   8 992         Claudett… Claudette  Cabrera   17          Store Permane… 8        17            1914-02-02 1998-01-…\n##   9 991         Maria Te… Maria      Terry     17          Store Permane… 8        17            1914-02-02 1998-01-…\n##  10 990         Stacey C… Stacey     Case      17          Store Permane… 8        17            1914-02-02 1998-01-…\n##  11 99          Elizabet… Elizabeth  Horne     18          Store Tempora… 6        18            1976-10-05 1997-01-…\n##  12 989         Dominick… Dominick   Nutter    17          Store Permane… 8        17            1914-02-02 1998-01-…\n##  13 988         Brian Wi… Brian      Willeford 17          Store Permane… 8        17            1914-02-02 1998-01-…\n##  14 987         Margaret… Margaret   Clendenen 17          Store Permane… 8        17            1914-02-02 1998-01-…\n##  15 986         Maeve Wa… Maeve      Wall      17          Store Permane… 8        17            1914-02-02 1998-01-…\n##  16 985         Mildred … Mildred    Morrow    16          Store Tempora… 8        16            1914-02-02 1998-01-…\n##  17 984         French W… French     Wilson    16          Store Tempora… 8        16            1914-02-02 1998-01-…\n##  18 983         Elisabet… Elisabeth  Duncan    16          Store Tempora… 8        16            1914-02-02 1998-01-…\n##  19 982         Linda An… Linda      Anderson  16          Store Tempora… 8        16            1914-02-02 1998-01-…\n##  20 981         Selene W… Selene     Watson    16          Store Tempora… 8        16            1914-02-02 1998-01-…\n##  # … with more rows, and 6 more variables: salary \u003cchr\u003e, supervisor_id \u003cchr\u003e, education_level \u003cchr\u003e,\n##  #   marital_status \u003cchr\u003e, gender \u003cchr\u003e, management_role \u003cchr\u003e\n\nmutate(db, position_title = tolower(position_title)) %\u003e%\n  mutate(salary = as.numeric(salary)) %\u003e%\n  mutate(gender = ifelse(gender == \"F\", \"Female\", \"Male\")) %\u003e%\n  mutate(marital_status = ifelse(marital_status == \"S\", \"Single\", \"Married\")) %\u003e%\n  group_by(supervisor_id) %\u003e%\n  summarise(underlings_count = n()) %\u003e%\n  collect()\n##  # A tibble: 112 x 2\n##     supervisor_id underlings_count\n##     \u003cchr\u003e                    \u003cdbl\u003e\n##   1 0                            1\n##   2 1                            7\n##   3 5                            9\n##   4 4                            2\n##   5 2                            3\n##   6 20                           2\n##   7 21                           4\n##   8 22                           7\n##   9 6                            4\n##  10 36                           2\n##  # … with 102 more rows\n```\n\n### REST API\n\n``` r\ndc \u003c- drill_connection(\"localhost\") \n\ndrill_active(dc)\n##  [1] TRUE\n\ndrill_version(dc)\n##  [1] \"1.15.0\"\n\ndrill_storage(dc)$name\n##   [1] \"cp\"       \"dfs\"      \"drilldat\" \"hbase\"    \"hdfs\"     \"hive\"     \"kudu\"     \"mongo\"    \"my\"       \"s3\"\n\ndrill_query(dc, \"SELECT * FROM cp.`employee.json` limit 100\")\n##  # A tibble: 100 x 16\n##     employee_id full_name first_name last_name position_id position_title store_id department_id birth_date hire_date\n##     \u003cchr\u003e       \u003cchr\u003e     \u003cchr\u003e      \u003cchr\u003e     \u003cchr\u003e       \u003cchr\u003e          \u003cchr\u003e    \u003cchr\u003e         \u003cchr\u003e      \u003cchr\u003e    \n##   1 1           Sheri No… Sheri      Nowmer    1           President      0        1             1961-08-26 1994-12-…\n##   2 2           Derrick … Derrick    Whelply   2           VP Country Ma… 0        1             1915-07-03 1994-12-…\n##   3 4           Michael … Michael    Spence    2           VP Country Ma… 0        1             1969-06-20 1998-01-…\n##   4 5           Maya Gut… Maya       Gutierrez 2           VP Country Ma… 0        1             1951-05-10 1998-01-…\n##   5 6           Roberta … Roberta    Damstra   3           VP Informatio… 0        2             1942-10-08 1994-12-…\n##   6 7           Rebecca … Rebecca    Kanagaki  4           VP Human Reso… 0        3             1949-03-27 1994-12-…\n##   7 8           Kim Brun… Kim        Brunner   11          Store Manager  9        11            1922-08-10 1998-01-…\n##   8 9           Brenda B… Brenda     Blumberg  11          Store Manager  21       11            1979-06-23 1998-01-…\n##   9 10          Darren S… Darren     Stanz     5           VP Finance     0        5             1949-08-26 1994-12-…\n##  10 11          Jonathan… Jonathan   Murraiin  11          Store Manager  1        11            1967-06-20 1998-01-…\n##  # … with 90 more rows, and 6 more variables: salary \u003cchr\u003e, supervisor_id \u003cchr\u003e, education_level \u003cchr\u003e,\n##  #   marital_status \u003cchr\u003e, gender \u003cchr\u003e, management_role \u003cchr\u003e\n\ndrill_query(dc, \"SELECT COUNT(gender) AS gctFROM cp.`employee.json` GROUP BY gender\")\n\ndrill_options(dc)\n##  # A tibble: 179 x 6\n##     name                                                        value    defaultValue accessibleScopes kind   optionScope\n##     \u003cchr\u003e                                                       \u003cchr\u003e    \u003cchr\u003e        \u003cchr\u003e            \u003cchr\u003e  \u003cchr\u003e      \n##   1 debug.validate_iterators                                    FALSE    false        ALL              BOOLE… BOOT       \n##   2 debug.validate_vectors                                      FALSE    false        ALL              BOOLE… BOOT       \n##   3 drill.exec.functions.cast_empty_string_to_null              FALSE    false        ALL              BOOLE… BOOT       \n##   4 drill.exec.hashagg.fallback.enabled                         FALSE    false        ALL              BOOLE… BOOT       \n##   5 drill.exec.hashjoin.fallback.enabled                        FALSE    false        ALL              BOOLE… BOOT       \n##   6 drill.exec.memory.operator.output_batch_size                16777216 16777216     SYSTEM           LONG   BOOT       \n##   7 drill.exec.memory.operator.output_batch_size_avail_mem_fac… 0.1      0.1          SYSTEM           DOUBLE BOOT       \n##   8 drill.exec.storage.file.partition.column.label              dir      dir          ALL              STRING BOOT       \n##   9 drill.exec.storage.implicit.filename.column.label           filename filename     ALL              STRING BOOT       \n##  10 drill.exec.storage.implicit.filepath.column.label           filepath filepath     ALL              STRING BOOT       \n##  # … with 169 more rows\n\ndrill_options(dc, \"json\")\n##  # A tibble: 10 x 6\n##     name                                                    value defaultValue accessibleScopes kind    optionScope\n##     \u003cchr\u003e                                                   \u003cchr\u003e \u003cchr\u003e        \u003cchr\u003e            \u003cchr\u003e   \u003cchr\u003e      \n##   1 store.hive.maprdb_json.optimize_scan_with_native_reader FALSE false        ALL              BOOLEAN BOOT       \n##   2 store.json.all_text_mode                                TRUE  false        ALL              BOOLEAN SYSTEM     \n##   3 store.json.extended_types                               TRUE  false        ALL              BOOLEAN SYSTEM     \n##   4 store.json.read_numbers_as_double                       FALSE false        ALL              BOOLEAN BOOT       \n##   5 store.json.reader.allow_nan_inf                         TRUE  true         ALL              BOOLEAN BOOT       \n##   6 store.json.reader.print_skipped_invalid_record_number   TRUE  false        ALL              BOOLEAN SYSTEM     \n##   7 store.json.reader.skip_invalid_records                  TRUE  false        ALL              BOOLEAN SYSTEM     \n##   8 store.json.writer.allow_nan_inf                         TRUE  true         ALL              BOOLEAN BOOT       \n##   9 store.json.writer.skip_null_fields                      TRUE  true         ALL              BOOLEAN BOOT       \n##  10 store.json.writer.uglify                                TRUE  false        ALL              BOOLEAN SYSTEM\n```\n\n## Working with parquet files\n\n``` r\ndrill_query(dc, \"SELECT * FROM dfs.`/usr/local/drill/sample-data/nation.parquet` LIMIT 5\")\n##  # A tibble: 5 x 4\n##    N_NATIONKEY N_NAME    N_REGIONKEY N_COMMENT           \n##          \u003cdbl\u003e \u003cchr\u003e           \u003cdbl\u003e \u003cchr\u003e               \n##  1           0 ALGERIA             0 haggle. carefully f \n##  2           1 ARGENTINA           1 al foxes promise sly\n##  3           2 BRAZIL              1 y alongside of the p\n##  4           3 CANADA              1 eas hang ironic, sil\n##  5           4 EGYPT               4 y above the carefull\n```\n\nIncluding multiple parquet files in different directories (note the\nwildcard support):\n\n``` r\ndrill_query(dc, \"SELECT * FROM dfs.`/usr/local/drill/sample-data/nations*/nations*.parquet` LIMIT 5\")\n##  # A tibble: 5 x 5\n##    dir0      N_NATIONKEY N_NAME    N_REGIONKEY N_COMMENT           \n##    \u003cchr\u003e           \u003cdbl\u003e \u003cchr\u003e           \u003cdbl\u003e \u003cchr\u003e               \n##  1 nationsSF           0 ALGERIA             0 haggle. carefully f \n##  2 nationsSF           1 ARGENTINA           1 al foxes promise sly\n##  3 nationsSF           2 BRAZIL              1 y alongside of the p\n##  4 nationsSF           3 CANADA              1 eas hang ironic, sil\n##  5 nationsSF           4 EGYPT               4 y above the carefull\n```\n\n### Drill has built-in support for spatial ops\n\nVia: \u003chttps://github.com/k255/drill-gis\u003e\n\nA common use case is to select data within boundary of given polygon:\n\n``` r\ndrill_query(dc, \"\nselect columns[2] as city, columns[4] as lon, columns[3] as lat\n    from cp.`sample-data/CA-cities.csv`\n    where\n        ST_Within(\n            ST_Point(columns[4], columns[3]),\n            ST_GeomFromText(\n                'POLYGON((-121.95 37.28, -121.94 37.35, -121.84 37.35, -121.84 37.28, -121.95 37.28))'\n                )\n            )\n\")\n##  # A tibble: 7 x 3\n##    city        lon          lat       \n##    \u003cchr\u003e       \u003cchr\u003e        \u003cchr\u003e     \n##  1 Burbank     -121.9316233 37.3232752\n##  2 San Jose    -121.8949555 37.3393857\n##  3 Lick        -121.8457863 37.2871647\n##  4 Willow Glen -121.8896771 37.3085532\n##  5 Buena Vista -121.9166227 37.3213308\n##  6 Parkmoor    -121.9307898 37.3210531\n##  7 Fruitdale   -121.932746  37.31086\n```\n\n### sergeant Metrics\n\n| Lang | \\# Files | (%) | LoC | (%) | Blank lines | (%) | \\# Lines | (%) |\n| :--- | -------: | --: | --: | --: | ----------: | --: | -------: | --: |\n| Rmd  |        1 |   1 |  55 |   1 |          54 |   1 |       89 |   1 |\n\n## Code of Conduct\n\nPlease note that this project is released with a Contributor Code of\nConduct By participating in this project you agree to\nabide by its terms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Fsergeant","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhrbrmstr%2Fsergeant","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Fsergeant/lists"}