{"id":14066854,"url":"https://github.com/globalgov/manydata","last_synced_at":"2025-04-16T14:39:58.829Z","repository":{"id":38312377,"uuid":"291753904","full_name":"globalgov/manydata","owner":"globalgov","description":"The portal for global governance data","archived":false,"fork":false,"pushed_at":"2025-03-21T17:23:50.000Z","size":11525,"stargazers_count":9,"open_issues_count":15,"forks_count":0,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-29T05:23:31.401Z","etag":null,"topics":["data","dataset","r","r-package"],"latest_commit_sha":null,"homepage":"https://manydata.ch","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/globalgov.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-08-31T15:31:53.000Z","updated_at":"2025-03-21T16:55:12.000Z","dependencies_parsed_at":"2023-12-14T15:30:31.715Z","dependency_job_id":"5bdc4361-9cf2-4895-9989-dd80cadf9d23","html_url":"https://github.com/globalgov/manydata","commit_stats":{"total_commits":1617,"total_committers":8,"mean_commits":202.125,"dds":"0.36363636363636365","last_synced_commit":"54c9f9c312fe02896b9a80af1fb7fdb93ca88a19"},"previous_names":[],"tags_count":31,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/globalgov%2Fmanydata","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/globalgov%2Fmanydata/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/globalgov%2Fmanydata/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/globalgov%2Fmanydata/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/globalgov","download_url":"https://codeload.github.com/globalgov/manydata/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249250971,"owners_count":21237965,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","dataset","r","r-package"],"created_at":"2024-08-13T07:05:17.942Z","updated_at":"2025-04-16T14:39:58.821Z","avatar_url":"https://github.com/globalgov.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n```{r setup, include=FALSE, warning=FALSE, message=FALSE}\nknitr::opts_chunk$set(warning   = FALSE, message   = FALSE, out.width = \"100%\",\n                      comment   = \"#\u003e\", fig.path  = \"man/figures/README-\")\n```\n\n# manydata \u003cimg src=\"man/figures/manydataLogo.png\" alt=\"The manydata logo\" align=\"right\" width=\"220\"/\u003e\n\n\u003c!-- badges: start --\u003e\n[![lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)\n![GitHub release (latest by date)](https://img.shields.io/github/v/release/globalgov/manydata)\n![GitHub Release Date](https://img.shields.io/github/release-date/globalgov/manydata)\n![GitHub issues](https://img.shields.io/github/issues-raw/globalgov/manydata)\n\u003c!-- [![HitCount](http://hits.dwyl.com/globalgov/manydata.svg)](http://hits.dwyl.com/globalgov/manydata) --\u003e\n[![Codecov test coverage](https://codecov.io/gh/globalgov/manydata/branch/main/graph/badge.svg)](https://app.codecov.io/gh/globalgov/manydata?branch=main)\n[![CodeFactor](https://www.codefactor.io/repository/github/globalgov/manydata/badge)](https://www.codefactor.io/repository/github/globalgov/manydata)\n[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/4562/badge)](https://bestpractices.coreinfrastructure.org/projects/4562)\n\u003c!-- ![GitHub All Releases](https://img.shields.io/github/downloads/jhollway/roctopus/total) --\u003e\n\u003c!-- badges: end --\u003e\n\n`{manydata}` is a portal to 'many' packages containing many datacubes,\neach containing many related datasets on many issue-domains,\nactors and institutions of global governance.\n`{manytreaties}` contains data on international environmental, trade, and health agreements, and `{manystates}`: contains data on states throughout history.\n\nDatasets are related to one another within a datacube through a particular coding system which follows the same principles across the different packages.\n\nFor instance, in `{manytreaties}`,\nthe `agreements` and `parties` datacubes have standardised IDs (`manyID`),\nand date variables such as `Begin` and `End` that denote the beginning and end dates of treaties respectively.\nThe beginning date is derived from the signature or entry into force date,\nwhichever is the earliest available date for the treaty.\nStandardised IDs across datasets allow the same observations to be matched across datasets so that the values can be compared or expanded where relevant.\nThese specific variable names allows the comparison of information across \ndatasets that have different sources.\nIt enables users to point out the recurrence, \ndifference or absence of observations between the datasets and\nextract more robust data when researching on a particular governance domain. \n\nThe parties datacube contains additional date variables on each state member's ratification,\nsignature, entry into force, and end dates for each treaty.\nData in the memberships datacube is comparable across datasets through standardised state names and stateIDs,\nmade possible with the `manypkgs::code_states()` function.\nMore information on each state, including its `Begin` and `End` date,\ncan be found in the `{manystates}` package.\n\nTo enable users to work with the data in these packages,\n`{manydata}` contains tools for: \n\n- _calling_ data packages, \n- _comparing_ individual datasets, and\n- _consolidating_ datacubes in different ways.\n\nWe intend for `{manydata}` to be useful: \n\n- at the **start** of a research project, \nto access and gather recent versions of well-regarded datasets, \nsee what is available, describe, and explore the data, \n- in the **middle** of a project, \nto facilitate analysis, comparison and modelling, and\n- at the **end** of the project, \nto help with conducting robustness checks, preparing replication scripts,\nand writing the next grant application.\n\n## Call 'many' packages\n\nThe easiest way to install `{manydata}` is directly from CRAN.\n\n```{r install, eval=FALSE}\ninstall.packages(\"manydata\")\n```\n\nThe development version of the package `{manydata}` can also be downloaded from GitHub. \n\n```{r git, eval=FALSE}\n# install.packages(\"remotes\")\nremotes::install_github(\"globalgov/manydata\")\n```\n\n```{r, include=FALSE, message=FALSE, warning=FALSE}\nlibrary(manydata)\n```\n\nOnce `{manydata}` is installed, the `call_` functions can be used to discover\nthe 'many packages' currently available and/or download or update these\npackages when needed. For this, the `call_packages()` can be used.\n\n```{r get, eval=FALSE}\nlibrary(manydata)\ncall_packages() # lists all packages currently available\ncall_packages(\"manytrade\") # downloads and installs this package\n```\n\nThe `call_sources()` function obtains information about the sources and original locations of the desired datasets.\n\n```{r source}\ncall_sources(\"emperors\")\n```\n\n## Comparing 'many' data\n\nThe first thing users of the data packages may want to do is to identify\ndatasets that might contribute to their research goals.\nOne major advantage of storing datasets in datacubes is that it facilitates the\ncomparison and analysis of multiple datasets in a specific domain of global governance.\nTo aid in the selection of datasets and the use of data within datacubes,\nthe `compare_` functions in `{manydata}` allows users to quickly compare\ndifferent information on datacubes and/or datasets across 'many packages'.\nThese include comparison for data observations, variables, and ranges,\noverlap among observations, missing observations,\nand conflicts among observations.\n\nFor now, let's work with the Roman Emperors datacube included in manydata. \nWe can get a quick summary of the datasets included in this\npackage with the following command:\n\n```{r load, eval=FALSE}\ndata(package = \"manydata\")\ndata(emperors, package = \"manydata\")\nemperors\n```\n\nWe can see that there are three named datasets relating to emperors here:\n`wikipedia` (dataset assembled from Wikipedia pages),\n`UNVR` (United Nations of Roman Vitrix),\nand `britannica` (Britannica Encyclopedia List of Roman Emperors).\nEach of these datasets has their advantages and so we may wish\nto understand their similarities and differences,\nsummarise variables across them, and perhaps also rerun models across them.\n\nThe `compare_dimensions()` function returns a tibble with the observations and variables \nof each dataset within the specified datacube of a many package.\n\n```{r compare data}\ncompare_dimensions(emperors)\n```\n\n\u003c!-- The `compare_ranges()` function returns a tibble with the date range using the --\u003e\n\u003c!-- earliest and latest dates of each dataset within the specified datacube of a many package. --\u003e\n\n\u003c!-- ```{r compare range} --\u003e\n\u003c!-- compare_ranges(emperors, variable = c(\"Begin\", \"End\")) --\u003e\n\u003c!-- ``` --\u003e\n\nThe `compare_overlap()` function returns a tibble with the number of overlapping observations for a specified variable (specify using the `key` argument) across datasets within the datacube.\n\n```{r overlap, fig.alt=\"A Venn diagram of overlapping observations\"}\nplot(compare_overlap(emperors, key = \"ID\"))\n```\n\nThe `compare_missing()` function returns a tibble with the number and percentage of missing observations in datasets within datacube.\n\n```{r missing, fig.alt=\"A heatmap of proportion missing observations\"}\nplot(compare_missing(emperors))\n```\n\nFinally, the `compare_categories()` function help researchers identify how variables across datasets within a datacube relate to one another in five categories.\nObservations are matched by an \"ID\" variable to facilitate comparison.\nThe categories here include 'confirmed', 'majority', 'unique', 'missing', and 'conflict'.\nObservations are 'confirmed' if all non-NA values are the same across all datasets,\nand 'majority' if the non-NA values are the same across most datasets.\n'Unique' observations are present in only one dataset and\n'missing' observations indicate there are no non-NA values across all datasets for that variable.\nObservations are in 'conflict' if datasets have different non-NA values.\n\n```{r categories, fig.alt=\"Stack chart of observations that are missing, in conflict, etc\"}\nplot(compare_categories(emperors, key = \"ID\"))\n```\n\n## Consolidating 'many' data\n\nTo retrieve an individual dataset from this datacube,\nwe can use the `pluck()` function.\n\n```{r pluck., eval=FALSE}\npluck(emperors, \"Wikipedia\")\n```\n\nHowever, the real value of the various 'many packages' is that multiple datasets\nrelating to the same phenomenon are presented together.\n`{manydata}` contains flexible methods for consolidating the different datasets in a datacube into a single dataset.\nFor example, you could have the rows (observations) from one dataset,\nbut add on some columns (variables) from another dataset.\nWhere there are conflicts in the values across the different datasets,\nthere are several ways that these may be resolved.\n\nThe `consolidate()` function facilitates consolidating a set of datasets, or a datacube,\nfrom a 'many' package into a single dataset with some combination of the rows and columns.\nThe function includes separate arguments for rows and columns,\nas well as for how to resolve conflicts in observations across datasets.\nThe key argument indicates the column to collapse datasets by.\nThis provides users with considerable flexibility in how they combine data.\n\nFor example, users may wish to see units and variables coded in \"any\" dataset\n(i.e. units or variables present in at least one of the datasets in the \ndatacube) or units and variables coded in \"every\" dataset (i.e. units or\nvariables present in all of the datasets in the datacube).\n\n```{r consolidate}\nconsolidate(datacube = emperors, join = \"full\",\n            resolve = \"coalesce\", key = \"ID\")\nconsolidate(datacube = emperors, join = \"inner\",\n            resolve = \"coalesce\", key = \"ID\")\n```\n\nUsers can also choose how they want to resolve conflicts between observations in\n`consolidate()` with several 'resolve' methods:\n\n* coalesce: the first non-NA value \n* max: the largest value\n* min: the smallest value\n* mean: the average value\n* median: the median value\n* random: a random value\n\n```{r resolve}\nconsolidate(datacube = emperors, join = \"full\", resolve = \"max\", key = \"ID\")\nconsolidate(datacube = emperors, join = \"inner\", resolve = \"min\", key = \"ID\")\n```\n\nAlternatively, users can \"favour\" a dataset in a datacube over others:\n\n```{r favour}\nconsolidate(emperors[c(\"UNRV\",\"Britannica\",\"Wikipedia\")], join = \"left\", resolve = \"coalesce\", key = \"ID\")\n```\n\n## Contributing to the many packages universe\n\nFor more information for developers and data contributors to 'many packages', please see `{manypkgs}` [the website](https://globalgov.github.io/manypkgs/).\n\n## Funding details\n\nDevelopment on this package has been funded by the Swiss National Science Foundation (SNSF)\n[Grant Number 188976](https://data.snf.ch/grants/grant/188976): \n\"Power and Networks and the Rate of Change in Institutional Complexes\" (PANARCHIC).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fglobalgov%2Fmanydata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fglobalgov%2Fmanydata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fglobalgov%2Fmanydata/lists"}