{"id":13738395,"url":"https://github.com/DataHerb/dataherb-python","last_synced_at":"2025-05-08T16:33:43.313Z","repository":{"id":40420244,"uuid":"240361497","full_name":"DataHerb/dataherb-python","owner":"DataHerb","description":"Python Package for DataHerb: create, search, and load datasets.","archived":false,"fork":false,"pushed_at":"2024-08-14T15:27:42.000Z","size":1172,"stargazers_count":3,"open_issues_count":3,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-10-31T23:46:50.216Z","etag":null,"topics":["data","data-analysis","data-mining","database","dataset","python"],"latest_commit_sha":null,"homepage":"https://dataherb.github.io/dataherb-python","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DataHerb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-02-13T21:08:03.000Z","updated_at":"2024-08-14T15:27:46.000Z","dependencies_parsed_at":"2024-01-12T18:31:05.271Z","dependency_job_id":"dc9a2fb9-35c7-42f8-a2cb-9aa04e239cc4","html_url":"https://github.com/DataHerb/dataherb-python","commit_stats":{"total_commits":80,"total_committers":4,"mean_commits":20.0,"dds":0.35,"last_synced_commit":"35b89cd03392d3e88a0d8992704be6605cff40e9"},"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataHerb%2Fdataherb-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataHerb%2Fdataherb-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataHerb%2Fdataherb-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataHerb%2Fdataherb-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DataHerb","download_url":"https://codeload.github.com/DataHerb/dataherb-python/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224746665,"owners_count":17363088,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-analysis","data-mining","database","dataset","python"],"created_at":"2024-08-03T03:02:21.128Z","updated_at":"2024-11-15T07:30:57.405Z","avatar_url":"https://github.com/DataHerb.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\n  \u003cbr\u003e\n  \u003ca href=\"https://dataherb.github.io\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/DataHerb/dataherb.github.io/master/assets/favicon/ms-icon-310x310.png\" alt=\"Markdownify\" width=\"200\"\u003e\u003c/a\u003e\n  \u003cbr\u003e\n  The Python Package for DataHerb\n  \u003cbr\u003e\n\u003c/h1\u003e\n\n\u003ch4 align=\"center\"\u003eA \u003ca href=\"https://dataherb.github.io\" target=\"_blank\"\u003eDataHerb\u003c/a\u003e Core Service to Create and Load Datasets.\u003c/h4\u003e\n\n\u003cp align=\"center\"\u003e\n\n\u003c/p\u003e\n\n\n\n## Install\n\n```\npip install dataherb\n```\n\nDocumentation: [dataherb.github.io/dataherb-python](https://dataherb.github.io/dataherb-python)\n\n## The DataHerb Command-Line Tool\n\n\u003e Requires Python 3\n\nThe DataHerb cli provides tools to create dataset metadata, validate metadata, search dataset in flora, and download dataset.\n\n### Search and Download\n\nSearch by keyword\n\n```\ndataherb search covid19\n# Shows the minimal metadata\n```\n\nSearch by dataherb id\n\n```\ndataherb search -i covid19_eu_data\n# Shows the full metadata\n```\n\nDownload dataset by dataherb id\n\n```\ndataherb download covid19_eu_data\n# Downloads this dataset: http://dataherb.io/flora/covid19_eu_data\n```\n\n\n### Create Dataset Using Command Line Tool\n\nWe provide a template for dataset creation.\n\nWithin a dataset folder where the data files are located, use the following command line tool to create the metadata template.\n\n```bash\ndataherb create\n```\n\n### Upload dataset to remote\n\nWithin the dataset folder, run\n\n```bash\ndataherb upload\n```\n\n### UI for all the datasets in a flora\n\n\n```bash\ndataherb serve\n```\n\n\n## Use DataHerb in Your Code\n\n### Load Data into DataFrame\n\n```\n# Load the package\nfrom dataherb.flora import Flora\n\n# Initialize Flora service\n# The Flora service holds all the dataset metadata\nuse_flora = \"path/to/my/flora.json\"\ndataherb = Flora(flora=use_flora)\n\n# Search datasets with keyword(s)\ngeo_datasets = dataherb.search(\"geo\")\nprint(geo_datasets)\n\n# Get a specific file from a dataset and load as DataFrame\ntz_df = pd.read_csv(\n  dataherb.herb(\n      \"geonames_timezone\"\n  ).get_resource(\n      \"dataset/geonames_timezone.csv\"\n  )\n)\nprint(tz_df)\n```\n\n\n## The DataHerb Project\n\n\n### What is DataHerb\n\nDataHerb is an open-source data discovery and management tool.\n\n- A **DataHerb** or **Herb** is a dataset. A dataset comes with the data files, and the metadata of the data files.\n- A **Herb Resource** or **Resource** is a data file in the DataHerb.\n- A **Flora** is the combination of all the DataHerbs.\n\nIn many data projects, finding the right datasets to enhance your data is one of the most time consuming part. DataHerb adds flavor to your data project. By creating metadata and manage the datasets systematically, locating an dataset is much easier.\n\nCurrently, dataherb supports sync dataset between local and S3/git. Each dataset can have its own remote location.\n\n### What is DataHerb Flora\n\nWe desigined the following workflow to share and index open datasets.\n\n![DataHerb Workflow](https://raw.githubusercontent.com/DataHerb/dataherb.github.io/master/assets/images/dataherb-components.png)\n\n\u003e The repo [dataherb-flora](https://github.com/DataHerb/dataherb-flora) is a demo flora that lists some datasets and demonstrated on the website [https://dataherb.github.io](https://dataherb.github.io). At this moment, the whole system is being renovated.\n\n## Development\n\n1. Create a conda environment.\n2. Install requirements: `pip install -r requirements.txt`\n\n## Documentation\n\nThe source of the documentation for this package is located at `docs`.\n\n\n## References and Acknolwedgement\n\n- `dataherb` uses `datapackage` in the core. `datapackage` is a python library for the [data-package standard](https://specs.frictionlessdata.io/data-package/). The core schema of the dataset is essentially the data-package standard.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDataHerb%2Fdataherb-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDataHerb%2Fdataherb-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDataHerb%2Fdataherb-python/lists"}