{"id":19888276,"url":"https://github.com/project-codeflare/data-integration","last_synced_at":"2025-06-26T15:33:02.016Z","repository":{"id":93848916,"uuid":"375429826","full_name":"project-codeflare/data-integration","owner":"project-codeflare","description":"Object Storage data processing for Ray framework","archived":false,"fork":false,"pushed_at":"2021-07-11T14:35:57.000Z","size":21,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-05-02T17:58:49.128Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/project-codeflare.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-06-09T17:00:40.000Z","updated_at":"2025-02-05T08:00:17.000Z","dependencies_parsed_at":"2023-05-19T01:00:37.610Z","dependency_job_id":null,"html_url":"https://github.com/project-codeflare/data-integration","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/project-codeflare/data-integration","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/project-codeflare%2Fdata-integration","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/project-codeflare%2Fdata-integration/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/project-codeflare%2Fdata-integration/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/project-codeflare%2Fdata-integration/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/project-codeflare","download_url":"https://codeload.github.com/project-codeflare/data-integration/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/project-codeflare%2Fdata-integration/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260497217,"owners_count":23018237,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T18:06:48.485Z","updated_at":"2025-06-26T15:33:02.005Z","avatar_url":"https://github.com/project-codeflare.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lithops4Ray - Object Storage data processing for Ray\n\nObject storage is widely used platform for persisting large amounts of unstructured data. The goal of \nLithops4Ray project is to enable [Ray](https://ray.io) tasks or actors to access object data without forcing developers to write additional boiler plate code or address advanced aspects of accessing Big Data persisted in object storage. Lithops4Ray supports almost any object storage platforms, like IBM Cloud Object Storage, Amazon S3, Azure, Google, CEPH, and so on.\n\n## Lithops4Ray\n\nLithops4Ray is based on the [Lithops](http://lithops.cloud)  framework that benefit Ray's task or actors to process data persisted in  the object storage. To integrate Lithops with Ray you need to install Lithops both at the head and worker nodes and configure Lithops to access object storage backend\n\n\n## Installation\nConfigure Lithops to access the storage backend. Edit provided `../scripts/lithops_config.yaml` and update IBM Cloud Object Storage access details including storage bucket. For other object storage providers, follow [storage backends](https://github.com/lithops-cloud/lithops/blob/master/config/README.md#compute-and-storage-backends). \n\nNow, edit Ray's cluster `cluster.yaml` file and configure\n\n```\nfile_mounts: {\n  \"~/lithops/lithops_config.yaml\":\"project-codeflare/data-integration/blob/main/scripts/lithops_config.yaml\" \n }\nsetup_commands:\n - echo 'export LITHOPS_CONFIG_FILE=~/lithops/default_config.yaml' \u003e\u003e ~/.bashrc\n - pip install lithops\n```\nMore details on the `cluster.yaml` file can be found [here](https://docs.ray.io/en/master/cluster/config.html)\n\n## Usage example\n\nWe run a simple example accessing CSV files and find a string match. Folder `examples/data` contains two CSV files that we use to find a string match\n\n  \timport lithops\n\timport ray\n\timport csv\n\n\tdef read_csv(obj, name):\n\t    buff = io.StringIO(obj.data_stream.read().decode())\n\t    reader = csv.reader(buff, delimiter=',')\n\t    for row in reader:\n\t        if name in row[0]:\n\t            return'{} is found in {}'.format(name, obj.key)\n\t    return '{} not found in {}'.format(name, obj.key) \n\n\t@ray.remote\n\tdef test_csv(data):\n\t    return data.result()\t\n\n\tif __name__ == '__main__':\n    \n\t    ray.init(ignore_reinit_error=True)\n\t    fexec = lithops.LocalhostExecutor(log_level=None)\n\t    \n\t    my_data = fexec.map(read_csv, 'data-integration/examples/data/', extra_args = ['John'])\n\t    results = [test_csv.remote(d) for d in my_data]\n\t\n\t    for res in results:\n\t        print(ray.get(res))\n\nRunning the code should print\n\n\tJohn is found in ages-part1.csv\n\tJohn not found in ages-part2.csv\n\n\n## Additional material\n[Accelerating object storage processing for Ray framework](https://medium.com/p/f581863c7662)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fproject-codeflare%2Fdata-integration","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fproject-codeflare%2Fdata-integration","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fproject-codeflare%2Fdata-integration/lists"}