{"id":17951058,"url":"https://github.com/leehuwuj/python-data-library-comparision","last_synced_at":"2025-10-12T04:06:35.432Z","repository":{"id":113012469,"uuid":"608716405","full_name":"leehuwuj/python-data-library-comparision","owner":"leehuwuj","description":"Do the test to compare python data libraries","archived":false,"fork":false,"pushed_at":"2023-03-02T15:39:06.000Z","size":2,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-04T15:40:01.030Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/leehuwuj.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-02T15:38:31.000Z","updated_at":"2023-03-06T08:45:01.000Z","dependencies_parsed_at":null,"dependency_job_id":"b89d63cd-0bad-4829-905d-85c4495cc4b7","html_url":"https://github.com/leehuwuj/python-data-library-comparision","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/leehuwuj/python-data-library-comparision","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leehuwuj%2Fpython-data-library-comparision","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leehuwuj%2Fpython-data-library-comparision/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leehuwuj%2Fpython-data-library-comparision/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leehuwuj%2Fpython-data-library-comparision/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/leehuwuj","download_url":"https://codeload.github.com/leehuwuj/python-data-library-comparision/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leehuwuj%2Fpython-data-library-comparision/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279010237,"owners_count":26084718,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-12T02:00:06.719Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-29T09:42:31.964Z","updated_at":"2025-10-12T04:06:35.404Z","avatar_url":"https://github.com/leehuwuj.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Describe\nTo test performance on reading data from S3 and do simple aggregation for Python libraries: `pandas`, `polars` and `duckdb`\n\n## How to:\n### Edit the code:\nUpdate S3_FILE: the path of s3 data  \nUpdate AWS_S3_ACCESS_KEY and AWS_S3_SECRET_KEY\n\n### Install packages:\n```\npip install -r requirements.txt\n```\n\n### Run the test\n```\nkernprof -l -v tester.py\n```\n\n-\u003e Results:\n- In brief:\n```\nPandas: Total time: 0.195698s\nDuckDB: Total time: 0.0209346s  -\u003e 10x faster than pandas\nPolars: Total time: 0.046093s   -\u003e 5x faster than pandas\n```\n\n- Details:\n```\nTimer unit: 1e-09 s\n\nTotal time: 0.0209346 s\nFile: tester.py\nFunction: the_duck_sql at line 15\n\nLine #      Hits         Time  Per Hit   % Time  Line Contents\n==============================================================\n    15                                           @profile\n    16                                           def the_duck_sql(duckdb_s3_sql):\n    17         1     512848.0 512848.0      2.4      duckdb.sql(duckdb_s3_sql)\n    18         1    9762616.0 9762616.0     46.6      rs = duckdb.sql(\n    19         1       2864.0   2864.0      0.0          f\"\"\"\n    20                                                   SELECT max(id), avg(parent), count(*)\n    21         1       2375.0   2375.0      0.0          FROM '{S3_FILE}'\n    22                                                   \"\"\"\n    23         1   10636336.0 10636336.0     50.8      ).fetchall()\n    24         1       5797.0   5797.0      0.0      return {\n    25         1       5657.0   5657.0      0.0          'max_id': rs[0][0],\n    26         1       3352.0   3352.0      0.0          'avg_parent': rs[0][1],\n    27         1       2793.0   2793.0      0.0          'count': rs[0][2],\n    28                                               }\n\nTotal time: 0.0276167 s\nFile: tester.py\nFunction: the_duck_programmatic at line 30\n\nLine #      Hits         Time  Per Hit   % Time  Line Contents\n==============================================================\n    30                                           @profile\n    31                                           def the_duck_programmatic(duckdb_s3_sql):\n    32         1     641985.0 641985.0      2.3      duckdb.sql(duckdb_s3_sql)\n    33         1    7874445.0 7874445.0     28.5      duck = duckdb.sql(f\"SELECT * FROM '{S3_FILE}'\")\n    34                                           \n    35         1    6354549.0 6354549.0     23.0      count = duck.count('*')\n    36         1    6454214.0 6454214.0     23.4      avg_parent = duck.mean('parent')\n    37         1    6281564.0 6281564.0     22.7      max_id = duck.max('id')\n    38                                           \n    39         1       4679.0   4679.0      0.0      return {\n    40         1       1397.0   1397.0      0.0          'max_id': max_id,\n    41         1       1956.0   1956.0      0.0          'avg_parent': avg_parent,\n    42         1       1886.0   1886.0      0.0          'count': count\n    43                                               }\n\nTotal time: 0.195698 s\nFile: tester.py\nFunction: the_pandas at line 45\n\nLine #      Hits         Time  Per Hit   % Time  Line Contents\n==============================================================\n    45                                           @profile\n    46                                           def the_pandas(storage_options: dict):\n    47         1  192850043.0 192850043.0     98.5      df = pd.read_parquet(\n    48         1       1885.0   1885.0      0.0          S3_FILE,\n    49         1       1886.0   1886.0      0.0          storage_options=storage_options\n    50                                               )\n    51         1     473876.0 473876.0      0.2      max_id = df.id.max()\n    52         1     303811.0 303811.0      0.2      avg_parent = df.parent.mean()\n    53         1    2059143.0 2059143.0      1.1      count = df.count()\n    54                                           \n    55         1       2304.0   2304.0      0.0      return {\n    56         1       1816.0   1816.0      0.0          'max_id': max_id,\n    57         1       1396.0   1396.0      0.0          'avg_parent': avg_parent,\n    58         1       1397.0   1397.0      0.0          'count': count\n    59                                               }\n\nTotal time: 0.046093 s\nFile: tester.py\nFunction: the_polar at line 61\n\nLine #      Hits         Time  Per Hit   % Time  Line Contents\n==============================================================\n    61                                           @profile\n    62                                           def the_polar(arrow_opts):\n    63         1      73334.0  73334.0      0.2      polars_arrow_fs = s3fs.S3FileSystem(**arrow_opts)\n    64         1   20141585.0 20141585.0     43.7      dataset = pq.ParquetDataset(S3_FILE, filesystem=polars_arrow_fs)\n    65         1   24806943.0 24806943.0     53.8      pl_df = pl.from_arrow(dataset.read())\n    66         1     650855.0 650855.0      1.4      rs = pl_df.select([\n    67         1      58597.0  58597.0      0.1          pl.col(\"id\").count(),\n    68         1      16133.0  16133.0      0.0          pl.col(\"parent\").mean(),\n    69         1       8171.0   8171.0      0.0          pl.count()\n    70                                               ])\n    71         1     337405.0 337405.0      0.7      return rs.to_dicts()[0]\n\nWrote profile results to tester.py.lprof\nTimer unit: 1e-06 s\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleehuwuj%2Fpython-data-library-comparision","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fleehuwuj%2Fpython-data-library-comparision","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleehuwuj%2Fpython-data-library-comparision/lists"}