{"id":37715542,"url":"https://github.com/canimus/alphareader","last_synced_at":"2026-01-16T13:25:09.513Z","repository":{"id":57409937,"uuid":"250228581","full_name":"canimus/alphareader","owner":"canimus","description":"A custom reader for delimited files in Python. Ability to ingest big data files.","archived":false,"fork":false,"pushed_at":"2020-04-01T13:35:15.000Z","size":38,"stargazers_count":6,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-09-03T17:58:46.884Z","etag":null,"topics":["bigdata","chunked","csv","csv-parser","hdfs","parser","python"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/canimus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-03-26T10:33:55.000Z","updated_at":"2024-11-18T12:51:29.000Z","dependencies_parsed_at":"2022-08-24T19:00:39.763Z","dependency_job_id":null,"html_url":"https://github.com/canimus/alphareader","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/canimus/alphareader","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/canimus%2Falphareader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/canimus%2Falphareader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/canimus%2Falphareader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/canimus%2Falphareader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/canimus","download_url":"https://codeload.github.com/canimus/alphareader/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/canimus%2Falphareader/sbom","scorecard":{"id":264376,"data":{"date":"2025-08-11","repo":{"name":"github.com/canimus/alphareader","commit":"bc9712e514837b637e3d26201247d6afbadb795a"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.3,"checks":[{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"License","score":9,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Warn: project license file does not contain an FSF or OSI license."],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":-1,"reason":"internal error: internal error: Client.Checks.ListCheckRunsForRef: error during graphqlHandler.setupCheckRuns: non-200 OK status code: 502 Bad Gateway body: \"\u003chtml\u003e\\r\\n\u003chead\u003e\u003ctitle\u003e502 Bad Gateway\u003c/title\u003e\u003c/head\u003e\\r\\n\u003cbody\u003e\\r\\n\u003ccenter\u003e\u003ch1\u003e502 Bad Gateway\u003c/h1\u003e\u003c/center\u003e\\r\\n\u003chr\u003e\u003ccenter\u003enginx\u003c/center\u003e\\r\\n\u003c/body\u003e\\r\\n\u003c/html\u003e\\r\\n\"","details":null,"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-17T11:36:40.549Z","repository_id":57409937,"created_at":"2025-08-17T11:36:40.549Z","updated_at":"2025-08-17T11:36:40.549Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28479033,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","chunked","csv","csv-parser","hdfs","parser","python"],"created_at":"2026-01-16T13:25:07.411Z","updated_at":"2026-01-16T13:25:09.502Z","avatar_url":"https://github.com/canimus.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AlphaReader\n\n[![canimus](https://circleci.com/gh/canimus/alphareader.svg?style=svg)](https://circleci.com/gh/canimus/alphareader)\n\nAfter several attempts to try the `csv` package or `pandas` for reading large files with custom delimiters, I ended up writting a little program that does the job without complaints.\n\n__AlphaReader__ is a high performant, pure python, 15-line of code library, that reads chunks of bytes from your files, and retrieve line by line, the content of it.\n\nThe inspiration of this library came by having to extract data from a MS-SQL Server database, and having to deal with the `CP1252` encoding. By default AlphaReader takes this encoding as it was useful in our use case.\n\nIt works also with `HDFS` through the `pyarrow` library. But is not a depedency.\n\n## CSVs\n```python\n# !cat file.csv\n# 1,John,Doe,2010\n# 2,Mary,Smith,2011\n# 3,Peter,Jones,2012\n\n\u003e reader = AlphaReader(open('file.csv', 'rb'), encoding='cp1252', terminator=10, delimiter=44)\n\u003e next(reader)\n\u003e ['1','John','Doe','2010']\n```\n\n## TSVs\n```python\n# !cat file.tsv\n# 1    John    Doe    2010\n# 2    Mary    Smith  2011\n# 3    Peter   Jones  2012\n\n\u003e reader = AlphaReader(open('file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=9)\n\u003e next(reader)\n\u003e ['1','John','Doe','2010']\n```\n\n## XSVs\n```python\n# !cat file.tsv\n# 1¦John¦Doe¦2010\n# 2¦Mary¦Smith¦2011\n# 3¦Peter¦Jones¦2012\n\n\u003e ord('¦')\n\u003e 166\n\u003e chr(166)\n\u003e '¦'\n\u003e reader = AlphaReader(open('file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=166)\n\u003e next(reader)\n\u003e ['1','John','Doe','2010']\n```\n\n## HDFS\n```python\n# !hdfs dfs -cat /raw/tsv/file.tsv\n# 1    John    Doe    2010\n# 2    Mary    Smith  2011\n# 3    Peter   Jones  2012\n\n\u003e import pyarrow as pa\n\u003e fs = pa.hdfs.connect()\n\u003e reader = AlphaReader(fs.open('/raw/tsv/file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=9)\n\u003e next(reader)\n\u003e ['1','John','Doe','2010']\n```\n\n## Transformations\n```python\n# !cat file.csv\n# 1,2,3\n# 10,20,30\n# 100,200,300\n\n\u003e fn = lambda x: int(x)\n\u003e reader = AlphaReader(open('/raw/tsv/file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=44, fn_transform=fn)\n\u003e next(reader)\n\u003e [1,2,3]\n\u003e next(reader)\n\u003e [10,20,30]\n```\n\n## Chain Transformations\n```python\n# !cat file.csv\n# 1,2,3\n# 10,20,30\n# 100,200,300\n\n\u003e fn_1 = lambda x: x+1\n\u003e fn_2 = lambda x: x*10\n\u003e reader = AlphaReader(open('/raw/tsv/file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=44, fn_transform=[int, fn_1, fn_2])\n\u003e next(reader)\n\u003e [20,30,40]\n\u003e next(reader)\n\u003e [110,210,310]\n```\n\n## Caution\n```python\n\u003e reader = AlphaReader(open('large_file.xsv', 'rb'), encoding='cp1252', terminator=172, delimiter=173)\n\u003e records = list(reader) # Avoid this as it will load all file in memory\n```\n\n## Limitations\n- No support for `multi-byte` delimiters\n- Relatively slower performance than `csv` library. Use `csv` and dialects when your files have `\\r\\n` terminators\n- Transformations are per row, perhaps vectorization could aid performance\n\n## Performance\n- 24MB file loaded with `list(AlphaReader(file_handle))`\n```bash\ntests/test_profile.py::test_alphareader_with_encoding\n--------------------------------------------------------------------------------- live log call \nINFO     root:test_profile.py:22          252343 function calls in 0.386 seconds\n\n    Ordered by: cumulative time\n\n   ncalls  tottime  percall  cumtime  percall filename:lineno(function)\n   119605    0.039    0.000    0.386    0.000 .\\alphareader\\__init__.py:39(AlphaReader)\n   122228    0.266    0.000    0.266    0.000 {method 'split' of 'str' objects}\n     2625    0.005    0.000    0.054    0.000 {method 'decode' of 'bytes' objects}\n     2624    0.001    0.000    0.049    0.000 .\\Python-3.7.4\\lib\\encodings\\cp1252.py:14(decode)\n     2624    0.048    0.000    0.048    0.000 {built-in method _codecs.charmap_decode}\n     2625    0.027    0.000    0.027    0.000 {method 'read' of '_io.BufferedReader' objects}\n        1    0.000    0.000    0.000    0.000 .\\__init__.py:5(_validate)\n        1    0.000    0.000    0.000    0.000 {built-in method _codecs.lookup}\n\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcanimus%2Falphareader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcanimus%2Falphareader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcanimus%2Falphareader/lists"}