{"id":17124332,"url":"https://github.com/wittline/csv-schema-inference","last_synced_at":"2025-04-13T06:09:46.046Z","repository":{"id":41314964,"uuid":"509118457","full_name":"Wittline/csv-schema-inference","owner":"Wittline","description":"A tool to automatically infer columns data types in .csv files","archived":false,"fork":false,"pushed_at":"2023-01-28T03:30:40.000Z","size":95,"stargazers_count":35,"open_issues_count":3,"forks_count":4,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-13T06:09:42.541Z","etag":null,"topics":["big-data","csv","csv-files","inference","large-csv","large-files","parallel-programming","schema-inference"],"latest_commit_sha":null,"homepage":"https://wittline.github.io/csv-schema-inference/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Wittline.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-06-30T14:41:44.000Z","updated_at":"2024-11-23T00:31:51.000Z","dependencies_parsed_at":"2023-02-15T14:16:19.099Z","dependency_job_id":null,"html_url":"https://github.com/Wittline/csv-schema-inference","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wittline%2Fcsv-schema-inference","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wittline%2Fcsv-schema-inference/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wittline%2Fcsv-schema-inference/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wittline%2Fcsv-schema-inference/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Wittline","download_url":"https://codeload.github.com/Wittline/csv-schema-inference/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248670434,"owners_count":21142904,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","csv","csv-files","inference","large-csv","large-files","parallel-programming","schema-inference"],"created_at":"2024-10-14T18:42:23.882Z","updated_at":"2025-04-13T06:09:46.014Z","avatar_url":"https://github.com/Wittline.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# **Csv Schema Inference**\nA tool to automatically infer columns data types in .csv files\n\n### Check the article here:  \u003ca href=\"https://itnext.io/building-a-schema-inference-data-pipeline-for-large-csv-files-7a45d41ad4df\"\u003eBuilding a Schema Inference Data Pipeline for Large CSV files\u003c/a\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg \n    src=\"https://user-images.githubusercontent.com/8701464/178112999-a80d984c-5dd7-44a6-bc83-a6eeaa2bf0c5.png\"\n  \u003e\n\u003c/p\u003e\n\n\n\u003cdiv class=\"cell markdown\" id=\"bDEfBKw0v5Gl\"\u003e\n\n## **Installing csv-schema-inference** 🔧\n\n\u003c/div\u003e\n\n\u003cdiv class=\"cell code\" data-execution_count=\"5\" data-colab=\"{\u0026quot;base_uri\u0026quot;:\u0026quot;https://localhost:8080/\u0026quot;}\" id=\"NW7FOsRhtptl\" data-outputId=\"2ad79008-9ec3-44e7-8e64-f990533c1fdc\"\u003e\n\n``` python\npip install csv-schema-inference\n```\n\n\u003cdiv class=\"output stream stdout\"\u003e\n\n    Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n    Collecting csv-schema-inference\n      Downloading csv_schema_inference-0.0.9-py3-none-any.whl (7.3 kB)\n    Installing collected packages: csv-schema-inference\n    Successfully installed csv-schema-inference-0.0.9\n\n\u003c/div\u003e\n\n\u003c/div\u003e\n\n\u003cdiv class=\"cell markdown\" id=\"fciY6CMswOcV\"\u003e\n\n## **Importing csv-schema-inference library** ⚡\n\n\u003c/div\u003e\n\n\u003cdiv class=\"cell code\" data-execution_count=\"6\" id=\"ZCe2cOfJtxbB\"\u003e\n\n``` python\nfrom csv_schema_inference import csv_schema_inference\n```\n\n\u003c/div\u003e\n\n\u003cdiv class=\"cell markdown\" id=\"ejVS9wb1wYK5\"\u003e\n\n## **Setting csv-schema-inference configuration** ✍\n\n\u003c/div\u003e\n\n\u003cdiv class=\"cell code\" data-execution_count=\"7\" id=\"MxqPQHl4t03W\"\u003e\n\n``` python\n\n#if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT\nconditions = {\"INTEGER\":\"FLOAT\"}\n\ncsv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=\",\", conditions = conditions)\npathfile = \"/content/file__500k.csv\"\n```\n\n\u003c/div\u003e\n\n\u003cdiv class=\"cell markdown\" id=\"-DbG_LFKwvD0\"\u003e\n\n## **Run inference** 🏃\n\n\u003c/div\u003e\n\n\u003cdiv class=\"cell code\" data-execution_count=\"8\" id=\"Ta4HiDbDwuXO\"\u003e\n\n``` python\naprox_schema = csv_infer.run_inference(pathfile)\n```\n\n\u003c/div\u003e\n\n\u003cdiv class=\"cell markdown\" id=\"sN5Y5Uktwryp\"\u003e\n\n## **Showing the approximate data type inference for each column** 🔍\n\n\u003c/div\u003e\n\n\u003cdiv class=\"cell code\" data-execution_count=\"9\" data-colab=\"{\u0026quot;base_uri\u0026quot;:\u0026quot;https://localhost:8080/\u0026quot;}\" id=\"lxUwb3hKwsKZ\" data-outputId=\"d269d7d9-ea0b-490d-d83f-353b8548b179\"\u003e\n\n``` python\ncsv_infer.pretty(aprox_schema)\n```\n\n\u003cdiv class=\"output stream stdout\"\u003e\n\n    0\n    \tname\n    \t\tid\n    \ttype\n    \t\tINTEGER\n    \tnullable\n    \t\tFalse\n    1\n    \tname\n    \t\tfull_name\n    \ttype\n    \t\tSTRING\n    \tnullable\n    \t\tTrue\n    2\n    \tname\n    \t\tage\n    \ttype\n    \t\tINTEGER\n    \tnullable\n    \t\tFalse\n    3\n    \tname\n    \t\tcity\n    \ttype\n    \t\tSTRING\n    \tnullable\n    \t\tTrue\n    4\n    \tname\n    \t\tweight\n    \ttype\n    \t\tFLOAT\n    \tnullable\n    \t\tFalse\n    5\n    \tname\n    \t\theight\n    \ttype\n    \t\tFLOAT\n    \tnullable\n    \t\tFalse\n    6\n    \tname\n    \t\tisActive\n    \ttype\n    \t\tBOOLEAN\n    \tnullable\n    \t\tFalse\n    7\n    \tname\n    \t\tcol_int1\n    \ttype\n    \t\tINTEGER\n    \tnullable\n    \t\tFalse\n    8\n    \tname\n    \t\tcol_int2\n    \ttype\n    \t\tINTEGER\n    \tnullable\n    \t\tFalse\n    9\n    \tname\n    \t\tcol_int3\n    \ttype\n    \t\tINTEGER\n    \tnullable\n    \t\tFalse\n    10\n    \tname\n    \t\tcol_float1\n    \ttype\n    \t\tFLOAT\n    \tnullable\n    \t\tFalse\n    11\n    \tname\n    \t\tcol_float2\n    \ttype\n    \t\tFLOAT\n    \tnullable\n    \t\tFalse\n    12\n    \tname\n    \t\tcol_float3\n    \ttype\n    \t\tFLOAT\n    \tnullable\n    \t\tFalse\n    13\n    \tname\n    \t\tcol_float4\n    \ttype\n    \t\tFLOAT\n    \tnullable\n    \t\tFalse\n    14\n    \tname\n    \t\tcol_float5\n    \ttype\n    \t\tFLOAT\n    \tnullable\n    \t\tFalse\n    15\n    \tname\n    \t\tcol_float6\n    \ttype\n    \t\tFLOAT\n    \tnullable\n    \t\tFalse\n    16\n    \tname\n    \t\tcol_float7\n    \ttype\n    \t\tFLOAT\n    \tnullable\n    \t\tFalse\n    17\n    \tname\n    \t\tcol_float8\n    \ttype\n    \t\tFLOAT\n    \tnullable\n    \t\tFalse\n    18\n    \tname\n    \t\tcol_float9\n    \ttype\n    \t\tFLOAT\n    \tnullable\n    \t\tFalse\n    19\n    \tname\n    \t\tcol_float10\n    \ttype\n    \t\tFLOAT\n    \tnullable\n    \t\tFalse\n    20\n    \tname\n    \t\ttest_column\n    \ttype\n    \t\tFLOAT\n    \tnullable\n    \t\tFalse\n\n\u003c/div\u003e\n\n\u003c/div\u003e\n\n\u003cdiv class=\"cell markdown\" id=\"LMP0nZNtxUvy\"\u003e\n\n## **Checking schema values for specific columns** ✔\n\n\u003c/div\u003e\n\n\u003cdiv class=\"cell code\" data-execution_count=\"10\" data-colab=\"{\u0026quot;base_uri\u0026quot;:\u0026quot;https://localhost:8080/\u0026quot;}\" id=\"_fxgKtFDt3aH\" data-outputId=\"0d09760a-a6b8-49f3-9230-61f8e61510d6\"\u003e\n\n``` python\nresult = csv_infer.get_schema_columns(columns = {\"test_column\"})\ncsv_infer.pretty(result)\n```\n\n\u003cdiv class=\"output stream stdout\"\u003e\n\n    20\n    \t_name\n    \t\ttest_column\n    \ttypes_found\n    \t\tINTEGER\n    \t\t\tcnt\n    \t\t\t\t406130\n    \t\tFLOAT\n    \t\t\tcnt\n    \t\t\t\t50964\n    \tnullable\n    \t\tFalse\n    \ttype\n    \t\tFLOAT\n\n\u003c/div\u003e\n\n\u003c/div\u003e\n\n\u003cdiv class=\"cell markdown\" id=\"tWIdQXTfx3hW\"\u003e\n\n## **Explore all possible data types for a specific columns** ✅\n\n\u003c/div\u003e\n\n\u003cdiv class=\"cell code\" data-execution_count=\"11\" data-colab=\"{\u0026quot;base_uri\u0026quot;:\u0026quot;https://localhost:8080/\u0026quot;}\" id=\"d93OWWDMt5Qy\" data-outputId=\"db73203d-9dcb-49de-dd00-8287ae9ca7d6\"\u003e\n\n``` python\nresult = csv_infer.explore_schema_column(column = \"test_column\")\ncsv_infer.pretty(result)\n```\n\n\u003cdiv class=\"output stream stdout\"\u003e\n\n    20\n    \tname\n    \t\ttest_column\n    \ttypes_found\n    \t\tINTEGER\n    \t\t\t88.85043339006856\n    \t\tFLOAT\n    \t\t\t11.149566609931437\n    \tnullable\n    \t\tFalse\n\n\u003c/div\u003e\n\n\u003c/div\u003e\n\n## Benchmark\nThe tests were done with 9 .csv files, 21 columns, different sizes and number of records, an average of 5 executions was calculated for each process, shuffle time and inferring time.\n\n- file__20m.csv: 20 million records\n- file__15m.csv: 15 million records\n- file__12m.csv: 12 million records\n- file__10m.csv: 10 million records\n- And so on...\n\nIf you want to know more about the shuffling process, you can check this other repository: \u003ca href=\"https://github.com/Wittline/csv-shuffler\"\u003eA tool to automatically Shuffle lines in .csv files\u003c/a\u003e, the shuffling process helps us to:\n\n1. Increase the probability of finding all the data types present in a single column. \n2. Avoid iterate the entire dataset.\n2. Avoid see biases in the data that may be part of its organic behavior and due to not knowing the nature of its construction.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg \n    src=\"https://user-images.githubusercontent.com/8701464/180056932-7e34c2a6-7107-48f2-9245-95af8bb354b5.png\"\n  \u003e\n\u003c/p\u003e\n\n## Contributing and Feedback\nAny ideas or feedback about this repository?. Help me to improve it.\n\n## Authors\n- Created by \u003ca href=\"https://twitter.com/RamsesCoraspe\"\u003e\u003cstrong\u003eRamses Alexander Coraspe Valdez\u003c/strong\u003e\u003c/a\u003e\n- Created on 2022\n\n## License\nThis project is licensed under the terms of the MIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwittline%2Fcsv-schema-inference","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwittline%2Fcsv-schema-inference","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwittline%2Fcsv-schema-inference/lists"}