{"id":47626864,"url":"https://github.com/nao1215/fileprep","last_synced_at":"2026-04-01T22:51:38.386Z","repository":{"id":328129179,"uuid":"1111195999","full_name":"nao1215/fileprep","owner":"nao1215","description":"struct-tag preprocessing and validation for CSV/TSV/LTSV, JSON/JSONL, Parquet, Excel.","archived":false,"fork":false,"pushed_at":"2026-03-19T08:52:18.000Z","size":4051,"stargazers_count":17,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-20T01:57:31.661Z","etag":null,"topics":["csv","excel","go","golang","json","ltsv","parquet","preprocess","preprocessing","struct-tag","tsv","validation"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nao1215.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"nao1215"}},"created_at":"2025-12-06T13:15:59.000Z","updated_at":"2026-03-19T08:52:14.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/nao1215/fileprep","commit_stats":null,"previous_names":["nao1215/fileprep"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/nao1215/fileprep","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nao1215%2Ffileprep","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nao1215%2Ffileprep/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nao1215%2Ffileprep/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nao1215%2Ffileprep/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nao1215","download_url":"https://codeload.github.com/nao1215/fileprep/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nao1215%2Ffileprep/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31292708,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-01T21:15:39.731Z","status":"ssl_error","status_checked_at":"2026-04-01T21:15:34.046Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv","excel","go","golang","json","ltsv","parquet","preprocess","preprocessing","struct-tag","tsv","validation"],"created_at":"2026-04-01T22:51:37.806Z","updated_at":"2026-04-01T22:51:38.377Z","avatar_url":"https://github.com/nao1215.png","language":"Go","funding_links":["https://github.com/sponsors/nao1215"],"categories":[],"sub_categories":[],"readme":"# fileprep\n\n[![Go Reference](https://pkg.go.dev/badge/github.com/nao1215/fileprep.svg)](https://pkg.go.dev/github.com/nao1215/fileprep)\n[![Go Report Card](https://goreportcard.com/badge/github.com/nao1215/fileprep)](https://goreportcard.com/report/github.com/nao1215/fileprep)\n[![MultiPlatformUnitTest](https://github.com/nao1215/fileprep/actions/workflows/unit_test.yml/badge.svg)](https://github.com/nao1215/fileprep/actions/workflows/unit_test.yml)\n![Coverage](https://raw.githubusercontent.com/nao1215/octocovs-central-repo/main/badges/nao1215/fileprep/coverage.svg)\n\n[日本語](doc/ja/README.md) | [Español](doc/es/README.md) | [Français](doc/fr/README.md) | [한국어](doc/ko/README.md) | [Русский](doc/ru/README.md) | [中文](doc/zh-cn/README.md)\n\n![fileprep-logo](./doc/images/fileprep-logo-small.png)\n\n**fileprep** is a Go library for cleaning, normalizing, and validating structured data—CSV, TSV, LTSV, JSON, JSONL, Parquet, and Excel—through lightweight struct-tag rules, with seamless support for gzip, bzip2, xz, zstd, zlib, snappy, s2, and lz4 streams.\n\n## Why fileprep?\n\nI developed [nao1215/filesql](https://github.com/nao1215/filesql), which allows you to execute SQL queries on files like CSV, TSV, LTSV, Parquet, and Excel. I also created [nao1215/csv](https://github.com/nao1215/csv) for CSV file validation.\n\nWhile studying machine learning, I realized: \"If I extend [nao1215/csv](https://github.com/nao1215/csv) to support the same file formats as [nao1215/filesql](https://github.com/nao1215/filesql), I could combine them to perform ETL-like operations.\" This idea led to the creation of **fileprep**—a library that bridges data preprocessing/validation with SQL-based file querying.\n\n## Features\n\n- Multiple file format support: CSV, TSV, LTSV, JSON (.json), JSONL (.jsonl), Parquet, Excel (.xlsx)\n- Compression support: gzip (.gz), bzip2 (.bz2), xz (.xz), zstd (.zst), zlib (.z), snappy (.snappy), s2 (.s2), lz4 (.lz4)\n- Name-based column binding: Fields auto-match `snake_case` column names, customizable via `name` tag\n- Struct tag-based preprocessing (`prep` tag): trim, lowercase, uppercase, default values\n- Struct tag-based validation (`validate` tag): required, omitempty, and more\n- Processor options: `WithStrictTagParsing()` for catching tag misconfigurations, `WithValidRowsOnly()` for filtering output\n- Seamless [filesql](https://github.com/nao1215/filesql) integration: Returns `io.Reader` for direct use with filesql\n- Detailed error reporting: Row and column information for each error\n\n## Installation\n\n```bash\ngo get github.com/nao1215/fileprep\n```\n\n## Requirements\n\n- Go Version: 1.25 or later\n- Operating Systems:\n  - Linux\n  - macOS  \n  - Windows\n\n\n## Quick Start\n\n```go\npackage main\n\nimport (\n    \"fmt\"\n    \"strings\"\n\n    \"github.com/nao1215/fileprep\"\n)\n\n// User represents a user record with preprocessing and validation\ntype User struct {\n    Name  string `prep:\"trim\" validate:\"required\"`\n    Email string `prep:\"trim,lowercase\"`\n    Age   string\n}\n\nfunc main() {\n    csvData := `name,email,age\n  John Doe  ,JOHN@EXAMPLE.COM,30\nJane Smith,jane@example.com,25\n`\n\n    processor := fileprep.NewProcessor(fileprep.FileTypeCSV)\n    var users []User\n\n    reader, result, err := processor.Process(strings.NewReader(csvData), \u0026users)\n    if err != nil {\n        fmt.Printf(\"Error: %v\\n\", err)\n        return\n    }\n\n    fmt.Printf(\"Processed %d rows, %d valid\\n\", result.RowCount, result.ValidRowCount)\n\n    for _, user := range users {\n        fmt.Printf(\"Name: %q, Email: %q\\n\", user.Name, user.Email)\n    }\n\n    // reader can be passed directly to filesql\n    _ = reader\n}\n```\n\nOutput:\n```\nProcessed 2 rows, 2 valid\nName: \"John Doe\", Email: \"john@example.com\"\nName: \"Jane Smith\", Email: \"jane@example.com\"\n```\n\n## Gotchas\n\nA few things worth knowing before you start.\n\n**JSON/JSONL → single `\"data\"` column.** fileparser flattens each JSON array element or JSONL line into one column called `\"data\"`. Your struct needs a field that maps to it:\n\n```go\ntype JSONRecord struct {\n    Data string `name:\"data\" prep:\"trim\" validate:\"required\"`\n}\n```\n\nOutput is always compact JSONL. A prep tag that breaks JSON structure causes `ErrInvalidJSONAfterPrep`; all-empty output causes `ErrEmptyJSONOutput`.\n\n**Column matching is case-sensitive.** Field `UserName` auto-converts to `user_name`. Headers spelled differently (`User_Name`, `USERNAME`, `userName`) won't match. Override with the `name` tag:\n\n```go\ntype Record struct {\n    UserName string                 // matches \"user_name\" only\n    Email    string `name:\"EMAIL\"`  // matches \"EMAIL\" exactly\n}\n```\n\n**Duplicate headers → first column wins.** Given `id,id,name`, only the first `id` binds.\n\n**Missing columns → empty string.** If a column is absent, the field gets `\"\"`. Use `validate:\"required\"` to catch this.\n\n**Excel → first sheet only.** Additional sheets in `.xlsx` are silently skipped.\n\n**Saving output memory → use `ProcessToWriter`.** `Process` buffers the entire output in memory. `ProcessToWriter` skips that buffer and writes directly to any `io.Writer`. Note that input records are still loaded into memory for preprocessing; this only eliminates the output copy:\n\n```go\nf, _ := os.Create(\"output.csv\")\ndefer f.Close()\n\nresult, err := processor.ProcessToWriter(input, \u0026records, f)\n```\n\n## Advanced Examples\n\n### Complex Data Preprocessing and Validation\n\nThis example demonstrates the full power of fileprep: combining multiple preprocessors and validators to clean and validate real-world messy data.\n\n```go\npackage main\n\nimport (\n    \"fmt\"\n    \"strings\"\n\n    \"github.com/nao1215/fileprep\"\n)\n\n// Employee represents employee data with comprehensive preprocessing and validation\ntype Employee struct {\n    // ID: pad to 6 digits, must be numeric\n    EmployeeID string `name:\"id\" prep:\"trim,pad_left=6:0\" validate:\"required,numeric,len=6\"`\n\n    // Name: clean whitespace, required alphabetic with spaces\n    FullName string `name:\"name\" prep:\"trim,collapse_space\" validate:\"required,alphaspace\"`\n\n    // Email: normalize to lowercase, validate format\n    Email string `prep:\"trim,lowercase\" validate:\"required,email\"`\n\n    // Department: normalize case, must be one of allowed values\n    Department string `prep:\"trim,uppercase\" validate:\"required,oneof=ENGINEERING SALES MARKETING HR\"`\n\n    // Salary: keep only digits, validate range\n    Salary string `prep:\"trim,keep_digits\" validate:\"required,numeric,gte=30000,lte=500000\"`\n\n    // Phone: extract digits, validate E.164 format after adding country code\n    Phone string `prep:\"trim,keep_digits,prefix=+1\" validate:\"e164\"`\n\n    // Start date: validate datetime format\n    StartDate string `name:\"start_date\" prep:\"trim\" validate:\"required,datetime=2006-01-02\"`\n\n    // Manager ID: required only if department is not HR\n    ManagerID string `name:\"manager_id\" prep:\"trim,pad_left=6:0\" validate:\"required_unless=Department HR\"`\n\n    // Website: fix missing scheme, validate URL\n    Website string `prep:\"trim,lowercase,fix_scheme=https\" validate:\"url\"`\n}\n\nfunc main() {\n    // Messy real-world CSV data\n    csvData := `id,name,email,department,salary,phone,start_date,manager_id,website\n  42,  John   Doe  ,JOHN.DOE@COMPANY.COM,engineering,\"$75,000\",555-123-4567,2023-01-15,000001,company.com/john\n7,Jane Smith,jane@COMPANY.com,  Sales  ,\"$120,000\",(555) 987-6543,2022-06-01,000002,WWW.LINKEDIN.COM/in/jane\n123,Bob Wilson,bob.wilson@company.com,HR,45000,555.111.2222,2024-03-20,,\n99,Alice Brown,alice@company.com,Marketing,$88500,555-444-3333,2023-09-10,000003,https://alice.dev\n`\n\n    processor := fileprep.NewProcessor(fileprep.FileTypeCSV)\n    var employees []Employee\n\n    _, result, err := processor.Process(strings.NewReader(csvData), \u0026employees)\n    if err != nil {\n        fmt.Printf(\"Fatal error: %v\\n\", err)\n        return\n    }\n\n    fmt.Printf(\"=== Processing Result ===\\n\")\n    fmt.Printf(\"Total rows: %d, Valid rows: %d\\n\\n\", result.RowCount, result.ValidRowCount)\n\n    for i, emp := range employees {\n        fmt.Printf(\"Employee %d:\\n\", i+1)\n        fmt.Printf(\"  ID:         %s\\n\", emp.EmployeeID)\n        fmt.Printf(\"  Name:       %s\\n\", emp.FullName)\n        fmt.Printf(\"  Email:      %s\\n\", emp.Email)\n        fmt.Printf(\"  Department: %s\\n\", emp.Department)\n        fmt.Printf(\"  Salary:     %s\\n\", emp.Salary)\n        fmt.Printf(\"  Phone:      %s\\n\", emp.Phone)\n        fmt.Printf(\"  Start Date: %s\\n\", emp.StartDate)\n        fmt.Printf(\"  Manager ID: %s\\n\", emp.ManagerID)\n        fmt.Printf(\"  Website:    %s\\n\\n\", emp.Website)\n    }\n}\n```\n\nOutput:\n```\n=== Processing Result ===\nTotal rows: 4, Valid rows: 4\n\nEmployee 1:\n  ID:         000042\n  Name:       John Doe\n  Email:      john.doe@company.com\n  Department: ENGINEERING\n  Salary:     75000\n  Phone:      +15551234567\n  Start Date: 2023-01-15\n  Manager ID: 000001\n  Website:    https://company.com/john\n\nEmployee 2:\n  ID:         000007\n  Name:       Jane Smith\n  Email:      jane@company.com\n  Department: SALES\n  Salary:     120000\n  Phone:      +15559876543\n  Start Date: 2022-06-01\n  Manager ID: 000002\n  Website:    https://www.linkedin.com/in/jane\n\nEmployee 3:\n  ID:         000123\n  Name:       Bob Wilson\n  Email:      bob.wilson@company.com\n  Department: HR\n  Salary:     45000\n  Phone:      +15551112222\n  Start Date: 2024-03-20\n  Manager ID: 000000\n  Website:\n\nEmployee 4:\n  ID:         000099\n  Name:       Alice Brown\n  Email:      alice@company.com\n  Department: MARKETING\n  Salary:     88500\n  Phone:      +15554443333\n  Start Date: 2023-09-10\n  Manager ID: 000003\n  Website:    https://alice.dev\n```\n\n\n### Detailed Error Reporting\n\nWhen validation fails, fileprep provides precise error information including row number, column name, and specific validation failure reason.\n\n```go\npackage main\n\nimport (\n    \"fmt\"\n    \"strings\"\n\n    \"github.com/nao1215/fileprep\"\n)\n\n// Order represents an order with strict validation rules\ntype Order struct {\n    OrderID    string `name:\"order_id\" validate:\"required,uuid4\"`\n    CustomerID string `name:\"customer_id\" validate:\"required,numeric\"`\n    Email      string `validate:\"required,email\"`\n    Amount     string `validate:\"required,number,gt=0,lte=10000\"`\n    Currency   string `validate:\"required,len=3,uppercase\"`\n    Country    string `validate:\"required,alpha,len=2\"`\n    OrderDate  string `name:\"order_date\" validate:\"required,datetime=2006-01-02\"`\n    ShipDate   string `name:\"ship_date\" validate:\"datetime=2006-01-02,gtfield=OrderDate\"`\n    IPAddress  string `name:\"ip_address\" validate:\"required,ip_addr\"`\n    PromoCode  string `name:\"promo_code\" validate:\"alphanumeric\"`\n    Quantity   string `validate:\"required,numeric,gte=1,lte=100\"`\n    UnitPrice  string `name:\"unit_price\" validate:\"required,number,gt=0\"`\n    TotalCheck string `name:\"total_check\" validate:\"required,eqfield=Amount\"`\n}\n\nfunc main() {\n    // CSV with multiple validation errors\n    csvData := `order_id,customer_id,email,amount,currency,country,order_date,ship_date,ip_address,promo_code,quantity,unit_price,total_check\n550e8400-e29b-41d4-a716-446655440000,12345,alice@example.com,500.00,USD,US,2024-01-15,2024-01-20,192.168.1.1,SAVE10,2,250.00,500.00\ninvalid-uuid,abc,not-an-email,-100,US,USA,2024/01/15,2024-01-10,999.999.999.999,PROMO-CODE-TOO-LONG!!,0,0,999\n550e8400-e29b-41d4-a716-446655440001,,bob@test,50000,EURO,J1,not-a-date,,2001:db8::1,VALID20,101,-50,50000\n123e4567-e89b-42d3-a456-426614174000,99999,charlie@company.com,1500.50,JPY,JP,2024-02-28,2024-02-25,10.0.0.1,VIP,5,300.10,1500.50\n`\n\n    processor := fileprep.NewProcessor(fileprep.FileTypeCSV)\n    var orders []Order\n\n    _, result, err := processor.Process(strings.NewReader(csvData), \u0026orders)\n    if err != nil {\n        fmt.Printf(\"Fatal error: %v\\n\", err)\n        return\n    }\n\n    fmt.Printf(\"=== Validation Report ===\\n\")\n    fmt.Printf(\"Total rows:     %d\\n\", result.RowCount)\n    fmt.Printf(\"Valid rows:     %d\\n\", result.ValidRowCount)\n    fmt.Printf(\"Invalid rows:   %d\\n\", result.RowCount-result.ValidRowCount)\n    fmt.Printf(\"Total errors:   %d\\n\\n\", len(result.ValidationErrors()))\n\n    if result.HasErrors() {\n        fmt.Println(\"=== Error Details ===\")\n        for _, e := range result.ValidationErrors() {\n            fmt.Printf(\"Row %d, Column '%s': %s\\n\", e.Row, e.Column, e.Message)\n        }\n    }\n}\n```\n\nOutput:\n```\n=== Validation Report ===\nTotal rows:     4\nValid rows:     1\nInvalid rows:   3\nTotal errors:   23\n\n=== Error Details ===\nRow 2, Column 'order_id': value must be a valid UUID version 4\nRow 2, Column 'customer_id': value must be numeric\nRow 2, Column 'email': value must be a valid email address\nRow 2, Column 'amount': value must be greater than 0\nRow 2, Column 'currency': value must have exactly 3 characters\nRow 2, Column 'country': value must have exactly 2 characters\nRow 2, Column 'order_date': value must be a valid datetime in format: 2006-01-02\nRow 2, Column 'ip_address': value must be a valid IP address\nRow 2, Column 'promo_code': value must contain only alphanumeric characters\nRow 2, Column 'quantity': value must be greater than or equal to 1\nRow 2, Column 'unit_price': value must be greater than 0\nRow 2, Column 'ship_date': value must be greater than field OrderDate\nRow 2, Column 'total_check': value must equal field Amount\nRow 3, Column 'customer_id': value is required\nRow 3, Column 'email': value must be a valid email address\nRow 3, Column 'amount': value must be less than or equal to 10000\nRow 3, Column 'currency': value must have exactly 3 characters\nRow 3, Column 'country': value must contain only alphabetic characters\nRow 3, Column 'order_date': value must be a valid datetime in format: 2006-01-02\nRow 3, Column 'quantity': value must be less than or equal to 100\nRow 3, Column 'unit_price': value must be greater than 0\nRow 3, Column 'ship_date': value must be greater than field OrderDate\nRow 4, Column 'ship_date': value must be greater than field OrderDate\n```\n\n## Preprocessing Tags (`prep`)\n\nMultiple tags can be combined: `prep:\"trim,lowercase,default=N/A\"`\n\n### Basic Preprocessors\n\n| Tag | Description | Example |\n|-----|-------------|---------|\n| `trim` | Remove leading/trailing whitespace | `prep:\"trim\"` |\n| `ltrim` | Remove leading whitespace | `prep:\"ltrim\"` |\n| `rtrim` | Remove trailing whitespace | `prep:\"rtrim\"` |\n| `lowercase` | Convert to lowercase | `prep:\"lowercase\"` |\n| `uppercase` | Convert to uppercase | `prep:\"uppercase\"` |\n| `default=value` | Set default if empty | `prep:\"default=N/A\"` |\n\n### String Transformation\n\n| Tag | Description | Example |\n|-----|-------------|---------|\n| `replace=old:new` | Replace all occurrences | `prep:\"replace=;:,\"` |\n| `prefix=value` | Prepend string to value | `prep:\"prefix=ID_\"` |\n| `suffix=value` | Append string to value | `prep:\"suffix=_END\"` |\n| `truncate=N` | Limit to N characters | `prep:\"truncate=100\"` |\n| `strip_html` | Remove HTML tags | `prep:\"strip_html\"` |\n| `strip_newline` | Remove newlines (LF, CRLF, CR) | `prep:\"strip_newline\"` |\n| `collapse_space` | Collapse multiple spaces into one | `prep:\"collapse_space\"` |\n\n### Character Filtering\n\n| Tag | Description | Example |\n|-----|-------------|---------|\n| `remove_digits` | Remove all digits | `prep:\"remove_digits\"` |\n| `remove_alpha` | Remove all alphabetic characters | `prep:\"remove_alpha\"` |\n| `keep_digits` | Keep only digits | `prep:\"keep_digits\"` |\n| `keep_alpha` | Keep only alphabetic characters | `prep:\"keep_alpha\"` |\n| `trim_set=chars` | Remove specified characters from both ends | `prep:\"trim_set=@#$\"` |\n\n### Padding\n\n| Tag | Description | Example |\n|-----|-------------|---------|\n| `pad_left=N:char` | Left-pad to N characters | `prep:\"pad_left=5:0\"` |\n| `pad_right=N:char` | Right-pad to N characters | `prep:\"pad_right=10: \"` |\n\n### Advanced Preprocessors\n\n| Tag | Description | Example |\n|-----|-------------|---------|\n| `normalize_unicode` | Normalize Unicode to NFC form | `prep:\"normalize_unicode\"` |\n| `nullify=value` | Treat specific string as empty | `prep:\"nullify=NULL\"` |\n| `coerce=type` | Type coercion (int, float, bool) | `prep:\"coerce=int\"` |\n| `fix_scheme=scheme` | Add or fix URL scheme | `prep:\"fix_scheme=https\"` |\n| `regex_replace=pattern:replacement` | Regex-based replacement | `prep:\"regex_replace=\\\\d+:X\"` |\n\n## Validation Tags (`validate`)\n\nMultiple tags can be combined: `validate:\"required,email\"`\n\n### Basic Validators\n\n| Tag | Description | Example |\n|-----|-------------|---------|\n| `required` | Field must not be empty | `validate:\"required\"` |\n| `omitempty` | Skip subsequent validators if value is empty | `validate:\"omitempty,email\"` |\n| `boolean` | Must be true, false, 0, or 1 | `validate:\"boolean\"` |\n\n### Character Type Validators\n\n| Tag | Description | Example |\n|-----|-------------|---------|\n| `alpha` | ASCII alphabetic characters only | `validate:\"alpha\"` |\n| `alphaunicode` | Unicode letters only | `validate:\"alphaunicode\"` |\n| `alphaspace` | Alphabetic characters or spaces | `validate:\"alphaspace\"` |\n| `alphanumeric` | ASCII alphanumeric characters | `validate:\"alphanumeric\"` |\n| `alphanumunicode` | Unicode letters or digits | `validate:\"alphanumunicode\"` |\n| `numeric` | Valid integer | `validate:\"numeric\"` |\n| `number` | Valid number (integer or decimal) | `validate:\"number\"` |\n| `ascii` | ASCII characters only | `validate:\"ascii\"` |\n| `printascii` | Printable ASCII characters (0x20-0x7E) | `validate:\"printascii\"` |\n| `multibyte` | Contains multibyte characters | `validate:\"multibyte\"` |\n\n### Numeric Comparison Validators\n\n| Tag | Description | Example |\n|-----|-------------|---------|\n| `eq=N` | Value equals N | `validate:\"eq=100\"` |\n| `ne=N` | Value not equals N | `validate:\"ne=0\"` |\n| `gt=N` | Value greater than N | `validate:\"gt=0\"` |\n| `gte=N` | Value greater than or equal to N | `validate:\"gte=1\"` |\n| `lt=N` | Value less than N | `validate:\"lt=100\"` |\n| `lte=N` | Value less than or equal to N | `validate:\"lte=99\"` |\n| `min=N` | Value at least N | `validate:\"min=0\"` |\n| `max=N` | Value at most N | `validate:\"max=100\"` |\n| `len=N` | Exactly N characters | `validate:\"len=10\"` |\n\n### String Validators\n\n| Tag | Description | Example |\n|-----|-------------|---------|\n| `oneof=a b c` | Value is one of the allowed values | `validate:\"oneof=active inactive\"` |\n| `lowercase` | Value is all lowercase | `validate:\"lowercase\"` |\n| `uppercase` | Value is all uppercase | `validate:\"uppercase\"` |\n| `eq_ignore_case=value` | Case-insensitive equality | `validate:\"eq_ignore_case=yes\"` |\n| `ne_ignore_case=value` | Case-insensitive not equal | `validate:\"ne_ignore_case=no\"` |\n\n### String Content Validators\n\n| Tag | Description | Example |\n|-----|-------------|---------|\n| `startswith=prefix` | Value starts with prefix | `validate:\"startswith=http\"` |\n| `startsnotwith=prefix` | Value does not start with prefix | `validate:\"startsnotwith=_\"` |\n| `endswith=suffix` | Value ends with suffix | `validate:\"endswith=.com\"` |\n| `endsnotwith=suffix` | Value does not end with suffix | `validate:\"endsnotwith=.tmp\"` |\n| `contains=substr` | Value contains substring | `validate:\"contains=@\"` |\n| `containsany=chars` | Value contains any of the chars | `validate:\"containsany=abc\"` |\n| `containsrune=r` | Value contains the rune | `validate:\"containsrune=@\"` |\n| `excludes=substr` | Value does not contain substring | `validate:\"excludes=admin\"` |\n| `excludesall=chars` | Value does not contain any of the chars | `validate:\"excludesall=\u003c\u003e\"` |\n| `excludesrune=r` | Value does not contain the rune | `validate:\"excludesrune=$\"` |\n\n### Format Validators\n\n| Tag | Description | Example |\n|-----|-------------|---------|\n| `email` | Valid email address | `validate:\"email\"` |\n| `uri` | Valid URI | `validate:\"uri\"` |\n| `url` | Valid URL | `validate:\"url\"` |\n| `http_url` | Valid HTTP or HTTPS URL | `validate:\"http_url\"` |\n| `https_url` | Valid HTTPS URL | `validate:\"https_url\"` |\n| `url_encoded` | URL encoded string | `validate:\"url_encoded\"` |\n| `datauri` | Valid data URI | `validate:\"datauri\"` |\n| `datetime=layout` | Valid datetime matching Go layout | `validate:\"datetime=2006-01-02\"` |\n| `uuid` | Valid UUID (any version) | `validate:\"uuid\"` |\n| `uuid3` | Valid UUID version 3 | `validate:\"uuid3\"` |\n| `uuid4` | Valid UUID version 4 | `validate:\"uuid4\"` |\n| `uuid5` | Valid UUID version 5 | `validate:\"uuid5\"` |\n| `ulid` | Valid ULID | `validate:\"ulid\"` |\n| `e164` | Valid E.164 phone number | `validate:\"e164\"` |\n| `latitude` | Valid latitude (-90 to 90) | `validate:\"latitude\"` |\n| `longitude` | Valid longitude (-180 to 180) | `validate:\"longitude\"` |\n| `hexadecimal` | Valid hexadecimal string | `validate:\"hexadecimal\"` |\n| `hexcolor` | Valid hex color code | `validate:\"hexcolor\"` |\n| `rgb` | Valid RGB color | `validate:\"rgb\"` |\n| `rgba` | Valid RGBA color | `validate:\"rgba\"` |\n| `hsl` | Valid HSL color | `validate:\"hsl\"` |\n| `hsla` | Valid HSLA color | `validate:\"hsla\"` |\n\n### Network Validators\n\n| Tag | Description | Example |\n|-----|-------------|---------|\n| `ip_addr` | Valid IP address (v4 or v6) | `validate:\"ip_addr\"` |\n| `ip4_addr` | Valid IPv4 address | `validate:\"ip4_addr\"` |\n| `ip6_addr` | Valid IPv6 address | `validate:\"ip6_addr\"` |\n| `cidr` | Valid CIDR notation | `validate:\"cidr\"` |\n| `cidrv4` | Valid IPv4 CIDR | `validate:\"cidrv4\"` |\n| `cidrv6` | Valid IPv6 CIDR | `validate:\"cidrv6\"` |\n| `mac` | Valid MAC address | `validate:\"mac\"` |\n| `fqdn` | Valid fully qualified domain name | `validate:\"fqdn\"` |\n| `hostname` | Valid hostname (RFC 952) | `validate:\"hostname\"` |\n| `hostname_rfc1123` | Valid hostname (RFC 1123) | `validate:\"hostname_rfc1123\"` |\n| `hostname_port` | Valid hostname:port | `validate:\"hostname_port\"` |\n\n### Cross-Field Validators\n\n| Tag | Description | Example |\n|-----|-------------|---------|\n| `eqfield=Field` | Value equals another field | `validate:\"eqfield=Password\"` |\n| `nefield=Field` | Value not equals another field | `validate:\"nefield=OldPassword\"` |\n| `gtfield=Field` | Value greater than another field | `validate:\"gtfield=MinPrice\"` |\n| `gtefield=Field` | Value \u003e= another field | `validate:\"gtefield=StartDate\"` |\n| `ltfield=Field` | Value less than another field | `validate:\"ltfield=MaxPrice\"` |\n| `ltefield=Field` | Value \u003c= another field | `validate:\"ltefield=EndDate\"` |\n| `fieldcontains=Field` | Value contains another field's value | `validate:\"fieldcontains=Keyword\"` |\n| `fieldexcludes=Field` | Value excludes another field's value | `validate:\"fieldexcludes=Forbidden\"` |\n\n### Conditional Required Validators\n\n| Tag | Description | Example |\n|-----|-------------|---------|\n| `required_if=Field value` | Required if field equals value | `validate:\"required_if=Status active\"` |\n| `required_unless=Field value` | Required unless field equals value | `validate:\"required_unless=Type guest\"` |\n| `required_with=Field` | Required if field is present | `validate:\"required_with=Email\"` |\n| `required_without=Field` | Required if field is absent | `validate:\"required_without=Phone\"` |\n\n**Examples:**\n\n```go\ntype User struct {\n    Role    string\n    // Profile is required when Role is \"admin\", optional for other roles\n    Profile string `validate:\"required_if=Role admin\"`\n    // Bio is required unless Role is \"guest\"\n    Bio     string `validate:\"required_unless=Role guest\"`\n}\n\ntype Contact struct {\n    Email string\n    Phone string\n    // Name is required when Email is non-empty\n    Name  string `validate:\"required_with=Email\"`\n    // At least one of Email or BackupEmail must be provided\n    BackupEmail string `validate:\"required_without=Email\"`\n}\n```\n\n## Supported File Formats\n\n| Format | Extension | Compressed Extensions |\n|--------|-----------|----------------------|\n| CSV | `.csv` | `.csv.gz`, `.csv.bz2`, `.csv.xz`, `.csv.zst`, `.csv.z`, `.csv.snappy`, `.csv.s2`, `.csv.lz4` |\n| TSV | `.tsv` | `.tsv.gz`, `.tsv.bz2`, `.tsv.xz`, `.tsv.zst`, `.tsv.z`, `.tsv.snappy`, `.tsv.s2`, `.tsv.lz4` |\n| LTSV | `.ltsv` | `.ltsv.gz`, `.ltsv.bz2`, `.ltsv.xz`, `.ltsv.zst`, `.ltsv.z`, `.ltsv.snappy`, `.ltsv.s2`, `.ltsv.lz4` |\n| JSON | `.json` | `.json.gz`, `.json.bz2`, `.json.xz`, `.json.zst`, `.json.z`, `.json.snappy`, `.json.s2`, `.json.lz4` |\n| JSONL | `.jsonl` | `.jsonl.gz`, `.jsonl.bz2`, `.jsonl.xz`, `.jsonl.zst`, `.jsonl.z`, `.jsonl.snappy`, `.jsonl.s2`, `.jsonl.lz4` |\n| Excel | `.xlsx` | `.xlsx.gz`, `.xlsx.bz2`, `.xlsx.xz`, `.xlsx.zst`, `.xlsx.z`, `.xlsx.snappy`, `.xlsx.s2`, `.xlsx.lz4` |\n| Parquet | `.parquet` | `.parquet.gz`, `.parquet.bz2`, `.parquet.xz`, `.parquet.zst`, `.parquet.z`, `.parquet.snappy`, `.parquet.s2`, `.parquet.lz4` |\n\n### Supported Compression Formats\n\n| Format | Extension | Library | Notes |\n|--------|-----------|---------|-------|\n| gzip | `.gz` | compress/gzip | Standard library |\n| bzip2 | `.bz2` | compress/bzip2 | Standard library |\n| xz | `.xz` | github.com/ulikunitz/xz | Pure Go |\n| zstd | `.zst` | github.com/klauspost/compress/zstd | Pure Go, high performance |\n| zlib | `.z` | compress/zlib | Standard library |\n| snappy | `.snappy` | github.com/klauspost/compress/snappy | Pure Go, high performance |\n| s2 | `.s2` | github.com/klauspost/compress/s2 | Snappy-compatible, faster |\n| lz4 | `.lz4` | github.com/pierrec/lz4/v4 | Pure Go |\n\n**Note on Parquet compression**: The external compression (`.parquet.gz`, etc.) is for the container file itself. Parquet files may also use internal compression (Snappy, GZIP, LZ4, ZSTD) which is handled transparently by the parquet-go library.\n\n## Integration with filesql\n\n```go\n// Process file with preprocessing and validation\nprocessor := fileprep.NewProcessor(fileprep.FileTypeCSV)\nvar records []MyRecord\n\nreader, result, err := processor.Process(file, \u0026records)\nif err != nil {\n    return err\n}\n\n// Check for validation errors\nif result.HasErrors() {\n    for _, e := range result.ValidationErrors() {\n        log.Printf(\"Row %d, Column %s: %s\", e.Row, e.Column, e.Message)\n    }\n}\n\n// Pass preprocessed data to filesql using Builder pattern\nctx := context.Background()\nbuilder := filesql.NewBuilder().\n    AddReader(reader, \"my_table\", filesql.FileTypeCSV)\n\nvalidatedBuilder, err := builder.Build(ctx)\nif err != nil {\n    return err\n}\n\ndb, err := validatedBuilder.Open(ctx)\nif err != nil {\n    return err\n}\ndefer db.Close()\n\n// Execute SQL queries on preprocessed data\nrows, err := db.QueryContext(ctx, \"SELECT * FROM my_table WHERE age \u003e 20\")\n```\n\n## Processor Options\n\n`NewProcessor` accepts functional options to customize behavior:\n\n### WithStrictTagParsing\n\nBy default, invalid tag arguments (e.g., `eq=abc` where a number is expected) are silently ignored. Enable strict mode to catch these misconfigurations:\n\n```go\nprocessor := fileprep.NewProcessor(fileprep.FileTypeCSV, fileprep.WithStrictTagParsing())\nvar records []MyRecord\n\n// Returns an error if any tag argument is invalid (e.g., \"eq=abc\", \"truncate=xyz\")\n_, _, err := processor.Process(input, \u0026records)\n```\n\n### WithValidRowsOnly\n\nBy default, the output includes all rows (valid and invalid). Use `WithValidRowsOnly` to filter the output to only valid rows:\n\n```go\nprocessor := fileprep.NewProcessor(fileprep.FileTypeCSV, fileprep.WithValidRowsOnly())\nvar records []MyRecord\n\nreader, result, err := processor.Process(input, \u0026records)\n// reader contains only rows that passed all validations\n// records contains only valid structs\n// result.RowCount includes all rows; result.ValidRowCount has the valid count\n// result.Errors still reports all validation failures\n```\n\nOptions can be combined:\n\n```go\nprocessor := fileprep.NewProcessor(fileprep.FileTypeCSV,\n    fileprep.WithStrictTagParsing(),\n    fileprep.WithValidRowsOnly(),\n)\n```\n\n## Design Considerations\n\n### Name-Based Column Binding\n\nStruct fields are mapped to file columns **by name**, not by position. Field names are automatically converted to `snake_case` to match column headers. Column order in the file does not matter.\n\n```go\ntype User struct {\n    UserName string `name:\"user\"`       // matches \"user\" column (not \"user_name\")\n    Email    string `name:\"mail_addr\"`  // matches \"mail_addr\" column (not \"email\")\n    Age      string                     // matches \"age\" column (auto snake_case)\n}\n```\n\nIf your LTSV keys use hyphens (`user-id`) or Parquet/XLSX columns use camelCase (`userId`), use the `name` tag to specify the exact column name.\n\nSee [Gotchas](#gotchas) for case-sensitivity rules, duplicate header behavior, and missing column handling.\n\n### Memory Usage\n\nfileprep loads the **entire file into memory** for processing. This enables random access and multi-pass operations but has implications for large files:\n\n| File Size | Approx. Memory | Recommendation |\n|-----------|----------------|----------------|\n| \u003c 100 MB | ~2-3x file size | Direct processing |\n| 100-500 MB | ~500 MB - 1.5 GB | Monitor memory, consider chunking |\n| \u003e 500 MB | \u003e 1.5 GB | Split files or use streaming alternatives |\n\nFor compressed inputs (gzip, bzip2, xz, zstd, zlib, snappy, s2, lz4), memory usage is based on **decompressed** size.\n\n## Performance\n\nBenchmark results processing CSV files with a complex struct containing 21 columns. Each field uses multiple preprocessing and validation tags:\n\n**Preprocessing tags used:** trim, lowercase, uppercase, keep_digits, pad_left, strip_html, strip_newline, collapse_space, truncate, fix_scheme, default\n\n**Validation tags used:** required, alpha, numeric, email, uuid, ip_addr, url, oneof, min, max, len, printascii, ascii, eqfield\n\n| Records | Time | Memory | Allocs/op |\n|--------:|-----:|-------:|----------:|\n| 100 | 0.6 ms | 0.9 MB | 7,654 |\n| 1,000 | 6.1 ms | 9.6 MB | 74,829 |\n| 10,000 | 69 ms | 101 MB | 746,266 |\n| 50,000 | 344 ms | 498 MB | 3,690,281 |\n\n```bash\n# Quick benchmark (100 and 1,000 records)\nmake bench\n\n# Full benchmark (all sizes including 50,000 records)\nmake bench-all\n```\n\n*Tested on AMD Ryzen AI MAX+ 395, Go 1.24, Linux. Results vary by hardware.*\n\n## Related or inspired Projects\n\n- [nao1215/filesql](https://github.com/nao1215/filesql) - sql driver for CSV, TSV, LTSV, Parquet, Excel with gzip, bzip2, xz, zstd support.\n- [nao1215/fileframe](https://github.com/nao1215/fileframe) - DataFrame API for CSV/TSV/LTSV, Parquet, Excel. \n- [nao1215/csv](https://github.com/nao1215/csv) - read csv with validation and simple DataFrame in golang.\n- [go-playground/validator](https://github.com/go-playground/validator) - Go Struct and Field validation, including Cross Field, Cross Struct, Map, Slice and Array diving\n- [shogo82148/go-header-csv](https://github.com/shogo82148/go-header-csv) - go-header-csv is encoder/decoder csv with a header.\n\n## Contributing\n\nContributions are welcome! Please see the [Contributing Guide](./CONTRIBUTING.md) for more details.\n\n## Support\n\nIf you find this project useful, please consider:\n\n- Giving it a star on GitHub - it helps others discover the project\n- [Becoming a sponsor](https://github.com/sponsors/nao1215) - your support keeps the project alive and motivates continued development\n\nYour support, whether through stars, sponsorships, or contributions, is what drives this project forward. Thank you!\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnao1215%2Ffileprep","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnao1215%2Ffileprep","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnao1215%2Ffileprep/lists"}