https://github.com/nao1215/fileprep
struct-tag preprocessing and validation for CSV/TSV/LTSV, JSON/JSONL, Parquet, Excel.
https://github.com/nao1215/fileprep
csv excel go golang json ltsv parquet preprocess preprocessing struct-tag tsv validation
Last synced: about 23 hours ago
JSON representation
struct-tag preprocessing and validation for CSV/TSV/LTSV, JSON/JSONL, Parquet, Excel.
- Host: GitHub
- URL: https://github.com/nao1215/fileprep
- Owner: nao1215
- License: mit
- Created: 2025-12-06T13:15:59.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-03-19T08:52:18.000Z (15 days ago)
- Last Synced: 2026-03-20T01:57:31.661Z (14 days ago)
- Topics: csv, excel, go, golang, json, ltsv, parquet, preprocess, preprocessing, struct-tag, tsv, validation
- Language: Go
- Homepage:
- Size: 3.86 MB
- Stars: 17
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
README
# fileprep
[](https://pkg.go.dev/github.com/nao1215/fileprep)
[](https://goreportcard.com/report/github.com/nao1215/fileprep)
[](https://github.com/nao1215/fileprep/actions/workflows/unit_test.yml)

[日本語](doc/ja/README.md) | [Español](doc/es/README.md) | [Français](doc/fr/README.md) | [한국어](doc/ko/README.md) | [Русский](doc/ru/README.md) | [中文](doc/zh-cn/README.md)

**fileprep** is a Go library for cleaning, normalizing, and validating structured data—CSV, TSV, LTSV, JSON, JSONL, Parquet, and Excel—through lightweight struct-tag rules, with seamless support for gzip, bzip2, xz, zstd, zlib, snappy, s2, and lz4 streams.
## Why fileprep?
I developed [nao1215/filesql](https://github.com/nao1215/filesql), which allows you to execute SQL queries on files like CSV, TSV, LTSV, Parquet, and Excel. I also created [nao1215/csv](https://github.com/nao1215/csv) for CSV file validation.
While studying machine learning, I realized: "If I extend [nao1215/csv](https://github.com/nao1215/csv) to support the same file formats as [nao1215/filesql](https://github.com/nao1215/filesql), I could combine them to perform ETL-like operations." This idea led to the creation of **fileprep**—a library that bridges data preprocessing/validation with SQL-based file querying.
## Features
- Multiple file format support: CSV, TSV, LTSV, JSON (.json), JSONL (.jsonl), Parquet, Excel (.xlsx)
- Compression support: gzip (.gz), bzip2 (.bz2), xz (.xz), zstd (.zst), zlib (.z), snappy (.snappy), s2 (.s2), lz4 (.lz4)
- Name-based column binding: Fields auto-match `snake_case` column names, customizable via `name` tag
- Struct tag-based preprocessing (`prep` tag): trim, lowercase, uppercase, default values
- Struct tag-based validation (`validate` tag): required, omitempty, and more
- Processor options: `WithStrictTagParsing()` for catching tag misconfigurations, `WithValidRowsOnly()` for filtering output
- Seamless [filesql](https://github.com/nao1215/filesql) integration: Returns `io.Reader` for direct use with filesql
- Detailed error reporting: Row and column information for each error
## Installation
```bash
go get github.com/nao1215/fileprep
```
## Requirements
- Go Version: 1.25 or later
- Operating Systems:
- Linux
- macOS
- Windows
## Quick Start
```go
package main
import (
"fmt"
"strings"
"github.com/nao1215/fileprep"
)
// User represents a user record with preprocessing and validation
type User struct {
Name string `prep:"trim" validate:"required"`
Email string `prep:"trim,lowercase"`
Age string
}
func main() {
csvData := `name,email,age
John Doe ,JOHN@EXAMPLE.COM,30
Jane Smith,jane@example.com,25
`
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var users []User
reader, result, err := processor.Process(strings.NewReader(csvData), &users)
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)
for _, user := range users {
fmt.Printf("Name: %q, Email: %q\n", user.Name, user.Email)
}
// reader can be passed directly to filesql
_ = reader
}
```
Output:
```
Processed 2 rows, 2 valid
Name: "John Doe", Email: "john@example.com"
Name: "Jane Smith", Email: "jane@example.com"
```
## Gotchas
A few things worth knowing before you start.
**JSON/JSONL → single `"data"` column.** fileparser flattens each JSON array element or JSONL line into one column called `"data"`. Your struct needs a field that maps to it:
```go
type JSONRecord struct {
Data string `name:"data" prep:"trim" validate:"required"`
}
```
Output is always compact JSONL. A prep tag that breaks JSON structure causes `ErrInvalidJSONAfterPrep`; all-empty output causes `ErrEmptyJSONOutput`.
**Column matching is case-sensitive.** Field `UserName` auto-converts to `user_name`. Headers spelled differently (`User_Name`, `USERNAME`, `userName`) won't match. Override with the `name` tag:
```go
type Record struct {
UserName string // matches "user_name" only
Email string `name:"EMAIL"` // matches "EMAIL" exactly
}
```
**Duplicate headers → first column wins.** Given `id,id,name`, only the first `id` binds.
**Missing columns → empty string.** If a column is absent, the field gets `""`. Use `validate:"required"` to catch this.
**Excel → first sheet only.** Additional sheets in `.xlsx` are silently skipped.
**Saving output memory → use `ProcessToWriter`.** `Process` buffers the entire output in memory. `ProcessToWriter` skips that buffer and writes directly to any `io.Writer`. Note that input records are still loaded into memory for preprocessing; this only eliminates the output copy:
```go
f, _ := os.Create("output.csv")
defer f.Close()
result, err := processor.ProcessToWriter(input, &records, f)
```
## Advanced Examples
### Complex Data Preprocessing and Validation
This example demonstrates the full power of fileprep: combining multiple preprocessors and validators to clean and validate real-world messy data.
```go
package main
import (
"fmt"
"strings"
"github.com/nao1215/fileprep"
)
// Employee represents employee data with comprehensive preprocessing and validation
type Employee struct {
// ID: pad to 6 digits, must be numeric
EmployeeID string `name:"id" prep:"trim,pad_left=6:0" validate:"required,numeric,len=6"`
// Name: clean whitespace, required alphabetic with spaces
FullName string `name:"name" prep:"trim,collapse_space" validate:"required,alphaspace"`
// Email: normalize to lowercase, validate format
Email string `prep:"trim,lowercase" validate:"required,email"`
// Department: normalize case, must be one of allowed values
Department string `prep:"trim,uppercase" validate:"required,oneof=ENGINEERING SALES MARKETING HR"`
// Salary: keep only digits, validate range
Salary string `prep:"trim,keep_digits" validate:"required,numeric,gte=30000,lte=500000"`
// Phone: extract digits, validate E.164 format after adding country code
Phone string `prep:"trim,keep_digits,prefix=+1" validate:"e164"`
// Start date: validate datetime format
StartDate string `name:"start_date" prep:"trim" validate:"required,datetime=2006-01-02"`
// Manager ID: required only if department is not HR
ManagerID string `name:"manager_id" prep:"trim,pad_left=6:0" validate:"required_unless=Department HR"`
// Website: fix missing scheme, validate URL
Website string `prep:"trim,lowercase,fix_scheme=https" validate:"url"`
}
func main() {
// Messy real-world CSV data
csvData := `id,name,email,department,salary,phone,start_date,manager_id,website
42, John Doe ,JOHN.DOE@COMPANY.COM,engineering,"$75,000",555-123-4567,2023-01-15,000001,company.com/john
7,Jane Smith,jane@COMPANY.com, Sales ,"$120,000",(555) 987-6543,2022-06-01,000002,WWW.LINKEDIN.COM/in/jane
123,Bob Wilson,bob.wilson@company.com,HR,45000,555.111.2222,2024-03-20,,
99,Alice Brown,alice@company.com,Marketing,$88500,555-444-3333,2023-09-10,000003,https://alice.dev
`
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var employees []Employee
_, result, err := processor.Process(strings.NewReader(csvData), &employees)
if err != nil {
fmt.Printf("Fatal error: %v\n", err)
return
}
fmt.Printf("=== Processing Result ===\n")
fmt.Printf("Total rows: %d, Valid rows: %d\n\n", result.RowCount, result.ValidRowCount)
for i, emp := range employees {
fmt.Printf("Employee %d:\n", i+1)
fmt.Printf(" ID: %s\n", emp.EmployeeID)
fmt.Printf(" Name: %s\n", emp.FullName)
fmt.Printf(" Email: %s\n", emp.Email)
fmt.Printf(" Department: %s\n", emp.Department)
fmt.Printf(" Salary: %s\n", emp.Salary)
fmt.Printf(" Phone: %s\n", emp.Phone)
fmt.Printf(" Start Date: %s\n", emp.StartDate)
fmt.Printf(" Manager ID: %s\n", emp.ManagerID)
fmt.Printf(" Website: %s\n\n", emp.Website)
}
}
```
Output:
```
=== Processing Result ===
Total rows: 4, Valid rows: 4
Employee 1:
ID: 000042
Name: John Doe
Email: john.doe@company.com
Department: ENGINEERING
Salary: 75000
Phone: +15551234567
Start Date: 2023-01-15
Manager ID: 000001
Website: https://company.com/john
Employee 2:
ID: 000007
Name: Jane Smith
Email: jane@company.com
Department: SALES
Salary: 120000
Phone: +15559876543
Start Date: 2022-06-01
Manager ID: 000002
Website: https://www.linkedin.com/in/jane
Employee 3:
ID: 000123
Name: Bob Wilson
Email: bob.wilson@company.com
Department: HR
Salary: 45000
Phone: +15551112222
Start Date: 2024-03-20
Manager ID: 000000
Website:
Employee 4:
ID: 000099
Name: Alice Brown
Email: alice@company.com
Department: MARKETING
Salary: 88500
Phone: +15554443333
Start Date: 2023-09-10
Manager ID: 000003
Website: https://alice.dev
```
### Detailed Error Reporting
When validation fails, fileprep provides precise error information including row number, column name, and specific validation failure reason.
```go
package main
import (
"fmt"
"strings"
"github.com/nao1215/fileprep"
)
// Order represents an order with strict validation rules
type Order struct {
OrderID string `name:"order_id" validate:"required,uuid4"`
CustomerID string `name:"customer_id" validate:"required,numeric"`
Email string `validate:"required,email"`
Amount string `validate:"required,number,gt=0,lte=10000"`
Currency string `validate:"required,len=3,uppercase"`
Country string `validate:"required,alpha,len=2"`
OrderDate string `name:"order_date" validate:"required,datetime=2006-01-02"`
ShipDate string `name:"ship_date" validate:"datetime=2006-01-02,gtfield=OrderDate"`
IPAddress string `name:"ip_address" validate:"required,ip_addr"`
PromoCode string `name:"promo_code" validate:"alphanumeric"`
Quantity string `validate:"required,numeric,gte=1,lte=100"`
UnitPrice string `name:"unit_price" validate:"required,number,gt=0"`
TotalCheck string `name:"total_check" validate:"required,eqfield=Amount"`
}
func main() {
// CSV with multiple validation errors
csvData := `order_id,customer_id,email,amount,currency,country,order_date,ship_date,ip_address,promo_code,quantity,unit_price,total_check
550e8400-e29b-41d4-a716-446655440000,12345,alice@example.com,500.00,USD,US,2024-01-15,2024-01-20,192.168.1.1,SAVE10,2,250.00,500.00
invalid-uuid,abc,not-an-email,-100,US,USA,2024/01/15,2024-01-10,999.999.999.999,PROMO-CODE-TOO-LONG!!,0,0,999
550e8400-e29b-41d4-a716-446655440001,,bob@test,50000,EURO,J1,not-a-date,,2001:db8::1,VALID20,101,-50,50000
123e4567-e89b-42d3-a456-426614174000,99999,charlie@company.com,1500.50,JPY,JP,2024-02-28,2024-02-25,10.0.0.1,VIP,5,300.10,1500.50
`
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var orders []Order
_, result, err := processor.Process(strings.NewReader(csvData), &orders)
if err != nil {
fmt.Printf("Fatal error: %v\n", err)
return
}
fmt.Printf("=== Validation Report ===\n")
fmt.Printf("Total rows: %d\n", result.RowCount)
fmt.Printf("Valid rows: %d\n", result.ValidRowCount)
fmt.Printf("Invalid rows: %d\n", result.RowCount-result.ValidRowCount)
fmt.Printf("Total errors: %d\n\n", len(result.ValidationErrors()))
if result.HasErrors() {
fmt.Println("=== Error Details ===")
for _, e := range result.ValidationErrors() {
fmt.Printf("Row %d, Column '%s': %s\n", e.Row, e.Column, e.Message)
}
}
}
```
Output:
```
=== Validation Report ===
Total rows: 4
Valid rows: 1
Invalid rows: 3
Total errors: 23
=== Error Details ===
Row 2, Column 'order_id': value must be a valid UUID version 4
Row 2, Column 'customer_id': value must be numeric
Row 2, Column 'email': value must be a valid email address
Row 2, Column 'amount': value must be greater than 0
Row 2, Column 'currency': value must have exactly 3 characters
Row 2, Column 'country': value must have exactly 2 characters
Row 2, Column 'order_date': value must be a valid datetime in format: 2006-01-02
Row 2, Column 'ip_address': value must be a valid IP address
Row 2, Column 'promo_code': value must contain only alphanumeric characters
Row 2, Column 'quantity': value must be greater than or equal to 1
Row 2, Column 'unit_price': value must be greater than 0
Row 2, Column 'ship_date': value must be greater than field OrderDate
Row 2, Column 'total_check': value must equal field Amount
Row 3, Column 'customer_id': value is required
Row 3, Column 'email': value must be a valid email address
Row 3, Column 'amount': value must be less than or equal to 10000
Row 3, Column 'currency': value must have exactly 3 characters
Row 3, Column 'country': value must contain only alphabetic characters
Row 3, Column 'order_date': value must be a valid datetime in format: 2006-01-02
Row 3, Column 'quantity': value must be less than or equal to 100
Row 3, Column 'unit_price': value must be greater than 0
Row 3, Column 'ship_date': value must be greater than field OrderDate
Row 4, Column 'ship_date': value must be greater than field OrderDate
```
## Preprocessing Tags (`prep`)
Multiple tags can be combined: `prep:"trim,lowercase,default=N/A"`
### Basic Preprocessors
| Tag | Description | Example |
|-----|-------------|---------|
| `trim` | Remove leading/trailing whitespace | `prep:"trim"` |
| `ltrim` | Remove leading whitespace | `prep:"ltrim"` |
| `rtrim` | Remove trailing whitespace | `prep:"rtrim"` |
| `lowercase` | Convert to lowercase | `prep:"lowercase"` |
| `uppercase` | Convert to uppercase | `prep:"uppercase"` |
| `default=value` | Set default if empty | `prep:"default=N/A"` |
### String Transformation
| Tag | Description | Example |
|-----|-------------|---------|
| `replace=old:new` | Replace all occurrences | `prep:"replace=;:,"` |
| `prefix=value` | Prepend string to value | `prep:"prefix=ID_"` |
| `suffix=value` | Append string to value | `prep:"suffix=_END"` |
| `truncate=N` | Limit to N characters | `prep:"truncate=100"` |
| `strip_html` | Remove HTML tags | `prep:"strip_html"` |
| `strip_newline` | Remove newlines (LF, CRLF, CR) | `prep:"strip_newline"` |
| `collapse_space` | Collapse multiple spaces into one | `prep:"collapse_space"` |
### Character Filtering
| Tag | Description | Example |
|-----|-------------|---------|
| `remove_digits` | Remove all digits | `prep:"remove_digits"` |
| `remove_alpha` | Remove all alphabetic characters | `prep:"remove_alpha"` |
| `keep_digits` | Keep only digits | `prep:"keep_digits"` |
| `keep_alpha` | Keep only alphabetic characters | `prep:"keep_alpha"` |
| `trim_set=chars` | Remove specified characters from both ends | `prep:"trim_set=@#$"` |
### Padding
| Tag | Description | Example |
|-----|-------------|---------|
| `pad_left=N:char` | Left-pad to N characters | `prep:"pad_left=5:0"` |
| `pad_right=N:char` | Right-pad to N characters | `prep:"pad_right=10: "` |
### Advanced Preprocessors
| Tag | Description | Example |
|-----|-------------|---------|
| `normalize_unicode` | Normalize Unicode to NFC form | `prep:"normalize_unicode"` |
| `nullify=value` | Treat specific string as empty | `prep:"nullify=NULL"` |
| `coerce=type` | Type coercion (int, float, bool) | `prep:"coerce=int"` |
| `fix_scheme=scheme` | Add or fix URL scheme | `prep:"fix_scheme=https"` |
| `regex_replace=pattern:replacement` | Regex-based replacement | `prep:"regex_replace=\\d+:X"` |
## Validation Tags (`validate`)
Multiple tags can be combined: `validate:"required,email"`
### Basic Validators
| Tag | Description | Example |
|-----|-------------|---------|
| `required` | Field must not be empty | `validate:"required"` |
| `omitempty` | Skip subsequent validators if value is empty | `validate:"omitempty,email"` |
| `boolean` | Must be true, false, 0, or 1 | `validate:"boolean"` |
### Character Type Validators
| Tag | Description | Example |
|-----|-------------|---------|
| `alpha` | ASCII alphabetic characters only | `validate:"alpha"` |
| `alphaunicode` | Unicode letters only | `validate:"alphaunicode"` |
| `alphaspace` | Alphabetic characters or spaces | `validate:"alphaspace"` |
| `alphanumeric` | ASCII alphanumeric characters | `validate:"alphanumeric"` |
| `alphanumunicode` | Unicode letters or digits | `validate:"alphanumunicode"` |
| `numeric` | Valid integer | `validate:"numeric"` |
| `number` | Valid number (integer or decimal) | `validate:"number"` |
| `ascii` | ASCII characters only | `validate:"ascii"` |
| `printascii` | Printable ASCII characters (0x20-0x7E) | `validate:"printascii"` |
| `multibyte` | Contains multibyte characters | `validate:"multibyte"` |
### Numeric Comparison Validators
| Tag | Description | Example |
|-----|-------------|---------|
| `eq=N` | Value equals N | `validate:"eq=100"` |
| `ne=N` | Value not equals N | `validate:"ne=0"` |
| `gt=N` | Value greater than N | `validate:"gt=0"` |
| `gte=N` | Value greater than or equal to N | `validate:"gte=1"` |
| `lt=N` | Value less than N | `validate:"lt=100"` |
| `lte=N` | Value less than or equal to N | `validate:"lte=99"` |
| `min=N` | Value at least N | `validate:"min=0"` |
| `max=N` | Value at most N | `validate:"max=100"` |
| `len=N` | Exactly N characters | `validate:"len=10"` |
### String Validators
| Tag | Description | Example |
|-----|-------------|---------|
| `oneof=a b c` | Value is one of the allowed values | `validate:"oneof=active inactive"` |
| `lowercase` | Value is all lowercase | `validate:"lowercase"` |
| `uppercase` | Value is all uppercase | `validate:"uppercase"` |
| `eq_ignore_case=value` | Case-insensitive equality | `validate:"eq_ignore_case=yes"` |
| `ne_ignore_case=value` | Case-insensitive not equal | `validate:"ne_ignore_case=no"` |
### String Content Validators
| Tag | Description | Example |
|-----|-------------|---------|
| `startswith=prefix` | Value starts with prefix | `validate:"startswith=http"` |
| `startsnotwith=prefix` | Value does not start with prefix | `validate:"startsnotwith=_"` |
| `endswith=suffix` | Value ends with suffix | `validate:"endswith=.com"` |
| `endsnotwith=suffix` | Value does not end with suffix | `validate:"endsnotwith=.tmp"` |
| `contains=substr` | Value contains substring | `validate:"contains=@"` |
| `containsany=chars` | Value contains any of the chars | `validate:"containsany=abc"` |
| `containsrune=r` | Value contains the rune | `validate:"containsrune=@"` |
| `excludes=substr` | Value does not contain substring | `validate:"excludes=admin"` |
| `excludesall=chars` | Value does not contain any of the chars | `validate:"excludesall=<>"` |
| `excludesrune=r` | Value does not contain the rune | `validate:"excludesrune=$"` |
### Format Validators
| Tag | Description | Example |
|-----|-------------|---------|
| `email` | Valid email address | `validate:"email"` |
| `uri` | Valid URI | `validate:"uri"` |
| `url` | Valid URL | `validate:"url"` |
| `http_url` | Valid HTTP or HTTPS URL | `validate:"http_url"` |
| `https_url` | Valid HTTPS URL | `validate:"https_url"` |
| `url_encoded` | URL encoded string | `validate:"url_encoded"` |
| `datauri` | Valid data URI | `validate:"datauri"` |
| `datetime=layout` | Valid datetime matching Go layout | `validate:"datetime=2006-01-02"` |
| `uuid` | Valid UUID (any version) | `validate:"uuid"` |
| `uuid3` | Valid UUID version 3 | `validate:"uuid3"` |
| `uuid4` | Valid UUID version 4 | `validate:"uuid4"` |
| `uuid5` | Valid UUID version 5 | `validate:"uuid5"` |
| `ulid` | Valid ULID | `validate:"ulid"` |
| `e164` | Valid E.164 phone number | `validate:"e164"` |
| `latitude` | Valid latitude (-90 to 90) | `validate:"latitude"` |
| `longitude` | Valid longitude (-180 to 180) | `validate:"longitude"` |
| `hexadecimal` | Valid hexadecimal string | `validate:"hexadecimal"` |
| `hexcolor` | Valid hex color code | `validate:"hexcolor"` |
| `rgb` | Valid RGB color | `validate:"rgb"` |
| `rgba` | Valid RGBA color | `validate:"rgba"` |
| `hsl` | Valid HSL color | `validate:"hsl"` |
| `hsla` | Valid HSLA color | `validate:"hsla"` |
### Network Validators
| Tag | Description | Example |
|-----|-------------|---------|
| `ip_addr` | Valid IP address (v4 or v6) | `validate:"ip_addr"` |
| `ip4_addr` | Valid IPv4 address | `validate:"ip4_addr"` |
| `ip6_addr` | Valid IPv6 address | `validate:"ip6_addr"` |
| `cidr` | Valid CIDR notation | `validate:"cidr"` |
| `cidrv4` | Valid IPv4 CIDR | `validate:"cidrv4"` |
| `cidrv6` | Valid IPv6 CIDR | `validate:"cidrv6"` |
| `mac` | Valid MAC address | `validate:"mac"` |
| `fqdn` | Valid fully qualified domain name | `validate:"fqdn"` |
| `hostname` | Valid hostname (RFC 952) | `validate:"hostname"` |
| `hostname_rfc1123` | Valid hostname (RFC 1123) | `validate:"hostname_rfc1123"` |
| `hostname_port` | Valid hostname:port | `validate:"hostname_port"` |
### Cross-Field Validators
| Tag | Description | Example |
|-----|-------------|---------|
| `eqfield=Field` | Value equals another field | `validate:"eqfield=Password"` |
| `nefield=Field` | Value not equals another field | `validate:"nefield=OldPassword"` |
| `gtfield=Field` | Value greater than another field | `validate:"gtfield=MinPrice"` |
| `gtefield=Field` | Value >= another field | `validate:"gtefield=StartDate"` |
| `ltfield=Field` | Value less than another field | `validate:"ltfield=MaxPrice"` |
| `ltefield=Field` | Value <= another field | `validate:"ltefield=EndDate"` |
| `fieldcontains=Field` | Value contains another field's value | `validate:"fieldcontains=Keyword"` |
| `fieldexcludes=Field` | Value excludes another field's value | `validate:"fieldexcludes=Forbidden"` |
### Conditional Required Validators
| Tag | Description | Example |
|-----|-------------|---------|
| `required_if=Field value` | Required if field equals value | `validate:"required_if=Status active"` |
| `required_unless=Field value` | Required unless field equals value | `validate:"required_unless=Type guest"` |
| `required_with=Field` | Required if field is present | `validate:"required_with=Email"` |
| `required_without=Field` | Required if field is absent | `validate:"required_without=Phone"` |
**Examples:**
```go
type User struct {
Role string
// Profile is required when Role is "admin", optional for other roles
Profile string `validate:"required_if=Role admin"`
// Bio is required unless Role is "guest"
Bio string `validate:"required_unless=Role guest"`
}
type Contact struct {
Email string
Phone string
// Name is required when Email is non-empty
Name string `validate:"required_with=Email"`
// At least one of Email or BackupEmail must be provided
BackupEmail string `validate:"required_without=Email"`
}
```
## Supported File Formats
| Format | Extension | Compressed Extensions |
|--------|-----------|----------------------|
| CSV | `.csv` | `.csv.gz`, `.csv.bz2`, `.csv.xz`, `.csv.zst`, `.csv.z`, `.csv.snappy`, `.csv.s2`, `.csv.lz4` |
| TSV | `.tsv` | `.tsv.gz`, `.tsv.bz2`, `.tsv.xz`, `.tsv.zst`, `.tsv.z`, `.tsv.snappy`, `.tsv.s2`, `.tsv.lz4` |
| LTSV | `.ltsv` | `.ltsv.gz`, `.ltsv.bz2`, `.ltsv.xz`, `.ltsv.zst`, `.ltsv.z`, `.ltsv.snappy`, `.ltsv.s2`, `.ltsv.lz4` |
| JSON | `.json` | `.json.gz`, `.json.bz2`, `.json.xz`, `.json.zst`, `.json.z`, `.json.snappy`, `.json.s2`, `.json.lz4` |
| JSONL | `.jsonl` | `.jsonl.gz`, `.jsonl.bz2`, `.jsonl.xz`, `.jsonl.zst`, `.jsonl.z`, `.jsonl.snappy`, `.jsonl.s2`, `.jsonl.lz4` |
| Excel | `.xlsx` | `.xlsx.gz`, `.xlsx.bz2`, `.xlsx.xz`, `.xlsx.zst`, `.xlsx.z`, `.xlsx.snappy`, `.xlsx.s2`, `.xlsx.lz4` |
| Parquet | `.parquet` | `.parquet.gz`, `.parquet.bz2`, `.parquet.xz`, `.parquet.zst`, `.parquet.z`, `.parquet.snappy`, `.parquet.s2`, `.parquet.lz4` |
### Supported Compression Formats
| Format | Extension | Library | Notes |
|--------|-----------|---------|-------|
| gzip | `.gz` | compress/gzip | Standard library |
| bzip2 | `.bz2` | compress/bzip2 | Standard library |
| xz | `.xz` | github.com/ulikunitz/xz | Pure Go |
| zstd | `.zst` | github.com/klauspost/compress/zstd | Pure Go, high performance |
| zlib | `.z` | compress/zlib | Standard library |
| snappy | `.snappy` | github.com/klauspost/compress/snappy | Pure Go, high performance |
| s2 | `.s2` | github.com/klauspost/compress/s2 | Snappy-compatible, faster |
| lz4 | `.lz4` | github.com/pierrec/lz4/v4 | Pure Go |
**Note on Parquet compression**: The external compression (`.parquet.gz`, etc.) is for the container file itself. Parquet files may also use internal compression (Snappy, GZIP, LZ4, ZSTD) which is handled transparently by the parquet-go library.
## Integration with filesql
```go
// Process file with preprocessing and validation
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var records []MyRecord
reader, result, err := processor.Process(file, &records)
if err != nil {
return err
}
// Check for validation errors
if result.HasErrors() {
for _, e := range result.ValidationErrors() {
log.Printf("Row %d, Column %s: %s", e.Row, e.Column, e.Message)
}
}
// Pass preprocessed data to filesql using Builder pattern
ctx := context.Background()
builder := filesql.NewBuilder().
AddReader(reader, "my_table", filesql.FileTypeCSV)
validatedBuilder, err := builder.Build(ctx)
if err != nil {
return err
}
db, err := validatedBuilder.Open(ctx)
if err != nil {
return err
}
defer db.Close()
// Execute SQL queries on preprocessed data
rows, err := db.QueryContext(ctx, "SELECT * FROM my_table WHERE age > 20")
```
## Processor Options
`NewProcessor` accepts functional options to customize behavior:
### WithStrictTagParsing
By default, invalid tag arguments (e.g., `eq=abc` where a number is expected) are silently ignored. Enable strict mode to catch these misconfigurations:
```go
processor := fileprep.NewProcessor(fileprep.FileTypeCSV, fileprep.WithStrictTagParsing())
var records []MyRecord
// Returns an error if any tag argument is invalid (e.g., "eq=abc", "truncate=xyz")
_, _, err := processor.Process(input, &records)
```
### WithValidRowsOnly
By default, the output includes all rows (valid and invalid). Use `WithValidRowsOnly` to filter the output to only valid rows:
```go
processor := fileprep.NewProcessor(fileprep.FileTypeCSV, fileprep.WithValidRowsOnly())
var records []MyRecord
reader, result, err := processor.Process(input, &records)
// reader contains only rows that passed all validations
// records contains only valid structs
// result.RowCount includes all rows; result.ValidRowCount has the valid count
// result.Errors still reports all validation failures
```
Options can be combined:
```go
processor := fileprep.NewProcessor(fileprep.FileTypeCSV,
fileprep.WithStrictTagParsing(),
fileprep.WithValidRowsOnly(),
)
```
## Design Considerations
### Name-Based Column Binding
Struct fields are mapped to file columns **by name**, not by position. Field names are automatically converted to `snake_case` to match column headers. Column order in the file does not matter.
```go
type User struct {
UserName string `name:"user"` // matches "user" column (not "user_name")
Email string `name:"mail_addr"` // matches "mail_addr" column (not "email")
Age string // matches "age" column (auto snake_case)
}
```
If your LTSV keys use hyphens (`user-id`) or Parquet/XLSX columns use camelCase (`userId`), use the `name` tag to specify the exact column name.
See [Gotchas](#gotchas) for case-sensitivity rules, duplicate header behavior, and missing column handling.
### Memory Usage
fileprep loads the **entire file into memory** for processing. This enables random access and multi-pass operations but has implications for large files:
| File Size | Approx. Memory | Recommendation |
|-----------|----------------|----------------|
| < 100 MB | ~2-3x file size | Direct processing |
| 100-500 MB | ~500 MB - 1.5 GB | Monitor memory, consider chunking |
| > 500 MB | > 1.5 GB | Split files or use streaming alternatives |
For compressed inputs (gzip, bzip2, xz, zstd, zlib, snappy, s2, lz4), memory usage is based on **decompressed** size.
## Performance
Benchmark results processing CSV files with a complex struct containing 21 columns. Each field uses multiple preprocessing and validation tags:
**Preprocessing tags used:** trim, lowercase, uppercase, keep_digits, pad_left, strip_html, strip_newline, collapse_space, truncate, fix_scheme, default
**Validation tags used:** required, alpha, numeric, email, uuid, ip_addr, url, oneof, min, max, len, printascii, ascii, eqfield
| Records | Time | Memory | Allocs/op |
|--------:|-----:|-------:|----------:|
| 100 | 0.6 ms | 0.9 MB | 7,654 |
| 1,000 | 6.1 ms | 9.6 MB | 74,829 |
| 10,000 | 69 ms | 101 MB | 746,266 |
| 50,000 | 344 ms | 498 MB | 3,690,281 |
```bash
# Quick benchmark (100 and 1,000 records)
make bench
# Full benchmark (all sizes including 50,000 records)
make bench-all
```
*Tested on AMD Ryzen AI MAX+ 395, Go 1.24, Linux. Results vary by hardware.*
## Related or inspired Projects
- [nao1215/filesql](https://github.com/nao1215/filesql) - sql driver for CSV, TSV, LTSV, Parquet, Excel with gzip, bzip2, xz, zstd support.
- [nao1215/fileframe](https://github.com/nao1215/fileframe) - DataFrame API for CSV/TSV/LTSV, Parquet, Excel.
- [nao1215/csv](https://github.com/nao1215/csv) - read csv with validation and simple DataFrame in golang.
- [go-playground/validator](https://github.com/go-playground/validator) - Go Struct and Field validation, including Cross Field, Cross Struct, Map, Slice and Array diving
- [shogo82148/go-header-csv](https://github.com/shogo82148/go-header-csv) - go-header-csv is encoder/decoder csv with a header.
## Contributing
Contributions are welcome! Please see the [Contributing Guide](./CONTRIBUTING.md) for more details.
## Support
If you find this project useful, please consider:
- Giving it a star on GitHub - it helps others discover the project
- [Becoming a sponsor](https://github.com/sponsors/nao1215) - your support keeps the project alive and motivates continued development
Your support, whether through stars, sponsorships, or contributions, is what drives this project forward. Thank you!
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.