https://github.com/jbzoo/csv-blueprint

CSV Validator - Strict and automated line-by-line CSV checking tool based on customizable Yaml schemas
https://github.com/jbzoo/csv-blueprint

ci csv csv-format csv-generation csv-linter csv-schema csv-validator github-actions jbzoo junit line-by-line linter schema teamcity validation validator

Last synced: 11 months ago
JSON representation

CSV Validator - Strict and automated line-by-line CSV checking tool based on customizable Yaml schemas

Host: GitHub
URL: https://github.com/jbzoo/csv-blueprint
Owner: JBZoo
License: mit
Created: 2024-03-08T19:21:09.000Z (almost 2 years ago)
Default Branch: master
Last Pushed: 2025-02-27T07:56:38.000Z (about 1 year ago)
Last Synced: 2025-03-30T10:09:44.486Z (11 months ago)
Topics: ci, csv, csv-format, csv-generation, csv-linter, csv-schema, csv-validator, github-actions, jbzoo, junit, line-by-line, linter, schema, teamcity, validation, validator
Language: PHP
Homepage: https://github.com/JBZoo/CSV-Blueprint-Demo
Size: 3.15 MB
Stars: 49
Watchers: 1
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

# CSV Blueprint

[![CI](https://github.com/JBZoo/Csv-Blueprint/actions/workflows/main.yml/badge.svg?branch=master)](https://github.com/JBZoo/Csv-Blueprint/actions/workflows/main.yml?query=branch%3Amaster)
[![CI](https://github.com/JBZoo/Csv-Blueprint/actions/workflows/demo.yml/badge.svg)](https://github.com/JBZoo/Csv-Blueprint/actions/workflows/demo.yml)
[![Psalm Coverage](https://shepherd.dev/github/JBZoo/Csv-Blueprint/coverage.svg)](https://shepherd.dev/github/JBZoo/Csv-Blueprint)
[![Coverage](https://sonarcloud.io/api/project_badges/measure?project=JBZoo_Csv-Blueprint&metric=coverage)](https://sonarcloud.io/code?id=JBZoo_Csv-Blueprint&selected=JBZoo_Csv-Blueprint%3Asrc)
[![Bugs](https://sonarcloud.io/api/project_badges/measure?project=JBZoo_Csv-Blueprint&metric=bugs)](https://sonarcloud.io/project/issues?resolved=false&id=JBZoo_Csv-Blueprint)
[![Code smells](https://sonarcloud.io/api/project_badges/measure?project=JBZoo_Csv-Blueprint&metric=code_smells)](https://sonarcloud.io/project/issues?resolved=false&id=JBZoo_Csv-Blueprint)
[![Docker Pulls](https://img.shields.io/docker/pulls/jbzoo/csv-blueprint.svg)](https://hub.docker.com/r/jbzoo/csv-blueprint/tags)

[![Static Badge](https://img.shields.io/badge/Rules-341-green?label=Total%20number%20of%20rules&labelColor=darkgreen&color=gray)](schema-examples/full.yml)
[![Static Badge](https://img.shields.io/badge/Rules-127-green?label=Cell%20rules&labelColor=blue&color=gray)](src/Rules/Cell)
[![Static Badge](https://img.shields.io/badge/Rules-206-green?label=Aggregate%20rules&labelColor=blue&color=gray)](src/Rules/Aggregate)
[![Static Badge](https://img.shields.io/badge/Rules-8-green?label=Extra%20checks&labelColor=blue&color=gray)](#extra-checks)
[![Static Badge](https://img.shields.io/badge/Rules-31/11/20-green?label=Plan%20to%20add&labelColor=gray&color=gray)](tests/schemas/todo.yml)

Strict and automated line-by-line CSV validation tool based on customizable Yaml schemas.

In seconds, make sure every char in a gigabyte file meets your expectations.

![Intro](.github/assets/intro.png)

I believe it is the simplest yet flexible and powerful CSV validator in the world. ☺️

## Features

- Just create a simple and [friendly Yaml](#schema-definition) with your CSV schema and the tool will validate your
files line by line. You will get a very [detailed report](#report-examples) with row, column and rule accuracy.
- Out of the box, you have access to [over 330 validation rules](schema-examples/full.yml) that can be combined to
control the severity of validation.
- You can validate each value (like, date has a strict format on each line), or the entire column
(like, median of all values is within limits). It's up to you to choose the severity of the rules.
- Use it anywhere as it is packaged in [Docker](#usage) or even as part of your [GitHub Actions](#github-action-format).
- Create a CSV in your pipelines/ETL/CI and ensure that it meets the most stringent expectations.
- Prepare your own libraries with complex rules using [presets](#presets-and-reusable-schemas). This will help you work
with hundreds of different files at the same time.
- [Create schema on the fly](#complete-cli-help-message) based on an existing CSV file and also analyze data in CSV -
find out what is stored in your file and get a summary report.

CLICK to see a typical cross-team workflow

1. **Data Preparation:** Team A generates CSV data adhering to a predefined format and places the file in a shared
location accessible to Team B (e.g., a shared repository or cloud storage).

2. **Notification:** Team A notifies Team B that the data is ready via corporate communication channels (email, chat,
task management system).

3. **Validation:** Team B uses predefined validation rules stored in the repository to check the CSV file for accuracy
and integrity before importing. This includes verifying date formats, numerical values, and the presence of required
columns.

4. **Data Import:** After successful validation, Team B imports the data from the CSV file into their system for further
processing.

5. **Error Handling:** If validation identifies errors, the process halts, and Team B provides feedback to Team A for
data correction.

**Why Validation is Necessary:**

- **Reduce Errors:** Validating data before import minimizes the likelihood of errors, enhancing data quality.
- **Efficiency:** Prevents time loss on manual error correction post-import.
- **Data Consistency:** Ensures data meets the expectations and requirements of Team B, facilitating accurate processing
and analysis.
- **Automation:** Storing validation rules in the repository eases the process of checking automation and simplifies
updating validation criteria.

### Live demo

As a live demonstration of how the tool works, you can explore the super minimal repository
at [demo](https://github.com/jbzoo/csv-blueprint-demo). For more complex examples and various reporting methods, take a look at
the [demo pipeline](https://github.com/JBZoo/CSV-Blueprint/actions/runs/8667852752/job/23771733937) with different reports types.

See also:

- [PR as a live demo](https://github.com/jbzoo/csv-blueprint-demo/pull/1/files) - Note the automatic comments in Diff at PR's.
- [.github/workflows/demo.yml](.github/workflows/demo.yml)
- [demo_invalid.yml](tests/schemas/demo_invalid.yml)
- [demo_valid.yml](tests/schemas/demo_valid.yml)
- [demo.csv](tests/fixtures/demo.csv)

## Table of content

- [Features](#features)
- [Table of content](#table-of-content)
- [Usage](#usage)
- [Schema definition](#schema-definition)
- [Presets and reusable schemas](#presets-and-reusable-schemas)
- [Parallel processing](#parallel-processing)
- [Complete CLI help message](#complete-cli-help-message)
- [Report examples](#report-examples)
- [Benchmarks](#benchmarks)
- [Disadvantages?](#disadvantages)
- [Coming soon](#coming-soon)
- [Contributing](#contributing)
- [License](#license)
- [See also](#see-also)

## Usage

### Docker container

Ensure you have Docker installed on your machine.

```sh
# Pull the Docker image
docker pull jbzoo/csv-blueprint:latest

# Run the tool inside Docker
docker run --rm \
--workdir=/parent-host \
-v $(pwd):/parent-host \
jbzoo/csv-blueprint:latest \
validate-csv \ # See available commands and options below.
--csv=./tests/fixtures/demo.csv \ # Your CSV(s).
--schema=./tests/schemas/demo_invalid.yml \ # Your schema(s).
--ansi

# OR build it from source.
git clone git@github.com:jbzoo/csv-blueprint.git csv-blueprint
cd csv-blueprint
make docker-build # local tag is "jbzoo/csv-blueprint:local"
```

### GitHub Action

You can find launch examples in the [workflow demo](https://github.com/JBZoo/Csv-Blueprint-Demo/blob/master/.github/workflows/demo.yml#L18-L22).

```yml
- uses: jbzoo/csv-blueprint@master # See the specific version on releases page. `@master` is latest.
with:
# Specify the path(s) to the CSV files you want to validate.
# This can include a direct path to a file or a directory to search with a maximum depth of 10 levels.
# Examples: p/file.csv; p/*.csv; p/**/*.csv; p/**/name-*.csv; **/*.csv
csv: './tests/**/*.csv'

# Specify the path(s) to the schema file(s), supporting YAML, JSON, or PHP formats.
# Similar to CSV paths, you can direct to specific files or search directories with glob patterns.
# Examples: p/file.yml; p/*.yml; p/**/*.yml; p/**/name-*.yml; **/*.yml
schema: './tests/**/*.yml'

# Report format. Available options: text, table, github, gitlab, teamcity, junit.
# Default value: 'table'
report: 'table'

# Apply all schemas (also without `filename_pattern`) to all CSV files found as global rules.
# Available options:
# auto: If no glob pattern (*) is used for --schema, the schema is applied to all found CSV files.
# yes: Apply all schemas to all CSV files, Schemas without `filename_pattern` are applied as a global rule.
# no: Apply only schemas with not empty `filename_pattern` and match the CSV files.
# Default value: 'auto'
apply-all: 'auto'

# Quick mode. It will not validate all rows. It will stop after the first error.
# Default value: 'no'
quick: 'no'

# Skip schema validation. If you are sure that the schema is correct, you can skip this check.
# Default value: 'no'
skip-schema: 'no'

# Extra options for the CSV Blueprint. Only for debbuging and profiling.
# Available options:
# Add flag `--parallel` if you want to validate CSV files in parallel.
# Add flag `--dump-schema` if you want to see the final schema after all includes and inheritance.
# Add flag `--debug` if you want to see more really deep details.
# Add flag `--profile` if you want to see profiling info. Add details with `-vvv`.
# Verbosity level: Available options: `-v`, `-vv`, `-vvv`
# ANSI output. You can disable ANSI colors if you want with `--no-ansi`.
# Default value: 'options: --ansi'
# You can skip it.
extra: 'options: --ansi'
```

### Phar binary

CLICK to see using PHAR file

Ensure you have PHP installed on your machine.

```sh
# Just download the latest version
wget https://github.com/jbzoo/csv-blueprint/releases/latest/download/csv-blueprint.phar
chmod +x ./csv-blueprint.phar
./csv-blueprint.phar validate-csv \
--csv=./tests/fixtures/demo.csv \
--schema=./tests/schemas/demo_invalid.yml

# OR create project via Composer (--no-dev is optional)
composer create-project --no-dev jbzoo/csv-blueprint
cd ./csv-blueprint
./csv-blueprint validate-csv \
--csv=./tests/fixtures/demo.csv \
--schema=./tests/schemas/demo_invalid.yml

# OR build from source
git clone git@github.com:jbzoo/csv-blueprint.git csv-blueprint
cd csv-blueprint
make build
./csv-blueprint validate-csv \
--csv=./tests/fixtures/demo.csv \
--schema=./tests/schemas/demo_invalid.yml
```

## Schema definition

Define your CSV validation schema in YAML for clear and structured configuration. Alternative formats are also
supported: [JSON](schema-examples/full.json) and [PHP](schema-examples/full.php), accommodating various preferences and workflow requirements.

The provided example illustrates a schema for a CSV file with a header row. It mandates that the `id`
column must not be empty and should only contain integer values. Additionally, the `name` column is required to have a
minimum length of 3 characters, ensuring basic data integrity and usefulness.

### Example schema in YAML

```yml
name: Simple CSV Schema
filename_pattern: /my-favorite-csv-\d+\.csv$/i
csv:
delimiter: ';'

columns:
- name: id
rules:
not_empty: true
is_int: true
aggregate_rules:
is_unique: true
sorted: [ asc, numeric ]

- name: name
rules:
length_min: 3
aggregate_rules:
count: 10
```

### Full schema description

In the [example YAML file](schema-examples/full.yml), a detailed description of all features is provided. This
documentation is verified through automated tests, ensuring it remains current.

**Notes:**

- The traditional typing of columns (e.g., `type: integer`) has been intentionally omitted in favor of rules. These
rules can be sequenced and combined freely, offering extensive flexibility for CSV file validation.
- All options are optional unless stated otherwise. You have the liberty to include or omit them as you see fit.
- Specifying an incorrect rule name, using non-existent values (not listed below), or assigning an incompatible variable
type for any option will result in a schema validation error. To bypass these errors, you may opt to use
the `--skip-schema` flag at your discretion, allowing the use of your custom keys in the schema.
- All rules ignore the empty string except `not_empty`. It doesn't ignore empty strings (length 0). To enforce a non-empty value,
apply `not_empty: true`. Note that a single space counts as a character, making the string length `1`. To
prevent such scenarios, include `is_trimmed: true`.
- Rules operate independently; they have no knowledge of or influence over one another.
- When a rule's value is `is_some_rule: true`, it merely serves as an activation toggle. Other values represent rule
parameters.
- The sequence of rule execution follows their order in the schema, affecting only the order of error messages in the
report.
- Unless explicitly stated, most rules are case-sensitive.
- As a fallback, the `regex` rule is available. However, using clear rule combinations is recommended for greater
clarity on validation errors.

Below is a comprehensive list of rules, each accompanied by a brief explanation and example for clarity. This section is
also validated through automated tests, ensuring the information is consistently accurate.

CLICK to see details about each rule

```yml
# It's a complete example of the CSV schema file in YAML format.
# See copy of the file without comments here ./schema-examples/full_clean.yml

# Just meta
name: CSV Blueprint Schema Example # Name of a CSV file. Not used in the validation process.
description: |- # Any description of the CSV file. Not used in the validation process.
This YAML file provides a detailed description and validation rules for CSV files
to be processed by CSV Blueprint tool. It includes specifications for file name patterns,
CSV formatting options, and extensive validation criteria for individual columns and their values,
supporting a wide range of data validation rules from basic type checks to complex regex validations.
This example serves as a comprehensive guide for creating robust CSV file validations.

# Include another schema and define an alias for it.
presets:
my-preset: ./preset_users.yml # Define preset alias "my-preset". See README.md for details.

# Regular expression to match the file name. If not set, then no pattern check.
# This allows you to pre-validate the file name before processing its contents.
# Feel free to check parent directories as well.
# See: https://www.php.net/manual/en/reference.pcre.pattern.syntax.php
filename_pattern: /\.csv$/i
# preset: my-preset # See README.md for details.

# Here are default values to parse CSV file.
# You can skip this section if you don't need to override the default values.
csv:
preset: my-preset # See README.md for details.
header: true # If the first row is a header. If true, name of each column is required.
delimiter: , # Delimiter character in CSV file.
quote_char: \ # Quote character in CSV file.
enclosure: '"' # Enclosure for each field in CSV file.
encoding: utf-8 # (Experimental) Only utf-8, utf-16, utf-32.
bom: false # (Experimental) If the file has a BOM (Byte Order Mark) at the beginning.

# Structural rules for the CSV file. These rules are applied to the entire CSV file.
# They are not(!) related to the data in the columns.
# You can skip this section if you don't need to override the default values.
structural_rules: # Here are default values.
preset: my-preset # See README.md for details.
strict_column_order: true # Ensure columns in CSV follow the same order as defined in this YML schema. It works only if "csv.header" is true.
allow_extra_columns: false # Allow CSV files to have more columns than specified in this YML schema.

# Add any extra data you want. It will be ignored by the tool but available for your own code.
# You can use any format and store anything. Examples:
# extra: 'some text'
# extra: [some, options, here]
# extra: 42
extra:
- key: "value"

# Description of each column in CSV.
# It is recommended to present each column in the same order as presented in the CSV file.
# This will not affect the validator, but will make it easier for you to navigate.
# For convenience, use the first line as a header (if possible).
columns:
- preset: my-preset/login # Add preset rules for the column. See README.md for details.
name: Column Name (header) # Any custom name of the column in the CSV file (first row). Required if "csv.header" is true.
description: Lorem ipsum # Description of the column. Not used in the validation process.
example: Some example # Example of the column value. Schema will also check this value on its own.

# If the column is required. If true, the column must be present in the CSV file. If false, the column can be missing in the CSV file.
# So, if you want to make the column optional, set this value to false, and it will validate the column only if it is present.
# By default, the column is required. It works only if "csv.header" is true and "structural_rules.allow_extra_columns" is false.
required: true

####################################################################################################################
# Data validation for each(!) value in the column. Please, see notes in README.md
# Every rule is optional.
rules:
preset: my-preset/login # Add preset rules for the column. See README.md for details.

# General rules
not_empty: true # Value is not an empty string. Actually checks if the string length is not 0.
exact_value: Some string # Exact value for string in the column.
allow_values: [ y, n, "" ] # Strict set of values that are allowed.
not_allow_values: [ invalid ] # Strict set of values that are NOT allowed.

# Any valid regex pattern. See: https://www.php.net/manual/en/reference.pcre.pattern.syntax.php
# Of course it's a super powerful tool to verify any sort of string data.
# Please, be careful. Regex is a powerful tool, but it can be very dangerous if used incorrectly.
# Remember that if you want to solve a problem with regex, you now have two problems.
# But have it your way, then happy debugging! https://regex101.com
regex: /^[\d]{2}$/

# Checks length of a string including spaces (multibyte safe).
length_min: 1 # x >= 1
length_greater: 2 # x > 2
length_not: 0 # x != 0
length: 7 # x == 7
length_less: 8 # x < 8
length_max: 9 # x <= 9

# Basic string checks
is_trimmed: true # Only trimmed strings. Example: "Hello World" (not " Hello World ").
is_lowercase: true # String is only lower-case. Example: "hello world".
is_uppercase: true # String is only upper-case. Example: "HELLO WORLD".
is_capitalize: true # String is only capitalized. Example: "Hello World".
is_sentence: true # Sentence with at least one space. Example: "Hello world!".

# Count number of words used in a string
# Note that multibyte locales are not supported.
# Example: "Hello World, 123" - 2 words only (123 is not a word).
word_count_min: 1 # x >= 1
word_count_greater: 2 # x > 2
word_count_not: 0 # x != 0
word_count: 7 # x == 7
word_count_less: 8 # x < 8
word_count_max: 9 # x <= 9

# Contains rules
contains: World # Example: "Hello World!". The string must contain "World" in any place.
contains_none: [ a, b ] # All the strings must NOT be part of a CSV value.
contains_one: [ a, b ] # Only one of the strings must be part of the CSV value.
contains_any: [ a, b ] # At least one of the string must be part of the CSV value.
contains_all: [ a, b ] # All the strings must be part of a CSV value.
starts_with: 'prefix ' # Example: "prefix Hello World".
ends_with: ' suffix' # Example: "Hello World suffix".

# Numeric
is_int: true # Check format only. Can be negative and positive. Without any separators.
is_float: true # Check format only. Can be negative and positive. Dot as decimal separator.

# Under the hood it converts and compares as float values.
# Comparison accuracy is 10 digits after a dot.
# Scientific number format is also supported. Example: "1.2e3"
num_min: 1.0 # x >= 1.0
num_greater: 2.0 # x > 2.0
num_not: 5.0 # x != 5.0
num: 7.0 # x == 7.0
num_less: 8.0 # x < 8.0
num_max: 9.0 # x <= 9.0

# Number of digits after the decimal point (with zeros)
precision_min: 1 # x >= 1
precision_greater: 2 # x > 2
precision_not: 0 # x != 0
precision: 7 # x == 7
precision_less: 8 # x < 8
precision_max: 9 # x <= 9

# Date & time
is_date: true # Accepts arbitrary date format. Is shows error if failed to convert to timestamp.
is_timezone: true # Allow only timezone identifiers. Case-insensitive. Example: "Europe/London", "utc".
is_timezone_offset: true # Allow only timezone offsets. Example: "+03:00".
is_time: true # Check if the cell value is a valid time in the format "HH:MM:SS AM/PM" / "HH:MM:SS" / "HH:MM". Case-insensitive.
is_leap_year: true # Check if the cell value is a leap year. Example: "2008", "2008-02-29 23:59:59 UTC".

# Dates. Under the hood, the strings are converted to timestamp and compared.
# This gives you the ability to use relative dates and any formatting you want.
# By default, it works in UTC. But you can specify your own timezone as part of the date string.
# Format: https://www.php.net/manual/en/datetime.format.php
# Parsing: https://www.php.net/manual/en/function.strtotime.php
# Timezones: https://www.php.net/manual/en/timezones.php
date_min: -100 years # Example of relative past date
date_greater: -99 days # Example of relative formats
date_not: 2006-01-02 15:04:05 -0700 Europe/Rome
date: 01 Jan 2000 # You can use any string that can be parsed by the strtotime function
date_less: now # Example of current date and time
date_max: +1 day # Example of relative future date
date_format: Y-m-d # Check strict format of the date.

# Date Intervals. Under the hood, the strings are converted to seconds and compared.
# See: https://www.php.net/manual/en/class.dateinterval.php
# See: https://www.php.net/manual/en/dateinterval.createfromdatestring.php
date_interval_min: PT0S # 0 seconds
date_interval_greater: 1day 1sec # 1 day and 1 second
date_interval_not: 100 days # Except for the 100 days
date_interval: P2W # Exactly 2 weeks
date_interval_less: PT23H59M59S # 23 hours, 59 minutes, and 59 seconds
date_interval_max: P1Y # 1 year

# Check an arbitrary date in a CSV cell for age (years).
# Actually it calculates the difference between the date and the current date.
# Convenient to use for age restrictions based on birthday.
# See the description of `date_*` functions for details on date formats.
date_age_min: 1 # x >= 1
date_age_greater: 14 # x > 14
date_age_not: 18 # x != 18
date_age: 21 # x == 21
date_age_less: 99 # x < 99
date_age_max: 100 # x <= 100

# Specific formats
is_bool: true # Allow only boolean values "true" and "false", case-insensitive.
is_binary: true # Both: with or without "0b" prefix. Example: "0b10" or "10".
is_octal: true # Validates octal numbers in the format "0o123".
is_hex: true # Both: with or without "0x" prefix. Example: "0x1A".
is_uuid: true # Validates whether the input is a valid UUID. It also supports validation of specific versions 1, 3, 4 and 5.
is_slug: true # Only slug format. Example: "my-slug-123". It can contain letters, numbers, and dashes.
is_currency_code: true # Validates an ISO 4217 currency code like GBP or EUR. Case-sensitive. See: https://en.wikipedia.org/wiki/ISO_4217.
is_base64: true # Validate if a string is Base64-encoded. Example: "cmVzcGVjdCE=".
is_angle: true # Check if the cell value is a valid angle (0.0 to 360.0).

# Safity checks
# Password strength calculation criteria include: Length (max 5 points, +1 every 2 characters),
# presence of uppercase letters (+1), lowercase letters (+1), numbers (+1), special characters (+1),
# spaces (+1), and penalties for consecutive sequences of uppercase, lowercase, or
# numbers (-0.5 each), repetitive sequences (-0.75 each), common weak passwords like "qwerty",
# and passwords under 6 characters (-2). Adjust scores to a 0 to 10 scale, with a minimum score of 0.
password_strength_min: 1 # x >= 1
password_strength_greater: 2 # x > 2
password_strength_not: 0 # x != 0
password_strength: 7 # x == 7
password_strength_less: 8 # x < 8
password_strength_max: 9 # x <= 9
is_password_safe_chars: true # Check that the cell value contains only safe characters for regular passwords. Allowed characters: a-z, A-Z, 0-9, !@#$%^&*()_+-=[]{};:'"|,.<>/?~.

# Internet
is_ip: true # Both: IPv4 or IPv6.
is_ip_v4: true # Only IPv4. Example: "127.0.0.1".
is_ip_v6: true # Only IPv6. Example: "2001:0db8:85a3:08d3:1319:8a2e:0370:7334".
is_ip_private: true # IPv4 has ranges: 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16. IPv6 has ranges starting with FD or FC.
is_ip_reserved: true # IPv4 has ranges: 0.0.0.0/8, 169.254.0.0/16, 127.0.0.0/8 and 240.0.0.0/4. IPv6 has ranges: ::1/128, ::/128, ::ffff:0:0/96 and fe80::/10.
ip_v4_range: [ '127.0.0.1-127.0.0.5', '127.0.0.0/21' ] # Check subnet mask or range for IPv4. Address must be in one of the ranges.
is_mac_address: true # The input is a valid MAC address. Example: 00:00:5e:00:53:01
is_domain: true # Only domain name. Example: "example.com".
is_public_domain_suffix: true # The input is a public ICANN domain suffix. Example: "com", "nom.br", "net" etc.
is_url: true # Only URL format. Example: "https://example.com/page?query=string#anchor".
is_email: true # Only email format. Example: "user@example.com".

# Validates if the given input is a valid JSON.
# This is possible if you escape all special characters correctly and use a special CSV format.
is_json: true # Example: {"foo":"bar"}.

# Geography
is_latitude: true # Can be integer or float. Example: 50.123456.
is_longitude: true # Can be integer or float. Example: -89.123456.
is_geohash: true # Check if the value is a valid geohash. Example: "u4pruydqqvj".
is_cardinal_direction: true # Valid cardinal direction. Case-insensitive. Available values: ["N", "S", "E", "W", "NE", "SE", "NW", "SW", "NONE"]
is_usa_market_name: true # Check if the value is a valid USA market name. Example: "New York, NY".
is_usa_state: true # Name or code of USA state name. Case-insensitive. Example: "CA" or "California".

# Validates whether the input is a country code in ISO 3166-1 standard.
# Available options: "alpha-2" (Ex: "US"), "alpha-3" (Ex: "USA"), "numeric" (Ex: "840").
# The rule uses data from iso-codes: https://salsa.debian.org/iso-codes-team/iso-codes.
country_code: alpha-2 # Country code in ISO 3166-1 standard. Examples: "US", "USA", "840"

# Validates whether the input is language code based on ISO 639.
# Available options: "alpha-2" (Ex: "en"), "alpha-3" (Ex: "eng").
# See: https://en.wikipedia.org/wiki/ISO_639.
language_code: alpha-2 # Examples: "en", "eng"

# Filesystem (with IO!)
is_file_exists: true # Check if file exists on the filesystem (It's FS IO operation!).
is_dir_exists: true # Check if directory exists on the filesystem (It's FS IO operation!).

# Mathematical
is_fibonacci: true # Validates whether the input follows the Fibonacci integer sequence. Example: "8", "13".
is_prime_number: true # Validates a prime number. Example: "3", "5", "7", "11".
is_even: true # Check if the value is an even number. Example: "2", "4", "6".
is_odd: true # Check if the value is an odd number. Example: "1", "7", "11".
is_roman: true # Validates if the input is a Roman numeral. Example: "I", "IV", "XX".
is_luhn: true # Luhn algorithm. See: https://en.wikipedia.org/wiki/Luhn_algorithm

# Identifications
phone: ALL # Validates if the input is a phone number. Specify the country code to validate the phone number for a specific country. Example: "ALL", "US", "BR".".
postal_code: US # Validate postal code by country code (alpha-2). Example: "02179". Extracted from https://www.geonames.org
is_iban: true # IBAN - International Bank Account Number. See: https://en.wikipedia.org/wiki/International_Bank_Account_Number
is_bic: true # Validates a Bank Identifier Code (BIC) according to ISO 9362 standards. See: https://en.wikipedia.org/wiki/ISO_9362
is_imei: true # Validates an International Mobile Equipment Identity (IMEI). See: https://en.wikipedia.org/wiki/International_Mobile_Station_Equipment_Identity
is_isbn: true # Validates an International Standard Book Number (ISBN). See: https://www.isbn-international.org/content/what-isbn

# Misc
is_version: true # Validates the string as version numbers using Semantic Versioning. Example: "1.2.3".
is_punct: true # Validates whether the input composed by only punctuation characters. Example: "!@#$%^&*()".
is_vowel: true # Validates whether the input contains only vowels. Example: "aei".
is_consonant: true # Validates if the input contains only consonants. Example: "bcd".
is_alnum: true # Validates whether the input is only alphanumeric. Example: "aBc123".
is_alpha: true # This is similar to `is_alnum`, but it does not allow numbers. Example: "aBc".
is_hex_rgb_color: true # Validates weather the input is a hex RGB color or not. Examples: "#FF0000", "#123", "ffffff", "fff".

# Check if the value is a valid hash. Supported algorithms:
# - md5, md4, md2, sha1, sha224, sha256, sha384, sha512/224, sha512/256, sha512
# - sha3-224, sha3-256, sha3-384, sha3-512, ripemd128, ripemd160, ripemd256, ripemd320, whirlpool, tiger128,3
# - tiger160,3, tiger192,3, tiger128,4, tiger160,4, tiger192,4, snefru, snefru256, gost, gost-crypto, crc32
# - crc32b, crc32c, adler32, fnv132, fnv1a32, fnv164, fnv1a64, joaat, murmur3a, murmur3c
# - murmur3f, xxh32, xxh64, xxh3, xxh128, haval128,3, haval160,3, haval192,3, haval224,3, haval256,3
# - haval128,4, haval160,4, haval192,4, haval224,4, haval256,4, haval128,5, haval160,5, haval192,5, haval224,5, haval256,5
hash: set_algo # Example: "1234567890abcdef".

# Check if a string is in a specific charset. Available charsets:
# - 7bit, 8bit, ASCII, ArmSCII-8, BASE64, BIG-5, CP850, CP866, CP932, CP936
# - CP950, CP50220, CP50221, CP50222, CP51932, EUC-CN, EUC-JP, EUC-JP-2004, EUC-KR, EUC-TW
# - GB18030, GB18030-2022, HTML-ENTITIES, HZ, ISO-2022-JP, ISO-2022-JP-2004, ISO-2022-JP-MOBILE#KDDI, ISO-2022-JP-MS, ISO-2022-KR, ISO-8859-1
# - ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-13
# - ISO-8859-14, ISO-8859-15, ISO-8859-16, JIS, KOI8-R, KOI8-U, Quoted-Printable, SJIS, SJIS-2004, SJIS-Mobile#DOCOMO
# - SJIS-Mobile#KDDI, SJIS-Mobile#SOFTBANK, SJIS-mac, SJIS-win, UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE
# - UHC, UTF-7, UTF-8, UTF-8-Mobile#DOCOMO, UTF-8-Mobile#KDDI-A, UTF-8-Mobile#KDDI-B, UTF-8-Mobile#SOFTBANK, UTF-16, UTF-16BE, UTF-16LE
# - UTF-32, UTF-32BE, UTF-32LE, UTF7-IMAP, UUENCODE, Windows-1251, Windows-1252, Windows-1254, eucJP-win
charset: charset_code # Validates if a string is in a specific charset. Example: "UTF-8".

# Validates whether the input is a credit card number.
# Available credit card brands: "Any", "American Express", "Diners Club", "Discover", "JCB", "MasterCard", "Visa", "RuPay".
credit_card: Any # Example: "5376-7473-9720-8720"

####################################################################################################################
# Data validation for the entire(!) column using different data aggregation methods.
# Every rule is optional.
aggregate_rules:
preset: my-preset/login # Add preset aggregate rules for the column. See README.md for details.

is_unique: true # All values in the column are unique.

# Check if the column is sorted in a specific order.
# - Direction: ["asc", "desc"].
# - Method: ["numeric", "string", "natural", "regular"].
# See: https://www.php.net/manual/en/function.sort.php
sorted: [ asc, natural ] # Expected ascending order, natural sorting.

# First number in the column. Expected value is float or integer.
first_num_min: 1.0 # x >= 1.0
first_num_greater: 2.0 # x > 2.0
first_num_not: 5.0 # x != 5.0
first_num: 7.0 # x == 7.0
first_num_less: 8.0 # x < 8.0
first_num_max: 9.0 # x <= 9.0
first: Expected # First value in the column. Will be compared as strings.
first_not: Not expected # Not allowed as the first value in the column. Will be compared as strings.

# N-th value in the column.
# The rule expects exactly two arguments: the first is the line number (without header), the second is the expected value.
# Example: `[ 42, 5.0 ]` On the line 42 (disregarding the header), we expect the 5.0. The comparison is always as float.
nth_num_min: [ 42, 1.0 ] # x >= 1.0
nth_num_greater: [ 42, 2.0 ] # x > 2.0
nth_num_not: [ 42, 5.0 ] # x != 5.0
nth_num: [ 42, 7.0 ] # x == 7.0
nth_num_less: [ 42, 8.0 ] # x < 8.0
nth_num_max: [ 42, 9.0 ] # x <= 9.0
nth: [ 2, Expected ] # Nth value in the column. Will be compared as strings.
nth_not: [ 2, Not expected ] # Not allowed as the N-th value in the column. Will be compared as strings.

# Last number in the column. Expected value is float or integer.
last_num_min: 1.0 # x >= 1.0
last_num_greater: 2.0 # x > 2.0
last_num_not: 5.0 # x != 5.0
last_num: 7.0 # x == 7.0
last_num_less: 8.0 # x < 8.0
last_num_max: 9.0 # x <= 9.0
last: Expected # Last value in the column. Will be compared as strings.
last_not: Not expected # Not allowed as the last value in the column. Will be compared as strings.

# Sum of the numbers in the column. Example: [1, 2, 3] => 6.
sum_min: 1.0 # x >= 1.0
sum_greater: 2.0 # x > 2.0
sum_not: 5.0 # x != 5.0
sum: 7.0 # x == 7.0
sum_less: 8.0 # x < 8.0
sum_max: 9.0 # x <= 9.0

# Regular the arithmetic mean. The sum of the numbers divided by the count.
average_min: 1.0 # x >= 1.0
average_greater: 2.0 # x > 2.0
average_not: 5.0 # x != 5.0
average: 7.0 # x == 7.0
average_less: 8.0 # x < 8.0
average_max: 9.0 # x <= 9.0

# Total number of rows in the CSV file.
# Since any(!) values are taken into account, it only makes sense to use these rules once in any column.
count_min: 1 # x >= 1
count_greater: 2 # x > 2
count_not: 0 # x != 0
count: 7 # x == 7
count_less: 8 # x < 8
count_max: 9 # x <= 9

# Counts only empty values (string length is 0).
count_empty_min: 1 # x >= 1
count_empty_greater: 2 # x > 2
count_empty_not: 0 # x != 0
count_empty: 7 # x == 7
count_empty_less: 8 # x < 8
count_empty_max: 9 # x <= 9

# Counts only not empty values (string length is not 0).
count_not_empty_min: 1 # x >= 1
count_not_empty_greater: 2 # x > 2
count_not_empty_not: 0 # x != 0
count_not_empty: 7 # x == 7
count_not_empty_less: 8 # x < 8
count_not_empty_max: 9 # x <= 9

# Number of unique values.
count_distinct_min: 1 # x >= 1
count_distinct_greater: 2 # x > 2
count_distinct_not: 0 # x != 0
count_distinct: 7 # x == 7
count_distinct_less: 8 # x < 8
count_distinct_max: 9 # x <= 9

# Number of positive values.
count_positive_min: 1 # x >= 1
count_positive_greater: 2 # x > 2
count_positive_not: 0 # x != 0
count_positive: 7 # x == 7
count_positive_less: 8 # x < 8
count_positive_max: 9 # x <= 9

# Number of negative values.
count_negative_min: 1 # x >= 1
count_negative_greater: 2 # x > 2
count_negative_not: 0 # x != 0
count_negative: 7 # x == 7
count_negative_less: 8 # x < 8
count_negative_max: 9 # x <= 9

# Number of zero values. Any text and spaces (i.e. anything that doesn't look like a number) will be converted to 0.
count_zero_min: 1 # x >= 1
count_zero_greater: 2 # x > 2
count_zero_not: 0 # x != 0
count_zero: 7 # x == 7
count_zero_less: 8 # x < 8
count_zero_max: 9 # x <= 9

# Number of even values.
count_even_min: 1 # x >= 1
count_even_greater: 2 # x > 2
count_even_not: 0 # x != 0
count_even: 7 # x == 7
count_even_less: 8 # x < 8
count_even_max: 9 # x <= 9

# Number of odd values.
count_odd_min: 1 # x >= 1
count_odd_greater: 2 # x > 2
count_odd_not: 0 # x != 0
count_odd: 7 # x == 7
count_odd_less: 8 # x < 8
count_odd_max: 9 # x <= 9

# Number of prime values.
count_prime_min: 1 # x >= 1
count_prime_greater: 2 # x > 2
count_prime_not: 0 # x != 0
count_prime: 7 # x == 7
count_prime_less: 8 # x < 8
count_prime_max: 9 # x <= 9

# Calculate the median average of a list of numbers.
# See: https://en.wikipedia.org/wiki/Median
median_min: 1.0 # x >= 1.0
median_greater: 2.0 # x > 2.0
median_not: 5.0 # x != 5.0
median: 7.0 # x == 7.0
median_less: 8.0 # x < 8.0
median_max: 9.0 # x <= 9.0

# Harmonic mean (subcontrary mean). The harmonic mean can be expressed as the reciprocal of the arithmetic mean of the reciprocals.
# Appropriate for situations when the average of rates is desired.
# See: https://en.wikipedia.org/wiki/Harmonic_mean
harmonic_mean_min: 1.0 # x >= 1.0
harmonic_mean_greater: 2.0 # x > 2.0
harmonic_mean_not: 5.0 # x != 5.0
harmonic_mean: 7.0 # x == 7.0
harmonic_mean_less: 8.0 # x < 8.0
harmonic_mean_max: 9.0 # x <= 9.0

# Geometric mean. A type of mean which indicates the central tendency or typical value of a set of numbers
# by using the product of their values (as opposed to the arithmetic mean which uses their sum).
# See: https://en.wikipedia.org/wiki/Geometric_mean
geometric_mean_min: 1.0 # x >= 1.0
geometric_mean_greater: 2.0 # x > 2.0
geometric_mean_not: 5.0 # x != 5.0
geometric_mean: 7.0 # x == 7.0
geometric_mean_less: 8.0 # x < 8.0
geometric_mean_max: 9.0 # x <= 9.0

# Contraharmonic mean. A function complementary to the harmonic mean. A special case of the Lehmer mean, L₂(x), where p = 2.
# See: https://en.wikipedia.org/wiki/Contraharmonic_mean
contraharmonic_mean_min: 1.0 # x >= 1.0
contraharmonic_mean_greater: 2.0 # x > 2.0
contraharmonic_mean_not: 5.0 # x != 5.0
contraharmonic_mean: 7.0 # x == 7.0
contraharmonic_mean_less: 8.0 # x < 8.0
contraharmonic_mean_max: 9.0 # x <= 9.0

# Root mean square (quadratic mean) The square root of the arithmetic mean of the squares of a set of numbers.
# See: https://en.wikipedia.org/wiki/Root_mean_square
root_mean_square_min: 1.0 # x >= 1.0
root_mean_square_greater: 2.0 # x > 2.0
root_mean_square_not: 5.0 # x != 5.0
root_mean_square: 7.0 # x == 7.0
root_mean_square_less: 8.0 # x < 8.0
root_mean_square_max: 9.0 # x <= 9.0

# Trimean (TM, or Tukey's trimean).
# A measure of a probability distribution's location defined as a weighted average of the distribution's median and its two quartiles.
# See: https://en.wikipedia.org/wiki/Trimean
trimean_min: 1.0 # x >= 1.0
trimean_greater: 2.0 # x > 2.0
trimean_not: 5.0 # x != 5.0
trimean: 7.0 # x == 7.0
trimean_less: 8.0 # x < 8.0
trimean_max: 9.0 # x <= 9.0

# Cubic mean. See: https://en.wikipedia.org/wiki/Cubic_mean
cubic_mean_min: 1.0 # x >= 1.0
cubic_mean_greater: 2.0 # x > 2.0
cubic_mean_not: 5.0 # x != 5.0
cubic_mean: 7.0 # x == 7.0
cubic_mean_less: 8.0 # x < 8.0
cubic_mean_max: 9.0 # x <= 9.0

# Compute the P-th percentile of a list of numbers.
# Linear interpolation between closest ranks method - Second variant, C = 1 P-th percentile (0 <= P <= 100) of a list of N ordered values (sorted from least to greatest).
# Similar method used in NumPy and Excel.
# See: https://en.wikipedia.org/wiki/Percentile#Second_variant.2C_.7F.27.22.60UNIQ--postMath-00000043-QINU.60.22.27.7F
# Example: `[ 95.5, 1.234 ]` The 95.5th percentile in the column must be "1.234" (float).
percentile_min: [ 95.0, 1.0 ] # x >= 1.0
percentile_greater: [ 95.0, 2.0 ] # x > 2.0
percentile_not: [ 95.0, 5.0 ] # x != 5.0
percentile: [ 95.0, 7.0 ] # x == 7.0
percentile_less: [ 95.0, 8.0 ] # x < 8.0
percentile_max: [ 95.0, 9.0 ] # x <= 9.0

# Quartiles. Three points that divide the data set into four equal groups, each group comprising a quarter of the data.
# See: https://en.wikipedia.org/wiki/Quartile
# There are multiple methods for computing quartiles: ["exclusive", "inclusive"]. Exclusive is ussually classic.
# Available types: ["0%", "Q1", "Q2", "Q3", "100%", "IQR"] ("IQR" is Interquartile Range)
# Example: `[ inclusive, 'Q3', 42.0 ]` - the Q3 inclusive quartile is 42.0
quartiles_min: [ exclusive, '0%', 1.0 ] # x >= 1.0
quartiles_greater: [ inclusive, 'Q1', 2.0 ] # x > 2.0
quartiles_not: [ exclusive, 'Q2', 5.0 ] # x != 5.0
quartiles: [ inclusive, 'Q3', 7.0 ] # x == 7.0
quartiles_less: [ exclusive, '100%', 8.0 ] # x < 8.0
quartiles_max: [ inclusive, 'IQR', 9.0 ] # x <= 9.0

# Midhinge. The average of the first and third quartiles and is thus a measure of location.
# Equivalently, it is the 25% trimmed mid-range or 25% midsummary; it is an L-estimator.
# See: https://en.wikipedia.org/wiki/Midhinge
# Midhinge = (first quartile, third quartile) / 2
midhinge_min: 1.0 # x >= 1.0
midhinge_greater: 2.0 # x > 2.0
midhinge_not: 5.0 # x != 5.0
midhinge: 7.0 # x == 7.0
midhinge_less: 8.0 # x < 8.0
midhinge_max: 9.0 # x <= 9.0

# MAD - mean absolute deviation. The average of the absolute deviations from a central point.
# It is a summary statistic of statistical dispersion or variability.
# See: https://en.wikipedia.org/wiki/Average_absolute_deviation
mean_abs_dev_min: 1.0 # x >= 1.0
mean_abs_dev_greater: 2.0 # x > 2.0
mean_abs_dev_not: 5.0 # x != 5.0
mean_abs_dev: 7.0 # x == 7.0
mean_abs_dev_less: 8.0 # x < 8.0
mean_abs_dev_max: 9.0 # x <= 9.0

# MAD - median absolute deviation. The average of the absolute deviations from a central point.
# It is a summary statistic of statistical dispersion or variability.
# It is a robust measure of the variability of a univariate sample of quantitative data.
# See: https://en.wikipedia.org/wiki/Median_absolute_deviation
median_abs_dev_min: 1.0 # x >= 1.0
median_abs_dev_greater: 2.0 # x > 2.0
median_abs_dev_not: 5.0 # x != 5.0
median_abs_dev: 7.0 # x == 7.0
median_abs_dev_less: 8.0 # x < 8.0
median_abs_dev_max: 9.0 # x <= 9.0

# Population variance - Use when all possible observations of the system are present.
# If used with a subset of data (sample variance), it will be a biased variance.
# n degrees of freedom, where n is the number of observations.
population_variance_min: 1.0 # x >= 1.0
population_variance_greater: 2.0 # x > 2.0
population_variance_not: 5.0 # x != 5.0
population_variance: 7.0 # x == 7.0
population_variance_less: 8.0 # x < 8.0
population_variance_max: 9.0 # x <= 9.0

# Unbiased sample variance Use when only a subset of all possible observations of the system are present.
# n - 1 degrees of freedom, where n is the number of observations.
sample_variance_min: 1.0 # x >= 1.0
sample_variance_greater: 2.0 # x > 2.0
sample_variance_not: 5.0 # x != 5.0
sample_variance: 7.0 # x == 7.0
sample_variance_less: 8.0 # x < 8.0
sample_variance_max: 9.0 # x <= 9.0

# Standard deviation (For a sample; uses sample variance). It also known as SD or StdDev.
# StdDev is a measure that is used to quantify the amount of variation or dispersion of a set of data values.
# - Low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set.
# - High standard deviation indicates that the data points are spread out over a wider range of values.
# See: https://en.wikipedia.org/wiki/Standard_deviation
stddev_min: 1.0 # x >= 1.0
stddev_greater: 2.0 # x > 2.0
stddev_not: 5.0 # x != 5.0
stddev: 7.0 # x == 7.0
stddev_less: 8.0 # x < 8.0
stddev_max: 9.0 # x <= 9.0

# SD+ (Standard deviation for a population; uses population variance)
stddev_pop_min: 1.0 # x >= 1.0
stddev_pop_greater: 2.0 # x > 2.0
stddev_pop_not: 5.0 # x != 5.0
stddev_pop: 7.0 # x == 7.0
stddev_pop_less: 8.0 # x < 8.0
stddev_pop_max: 9.0 # x <= 9.0

# Coefficient of variation (cᵥ) Also known as relative standard deviation (RSD)
# A standardized measure of dispersion of a probability distribution or frequency distribution.
# It is often expressed as a percentage. The ratio of the standard deviation to the mean.
# See: https://en.wikipedia.org/wiki/Coefficient_of_variation
coef_of_var_min: 1.0 # x >= 1.0
coef_of_var_greater: 2.0 # x > 2.0
coef_of_var_not: 5.0 # x != 5.0
coef_of_var: 7.0 # x == 7.0
coef_of_var_less: 8.0 # x < 8.0
coef_of_var_max: 9.0 # x <= 9.0

# Interquartile mean (IQM). A measure of central tendency based on the truncated mean of the interquartile range.
# Only the data in the second and third quartiles is used (as in the interquartile range), and the lowest 25% and the highest 25% of the scores are discarded.
# See: https://en.wikipedia.org/wiki/Interquartile_mean
# Note: It's SUPER slow!!!
interquartile_mean_min: 1.0 # x >= 1.0
interquartile_mean_greater: 2.0 # x > 2.0
interquartile_mean_not: 5.0 # x != 5.0
interquartile_mean: 7.0 # x == 7.0
interquartile_mean_less: 8.0 # x < 8.0
interquartile_mean_max: 9.0 # x <= 9.0

- name: another_column
rules:
not_empty: true

- name: inherited_column_login
preset: my-preset/login

- name: inherited_column_full_name
preset: my-preset/full_name
```

### Extra checks

In addition to the schema outlined in the YAML configuration, the tool performs several underlying checks by default to
ensure thorough validation of your CSV files.

- The `filename_pattern` rule verifies that the file name adheres to the specified regex pattern, ensuring file naming conventions are followed.
- Ensures that the `name` property is defined for each column, applicable only when `csv.header` is set to `true`, to guarantee header integrity.
- The `required` property, when set to `true`, mandates the presence of the specified column in the CSV file, enhancing data completeness.
This is only relevant if `csv.header` is true.
- Validates that each row contains the correct number of columns, aligning with the schema's defined structure, to prevent data misalignment.
- The `strict_column_order` rule checks for the correct sequential order of columns as defined in the schema, ensuring structural consistency.
- The `allow_extra_columns` rule asserts no additional columns are present in the CSV file beyond those specified in the schema,
maintaining strict data fidelity.
- For `csv.header: true`, it checks if the schema contains any column `name` not found in the CSV file, addressing header discrepancies.
- For `csv.header: false`, it compares the number of columns in the schema against those in the CSV file, ensuring schema conformity.

## Presets and reusable schemas

Presets significantly enhance the efficiency and reusability of schema definitions in CSV file validation by ensuring
consistency across various files with common validation rules for fields like user IDs and email addresses.

This uniformity maintains data integrity and simplifies maintenance by allowing centralized updates that automatically apply
to all linked schemas. Moreover, presets support customization through field-specific rule overrides, facilitating both
standardization and specific needs adaptation.

Focus on your task rather than copy&pasting.

### Example with presets

Let's look at a real life example. Suppose you have a "library" of different user profile validation rules that can be
used in a wide variety of CSV files.

In order not to care about integrity and not to suffer from copy and paste, you can reuse ANY(!) existing schema.
In fact, this can be considered as partial inheritance.

#### Important notes

- You can make the chain of inheritance infinitely long.
I.e. make chains of the form `grant-parent.yml` -> `parent.yml` -> `child.yml` -> `grandchild.yml` -> etc.
Of course if you like to take risks ;).
- But be careful with circular dependencies. The tool will not be able to handle them, and it can be an infinite loop.
- Any(!) of the schema files can be used alone or as a library. The syntax is the same.
- Schemas with presets validate themselves and if there are any obvious issues, you will see them when you try to use
the schema. But logical conflicts between rules are not checked (It's almost impossible from a code perspective).
As mentioned above, rules work in isolation and are not aware of each other. So the set of rules is your responsibility as always.
- Alias in presets must match the regex pattern `/^[a-z0-9-_]+$/i`. Otherwise, it might break the syntax.

#### If something went wrong

If you're having trouble working with presets and don't understand how the CSV Blueprint under the hood understands it,
just add `--dump-schema` to see it. Also, there is a separate CLI command to dump schema:

```shell
./csv-blueprint debug-schema -s ./your/schema.yml
```

Let's take a look at what this looks like in code.

- Define a couple of basic rules for [database columns](schema-examples/preset_database.yml).
- Also, one of the files will contain rules specific only to the [users profile](schema-examples/preset_users.yml).
- And of course, let's [make a schema](schema-examples/preset_usage.yml) that will simultaneously reuse the rules from these two files.

As a result, you don't just get a bunch of schemas for validation, which is difficult to manage, but something like a
framework(!) that will be targeted to the specifics of your project, especially when there are dozens or even hundreds
of CSV files and rules. It will be much easier to achieve consistency. Very often it's quite important.

[Database preset](schema-examples/preset_database.yml)

CLICK to see source code

```yml
name: Presets for database columns
description: This schema contains basic rules for database user data.

columns:
- name: id
description: Unique identifier, usually used to denote a primary key in databases.
example: 12345
extra:
custom_key: custom value
rules:
not_empty: true
is_trimmed: true
is_int: true
num_min: 1
aggregate_rules:
is_unique: true
sorted: [ asc, numeric ]

- name: status
description: Status in database
example: active
rules:
not_empty: true
allow_values: [ active, inactive, pending, deleted ]
```

[User data preset](schema-examples/preset_users.yml)

CLICK to see source code

```yml
name: Common presets for user data
description: >
This schema contains common presets for user data.
It can be used as a base for other schemas.

filename_pattern: /users-.*\.csv$/i

csv:
delimiter: ';'

columns:
- name: login
description: User's login name
example: johndoe
rules:
not_empty: true
is_trimmed: true
is_lowercase: true
is_slug: true
length_min: 3
length_max: 20
is_alnum: true
aggregate_rules:
is_unique: true

- name: password
description: User's password
example: '9RfzENKD'
rules:
not_empty: true
is_trimmed: true
is_password_safe_chars: true
password_strength_min: 7
contains_none: [ "password", "123456", "qwerty", " " ]
charset: UTF-8
length_min: 6
length_max: 20

- name: full_name
description: User's full name
example: 'John Doe Smith'
rules:
not_empty: true
is_trimmed: true
charset: UTF-8
contains: " "
word_count_min: 2
word_count_max: 8
is_capitalize: true
aggregate_rules:
is_unique: true

- name: email
description: User's email address
example: user@example.com
rules:
not_empty: true
is_trimmed: true
is_email: true
is_lowercase: true
aggregate_rules:
is_unique: true

- name: birthday
description: Validates the user's birthday.
example: '1990-01-01'
rules:
not_empty: true # The birthday field must not be empty.
is_trimmed: true # Trims the value before validation.
date_format: Y-m-d # Checks if the date matches the YYYY-MM-DD format.
is_date: true # Validates if the value is a valid date.
date_age_greater: 0 # Ensures the date is in the past.
date_age_less: 150 # Ensures the user is not older than 150 years.
date_max: now # Ensures the date is not in the future.

- name: phone_number
description: User's phone number in US
example: '+1 650 253 00 00'
rules:
not_empty: true
is_trimmed: true
starts_with: '+1'
phone: US

- name: balance
description: User's balance in USD
example: '100.00'
rules:
not_empty: true
is_trimmed: true
is_float: true
num_min: 0.00
num_max: 1000000000.00 # 1 billion is max amount in our system.
precision: 2

- name: short_description
description: A brief description of the item
example: 'Lorem ipsum dolor sit amet'
rules:
not_empty: true
contains: " "
length_max: 255
is_trimmed: true
```

[Usage of presets](schema-examples/preset_usage.yml) This short and clear Yaml under the hood as roughly as follows.
As you can see it simplifies your work a lot.

```yml
name: Schema uses presets and add new columns + specific rules.
description: This schema uses presets. Also, it demonstrates how to override preset values.

presets: # Include any other schemas and defined for each alias.
users: ./preset_users.yml # Include the schema with common user data.
db: ./preset_database.yml # Include the schema with basic database columns.

csv:
preset: users # Take the CSV settings from the preset.
enclosure: '|' # Overridden enclosure only for this schema.

columns:
# Grap only needed columns from the preset in specific order.
- preset: db/id
- preset: db/status
- preset: users/login
- preset: users/email
- preset: users/full_name
- preset: users/birthday
- preset: users/phone_number # Rename the column. "phone_number" => "phone".
name: phone
- preset: users/password # Overridden value to force a strong password.
rules: { length_min: 10 }
- name: admin_note # New column specific only this schema.
description: Admin note
rules:
not_empty: true
length_min: 1
length_max: 10
aggregate_rules: # In practice this will be a rare case, but the opportunity is there.
preset: db/id # Take only aggregate rules from the preset.
is_unique: true # Added new specific aggregate rule.
```

CLICK to see what it looks like in memory.

```yml
# Schema file is "./schema-examples/preset_usage.yml"
name: 'Schema uses presets and add new columns + specific rules.'
description: 'This schema uses presets. Also, it demonstrates how to override preset values.'
presets:
users: ./schema-examples/preset_users.yml
db: ./schema-examples/preset_database.yml
csv:
delimiter: ;
enclosure: '|'
columns:
- name: id
description: 'Unique identifier, usually used to denote a primary key in databases.'
example: 12345
extra:
custom_key: 'custom value'
rules:
not_empty: true
is_trimmed: true
is_int: true
num_min: 1
aggregate_rules:
is_unique: true
sorted:
- asc
- numeric

- name: status
description: 'Status in database'
example: active
rules:
not_empty: true
allow_values:
- active
- inactive
- pending
- deleted

- name: login
description: "User's login name"
example: johndoe
rules:
not_empty: true
length_min: 3
length_max: 20
is_trimmed: true
is_lowercase: true
is_slug: true
is_alnum: true
aggregate_rules:
is_unique: true

- name: email
description: "User's email address"
example: user@example.com
rules:
not_empty: true
is_trimmed: true
is_lowercase: true
is_email: true
aggregate_rules:
is_unique: true

- name: full_name
description: "User's full name"
example: 'John Doe Smith'
rules:
not_empty: true
is_trimmed: true
is_capitalize: true
word_count_min: 2
word_count_max: 8
contains: ' '
charset: UTF-8
aggregate_rules:
is_unique: true

- name: birthday
description: "Validates the user's birthday."
example: '1990-01-01'
rules:
not_empty: true
is_trimmed: true
is_date: true
date_max: now
date_format: Y-m-d
date_age_greater: 0
date_age_less: 150

- name: phone
description: "User's phone number in US"
example: '+1 650 253 00 00'
rules:
not_empty: true
is_trimmed: true
starts_with: '+1'
phone: US

- name: password
description: "User's password"
example: 9RfzENKD
rules:
not_empty: true
length_min: 10
length_max: 20
is_trimmed: true
contains_none:
- password
- '123456'
- qwerty
- ' '
password_strength_min: 7
is_password_safe_chars: true
charset: UTF-8

- name: admin_note
description: 'Admin note'
rules:
not_empty: true
length_min: 1
length_max: 10
aggregate_rules:
is_unique: true
sorted:
- asc
- numeric
```

As a result, readability and maintainability became dramatically easier. You can easily add new rules, change existing, etc.

### Complete example with all available syntax

CLICK to see available syntax.

```yml
name: Complite list of preset features
description: This schema contains all the features of the presets.

presets:
# The basepath for the preset is `.` (current directory of the current schema file).
# Define alias "db" for schema in `./preset_database.yml`.
db: preset_database.yml # Or `db: ./preset_database.yml`. It's up to you.

# For example, you can use a relative path.
users: ./../schema-examples/preset_users.yml

# Or you can use an absolute path.
# db: /full/path/preset_database.yml

filename_pattern: { preset: users } # Take the filename pattern from the preset.
structural_rules: { preset: users } # Take the global rules from the preset.
csv: { preset: users } # Take the CSV settings from the preset.

columns:
# Use name of column from the preset.
# "db" is alias. "id" is column `name` in `preset_database.yml`.
- preset: 'db/id'

# Use column index. "db" is alias. "0" is column index in `preset_database.yml`.
- preset: 'db/0'
- preset: 'db/0:'

# Use column index and column name. It useful if column name is not unique.
- preset: 'db/0:id'

# Use only `rules` of "status" column from the preset.
- name: My column
rules:
preset: 'db/status'

# Override only `aggregate_rules` from the preset.
# Use only `aggregate_rules` of "id" column from the preset.
# We strictly take only the very first column (index = 0).
- name: My column
aggregate_rules:
preset: 'db/0:id'

# Combo!!! If you're a risk-taker or have a high level of inner zen. :)
# Creating a column from three other columns.
# In fact, it will merge all three at once with key replacement.
- name: Crazy combo!
description: > # Just a great advice.
I like to take risks, too.
Be careful. Use your power wisely.
example: ~ # Ignore inherited "example" value. Set it `null`.
preset: 'users/login'
rules:
preset: 'users/email'
not_empty: true # Disable the rule from the preset.
aggregate_rules:
preset: 'db/0'
```

**Note:** All provided YAML examples pass built-in validation, yet they may not make practical sense.
These are intended solely for demonstration and to illustrate potential configurations and features.

## Parallel processing

The `--parallel` option is available for speeding up the validation of CSV files by utilizing more CPU resources
effectively.

### Key Points

- **Experimental Feature:** This feature is currently experimental and requires further debugging and testing. Although
it performs well in synthetic autotests and benchmarks. More practical use cases are needed to validate its stability.
- **Use Case:** This option is beneficial if you are processing dozens of CSV files, with each file taking 1 second or
more to process.
- **Default Behavior:** If you use `--parallel` without specifying a value, it defaults to using the maximum number of
available CPU cores.
- **Thread Pool Size:** You can set a specific number of threads for the pool. For example, `--parallel=10` will set the
thread pool size to 10. It doesn't make much sense to specify more than the number of logical cores in your CPU.
Otherwise, it will only slow things down a bit due to the system overhead to handle multithreading.
- **Disabling Parallelism:** Using `--parallel=1` disables parallel processing, which is the default setting if the
option is not specified.
- **Implementation:** The feature relies on the `ext-parallel` PHP extension, which enables the creation of lightweight
threads rather than processes. This extension is already included in our Docker image. Ensure that you have
the `ext-parallel` extension installed if you are not using our Docker image. This extension is crucial for the
operation of the parallel processing feature. The application always runs in single-threaded mode if the extension is
not installed.

## Complete CLI help message

This section outlines all available options and commands provided by the tool, leveraging the JBZoo/Cli package for its
CLI. The tool offers a comprehensive set of options to cater to various needs and scenarios, ensuring flexibility and
efficiency in CSV file validation and manipulation.

For detailed information on each command and option, refer to
the [JBZoo/Cli documentation](https://github.com/JBZoo/Cli). This resource provides insights into the functionality and
application of the CLI commands, helping users make the most out of the tool's capabilities.

`./csv-blueprint validate-csv --help`

CLICK to see validate-csv help messege

```txt
Description:
Validate CSV file(s) by schema(s).

Usage:
validate-csv [options]
validate:csv

Options:
-c, --csv=CSV Specify the path(s) to the CSV files you want to validate.
This can include a direct path to a file or a directory to search with a maximum depth of 10 levels.
Examples: p/file.csv; p/*.csv; p/**/*.csv; p/**/name-*.csv; **/*.csv
(multiple values allowed)
-s, --schema=SCHEMA Specify the path(s) to the schema file(s), supporting YAML, JSON, or PHP formats.
Similar to CSV paths, you can direct to specific files or search directories with glob patterns.
Examples: p/file.yml; p/*.yml; p/**/*.yml; p/**/name-*.yml; **/*.yml
(multiple values allowed)
-S, --skip-schema[=SKIP-SCHEMA] Skips schema validation for quicker checks when the schema's correctness is certain.
Use any non-empty value or "yes" to activate
[default: "no"]
-a, --apply-all[=APPLY-ALL] Apply all schemas (also without `filename_pattern`) to all CSV files found as global rules.
Available options:
- auto: If no glob pattern (*) is used for --schema, the schema is applied to all found CSV files.
- yes|y|1: Apply all schemas to all CSV files, Schemas without `filename_pattern` are applied as a global rule.
- no|n|0: Apply only schemas with not empty `filename_pattern` and match the CSV files.
Note. If specify the option `--apply-all` without value, it will be treated as "yes".
[default: "auto"]
-Q, --quick[=QUICK] Stops the validation process upon encountering the first error,
accelerating the check but limiting error visibility.
Returns a non-zero exit code if any error is detected.
Enable by setting to any non-empty value or "yes".
[default: "no"]
-r, --report=REPORT Determines the report's output format.
Available options: text, table, github, gitlab, teamcity, junit
[default: "table"]
--dump-schema Dumps the schema of the CSV file if you want to see the final schema after inheritance.
--debug Intended solely for debugging and advanced profiling purposes.
Activating this option provides detailed process insights,
useful for troubleshooting and performance analysis.
--parallel[=PARALLEL] EXPERIMENTAL! Launches the process in parallel mode (if possible). Works only with ext-parallel.
You can specify the number of threads.
If you do not specify a value, the number of threads will be equal to the number of CPU cores.
By default, the process is launched in a single-threaded mode. [default: "1"]
--no-progress Disable progress bar animation for logs. It will be used only for text output format.
--mute-errors Mute any sort of errors. So exit code will be always "0" (if it's possible).
It has major priority then --non-zero-on-error. It's on your own risk!
--stdout-only For any errors messages application will use StdOut instead of StdErr. It's on your own risk!
--non-zero-on-error None-zero exit code on any StdErr message.
--timestamp Show timestamp at the beginning of each message.It will be used only for text output format.
--profile Display timing and memory usage information.
--output-mode=OUTPUT-MODE Output format. Available options:
text - Default text output format, userfriendly and easy to read.
cron - Shortcut for crontab. It's basically focused on human-readable logs output.
It's combination of --timestamp --profile --stdout-only --no-progress -vv.
logstash - Logstash output format, for integration with ELK stack.
[default: "text"]
--cron Alias for --output-mode=cron. Deprecated!
-h, --help Display help for the given command. When no command is given display help for the list command
--silent Do not output any message
-q, --quiet Only errors are displayed. All other output is suppressed
-V, --version Display this application version
--ansi|--no-ansi Force (or disable --no-ansi) ANSI output
-n, --no-interaction Do not ask any interactive question
-v|vv|vvv, --verbose Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
```

`./csv-blueprint validate-schema --help`

CLICK to see validate-schema help messege

```txt
Description:
Validate syntax in schema file(s).

Usage:
validate-schema [options]

Options:
-s, --schema=SCHEMA Specify the path(s) to the schema file(s), supporting YAML, JSON, or PHP formats.
Similar to CSV paths, you can direct to specific files or search directories with glob patterns.
Examples: /full/path/name.yml; p/file.yml; p/*.yml; p/**/*.yml; p/**/name-*.yml; **/*.yml
(multiple values allowed)
-Q, --quick[=QUICK] Stops the validation process upon encountering the first error,
accelerating the check but limiting error visibility.
Returns a non-zero exit code if any error is detected.
Enable by setting to any non-empty value or "yes".
[default: "no"]
-r, --report=REPORT Determines the report's output format.
Available options: text, table, github, gitlab, teamcity, junit
[default: "table"]
--dump-schema Dumps the schema of the CSV file if you want to see the final schema after inheritance.
--debug Intended solely for debugging and advanced profiling purposes.
Activating this option provides detailed process insights,
useful for troubleshooting and performance analysis.
--parallel[=PARALLEL] EXPERIMENTAL! Launches the process in parallel mode (if possible). Works only with ext-parallel.
You can specify the number of threads.
If you do not specify a value, the number of threads will be equal to the number of CPU cores.
By default, the process is launched in a single-threaded mode. [default: "1"]
--no-progress Disable progress bar animation for logs. It will be used only for text output format.
--mute-errors Mute any sort of errors. So exit code will be always "0" (if it's possible).
It has major priority then --non-zero-on-error. It's on your own risk!
--stdout-only For any errors messages application will use StdOut instead of StdErr. It's on your own risk!
--non-zero-on-error None-zero exit code on any StdErr message.
--timestamp Show timestamp at the beginning of each message.It will be used only for text output format.
--profile Display timing and memory usage information.
--output-mode=OUTPUT-MODE Output format. Available options:
text - Default text output format, userfriendly and easy to read.
cron - Shortcut for crontab. It's basically focused on human-readable logs output.
It's combination of --timestamp --profile --stdout-only --no-progress -vv.
logstash - Logstash output format, for integration with ELK stack.
[default: "text"]
--cron Alias for --output-mode=cron. Deprecated!
-h, --help Display help for the given command. When no command is given display help for the list command
--silent Do not output any message
-q, --quiet Only errors are displayed. All other output is suppressed
-V, --version Display this application version
--ansi|--no-ansi Force (or disable --no-ansi) ANSI output
-n, --no-interaction Do not ask any interactive question
-v|vv|vvv, --verbose Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
```

`./csv-blueprint dump-schema --help`

CLICK to see debug-schema help messege

```txt
Description:
Show the internal representation of the schema taking into account presets.

Usage:
debug-schema [options]

Options:
-s, --schema=SCHEMA Specify the path to a schema file, supporting YAML, JSON, or PHP formats.
Examples: /full/path/name.yml; p/file.yml
-d, --hide-defaults Hide default values in the output.
--no-progress Disable progress bar animation for logs. It will be used only for text output format.
--mute-errors Mute any sort of errors. So exit code will be always "0" (if it's possible).
It has major priority then --non-zero-on-error. It's on your own risk!
--stdout-only For any errors messages application will use StdOut instead of StdErr. It's on your own risk!
--non-zero-on-error None-zero exit code on any StdErr message.
--timestamp Show timestamp at the beginning of each message.It will be used only for text output format.
--profile Display timing and memory usage information.
--output-mode=OUTPUT-MODE Output format. Available options:
text - Default text output format, userfriendly and easy to read.
cron - Shortcut for crontab. It's basically focused on human-readable logs output.
It's combination of --timestamp --profile --stdout-only --no-progress -vv.
logstash - Logstash output format, for integration with ELK stack.
[default: "text"]
--cron Alias for --output-mode=cron. Deprecated!
-h, --help Display help for the given command. When no command is given display help for the list command
--silent Do not output any message
-q, --quiet Only errors are displayed. All other output is suppressed
-V, --version Display this application version
--ansi|--no-ansi Force (or disable --no-ansi) ANSI output
-n, --no-interaction Do not ask any interactive question
-v|vv|vvv, --verbose Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
```

`./csv-blueprint create-schema --help`
It's beta. Work in progress.

CLICK to see create-schema help messege

```txt
Description:
Analyze CSV files and suggest a schema based on the data found.

Usage:
create-schema [options]

Options:
-c, --csv=CSV Specify the path(s) to the CSV files you want to analyze.
This can include a direct path to a file or a directory to search with a maximum depth of 10 levels.
Examples: p/file.csv; p/*.csv; p/**/*.csv; p/**/name-*.csv; **/*.csv
(multiple values allowed)
-H, --header[=HEADER] Force the presence of a header row in the CSV files. [default: "yes"]
-L, --lines[=LINES] The number of lines to read when detecting parameters. Minimum is 1. [default: 10000]
-C, --check-syntax[=CHECK-SYNTAX] Check the syntax of the suggested schema. [default: "yes"]
-r, --report=REPORT Determines the report's output format.
Available options: text, table, github, gitlab, teamcity, junit
[default: "table"]
--dump-schema Dumps the schema of the CSV file if you want to see the final schema after inheritance.
--debug Intended solely for debugging and advanced profiling purposes.
Activating this option provides detailed process insights,
useful for troubleshooting and performance analysis.
--parallel[=PARALLEL] EXPERIMENTAL! Launches the process in parallel mode (if possible). Works only with ext-parallel.
You can specify the number of threads.
If you do not specify a value, the number of threads will be equal to the number of CPU cores.
By default, the process is launched in a single-threaded mode. [default: "1"]
--no-progress Disable progress bar animation for logs. It will be used only for text output format.
--mute-errors Mute any sort of errors. So exit code will be always "0" (if it's possible).
It has major priority then --non-zero-on-error. It's on your own risk!
--stdout-only For any errors messages application will use StdOut instead of StdErr. It's on your own risk!
--non-zero-on-error None-zero exit code on any StdErr message.
--timestamp Show timestamp at the beginning of each message.It will be used only for text output format.
--profile Display timing and memory usage information.
--output-mode=OUTPUT-MODE Output format. Available options:
text - Default text output format, userfriendly and easy to read.
cron - Shortcut for crontab. It's basically focused on human-readable logs output.
It's combination of --timestamp --profile --stdout-only --no-progress -vv.
logstash - Logstash output format, for integration with ELK stack.
[default: "text"]
--cron Alias for --output-mode=cron. Deprecated!
-h, --help Display help for the given command. When no command is given display help for the list command
--silent Do not output any message
-q, --quiet Only errors are displayed. All other output is suppressed
-V, --version Display this application version
--ansi|--no-ansi Force (or disable --no-ansi) ANSI output
-n, --no-interaction Do not ask any interactive question
-v|vv|vvv, --verbose Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
```

## Report examples

The validation process culminates in a human-readable report detailing any errors identified within the CSV file. While
the default report format is a table, the tool supports various output formats, including text, GitHub, GitLab,
TeamCity, JUnit, among others, to best suit your project's needs and your personal or team preferences.

### GitHub Action format

To see user-friendly error outputs in your pull requests (PRs), specify `report: github`. This
utilizes [annotations](https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#setting-a-warning-message)
to highlight bugs directly within the GitHub interface at the PR level. This feature allows errors to be displayed in
the exact location within the CSV file, right in the diff of your Pull Requests. For a practical example,
view [this live demo PR](https://github.com/jbzoo/csv-blueprint-demo/pull/1/files).

![GitHub Actions - PR](.github/assets/github-actions-pr.png)

CLICK to see example in GitHub Actions terminal

![GitHub Actions - Terminal](.github/assets/github-actions-termintal.png)

### Text format

Optional format `text` with highlited keywords for qucik navigation.

![Report - Text](.github/assets/output-text.png)

### Table format

When using the `table` format (default), the output is organized in a clear, easily interpretable table that lists all
discovered errors. This format is ideal for quick reviews and sharing with team members for further action.

![Table format](.github/assets/output-table.png)

Notes

- Report format for GitHub Actions is `table` by default.
- Tools uses [JBZoo/CI-Report-Converter](https://github.com/JBZoo/CI-Report-Converter) as SDK to convert reports to
different formats. So you can easily integrate it with any CI system.

## Benchmarks

Understanding the performance of this tool is crucial, but it's important to note that its efficiency is influenced by
several key factors:

- **File Size:** The dimensions of the CSV file, both in terms of rows and columns, directly impact processing time.
Performance scales linearly with file size and is dependent on the capabilities of your hardware, such as CPU and SSD
speed.
- **Number of Rules:** More validation rules per column mean more iterations for processing. Each rule operates
independently, so the total time and memory consumption are cumulative across all rules.
- **Rule Intensity:** While most validation rules are optimized for speed and low memory usage, some,
like `interquartile_mean`, can be significantly slower. For instance, `interquartile_mean` might process around 4,000
lines per second, whereas other rules can handle upwards of 50 million lines per second.

However, to gain a general understanding of performance, refer to the table below.

- All tests were conducted on a dataset comprising `2 million lines` plus an additional line for the header.
- These results are derived from the most current version, as verified by tests run
using [GitHub Actions](https://github.com/jbzoo/csv-blueprint/actions/workflows/benchmark.yml) ([See workflow.yml](.github/workflows/benchmark.yml)).
The link provides access to a variety of builds, which are essential for different testing scenarios and experiments.
The most representative data can be found under `Docker (latest, XX)`.
- Developer mode was activated for these tests, using the flags `-vvv --debug --profile`.
- Testing environment included the latest Ubuntu + Docker. For more information about the GitHub Actions (GA) hardware
used, please [see details about GA hardware](https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-private-repositories).
- The primary performance indicator is the processing speed, measured in lines per second. Note that speeds are
presented in thousands of lines per second (`100K` equals `100,000 lines per second`).
- Peak RAM usage throughout the duration of each test case serves as an additional performance metric.

Profiles:

- **[Quickest:](tests/Benchmarks/bench_0_quickest_combo.yml)** Focuses on the fastest rules, either cell or aggregation, providing a baseline.
- **[Minimum:](tests/Benchmarks/bench_1_mini_combo.yml)** Uses a set of normal performance rules, with two instances of each.
- **[Realistic:](tests/Benchmarks/bench_2_realistic_combo.yml)** Represents a mix of rules likely encountered in typical use cases.
- **[All Aggregations:](tests/Benchmarks/bench_3_all_agg.yml)** Tests all aggregation rules simultaneously, illustrating maximum load.

Divisions:

- **Cell Rules:** Tests only individual cell validation rules.
- **Agg Rules:** Focuses solely on column-wide aggregation rules.
- **Cell + Agg:** Combines cell and aggregation rules for comprehensive validation.
- **Peak Memory:** Indicates the maximum RAM usage, particularly relevant in scenarios with aggregation rules.

**Note:** The `Peak Memory` metric is primarily of interest when aggregation rules are used, as non-aggregation
scenarios typically require no more than 2-4 megabytes of memory, regardless of file size or rule count.

These benchmarks offer a snapshot of the tool's capabilities across a range of scenarios, helping you gauge its
suitability for your specific CSV validation needs.

File / Profile

Metric

Quickest
Minimum
Realistic
All aggregations

Columns: 1
Size: ~8 MB

Cell rules
Agg rules
Cell + Agg
Peak Memory
786K,  2.5 sec
1187K,  1.7 sec
762K,  2.6 sec
52 MB
386K,  5.2 sec
1096K,  1.8 sec
373K,  5.4 sec
68 MB
189K, 10.6 sec
667K,  3.0 sec
167K, 12.0 sec
208 MB
184K, 10.9 sec
96K, 20.8 sec
63K, 31.7 sec
272 MB

Columns: 5
Size: 64 MB

Cell rules
Agg rules
Cell + Agg
Peak Memory
545K,  3.7 sec
714K,  2.8 sec
538K,  3.7 sec
52 MB
319K,  6.3 sec
675K,  3.0 sec
308K,  6.5 sec
68 MB
174K, 11.5 sec
486K,  4.1 sec
154K, 13.0 sec
208 MB
168K, 11.9 sec
96K, 20.8 sec
61K, 32.8 sec
272 MB

Columns: 10
Size: 220 MB

Cell rules
Agg rules
Cell + Agg
Peak Memory
311K,  6.4 sec
362K,  5.5 sec
307K,  6.5 sec
52 MB
221K,  9.0 sec
354K,  5.6 sec
215K,  9.3 sec
68 MB
137K, 14.6 sec
294K,  6.8 sec
125K, 16.0 sec
208 MB
135K, 14.8 sec
96K, 20.8 sec
56K, 35.7 sec
272 MB

Columns: 20
Size: 1.2 GB

Cell rules
Agg rules
Cell + Agg
Peak Memory
103K, 19.4 sec
108K, 18.5 sec
102K, 19.6 sec
52 MB
91K, 22.0 sec
107K, 18.7 sec
89K, 22.5 sec
68 MB
72K, 27.8 sec
101K, 19.8 sec
69K, 29.0 sec
208 MB
71K, 28.2 sec
96K, 20.8 sec
41K, 48.8 sec
272 MB

**Additional Benchmark Insights:**

When running the same validation tests on different hardware configurations, the performance of the tool can vary
significantly. Notably, testing on a **MacBook 14" M2 Max (2023)** yields results that are approximately twice as fast
as those observed on the GitHub Actions hardware. This indicates the tool's exceptional performance on modern, high-spec
devices.

Conversely, tests conducted on a **MacBook Pro (2019) with an Intel 2.4 GHz processor** align closely with the GitHub
Actions results, suggesting that the benchmark table provided reflects an average performance level for typical
engineering hardware.

### Brief conclusions

- **Cell Rules**: These rules are highly CPU-intensive but require minimal RAM, typically around 1-2 MB at peak. The
more cell rules applied to a column, the longer the validation process takes due to the additional actions performed
on each value.

- **Aggregation Rules**: These rules operate at incredible speeds, processing anywhere from 10 million to billions of
rows per second. However, they are significantly more RAM-intensive. Interestingly, adding over 100 different
aggregation rules does not substantially increase memory consumption.

- **PHP Array Functions**: Not all PHP array functions can operate by reference (`&$var`). Whether or not a dataset in a
column can be manipulated in this way is highly dependent on the specific algorithm used. For example, a 20 MB dataset
might be duplicated during processing, leading to a peak memory usage of 40 MB. Consequently, optimization techniques
that rely on passing data by reference are often ineffective.

- **Practical Implications**: If processing a 1 GB file within 30-60 seconds is acceptable, and if there is 200-500 MB
of RAM available, there may be little need to overly concern oneself with these performance considerations.

- **Memory Management**: Throughout testing, no memory leaks were observed.

### Examples of CSV Files

The CSV files utilized for benchmark testing are described below. These files were initially generated
using [PHP Faker](tests/Benchmarks/Commands/CreateCsv.php) to create the first 2000 lines. Subsequently, they were
replicated [1000 times within themselves](tests/Benchmarks/create-csv.sh), allowing for the creation of significantly
large random files in a matter of seconds.

A key principle observed in these files is that as the number of columns increases, the length of the values within
these columns also tends to increase, following a pattern akin to exponential growth.

Columns: 1, Size: 8.48 MB

```csv
id
1
2
```

Columns: 5, Size: 64 MB

```csv
id,bool_int,bool_str,number,float
1,0,false,289566,864360.14285714
2,1,true,366276,444761.71428571
```

Columns: 10, Size: 220 MB

```csv
id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4
1,1,true,779914,1101964.2857143,2011-02-04,"2000-03-02 00:33:57",erdman.net,germaine.brakus@yahoo.com,32.51.181.238
2,0,true,405408,695839.42857143,1971-01-29,"1988-08-12 21:25:27",bode.com,tatyana.cremin@yahoo.com,76.79.155.73
```

Columns: 20, Size: 1.2 GB

```csv
id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4,uuid,address,postcode,latitude,longitude,ip6,sentence_tiny,sentence_small,sentence_medium,sentence_huge
1,1,false,884798,1078489.5714286,2006-02-09,"2015-12-07 22:59:06",gerhold.com,alisa93@barrows.com,173.231.203.134,5a2b6f01-0bac-35b2-bef1-5be7bb3c2d78,"776 Moises Coves Apt. 531; Port Rylan, DC 80810",10794,-69.908375,136.780034,78cb:75d9:4dd:8248:f190:9f3c:b0e:9afc,"Ea iusto non.","Qui sapiente qui ut nihil sit.","Modi et voluptate blanditiis aliquid iure eveniet voluptas facilis ipsum omnis velit.","Minima in molestiae nam ullam voluptatem sapiente corporis sunt in ut aut alias exercitationem incidunt fugiat doloribus laudantium ducimus iusto nemo assumenda non ratione neque labore voluptatem."
2,0,false,267823,408705.14285714,1985-07-19,"1996-11-18 08:21:44",keebler.net,wwolff@connelly.com,73.197.210.145,29e076ab-a769-3a1f-abd4-2bc73ab17c99,"909 Sabryna Island Apt. 815; West Matteoside, CO 54360-7141",80948,7.908256,123.666864,bf3b:abab:3dcb:c335:b1a:b5d6:60e9:107e,"Aut dolor distinctio quasi.","Alias sit ut perferendis quod at dolores.","Molestiae est eos dolore deserunt atque temporibus.","Quisquam velit aut saepe temporibus officia labore quam numquam eveniet velit aliquid aut autem quis voluptatem in ut iste sunt omnis iure laudantium aspernatur tenetur nemo consequatur aliquid sint nostrum aut nostrum."
```

### Run benchmark locally

Make sure you have PHP 8.2+ and Docker installed.

```shell
# Clone the latest version
git clone git@github.com:jbzoo/csv-blueprint.git csv-blueprint
cd csv-blueprint

# download dependencies and build the tool.
make build # We need it to build benchmark tool. See `./tests/Benchmarks` folder.
make build-phar-file # Optional. Only if you want to test it.
make docker-build # Recommended. local tag is "jbzoo/csv-blueprint:local"

# Create random CSV files with 5 columns (max: 20).
BENCH_COLS=5 make bench-create-csv

# Run the benchmark for the recent CSV file.
BENCH_COLS=5 make bench-docker # Recommended
BENCH_COLS=5 make bench-phar
BENCH_COLS=5 make bench-php

# It's a shortcut that combines CSV file creation and Docker run.
# By default BENCH_COLS=10
make bench
```

## Disadvantages?

The perception that PHP is inherently slow is a common misconception. However, with the right optimization strategies,
PHP can perform exceptionally well. For evidence, refer to the article [Processing One Billion CSV Rows in PHP!](https://dev.to/realflowcontrol/processing-one-billion-rows-in-php-3eg0),
which demonstrates that PHP can process, aggregate, and calculate data from CSV files at approximately **15 million
lines per second**! While not all optimizations are currently implemented, the performance is already quite impressive.

- Yes, it's acknowledged that this tool might not be the fastest available, but it's also far from the slowest. For more
details, see the link provided above.
- Yes, the tool is built with PHP—not Python, Go, or PySpark—which may not be the first choice for such tasks.
- Yes, it functions like a standalone binary. The recommendation is simply to use it without overthinking its internal
workings.
- Yes, it's recognized that this cannot be used as a Python SDK within a pipeline.

However, for the majority of scenarios, these are not deal-breakers. The utility effectively addresses the challenge of
validating CSV files in continuous integration (CI) environments. 👍

This utility is designed for immediate use without necessitating a deep understanding of its inner mechanics. It adheres
to rigorous testing standards, including strict typing, approximately seven linters and static analyzers at the highest
rule level. Furthermore, every pull request is subjected to around ten different checks on GitHub Actions, spanning a
matrix of PHP versions and modes, ensuring robustness. The extensive coverage and precautions are due to the
unpredictability of usage conditions, embodying the spirit of the Open Source community.

In summary, the tool is developed with the highest standards of modern PHP practices, ensuring it performs as expected.

## Coming soon

It's random ideas and plans. No promises and deadlines. Feel free to [help me!](#contributing).

CLICK to see the roadmap

- **Batch processing**
- If option `--csv` is not specified, then the STDIN is used. To build a pipeline in Unix-like systems.

- **Validation**
- Multi `filename_pattern`. Support list of regexs.
- Multi values in one cell.
- Custom cell rule as a callback. It's useful when you have a complex rule that can't be described in the schema file.
- Custom agregate rule as a callback. It's useful when you have a complex rule that can't be described in the schema file.
- Configurable keyword for null/empty values. By default, it's an empty string. But you will use `null`, `nil`, `none`, `empty`, etc.
Overridable on the column level.
- Handle empty files and files with only a header row, or only with one line of data. One column wthout header is also possible.
- If option `--schema` is not specified, then validate only super base level things (like "is it a CSV file?").
- Complex rules (like "if field `A` is not empty, then field `B` should be not empty too").
- Extending with custom rules and custom report formats. Plugins?
- Input encoding detection + `BOM` (right now it's experimental). It works but not so accurate... UTF-8 is the best choice for now.

- **Performance and optimization**
- Using [vectors](https://www.php.net/manual/en/class.ds-vector.php) instead of arrays to optimaze memory usage and speed of access.
- Multithreading support for parallel validation of CSV by columns.

- **Mock data generation**
- Create CSV files based on the schema (like "create 1000 rows with random data based on schema and rules").
- Use [Faker](https://github.com/FakerPHP/Faker) for random data generation.
- [ReverseRegex](https://github.com/enso-media/ReverseRegex) to generate text from regex.

- **Reporting**
- More report formats (like JSON, XML, etc). Any ideas?
- Gitlab and JUnit reports must be as one structure. It's not so easy to implement. But it's a good idea.
- Merge reports from multiple CSV files into one report. It's useful when you have a lot of files and you want to
see all errors in one place. Especially for GitLab and JUnit reports.

- **Misc**
- Rewrite in Go/Rust. It's a good idea to have a standalone binary with the same functionality.
- Install via brew on MacOS.
- Install via apt on Ubuntu.
- Use it as PHP SDK. Examples in Readme.
- Warnings about deprecated options and features.
- Add option `--recomendation` to show a list of recommended rules for the schema or potential issues in the CSV file or schema.
It's useful when you are not sure what rules to use.
- Add option `--error=[level]` to show only errors with a specific level. It's useful when you have a lot of warnings and you want to see only errors.
- More examples and documentation.

PS. [There is a file](tests/schemas/todo.yml) with my ideas and imagination. It's not valid schema file, just a draft.
I'm not sure if I will implement all of them. But I will try to do my best.

## Contributing

If you have any ideas or suggestions, feel free to open an issue or create a pull request.

```sh
# Fork the repo and build project
git clone git@github.com:jbzoo/csv-blueprint.git ./jbzoo-csv-blueprint
cd ./jbzoo-csv-blueprint
make build

# Make your local changes

# Autofix code style
make test-phpcsfixer-fix test-phpcs

# Run all tests and check code style
make test
make codestyle

# Create your pull request and check all tests in CI (Github Actions)
# ???
# Profit!
```

## License

[MIT License](LICENSE): It's like free pizza - enjoy it, share it, just don't sell it as your own. And remember, no warranty for stomach aches! 😅

## See also

- [Cli](https://github.com/JBZoo/Cli) - Framework helps create complex CLI apps and provides new tools for Symfony/Console.
- [CI-Report-Converter](https://github.com/JBZoo/CI-Report-Converter) - It converts different error reporting standards for popular CI systems.
- [Composer-Diff](https://github.com/JBZoo/Composer-Diff) - See what packages have changed after `composer update`.
- [Composer-Graph](https://github.com/JBZoo/Composer-Graph) - Dependency graph visualization of `composer.json` based on [Mermaid JS](https://mermaid.js.org/).
- [Mermaid-PHP](https://github.com/JBZoo/Mermaid-PHP) - Generate diagrams and flowcharts with the help of the
[mermaid](https://mermaid.js.org/) script language.
- [Utils](https://github.com/JBZoo/Utils) - Collection of useful PHP functions, mini-classes, and snippets for every day.
- [Image](https://github.com/JBZoo/Image) - Package provides object-oriented way to manipulate with images as simple as possible.
- [Data](https://github.com/JBZoo/Data) - Extended implementation of ArrayObject. Use Yml/PHP/JSON/INI files as config. Forget about arrays.
- [Retry](https://github.com/JBZoo/Retry) - Tiny PHP library providing retry/backoff functionality with strategies and jitter.

just interesting fact

I've achieved a personal milestone. The [initial release](https://github.com/jbzoo/csv-blueprint/releases/tag/0.1) of
the project was crafted from the ground up in approximately 3 days, interspersed with regular breaks to care for a
4-month-old baby. Reflecting on the first commit and the earliest git tag, it's clear that this was accomplished over a
weekend, utilizing spare moments on my personal laptop.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jbzoo/csv-blueprint

Awesome Lists containing this project

README