https://github.com/jlpdeveloper/go-file-parsing

Last synced: 3 months ago
JSON representation
Host: GitHub
URL: https://github.com/jlpdeveloper/go-file-parsing
Owner: jlpdeveloper
License: mit
Created: 2025-05-29T01:08:01.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-06-19T02:12:46.000Z (about 1 year ago)
Last Synced: 2025-06-19T03:26:13.318Z (about 1 year ago)
Language: Go
Size: 67.4 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # GO File Parsing Experiments

This application is an experiment in using Go's high concurrency to parse file data and store in a Redis-compatible cache (Valkey). It demonstrates efficient techniques for processing large CSV files using Go's concurrency features.

> [!IMPORTANT]

> This application has no real world use, it is meant to be an experiment and possibly a model for how 

> Go can be used to efficiently read large files.

## Project Overview

This project demonstrates:

- Concurrent CSV file parsing using goroutines

- Efficient validation of data rows against business rules

- Caching valid data in a Redis-compatible database (Valkey)

- Memory-efficient processing using object pools

- Error handling and reporting

## Project Structure

```

go-file-parsing/

├── cache/              # Cache abstraction and implementation

│   ├── cache.go        # Cache interface definition

│   ├── parser_cache.go # Valkey implementation of cache

│   └── cache_test.go   # Tests for cache functionality

├── config/             # Configuration handling

│   └── config.go       # Parser configuration

├── loan_info/          # Domain-specific validation logic

│   ├── loan_info.go    # Main validation rules

│   ├── *_validations.go # Specific validation implementations

│   └── *_test.go       # Tests for validations

├── utils/              # Utility functions

│   └── string_utils.go # String manipulation utilities

├── validator/          # Generic validation framework

│   ├── row_validator.go # Row validation logic

│   ├── column_utils.go  # Column processing utilities

│   └── map_pool.go     # Memory-efficient map pool

├── main.go             # Application entry point

├── config.json         # Parser configuration

├── dev.compose.yml     # Docker Compose for development

└── sample.csv          # Sample data file

```

## How It Works

1. The application reads a CSV file line by line

2. For each row, it:

   - Allocates a validator from a pool

   - Validates the row concurrently using multiple validation rules

   - Stores valid data in the cache

   - Collects and stores validation errors in the cache

3. After processing, reports statistics on the run

The application uses Go's concurrency primitives (goroutines, channels, wait groups, and errgroup) to process rows efficiently.

## Results

Achieving optimal performance in this highly concurrent workload requires a careful balance between the number 

of row validators and cache writers. Setting these pool sizes too high or too low directly impacts both throughput 

and resource utilization. An excess of cache or error writers can reduce processing time, 

but at the cost of higher memory usage. Too few writers can dramatically slow down the process, 

even if memory usage is minimized. Similarly, increasing the number of row validators speeds up validation, 

but if writing can’t keep pace, system resources may be strained without further gains in throughput.

Below is a summary of experiments exploring these tradeoffs. The table is reordered to highlight how adjusting pool sizes affects performance and resources:

| Row Validators | Cache Writers | Error Writers | Time (s) | Avg per 10,000 rows | Memory Used | Notes                  |

|----------------|--------------|--------------|----------|---------------------|-------------|------------------------|

| 1,000          | 10,000       | 10,000       | 8.2      | 34ms                | 218MiB      | Others ran from GoLand |

| 1,000          | 10,000       | 10,000       | 18.0     | 75ms                | 81MiB       | In Docker Compose      |

| 10             | 10,000       | 10,000       | 11.7     | 48ms                | 70MiB       |                        |

| 10             | 500          | 500          | 11.9     | 49ms                | 57MiB       |                        |

| 10             | 10           | 10           | 159      | 677ms               | 39MiB       |                        |

**Interpretation:**

- **High pool sizes** (validators/writers in the thousands) drastically reduce processing time, but increase memory usage significantly.

- **Moderate pool sizes** offer reasonable performance with a balance of memory footprint and throughput.

- **Small pool sizes** minimize memory usage, but at a substantial cost to processing time.

Fine-tuning these parameters is essential: 

too few writers create a processing backlog, while overly large pools can exhaust memory. 

The ideal configuration depends on hardware and workload needs, 

but should always find a balance that maintains high throughput within memory constraints.

## Dependencies

- Go 1.24 or later

- [valkey-io/valkey-go](https://github.com/valkey-io/valkey-go) v1.0.60 - Redis-compatible client library

- [golang.org/x/sync](https://pkg.go.dev/golang.org/x/sync) v0.15.0 - Additional synchronization primitives

- [Valkey](https://valkey.io/) - Redis-compatible database (via Docker)

## Setup and Installation

### Prerequisites

- Go 1.24 or later (for local development)

- Docker and Docker Compose (for running with Docker)

### Option 1: Running with Docker

1. Clone the repository:

   ```bash

   git clone https://github.com/yourusername/go-file-parsing.git

   cd go-file-parsing

   ```

2. Build and run the application with Docker Compose:

   ```bash

   docker-compose up -d

   ```

   This will start both the Valkey container and the application container.

3. To use the sample file instead of the large dataset, modify the `docker-compose.yml` file:

   ```yaml

   app:

     # ... other settings ...

     command: ["sample.csv"]

   ```

4. To view logs:

   ```bash

   docker-compose logs -f app

   ```

### Option 2: Running Locally

1. Clone the repository:

   ```bash

   git clone https://github.com/yourusername/go-file-parsing.git

   cd go-file-parsing

   ```

2. Start the Valkey container:

   ```bash

   docker-compose -f dev.compose.yml up -d

   ```

3. Set the environment variable for Valkey:

   ```bash

   # For Windows PowerShell

   $env:VALKEY_URLS = "localhost:6379"

   # For Linux/macOS

   export VALKEY_URLS="localhost:6379"

   ```

4. Run the application:

   ```bash

   go run main.go

   ```

## Usage

The application is configured via the `config.json` file:

```json

{

  "HasHeader": true,

  "Delimiter": ",",

  "ExpectedColumns": 156

}

```

- `HasHeader`: Set to true if the CSV file has a header row

- `Delimiter`: The character used to separate columns

- `ExpectedColumns`: The expected number of columns in each row

To use your own CSV file, modify the `parseFile` function call in `main.go`:

```

// In main.go

parseFile("your-file.csv", cacheClient)

```

## Performance Considerations

- The application uses a pool of validators to limit memory usage

- Each row is processed concurrently, with validation rules applied in parallel

- The map pool pattern is used to reduce garbage collection pressure

- Buffer sizes are configurable to balance memory usage and performance

## Reuse considerations

If you wish to reuse this project, here are some considerations to help you adapt it to your needs:

### Extracting Validations as a Separate Package

The validation framework in this project is designed to be modular and reusable. You could extract the validation logic into its own package or even a separate library:

1. **Core Validation Framework**: The `validator` package contains the core validation framework, including:

   - `RowValidator` interface and `CsvRowValidator` implementation

   - `ColValidator` function type for individual column validations

   - Map pooling for efficient memory usage

2. **Domain-Specific Validations**: The `loan_info` package contains domain-specific validations that could be moved to a separate package:

   - Validation functions like `hasValidLoanAmount`, `hasValidInterestRate`, etc.

   - Error definitions specific to loan data validation

   - The validator registration in `loan_info.go`

### Adding Different File-Based Validators

To add support for different file types or data domains:

1. **Create a New Domain Package**: Similar to the `loan_info` package, create a new package for your domain:

   ```

   your-domain/

   ├── domain.go             # Register validators and create validator pool

   ├── errors.go             # Define domain-specific errors

   ├── validations.go        # Implement domain-specific validation functions

   └── row_shape_validations.go # Basic structure validations

   ```

2. **Implement Validation Functions**: Create functions that follow the `ColValidator` signature:

   ```

   // Example validation function

   func yourValidationFunction(ctx *validator.RowValidatorContext, cols []string) (map[string]string, error) {

       // Validation logic here

       // If validation passes, optionally return data to cache

       result := ctx.GetMap()

       result["key"] = "value"

       return result, nil // or return nil, yourError

   }

   ```

3. **Register Validators**: Create a slice of validators in your domain package:

   ```

   // Example validator registration

   var validators = []validator.ColValidator{

       isValidSize,

       yourValidationFunction1,

       yourValidationFunction2,

       // ...

   }

   ```

4. **Create Validator Pool**: Implement a function to create a pool of validators:

   ```

   // Example validator pool creation

   func NewRowValidatorPool(conf *config.ParserConfig, cache cache.DistributedCache, poolSize int) chan validator.CsvRowValidator {

       pool := make(chan validator.CsvRowValidator, poolSize)

       for i := 0; i < poolSize; i++ {

           pool <- validator.New(conf, cache, validators)

       }

       return pool

   }

   ```

5. **Update Main Application**: Modify `main.go` to use your new validator pool:

   ```

   // Example usage in main.go

   pool := your_domain.NewRowValidatorPool(&conf, cacheClient, 100)

   ```

### Best Practices for Extension

1. **Keep Validators Focused**: Each validator function should validate one specific aspect of the data.

2. **Use Error Constants**: Define error constants in an `errors.go` file for consistent error messages.

3. **Reuse Maps**: Always use the map pool (`ctx.GetMap()`) to get maps for returning data.

4. **Concurrent Safety**: Ensure your validators are safe for concurrent use.

5. **Testing**: Write comprehensive tests for each validator function.

6. **Configuration**: Use the configuration system to make your validators configurable.

7. **Documentation**: Document your validators and their expected behavior.

By following these guidelines, you can extend this project to handle different types of file parsing and validation while maintaining its performance and memory efficiency.

## Data

The data I'm using for this experiment is from [kaggle](https://www.kaggle.com/datasets/wordsforthewise/lending-club?resource=download)

The data has the following columns:

| Field Name                   | Description/Value    |

|------------------------------|----------------------|

| id                           |                      |

| member_id                    |                      |

| loan_amnt                    |                      |

| funded_amnt                  |                      |

| funded_amnt_inv              |                      |

| term                         |                      |

| int_rate                     |                      |

| installment                  |                      |

| grade                        |                      |

| sub_grade                    |                      |

| emp_title                    |                      |

| emp_length                   |                      |

| home_ownership               |                      |

| annual_inc                   |                      |

| verification_status          |                      |

| issue_d                      |                      |

| loan_status                  |                      |

| pymnt_plan                   |                      |

| url                          |                      |

| desc                         |                      |

| purpose                      |                      |

| title                        |                      |

| zip_code                     |                      |

| addr_state                   |                      |

| dti                          |                      |

| delinq_2yrs                  |                      |

| earliest_cr_line             |                      |

| fico_range_low               |                      |

| fico_range_high              |                      |

| inq_last_6mths               |                      |

| mths_since_last_delinq       |                      |

| mths_since_last_record       |                      |

| open_acc                     |                      |

| pub_rec                      |                      |

| revol_bal                    |                      |

| revol_util                   |                      |

| total_acc                    |                      |

| initial_list_status          |                      |

| out_prncp                    |                      |

| out_prncp_inv                |                      |

| total_pymnt                  |                      |

| total_pymnt_inv              |                      |

| total_rec_prncp              |                      |

| total_rec_int                |                      |

| total_rec_late_fee           |                      |

| recoveries                   |                      |

| collection_recovery_fee      |                      |

| last_pymnt_d                 |                      |

| last_pymnt_amnt              |                      |

| next_pymnt_d                 |                      |

| last_credit_pull_d           |                      |

| last_fico_range_high         |                      |

| last_fico_range_low          |                      |

| collections_12_mths_ex_med   |                      |

| mths_since_last_major_derog  |                      |

| policy_code                  |                      |

| application_type             |                      |

| annual_inc_joint             |                      |

| dti_joint                    |                      |

| verification_status_joint    |                      |

| acc_now_delinq               |                      |

| tot_coll_amt                 |                      |

| tot_cur_bal                  |                      |

| open_acc_6m                  |                      |

| open_act_il                  |                      |

| open_il_12m                  |                      |

| open_il_24m                  |                      |

| mths_since_rcnt_il           |                      |

| total_bal_il                 |                      |

| il_util                      |                      |

| open_rv_12m                  |                      |

| open_rv_24m                  |                      |

| max_bal_bc                   |                      |

| all_util                     |                      |

| total_rev_hi_lim             |                      |

| inq_fi                       |                      |

| total_cu_tl                  |                      |

| inq_last_12m                 |                      |

| acc_open_past_24mths         |                      |

| avg_cur_bal                  |                      |

| bc_open_to_buy               |                      |

| bc_util                      |                      |

| chargeoff_within_12_mths     |                      |

| delinq_amnt                  |                      |

| mo_sin_old_il_acct           |                      |

| mo_sin_old_rev_tl_op         |                      |

| mo_sin_rcnt_rev_tl_op        |                      |

| mo_sin_rcnt_tl               |                      |

| mort_acc                     |                      |

| mths_since_recent_bc         |                      |

| mths_since_recent_bc_dlq     |                      |

| mths_since_recent_inq        |                      |

| mths_since_recent_revol_delinq|                     |

| num_accts_ever_120_pd        |                      |

| num_actv_bc_tl               |                      |

| num_actv_rev_tl              |                      |

| num_bc_sats                  |                      |

| num_bc_tl                    |                      |

| num_il_tl                    |                      |

| num_op_rev_tl                |                      |

| num_rev_accts                |                      |

| num_rev_tl_bal_gt_0          |                      |

| num_sats                     |                      |

| num_tl_120dpd_2m             |                      |

| num_tl_30dpd                 |                      |

| num_tl_90g_dpd_24m           |                      |

| num_tl_op_past_12m           |                      |

| pct_tl_nvr_dlq               |                      |

| percent_bc_gt_75             |                      |

| pub_rec_bankruptcies         |                      |

| tax_liens                    |                      |

| tot_hi_cred_lim              |                      |

| total_bal_ex_mort            |                      |

| total_bc_limit               |                      |

| total_il_high_credit_limit   |                      |

| revol_bal_joint              |                      |

| sec_app_fico_range_low       |                      |

| sec_app_fico_range_high      |                      |

| sec_app_earliest_cr_line     |                      |

| sec_app_inq_last_6mths       |                      |

| sec_app_mort_acc             |                      |

| sec_app_open_acc             |                      |

| sec_app_revol_util           |                      |

| sec_app_open_act_il          |                      |

| sec_app_num_rev_accts        |                      |

| sec_app_chargeoff_within_12_mths|                   |

| sec_app_collections_12_mths_ex_med|                |

| sec_app_mths_since_last_major_derog|               |

| hardship_flag                |                      |

| hardship_type                |                      |

| hardship_reason              |                      |

| hardship_status              |                      |

| deferral_term                |                      |

| hardship_amount              |                      |

| hardship_start_date          |                      |

| hardship_end_date            |                      |

| payment_plan_start_date      |                      |

| hardship_length              |                      |

| hardship_dpd                 |                      |

| hardship_loan_status         |                      |

| orig_projected_additional_accrued_interest|         |

| hardship_payoff_balance_amount|                     |

| hardship_last_payment_amount |                      |

| disbursement_method          |                      |

| debt_settlement_flag         |                      |

| debt_settlement_flag_date    |                      |

| settlement_status            |                      |

| settlement_date              |                      |

| settlement_amount            |                      |

| settlement_percentage        |                      |

| settlement_term              |                      |

A sample file has been generated using ChatGPT.

## Rules

The below rules were generated as ways to determine if data that is being parsed in is "good" or "bad." They were generated

using ChatGPT to analyze the columns and give recommendations on rules to add complexity to the parsing

### ✅ Data Validation Rules

1. **Valid Loan and Funding** 

    - `loan_amnt` > 0 and `funded_amnt` == `funded_amnt_inv`.

2. **Reasonable Interest Rate** 

    - `int_rate` between 5% and 35%.

3. **Valid Grade/Subgrade** 

    - `grade` in [A–G], `sub_grade` matches pattern like `B3`.

4. **Valid Term** 

   - `term` is between 12 and 72 months

5. **Has Employment Info**

    - Non-empty `emp_title` and `emp_length` is not null.

6. **Low DTI and Home Ownership**

    - `dti` < 20, `home_ownership` in [MORTGAGE, OWN], and `annual_inc` > 40,000.

7. **Established Credit History**

    - `earliest_cr_line` not null and is > 10 years ago.

8. **Healthy FICO Score**

    - `fico_range_low` >= 660 and `fico_range_high` <= 850.

9. **Has Sufficient Accounts**

   - `total_acc` >= 5 and `open_acc` >= 2.

10. **Valid Income**

    - `annual_inc` > 30,000.

## Additional Data to Store

- `avg_cur_bal`

- `application_type`

  - `annual_inc_joint` if type: `Joint App`

- `tot_coll_amt`

- `acc_now_delinq`

## Contributing

This project is an experiment and demonstration. Feel free to fork it and adapt it to your needs. If you have suggestions for improvements, please open an issue or submit a pull request.

## License

This project is licensed under the terms found in the LICENSE file in the root of this repository.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jlpdeveloper/go-file-parsing

Awesome Lists containing this project

README