An open API service indexing awesome lists of open source software.

https://github.com/olbrichattila/gocsvimporter

Fast and large file optimised CSV importer
https://github.com/olbrichattila/gocsvimporter

csv csv-import golang-application golang-cli

Last synced: 3 months ago
JSON representation

Fast and large file optimised CSV importer

Awesome Lists containing this project

README

        

# Go CSV Importer: Multi-threaded Fast CSV to Database Tool

## Overview
`Go CSV Importer` is a powerful, multi-threaded command-line tool designed to import large CSV files into databases quickly and efficiently. It supports various database types, offers customizable import modes, and is optimized for performance.

### Key Features
- **Support for Large CSV Files**: Handles millions of rows with ease.
- **Multi-Database Compatibility**: Works with SQLite, MySQL, PostgreSQL, and Firebird.
- **Configurable Import Modes**:
- Transactional or non-transactional imports.
- Batch SQL inserts or row-by-row operations.
- Single or multiple database connections.

---

## Installation

Install the latest version directly from Go:

```bash
go install github.com/olbrichattila/gocsvimporter/cmd/csvimporter@latest
```

---

## How It Works

1. **Analyze the CSV**: Determines data types and structures.
2. **Prepare the Database**: Automatically creates the necessary table, dropping it if it already exists.
3. **Import Data**: Optimizes the process based on database type and chosen parameters.

### Usage
Run the tool with:

```bash
csvimporter [delimiter]
```

#### Parameters:
1. **CSV File**: Path to the CSV file.
2. **Table Name**: Target database table name.
3. **Delimiter** *(Optional)*: CSV delimiter (default: `,`).

#### Example:
```bash
csvimporter data.csv vehicles ";"
```

## Field Header Management for Faster Imports
The CSV importer command line tool now includes a powerful feature for handling CSV headers efficiently.
### Feature Overview

When importing CSV files, the tool automatically:
1. **Analyzes the CSV content:** Deduplicates field names, determines proper field types, and calculates field sizes based on the data.
2. **Optimizes imports:** Speeds up processing for repeated file formats by allowing you to save and reuse header definitions.

This analysis can take time, especially with large files. If you're importing the same file format multiple times, you can now save the calculated headers for reuse, bypassing recalculation for subsequent imports.

## How It Works

### Save Field Definitions:
Use the -sh= parameter to save the field definitions in a JSON file. You can optionally modify this JSON file to fine-tune your header configurations.
```
csvimporter [delimiter] -sh=customerFieldDefinition
```
This will generate a JSON file with the field definitions.

### Load Field Definitions:
Use the -lh= parameter to load the pre-saved field definitions, skipping the recalculation step.
```
csvimporter [delimiter] -lh=customerFieldDefinition
```
When this parameter is provided, the import process starts immediately, but progress percentage updates will not be displayed.
### Example JSON Structure
Here’s an example of the auto-saved JSON file generated by the tool:
```json
[
{
"name": "index",
"Type": "INT",
"Length": 7
},
{
"name": "customer_id",
"Type": "VARCHAR",
"Length": 15
},
{
"name": "first_name",
"Type": "VARCHAR",
"Length": 11
},
{
"name": "last_name",
"Type": "VARCHAR",
"Length": 11
},
{
"name": "company",
"Type": "VARCHAR",
"Length": 38
},
{
"name": "city",
"Type": "VARCHAR",
"Length": 24
},
{
"name": "country",
"Type": "VARCHAR",
"Length": 51
},
{
"name": "phone_a1",
"Type": "VARCHAR",
"Length": 22
},
{
"name": "phone_a2",
"Type": "VARCHAR",
"Length": 22
},
{
"name": "email",
"Type": "VARCHAR",
"Length": 44
},
{
"name": "subscription_date",
"Type": "VARCHAR",
"Length": 10
},
{
"name": "website",
"Type": "VARCHAR",
"Length": 40
}
]
```
### Benefits
- ***Time-Saving:*** Skip repetitive calculations for consistent file formats.
- ***Customizable:*** Modify the saved JSON to align with specific import requirements.
- ***Improved Efficiency:*** Ideal for workflows requiring frequent CSV imports.

### Notes
- The -lh parameter disables progress percentage display.
- Ensure the JSON file accurately reflects your data structure for seamless imports.

> By leveraging this feature, the CSV importer command line tool becomes even more efficient for managing recurring import tasks and large datasets.

## Supported Databases

### SQLite
```env
DB_CONNECTION=sqlite
DB_DATABASE=./database/database.sqlite
```

### MySQL
```env
DB_CONNECTION=mysql
DB_HOST=127.0.0.1
DB_PORT=3306
DB_DATABASE=mydb
DB_USERNAME=myuser
DB_PASSWORD=mypassword
```

### PostgreSQL
```env
DB_CONNECTION=pgsql
DB_HOST=127.0.0.1
DB_PORT=5432
DB_DATABASE=postgres
DB_USERNAME=postgres
DB_PASSWORD=postgres
DB_SSLMODE=disable
```

### Firebird
```env
DB_CONNECTION=firebird
DB_HOST=127.0.0.1
DB_PORT=3050
DB_DATABASE=/path/to/database.fdb
DB_USERNAME=SYSDBA
DB_PASSWORD=masterkey
```

---

## Configuration

Create a `.env.csvimporter` file in the application's directory or export environment variables:

```env
BATCH_SIZE=500 # Rows per batch (default: 100)
MAX_CONNECTION_COUNT=25 # Maximum connections (default: 10)
BATCH_INSERT=on # Enable/disable batch insert
MULTIPLE_CONNECTIONS=on # Enable/disable multi-threading
TRANSACTIONAL=off # Enable/disable transactions
```

*Note: Unsupported options for certain databases (e.g., SQLite) are ignored.*

---

## Performance

### Speed Test: 2 Million Rows
**System Configuration**:
- **OS**: Ubuntu Linux
- **Processor**: Intel® Core™ i7-3770S @ 3.10GHz
- **Storage**: SSD

| Database | Duration | Mode | Threads |
|-------------|------------|--------------------------|---------|
| **SQLite** | 52 seconds | Transactional, Batch SQL | 1 |
| **MySQL** | 65 seconds | Transactional, Multi-Threaded | 10 |
| **PostgreSQL** | 43 seconds | Transactional, Multi-Threaded | 10 |
| **Firebird**| 5m 42s | Transactional, Multi-Threaded | 10 |

---

## Makefile Targets

### Test Imports
```bash
make vehicles
make customers
```

### Switch Environments
```bash
make switch-sqlite
make switch-mysql
make switch-pgsql
make switch-firebird
```

---

## Local Testing with Docker
A `docker-compose` setup is provided for testing:

```bash
cd docker
docker-compose up -d
```

---

## Roadmap

### Planned Improvements
- **Distributed Import**: Split CSV files across multiple instances (pods/servers) for faster parallel imports.
- **Enhanced Configuration**: Support more advanced database-specific settings.

---

Start importing your CSV files faster and more efficiently today!

## About me:
- Learn more about me on my personal website. https://attilaolbrich.co.uk/menu/my-story
- Check out my latest blog blog at my personal page. https://attilaolbrich.co.uk/blog/1/single