https://github.com/microprediction/csvsdataset
Dataset from multiple CSV files
https://github.com/microprediction/csvsdataset
Last synced: 3 months ago
JSON representation
Dataset from multiple CSV files
- Host: GitHub
- URL: https://github.com/microprediction/csvsdataset
- Owner: microprediction
- License: mit
- Created: 2023-03-27T21:42:47.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-06-03T20:29:10.000Z (11 months ago)
- Last Synced: 2025-01-28T23:56:56.116Z (3 months ago)
- Language: Python
- Size: 1.6 GB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# csvsdataset
`csvsdataset` is a Python library designed to simplify the process of working with multiple CSV files as a single dataset. The primary functionality is provided by the `CsvsDataset` class in the `csvsdataset.py` module.
This was written by ChatGPT4 as mentioned [here](https://www.linkedin.com/posts/petercotton_chatgpt4-opensource-python-activity-7047184874163597312-JTr3?utm_source=share&utm_medium=member_desktop). Issues will be cut and paste into a session. It is an experiment in semi-autonomous code maintenance.
## Installation
To install the `csvsdataset` library, simply run:
```bash
pip install csvsdataset
```## Usage
from csvsdataset.csvsdataset import CsvsDataset
# Initialize the CsvsDataset instance
dataset = CsvsDataset(folder_path="path/to/your/csv/folder",
file_pattern="*.csv",
x_columns=["column1", "column2"],
y_column="target_column")
# Iterate over the dataset
for x_data, y_data in dataset:
# Your processing code here
pass
# Access a specific item in the dataset
x_data, y_data = dataset[42]### Memory frugality
Only data from a small number of csv files are maintained in memory. The
rest is discarded on a LRU basis. This class is intended for use
when a very large number of data files exist which cannot be loaded into
memory conveniently.