{"id":51417969,"url":"https://github.com/abhimehro/series_correction_project_updated","last_synced_at":"2026-07-04T22:00:25.364Z","repository":{"id":290655552,"uuid":"968894349","full_name":"abhimehro/series_correction_project_updated","owner":"abhimehro","description":"This project provides tools to automatically detect and correct discontinuities (such as jumps, gaps, or outliers) commonly found in time-series data from Seatek sensors. The goal is to produce cleaner, more reliable datasets for further analysis, suitable for integration with systems like NESST II.","archived":false,"fork":false,"pushed_at":"2026-07-04T14:00:11.000Z","size":1906,"stargazers_count":1,"open_issues_count":23,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-07-04T15:21:04.155Z","etag":null,"topics":["data-analysis","data-science","data-visualization","jupyter-notebook","python"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abhimehro.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-04-18T23:55:00.000Z","updated_at":"2026-07-03T17:51:49.000Z","dependencies_parsed_at":null,"dependency_job_id":"30229930-d50d-4268-88d0-3f62b600a150","html_url":"https://github.com/abhimehro/series_correction_project_updated","commit_stats":null,"previous_names":["abhimehro/series_correction_project_updated"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/abhimehro/series_correction_project_updated","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhimehro%2Fseries_correction_project_updated","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhimehro%2Fseries_correction_project_updated/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhimehro%2Fseries_correction_project_updated/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhimehro%2Fseries_correction_project_updated/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abhimehro","download_url":"https://codeload.github.com/abhimehro/series_correction_project_updated/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhimehro%2Fseries_correction_project_updated/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":35136712,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-04T02:00:05.987Z","response_time":113,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-science","data-visualization","jupyter-notebook","python"],"created_at":"2026-07-04T22:00:16.550Z","updated_at":"2026-07-04T22:00:25.349Z","avatar_url":"https://github.com/abhimehro.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Series Correction Project (Seatek Sensor Data)\n\n[![CodeScene general](https://codescene.io/images/analyzed-by-codescene-badge.svg)](https://codescene.io/projects/80827)\n[![Code Coverage](https://codecov.io/gh/abhimehro/series_correction_project_updated/branch/main/graph/badge.svg)](https://codecov.io/gh/abhimehro/series_correction_project_updated) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n\n[![CodeScene Average Code Health](https://codescene.io/projects/80827/status-badges/average-code-health)](https://codescene.io/projects/80827)\n[![CodeScene Hotspot Code Health](https://codescene.io/projects/80827/status-badges/hotspot-code-health)](https://codescene.io/projects/80827)\n[![CodeScene System Mastery](https://codescene.io/projects/80827/status-badges/system-mastery)](https://codescene.io/projects/80827)\n[![CodeScene Missed Goals](https://codescene.io/projects/80827/status-badges/missed-goals)](https://codescene.io/projects/80827)\n\nThis project provides tools to automatically detect and correct discontinuities (such as jumps, gaps, or outliers) commonly found in time-series data from Seatek sensors. The goal is to produce cleaner, more reliable datasets for further analysis, suitable for integration with systems like NESST II.\n\n## Table of Contents\n\n- [Problem Statement](#problem-statement)\n- [Features](#features)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Configuration](#configuration)\n- [Data Format](#data-format)\n  - [Input Data Requirements](#input-data-requirements)\n  - [Sample Data](#sample-data)\n- [Methodology](#methodology)\n- [Project Structure](#project-structure)\n- [Testing](#testing)\n- [Contributing](#contributing)\n- [License](#license)\n- [Contact](#contact)\n- [Acknowledgements](#acknowledgements)\n- [Future Improvements](#future-improvements)\n- [Workflow for Efficient Batch Processing, QA, and Visualization Updates (2025)](#workflow-for-efficient-batch-processing-qa-and-visualization-updates-2025)\n\n## Problem Statement\n\nSeatek sensors, like many environmental or industrial sensors, can produce time-series data with various imperfections:\n\n- **Jumps/Shifts:** Sudden, persistent changes in the baseline value.\n- **Gaps:** Missing data points or periods.\n- **Outliers:** Spurious, isolated spikes or drops.\n- **Drift:** Slow, gradual changes unrelated to the measured phenomenon.\n\nManually identifying and correcting these discontinuities is time-consuming, subjective, and prone to errors. This project aims to automate this process, providing a consistent and efficient way to improve data quality.\n\n## Features\n\n- **Discontinuity Detection:** Implements algorithms (`processor.py`) to automatically identify potential jumps, gaps, and outliers in Seatek sensor time-series data.\n- **Discontinuity Correction:** Offers methods (`processor.py`) to correct detected discontinuities:\n  - Interpolation for gaps (e.g., 'time', 'linear').\n  - Offset correction for jumps.\n  - Replacement for outliers (e.g., 'median', 'mean', 'interpolate').\n- **Configurable Parameters:** Allows tuning of detection sensitivity (e.g., thresholds, window sizes) and selection of correction methods via configuration files (`config.json`).\n- **Batch Processing:** Capable of processing multiple data files (`S\u003cseries\u003e_Y\u003cindex\u003e.txt`) efficiently based on series, year range, and river mile criteria (`batch_correction.py`).\n- **Command-Line Interface:** Provides a CLI tool (`seatek-correction`) for easy execution (`series_correction_cli.py`).\n- **Reporting:** Generates a summary CSV file (`Batch_Processing_Summary.csv`) and detailed logs (`processing_log.txt`).\n- **Testing \u0026 CI:** Includes unit tests (`tests/`) and a GitHub Actions workflow (`.github/workflows/python-tests.yml`) for automated testing.\n\n## Installation\n\n**Prerequisites:**\n\n- Python \u003e= 3.8\n- `pip` (Python package installer)\n- `git` (for cloning the repository)\n- (Optional: `virtualenv` or `conda` for environment management)\n\n**Steps:**\n\n1. **Clone the repository:**\n\n   ```bash\n   git clone https://github.com/abhimehro/series_correction_project_updated.git\n   cd series_correction_project_updated\n   ```\n\n2. **Create and activate a virtual environment (Recommended):**\n\n   ```bash\n   # Using venv\n   python -m venv venv\n   source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`\n\n   # Or using conda\n   # conda create --name seatek_corr python=3.9\n   # conda activate seatek_corr\n   ```\n\n3. **Install the package and its dependencies:**\n   This command uses `setup.py` to install the project, including the CLI tool and required libraries from `requirements.txt`.\n\n   ```bash\n   pip install -e .\n   ```\n\n   _(The `-e` flag installs the project in \"editable\" mode, meaning changes to the source code are reflected immediately without needing to reinstall.)_\n\n## Usage\n\nThe primary way to use the tool is via the `seatek-correction` command-line interface.\n\n**Command-Line Arguments:**\n\n```bash\nseatek-correction --help\nOutput:\nusage: seatek-correction [-h] [--series SERIES] --river-miles RIVER_MILES RIVER_MILES --years YEARS YEARS [--dry-run] [--config CONFIG_PATH] [--output OUTPUT_DIR] [--log LOG_FILE]\n\nRun series correction batch processing on sensor data.\n\noptions:\n  -h, --help            show this help message and exit\n  --series SERIES       Series number to process, or 'all' for all available series. (default: all)\n  --river-miles RIVER_MILES RIVER_MILES\n                        Upstream and downstream river mile markers (e.g., 54.0 53.0). (required)\n  --years YEARS YEARS   Start and end years of data to process (e.g., 1995 2014). (required)\n  --dry-run             If set, process data without saving output files.\n  --config CONFIG_PATH  Path to the configuration JSON file. (default: scripts/config.json)\n  --output OUTPUT_DIR   Directory to save output files (default: data/output/).\n  --log LOG_FILE        Path to the log file (default: processing_log.txt).\n```\n\n**Example: Process Series 26 data between river miles 53.0 and 54.0 for the years 1995 to 2014, using a custom configuration and saving output to ./corrected_output/.**\n\n```bash\n# This command processes data for a specific series (26) within the given river mile range and years.\n# It uses a custom config file and directs output to a specific folder.\nseatek-correction --series 26 --river-miles 54.0 53.0 --years 1995 2014 --config custom_config.json --output ./corrected_output/\n```\n\n**Example (Dry Run for All Series): Perform a dry run (no files saved) for all series associated with river miles 53.0-54.0 between 1995-1996, logging to a specific file.**\n\n```bash\n# This command simulates processing for all applicable series in the range without saving output files.\n# It's useful for checking which files would be processed and verifying logs.\nseatek-correction --series all --river-miles 54.0 53.0 --years 1995 1996 --dry-run --log dry_run.log\n```\n\n## Configuration\n\nThe processing behavior is controlled by a JSON configuration file (default: `scripts/config.json`). See `config.json` for the structure. Key sections include:\n\n- **series** (Optional): Mapping series numbers to specific diagnostic or raw data file paths (less used now as batch processing relies on file patterns).\n- **defaults / processor_config**: Parameters passed to the `processor.py` module:\n  - `window_size`: Rolling window size (default: 5).\n  - `threshold`: Sensitivity threshold for jumps/outliers (default: 3.0).\n  - `gap_threshold_factor`: Multiplier for gap detection (default: 3.0).\n  - `gap_method`: Gap interpolation method ('time', 'linear', etc.).\n  - `outlier_method`: Outlier correction method ('median', 'mean', 'interpolate', 'remove').\n  - `time_col`: Name of the time column (default: \"Time (Seconds)\").\n  - `value_col`: Name of the value column (default: null for auto-detect).\n- **RAW_DATA_DIR** (Optional): Path to the directory containing input .txt files (defaults to `./data/`).\n- **RIVER_MILE_MAP_PATH** (Optional): Path to the JSON file mapping sensors to river miles (defaults to `scripts/river_mile_map.json`).\n\n## Data Format\n\n### Input Data Requirements\n\n- **Location**: Input files should reside in the directory specified by `RAW_DATA_DIR` in the config, or the `./data/` directory by default.\n- **Filename Pattern**: Files must follow the pattern `S\u003cseries\u003e_Y\u003cindex\u003e.txt`, where `\u003cseries\u003e` is the series number (e.g., 26) and `\u003cindex\u003e` is a zero-padded sequential index representing the year within the processing range (e.g., Y01 for the first year, Y02 for the second, etc.).\n- **File Format**: Text files (.txt).\n- **Content**: Space or tab-delimited columns. Must contain at least a time column and one numeric sensor value column. Header rows are typically absent or should be handled by pandas.read_csv (e.g., using comment='#'). Data should be roughly chronological.\n\n**Example File (data/S26_Y01.txt)**\n\n```\n# Example data for Series 26, Year 1 (e.g., 1995)\n# Time(Seconds) Value\n100             23.5\n102             23.6\n104             23.5\n108             23.7  # Potential small gap here\n110             23.8\n112             50.0  # Potential outlier\n114             23.9\n116             30.1  # Potential jump start\n118             30.2\n120             30.0\n```\n\n### Sample Data\n\n**Status: Required**\n\nThis repository does not include sample `S\u003cseries\u003e_Y\u003cindex\u003e.txt` files due to data sensitivity or size. To run the processing and tests effectively, you must provide your own representative (anonymized if necessary) data files and place them in the `./data/` directory.\n\n**Action Required**: Create files like `data/S26_Y01.txt`, `data/S27_Y01.txt`, etc., using your Seatek data, ensuring they match the required filename pattern and content format described above.\n\n**Importance**: Without these files in the `./data/` directory, the batch processing script will not find any input, and end-to-end testing cannot be performed.\n\n## Methodology\n\nThe correction process involves sequentially detecting and correcting gaps, outliers, and jumps using statistical methods within rolling windows. For detailed steps and algorithms, please refer to:\n\n`docs/correction_methodology.md`\n\n## Project Structure\n\n```\nseries_correction_project_updated/\n│\n├── data/                     # Input data directory (MUST ADD SAMPLE S\u003cseries\u003e_Y\u003cindex\u003e.txt FILES HERE)\n│   └── S26_Y01.txt           # Example required input file format\n│\n├── output/                   # Default directory for corrected data and reports\n│   └── S26_Y01_1995_CorrectedData.csv # Example output format\n│   └── Batch_Processing_Summary.csv   # Example summary output\n│\n├── scripts/                  # Source code package\n│   ├── __init__.py\n│   ├── batch_correction.py\n│   ├── loaders.py\n│   ├── processor.py\n│   ├── series_correction_cli.py\n│   ├── config.json           # Default configuration\n│   ├── river_mile_map.json   # Default mapping\n│   └── requirements.txt\n│\n├── tests/                    # Unit and integration tests\n│   ├── __init__.py\n│   └── test_batch_correction.py # Example test file\n│   # (Add other test files like test_processor.py here)\n│\n├── docs/                     # Documentation files\n│   ├── correction_methodology.md\n│   └── automation_setup.md\n│\n├── .github/                  # GitHub Actions Workflows\n│   └── workflows/\n│       └── python-tests.yml\n│\n├── .gitignore\n├── LICENSE\n├── README.md                 # This file\n└── setup.py                  # Installation script\n```\n\n## Testing\n\nThis project uses pytest for testing.\n\n1. Ensure development dependencies are installed (included in `pip install -e .` if requirements.txt includes them, or use a separate requirements-dev.txt).\n\n   ```bash\n   pip install pytest pytest-cov pytest-mock\n   ```\n\n2. Navigate to the project's root directory.\n\n3. Run tests:\n\n   ```bash\n   pytest\n   ```\n\n4. Run tests with coverage:\n\n   ```bash\n   pytest --cov=scripts tests/\n   ```\n\n(Note: Comprehensive testing, especially integration tests, requires the presence of sample data files in the `data/` directory.)\n\n## Contributing\n\nContributions are welcome! Please follow standard GitHub Fork \u0026 Pull Request workflows. Ensure code includes tests, passes linting/formatting checks, and updates documentation where necessary.\n\nIf CodeScene fails on a PR during review/salvage sessions, post:\n\n```bash\n/cs-agent skill:fix-code-health-degradations\n```\n\nThen continue salvage verification after that run finishes.\n\n## License\n\nThis project is licensed under the MIT License. See the LICENSE file for details.\n\n## Contact\n\nFor questions or support, please open an issue on the GitHub repository:\n\n\u003chttps://github.com/abhimehro/series_correction_project_updated/issues\u003e\n\nAlternatively, contact Abhi Mehrotra \u003cAbhiMhrtr@pm.me\u003e\n\n## Acknowledgements\n\n- **Libraries**: pandas, numpy, click\n- **Testing**: pytest\n- **Guidance**: Audit report and best practices.\n- **Support**: Baton Rouge Community College (BRCC), Louisiana State University (LSU), LSU Center for River Studies.\n\n## Future Improvements\n\nBased on the audit report and potential enhancements:\n\n- **Sample Data**: Add standardized, anonymized sample data files.\n- **Integration Tests**: Develop tests that run the full pipeline on sample data.\n- **Visualization**: Implement a visualization.py module (e.g., using Matplotlib/Seaborn) to plot raw vs. corrected data, highlighting changes. Add a --plot flag to the CLI.\n- **Data Validation**: Add more explicit data validation checks during loading (e.g., expected columns, value ranges).\n- **Configuration**: Refactor configuration loading, potentially using Pydantic or dataclasses for better structure and validation.\n- **Error Handling**: Enhance error reporting and recovery within the batch process.\n- **Documentation**: Generate API reference documentation (e.g., using Sphinx). Add more examples and tutorials (Jupyter Notebooks).\n- **Performance**: Profile and optimize processing for very large datasets if needed.\n- **Packaging**: Consider using pyproject.toml for modern packaging standards.\n- **Dependency Pinning**: Provide a pinned requirements.txt for reproducible environments.\n\n## Workflow for Efficient Batch Processing, QA, and Visualization Updates (2025)\n\n### 1. Add New Data Files\n\n- Place new raw `.txt` files in the `data/` directory.\n- Update `scripts/config.json` to include new series and files (e.g., for Series 28, 29, etc.).\n\n### 2. Run Batch Processing\n\n- From the project root, run:\n  ```bash\n  python scripts/manual_batch_run.py\n  ```\n- This processes all files listed in the config, corrects outliers/gaps/jumps, and writes output `.xlsx` files to `data/output/`.\n\n### 3. Export Comparison Sheets (for QA and Transparency)\n\n- To generate Excel files comparing raw and processed data (with outlier flags), run:\n  ```bash\n  python scripts/export_comparison_sheets.py\n  ```\n- This creates a `comparisons/` subfolder in `data/output/`, with one comparison Excel file per processed dataset.\n- Each file contains columns for raw value(s), processed value(s), and an `Outlier_Flag`.\n\n### 4. Update Excel Visualizations\n\n- **Recommended:** Point your Excel charts/tables directly to the processed `.xlsx` files in `data/output/`.\n  - This ensures your visualizations always reflect the latest, cleaned data.\n- **To compare raw and adjusted data visually:**\n  - Use the comparison Excel files in `data/output/comparisons/`.\n  - Add both raw and processed columns to your chart for before/after views.\n- **To update existing charts:**\n  - Overwrite the data range in your workbook with processed data, or use Excel's \"Select Data\" to re-link charts.\n\n### 5. For New Series or Years\n\n- Add new entries to `config.json`.\n- Repeat steps 2–4. No code changes needed.\n\n### 6. Troubleshooting \u0026 QA\n\n- Check logs and summary files for errors or warnings.\n- Use the comparison sheets to spot-check corrections and outlier handling.\n\n---\n\n## Additional Resources \u0026 Tips for Visualization Updates\n\n- **Excel Shortcuts:**\n  - Use \"Data \u003e Get Data \u003e From File\" to import processed `.xlsx` files.\n  - Use \"Select Data\" on a chart to quickly change its data source.\n  - Use filters or conditional formatting to highlight outliers using the `Outlier_Flag` column.\n- **Best Practices:**\n  - Always keep a backup of your original workbook.\n  - Document any manual changes for reproducibility.\n- **For LSU Center for River Studies:**\n  - Bring both the processed `.xlsx` files and the comparison files for transparency.\n  - If asked about corrections, show the `Outlier_Flag` and before/after columns.\n- **Further Automation:**\n  - If you need to automate chart updates or reporting, consider using Python libraries like `openpyxl` or `xlwings`.\n  - For advanced analytics or custom plots, Jupyter notebooks with `pandas` and `matplotlib` are recommended.\n\n---\n\n## Quick Reference\n\n- **Process all data:** `python scripts/manual_batch_run.py`\n- **Export comparison sheets:** `python scripts/export_comparison_sheets.py`\n- **Processed files:** `data/output/`\n- **Comparison files:** `data/output/comparisons/`\n- **Add new series:** Edit `scripts/config.json`\n\nIf you have any questions or need further scripts for automation or visualization, please reach out to the project maintainer.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabhimehro%2Fseries_correction_project_updated","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabhimehro%2Fseries_correction_project_updated","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabhimehro%2Fseries_correction_project_updated/lists"}