{"id":29383424,"url":"https://github.com/olympus-terminal/data-processing","last_synced_at":"2026-05-16T21:02:22.974Z","repository":{"id":300154486,"uuid":"1004728383","full_name":"olympus-terminal/data-processing","owner":"olympus-terminal","description":"Data analysis and processing tools","archived":false,"fork":false,"pushed_at":"2025-06-20T05:23:26.000Z","size":15,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-20T06:23:04.678Z","etag":null,"topics":["automation","data-analysis","data-processing","data-science","etl","machine-learning","pdf-extraction","python","r","research","statistics","web-scraping"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/olympus-terminal.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-19T05:05:21.000Z","updated_at":"2025-06-20T05:22:07.000Z","dependencies_parsed_at":"2025-06-20T06:33:13.672Z","dependency_job_id":null,"html_url":"https://github.com/olympus-terminal/data-processing","commit_stats":null,"previous_names":["olympus-terminal/data-processing"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/olympus-terminal/data-processing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/olympus-terminal%2Fdata-processing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/olympus-terminal%2Fdata-processing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/olympus-terminal%2Fdata-processing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/olympus-terminal%2Fdata-processing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/olympus-terminal","download_url":"https://codeload.github.com/olympus-terminal/data-processing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/olympus-terminal%2Fdata-processing/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264526961,"owners_count":23623194,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","data-analysis","data-processing","data-science","etl","machine-learning","pdf-extraction","python","r","research","statistics","web-scraping"],"created_at":"2025-07-10T04:01:48.327Z","updated_at":"2026-05-16T21:02:17.943Z","avatar_url":"https://github.com/olympus-terminal.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Processing Tools\n\n\u003e Modern data analysis and processing toolkit featuring Python scripts, R utilities, web scraping tools, and intelligent automation solutions.\n\n[![License](https://img.shields.io/github/license/olympus-terminal/data-processing)](LICENSE)\n[![GitHub stars](https://img.shields.io/github/stars/olympus-terminal/data-processing?style=social)](https://github.com/olympus-terminal/data-processing/stargazers)\n[![GitHub issues](https://img.shields.io/github/issues/olympus-terminal/data-processing)](https://github.com/olympus-terminal/data-processing/issues)\n[![GitHub last commit](https://img.shields.io/github/last-commit/olympus-terminal/data-processing)](https://github.com/olympus-terminal/data-processing/commits/main)\n[![Tools](https://img.shields.io/badge/tools-11-green.svg)](https://github.com/olympus-terminal/data-processing)\n[![Python](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org)\n[![R](https://img.shields.io/badge/R-4.0+-276DC3.svg)](https://www.r-project.org)\n\n## 📊 Overview\n\nA versatile collection of data processing tools designed for researchers, data scientists, and analysts. This toolkit bridges the gap between raw data and insights, offering solutions for common data manipulation tasks, web scraping, format conversion, and analysis workflows.\n\n### Core Strengths\n\n- **Multi-Language Support**: Leverage Python's versatility and R's statistical power\n- **Automation First**: Tools designed for batch processing and pipelines\n- **Format Agnostic**: Handle CSV, TSV, PDF, JSON, and custom formats\n- **Research-Oriented**: Citation analysis, data pivoting, and academic workflows\n- **Web-Aware**: Intelligent web scraping with respect for robots.txt\n\n## 📁 Repository Structure\n\n```\ndata-processing/\n├── python-tools/       # Python-based processing utilities\n├── r-scripts/         # R scripts for statistical analysis\n├── web-scraping/      # Web data extraction tools\n├── format-conversion/ # File format converters\n├── data-analysis/     # Analysis and aggregation tools\n└── ml_dir_config.txt  # Machine learning directory configuration\n```\n\n## 🚀 Quick Start\n\n### Prerequisites\n\n```bash\n# Python requirements\npython --version  # 3.7 or higher\npip install pandas numpy requests beautifulsoup4 PyPDF2 matplotlib seaborn\n\n# R requirements\nR --version  # 4.0 or higher\n# Install R packages (in R console):\n# install.packages(c(\"dplyr\", \"reshape2\", \"tidyverse\", \"data.table\"))\n\n# System requirements\nbash \u003e= 4.0\nawk, sed, grep (standard Unix tools)\n```\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/olympus-terminal/data-processing.git\ncd data-processing\n\n# Install Python dependencies\npip install -r requirements.txt  # or manually:\npip install pandas numpy requests beautifulsoup4 PyPDF2 \\\n            matplotlib seaborn scipy scikit-learn\n\n# Make scripts executable\nfind . -name \"*.py\" -o -name \"*.sh\" | xargs chmod +x\n```\n\n## 🔧 Tool Showcase\n\n### Python Tools (`python-tools/`)\n\n#### Data Chunking and Processing\n```python\n# Break large datasets into manageable chunks\npython break-to-100s.py huge_dataset.csv --chunk-size 100\n# Creates: chunk_001.csv, chunk_002.csv, etc.\n\n# Process timestamps and split samples\npython SampleSplitStamp.py experiment_data.csv \\\n    --timestamp-col \"date\" \\\n    --split-ratio 0.7\n```\n\n#### Intelligent Automation\n```python\n# Two-agent automation system\npython twoagent-DRN-autogen-mod.py \\\n    --config automation_config.json \\\n    --mode \"collaborative\"\n```\n\n### R Scripts (`r-scripts/`)\n\n#### Advanced Pivot Tables\n```r\n# Create complex pivot tables with multiple aggregations\nRscript pivot-table-maker.R \\\n    --input sales_data.csv \\\n    --rows \"product,region\" \\\n    --cols \"quarter\" \\\n    --values \"revenue,units\" \\\n    --fun \"sum,mean\"\n```\n\n#### Statistical Pivoting\n```r\n# Quick pivoting for statistical analysis\nRscript pivot.r data.csv id_vars value_vars\n```\n\n### Web Scraping (`web-scraping/`)\n\n#### Academic PDF Scraper\n```python\n# Scrape Y-chromosome research PDFs\npython web-scraper_Ychr-pdfs.py \\\n    --start-url \"https://research-site.com\" \\\n    --output-dir \"./pdfs/\" \\\n    --max-depth 3 \\\n    --respectful  # Follows robots.txt\n```\n\n### Format Conversion (`format-conversion/`)\n\n#### PDF to Text Extraction\n```python\n# Extract text from PDFs with layout preservation\npython pdf_to_txt_argv.py research_paper.pdf \\\n    --output extracted_text.txt \\\n    --preserve-layout \\\n    --encoding utf-8\n```\n\n### Data Analysis (`data-analysis/`)\n\n#### Citation Analysis\n```python\n# Analyze citations in academic texts\npython CountCitations.py manuscript.txt \\\n    --style \"APA\" \\\n    --output citation_report.csv\n```\n\n#### AWK-based Tallying\n```bash\n# Fast tallying of categorical data\n./tally-awk category_column data.txt \u003e frequency_table.txt\n```\n\n## 📊 Real-World Workflows\n\n### 1. Research Data Pipeline\n```bash\n# Step 1: Extract data from PDFs\nfor pdf in literature/*.pdf; do\n    python format-conversion/pdf_to_txt_argv.py \"$pdf\" \\\n        --output \"texts/$(basename \"$pdf\" .pdf).txt\"\ndone\n\n# Step 2: Count citations\npython data-analysis/CountCitations.py texts/*.txt \\\n    --output citations_summary.csv\n\n# Step 3: Create visualization\npython python-tools/visualize_citations.py citations_summary.csv\n```\n\n### 2. Large Dataset Processing\n```bash\n# Split large file\npython python-tools/break-to-100s.py massive_dataset.csv\n\n# Process each chunk in parallel\nfor chunk in chunk_*.csv; do\n    python process_chunk.py \"$chunk\" \u0026\ndone\nwait\n\n# Merge results\npython merge_results.py chunk_*.processed.csv \u003e final_results.csv\n```\n\n### 3. Web Data Collection\n```python\n# Scrape data respectfully\npython web-scraping/web-scraper_Ychr-pdfs.py \\\n    --config scraping_config.json \\\n    --rate-limit 1  # 1 request per second\n\n# Process downloaded content\nfor pdf in downloads/*.pdf; do\n    python format-conversion/pdf_to_txt_argv.py \"$pdf\"\ndone\n\n# Analyze extracted text\npython analyze_papers.py downloads/*.txt\n```\n\n## 🎯 Advanced Features\n\n### Machine Learning Integration\n\nThe `ml_dir_config.txt` file configures directories for ML workflows:\n```\ndata/\n├── raw/           # Original datasets\n├── processed/     # Cleaned data\n├── features/      # Feature engineering\n├── models/        # Trained models\n└── predictions/   # Model outputs\n```\n\n### Parallel Processing\n\nMany tools support parallel execution:\n```bash\n# Set parallel jobs\nexport PARALLEL_JOBS=4\n\n# Use GNU parallel for batch processing\nparallel -j 4 python process_file.py {} ::: *.csv\n```\n\n### Memory-Efficient Processing\n\n```python\n# Process large files in chunks\nimport pandas as pd\n\nfor chunk in pd.read_csv('huge_file.csv', chunksize=10000):\n    # Process chunk\n    result = process_chunk(chunk)\n    # Append to output\n    result.to_csv('output.csv', mode='a', header=False)\n```\n\n## 🔧 Configuration\n\n### Environment Setup\n```bash\n# Create virtual environment\npython -m venv data-proc-env\nsource data-proc-env/bin/activate  # Linux/Mac\n# or\ndata-proc-env\\Scripts\\activate  # Windows\n\n# Install dependencies\npip install -r requirements.txt\n```\n\n### Tool Configuration\n\nMost tools support configuration files:\n```json\n{\n  \"processing\": {\n    \"chunk_size\": 1000,\n    \"parallel_jobs\": 4,\n    \"memory_limit\": \"2GB\"\n  },\n  \"output\": {\n    \"format\": \"csv\",\n    \"compression\": \"gzip\",\n    \"encoding\": \"utf-8\"\n  }\n}\n```\n\n## 🤝 Contributing\n\nWe welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n### Areas of Interest\n- Machine learning pipelines\n- Real-time data processing\n- Additional format converters\n- Statistical analysis tools\n- Visualization utilities\n\n## 📈 Performance Optimization\n\n### Tips for Large Datasets\n\n1. **Use chunking**: Process data in manageable pieces\n2. **Leverage multiprocessing**: Utilize all CPU cores\n3. **Profile memory usage**: Monitor with `memory_profiler`\n4. **Consider formats**: Parquet/HDF5 for large datasets\n\n### Benchmarks\n\n| Operation | Dataset Size | Time | Memory |\n|-----------|-------------|------|--------|\n| PDF extraction | 1000 pages | 45s | 200MB |\n| Citation counting | 100 papers | 12s | 50MB |\n| Pivot table (R) | 1M rows | 8s | 500MB |\n| Web scraping | 100 pages | 2m | 100MB |\n\n## 🐛 Troubleshooting\n\n### Common Issues\n\n**Import Errors**\n```bash\n# Check installed packages\npip list | grep package_name\n\n# Reinstall if needed\npip install --upgrade package_name\n```\n\n**Memory Errors**\n```python\n# Use chunking for large files\nchunk_size = 10000\nfor chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):\n    process(chunk)\n```\n\n**Encoding Issues**\n```python\n# Specify encoding explicitly\nwith open('file.txt', 'r', encoding='utf-8', errors='ignore') as f:\n    content = f.read()\n```\n\n## 📚 Documentation\n\n- [Python Tools Guide](docs/python-tools.md)\n- [R Scripts Manual](docs/r-scripts.md)\n- [Web Scraping Ethics](docs/scraping-ethics.md)\n- [Performance Tuning](docs/performance.md)\n\n## 📄 License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## 🌟 Acknowledgments\n\n- Python Software Foundation\n- R Core Team\n- Open source data science community\n- Contributors and testers\n\n## 📮 Contact\n\n- Issues: [GitHub Issues](https://github.com/olympus-terminal/data-processing/issues)\n- Discussions: [GitHub Discussions](https://github.com/olympus-terminal/data-processing/discussions)\n- Author: [@olympus-terminal](https://github.com/olympus-terminal)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Folympus-terminal%2Fdata-processing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Folympus-terminal%2Fdata-processing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Folympus-terminal%2Fdata-processing/lists"}