{"id":29726808,"url":"https://github.com/lmlk-seal/arrowshelf","last_synced_at":"2026-05-01T21:35:21.980Z","repository":{"id":300290442,"uuid":"1005799799","full_name":"LMLK-seal/ArrowShelf","owner":"LMLK-seal","description":"A lightning-fast, zero-copy, cross-process data store for Python using Apache Arrow.","archived":false,"fork":false,"pushed_at":"2025-07-02T19:34:55.000Z","size":92,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-03T17:48:21.893Z","etag":null,"topics":["apache-arrow","big-data","cross-process","data-processing","data-science","data-store","distributed-computing","high-performance","ipc","lightning-fast","memory-mapping","multiprocessing","numpy","pandas","parallel-computing","performance-optimization","python","scientific-computing","shared-memory","zero-copy"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LMLK-seal.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-20T20:59:17.000Z","updated_at":"2025-07-02T19:34:58.000Z","dependencies_parsed_at":"2025-06-20T22:06:33.726Z","dependency_job_id":"eea7f173-4fc1-4383-b0c5-e610feae20dc","html_url":"https://github.com/LMLK-seal/ArrowShelf","commit_stats":null,"previous_names":["lmlk-seal/arrowshelf"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/LMLK-seal/ArrowShelf","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LMLK-seal%2FArrowShelf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LMLK-seal%2FArrowShelf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LMLK-seal%2FArrowShelf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LMLK-seal%2FArrowShelf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LMLK-seal","download_url":"https://codeload.github.com/LMLK-seal/ArrowShelf/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LMLK-seal%2FArrowShelf/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32513955,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-30T13:12:12.517Z","status":"online","status_checked_at":"2026-05-01T02:00:05.856Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-arrow","big-data","cross-process","data-processing","data-science","data-store","distributed-computing","high-performance","ipc","lightning-fast","memory-mapping","multiprocessing","numpy","pandas","parallel-computing","performance-optimization","python","scientific-computing","shared-memory","zero-copy"],"created_at":"2025-07-25T00:21:53.602Z","updated_at":"2026-05-01T21:35:21.954Z","avatar_url":"https://github.com/LMLK-seal.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🏹 ArrowShelf\n\n**High-Performance Shared Memory Data Exchange for Python**\n\n[![Python](https://img.shields.io/badge/Python-3.7%2B-blue?logo=python\u0026logoColor=white)](https://python.org)\n[![Apache Arrow](https://img.shields.io/badge/Apache%20Arrow-Powered-orange?logo=apache\u0026logoColor=white)](https://arrow.apache.org)\n[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)\n[![Performance](https://img.shields.io/badge/Performance-⚡%20Ultra%20Fast-yellow)](https://github.com/LMLK-seal/ArrowShelf)\n\nArrowShelf is a cutting-edge Python library that enables **lightning-fast shared memory data exchange** between processes using Apache Arrow's columnar format. Perfect for high-performance computing, machine learning pipelines, and distributed data processing.\n\n## 🌟 Key Features\n\n- **🚀 Zero-Copy Operations**: Direct memory access without serialization overhead\n- **🔧 Process-Safe**: Thread and multiprocess safe data sharing\n- **📊 Columnar Efficiency**: Optimized for analytical workloads with Apache Arrow\n- **🎯 FAISS Integration**: Built-in support for approximate nearest neighbor search\n- **🔄 Automatic Cleanup**: Smart memory management with reference counting\n- **🛡️ Production Ready**: Robust error handling and connection management\n\n## 📦 Installation\n\n```bash\npip install arrowshelf\n```\n\nFor FAISS integration (optional):\n```bash\npip install faiss-cpu  # or faiss-gpu for GPU support\n```\n\n### 🚀 Starting the ArrowShelf Server\n\nArrowShelf requires a server daemon to manage shared memory. Start it before running your applications:\n\n```bash\n# Start the ArrowShelf server\narrowshelf-server\n\n# Or run in background (Linux/Mac)\narrowshelf-server \u0026\n\n# Windows background (using PowerShell)\nStart-Process arrowshelf-server -WindowStyle Hidden\n```\n\nThe server will run on `localhost:50051` by default.\n\n## 🚀 Quick Start\n\n### Setting Up ArrowShelf\n\n1. **Start the server** (required):\n```bash\narrowshelf-server\n```\n\n2. **Basic Usage** (in a separate terminal/process):\n\n```python\nimport arrowshelf\nimport pandas as pd\nimport numpy as np\n\n# Create sample data\ndf = pd.DataFrame({\n    'x': np.random.rand(10000),\n    'y': np.random.rand(10000),\n    'z': np.random.rand(10000)\n})\n\n# Store in shared memory\nkey = arrowshelf.put(df)\n\n# Access from any process\nretrieved_df = arrowshelf.get(key)\nprint(f\"Retrieved {len(retrieved_df)} rows\")\n\n# Cleanup\narrowshelf.delete(key)\n```\n\n### Advanced Zero-Copy Access\n\n```python\nimport arrowshelf\nimport numpy as np\n\n# Store data\nkey = arrowshelf.put(df)\n\n# Get Arrow table for zero-copy operations\ntable = arrowshelf.get_arrow(key)\nx_column = table.column(\"x\").chunk(0).to_numpy(zero_copy_only=True)\n\n# Direct NumPy operations without copying\nresult = np.mean(x_column)\n```\n\n## 🎯 Real-World Example: Parallel Nearest Neighbor Search\n\nThis example demonstrates how ArrowShelf enables efficient parallel processing with FAISS for approximate nearest neighbor search.\n\n### Prerequisites\n\n1. **Start ArrowShelf server**:\n```bash\narrowshelf-server\n```\n\n2. **Install dependencies**:\n```bash\npip install arrowshelf faiss-cpu pandas numpy\n```\n\n### Complete Example\n\n```python\nimport multiprocessing as mp\nimport threading\nimport pandas as pd\nimport numpy as np\nimport time\nimport arrowshelf\nimport faiss\nfrom multiprocessing.pool import ThreadPool\n\n# Enable thread-based multiprocessing\nthreading.Pool = ThreadPool\n\ndef worker_faiss_search(task_data):\n    \"\"\"Worker function for parallel FAISS nearest neighbor search\"\"\"\n    key, start_index, end_index = task_data\n    \n    # Zero-copy access to shared data\n    table = arrowshelf.get_arrow(key).combine_chunks()\n    x = table.column(\"x\").chunk(0).to_numpy(zero_copy_only=True)\n    y = table.column(\"y\").chunk(0).to_numpy(zero_copy_only=True)\n    z = table.column(\"z\").chunk(0).to_numpy(zero_copy_only=True)\n    \n    # Stack coordinates for FAISS\n    all_points = np.stack([x, y, z], axis=1).astype(np.float32)\n    query_chunk = all_points[start_index:end_index]\n    \n    # Configure FAISS IVF index\n    d = 3  # 3D points\n    nlist = 100  # Voronoi cells\n    quantizer = faiss.IndexFlatL2(d)\n    index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)\n    \n    # Train and populate index\n    index.train(all_points)\n    index.add(all_points)\n    index.nprobe = 10  # Search cells\n    \n    # Perform approximate k-NN search\n    _, distances = index.search(query_chunk, 11)  # k=11 (excluding self)\n    avg_distance = np.mean(np.sqrt(distances[:, 1:]))  # Exclude self-distance\n    \n    return avg_distance\n\ndef parallel_nearest_neighbor_demo():\n    \"\"\"Demonstrate parallel processing with ArrowShelf + FAISS\"\"\"\n    \n    # Check ArrowShelf connection\n    try:\n        arrowshelf.list_keys()\n        print(\"✅ ArrowShelf server connection OK\")\n    except arrowshelf.ConnectionError:\n        print(\"❌ ERROR: ArrowShelf server not running!\")\n        print(\"Please start the server first: arrowshelf-server\")\n        return\n    \n    # Generate sample 3D points\n    num_points = 100_000\n    num_cores = 6\n    \n    print(f\"🔍 Running parallel k-NN search on {num_points:,} 3D points\")\n    \n    # Create dataset\n    df = pd.DataFrame(np.random.rand(num_points, 3), columns=['x', 'y', 'z'])\n    print(f\"📊 Dataset size: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB\")\n    \n    # Store in ArrowShelf\n    key = arrowshelf.put(df)\n    \n    # Create tasks for parallel processing\n    chunk_size = num_points // num_cores\n    tasks = [\n        (key, i * chunk_size, (i + 1) * chunk_size) \n        for i in range(num_cores)\n    ]\n    \n    # Execute parallel search\n    print(f\"⚡ Processing with {num_cores} cores...\")\n    start_time = time.perf_counter()\n    \n    with ThreadPool(processes=num_cores) as pool:\n        results = pool.map(worker_faiss_search, tasks)\n    \n    duration = time.perf_counter() - start_time\n    avg_distance = np.mean(results)\n    \n    # Results\n    print(f\"✅ Average 10-NN distance: {avg_distance:.6f}\")\n    print(f\"🚀 Processing time: {duration:.4f} seconds\")\n    print(f\"🔥 Throughput: {num_points/duration:,.0f} points/second\")\n    \n    # Cleanup\n    arrowshelf.delete(key)\n    print(\"🧹 Cleanup completed\")\n\nif __name__ == \"__main__\":\n    mp.set_start_method('spawn', force=True)\n    parallel_nearest_neighbor_demo()\n```\n\n### Running the Example\n\n1. **Terminal 1** - Start the server:\n```bash\narrowshelf-server\n```\n\n2. **Terminal 2** - Run the example:\n```bash\npython nearest_neighbor_demo.py\n```\n\n**Expected Output:**\n```\n✅ ArrowShelf server connection OK\n🔍 Running parallel k-NN search on 100,000 3D points\n📊 Dataset size: 2.29 MB\n⚡ Processing with 6 cores...\n✅ Average 10-NN distance: 210.789151\n🚀 Processing time: 1.0017 seconds\n🔥 Throughput: 99,830 points/second\n🧹 Cleanup completed\n```\n\n# How ArrowShelf Helps with Large Datasets\n\n## The Process Flow\n\n1. **Load Once, Use Many Times**: Your large dataset is loaded into memory once and placed on the ArrowShelf\n2. **Zero-Copy Access**: Multiple worker processes access the same data instantly without copying\n3. **Memory Efficient**: Instead of having 8 copies of your data for 8 cores, you have just 1 shared copy\n4. **Fast Parallel Processing**: Workers can immediately start computing instead of waiting for data transfer\n\n## Real-World Scenarios Where This Shines\n\n### Scenario 1: Machine Learning Feature Engineering\n\n```python\n# You have a 5GB customer dataset\ncustomer_data = pd.read_csv(\"customer_behavior_5gb.csv\")\n\n# Put it on the shelf once\ndata_key = arrowshelf.put(customer_data)\n\n# Now run multiple feature engineering tasks in parallel:\n# - Calculate RFM scores\n# - Generate time-based features  \n# - Compute behavioral clusters\n# - Create recommendation features\n\n# Each task accesses the same 5GB instantly, no copying!\n```\n\n### Scenario 2: Financial Risk Analysis\n\n```python\n# Load 10 million stock price records\nstock_data = pd.read_parquet(\"stock_prices_10m_rows.parquet\")\ndata_key = arrowshelf.put(stock_data)\n\n# Run parallel risk calculations:\n# - VaR calculations for different portfolios\n# - Correlation analysis across sectors\n# - Volatility modeling\n# - Stress testing scenarios\n\n# Traditional approach: Each task waits 30+ seconds for data copying\n# ArrowShelf approach: Each task starts immediately\n```\n\n### Scenario 3: Geospatial Analysis\n\n```python\n# Load millions of GPS coordinates\nlocation_data = pd.read_csv(\"gps_coordinates_50m_points.csv\")\ndata_key = arrowshelf.put(location_data)\n\n# Parallel geospatial tasks:\n# - Find nearest neighbors for different regions\n# - Calculate clustering patterns\n# - Identify hotspots and anomalies\n# - Generate heatmaps for different time periods\n```\n\n## Key Benefits\n\n1. **Memory Efficiency**: Instead of 6 copies of your 3D points (one per core), you have 1 shared copy\n2. **Instant Access**: Each worker gets the data via `arrowshelf.get_arrow(key)` instantly\n3. **Zero-Copy Operations**: The `.to_numpy(zero_copy_only=True)` means no data copying at all\n4. **Scalable**: Works whether you have 100K points or 100M points\n\n## The Traditional Problem vs ArrowShelf Solution\n\n### Traditional Multiprocessing (Pickle)\n\n```\nMain Process: Load 5GB dataset\n├── Send 5GB copy to Worker 1 (30 seconds)\n├── Send 5GB copy to Worker 2 (30 seconds)  \n├── Send 5GB copy to Worker 3 (30 seconds)\n└── Send 5GB copy to Worker 4 (30 seconds)\nTotal data transfer: 120 seconds + computation time\n```\n\n### ArrowShelf Approach\n\n```\nMain Process: Load 5GB dataset → Put on shelf (2 seconds)\n├── Worker 1: Get instant reference (0.001 seconds)\n├── Worker 2: Get instant reference (0.001 seconds)\n├── Worker 3: Get instant reference (0.001 seconds)\n└── Worker 4: Get instant reference (0.001 seconds)\nTotal data transfer: 2 seconds + computation time\n```\n\n## Perfect Use Cases\n\n1. **Data Science Notebooks**: When you're iteratively running different analyses on the same large dataset\n2. **ETL Pipelines**: When multiple transformation steps need access to the same source data\n3. **Machine Learning**: When training multiple models or doing hyperparameter tuning on the same dataset\n4. **Scientific Computing**: When running simulations that need shared reference data\n5. **Real-time Analytics**: When multiple dashboards need to query the same large dataset\n\n## The Key Insight\n\nArrowShelf eliminates the \"data tax\" - the time penalty you normally pay for having multiple processes work with large datasets. Instead of spending most of your time copying data, you spend it actually computing results.\n\n\n## 🚀 Project Evolution\n\nArrowShelf has evolved from a simple data sharing concept to a high-performance computing powerhouse. Here's the journey of optimization:\n\n### Performance Evolution Timeline\n\n| Benchmark | Architecture | Algorithm | Time | Improvement |\n|-----------|-------------|-----------|------|-------------|\n| **Pickle + Brute Force** | Slow Data Transfer | Brute Force O(n²) | 16.7s | *Baseline* |\n| **ArrowShelf + Brute Force** | Fast Data Transfer | Brute Force O(n²) | 14.5s | **13% faster** |\n| **ArrowShelf + FAISS IndexFlatL2** | Fast Data Transfer | Optimized Exact Search | 1.84s | **87% faster** |\n| **ArrowShelf + FAISS IndexIVFFlat** | Fast Data Transfer | Approximate Search | **1.00s** | **94% faster** |\n\n### 📊 Performance Visualization\n\n```\nTraditional Pickle Approach   ████████████████████████████████████ 16.7s\nArrowShelf + Brute Force      ███████████████████████████████     14.5s\nArrowShelf + FAISS Exact      ████                               1.84s\nArrowShelf + FAISS Approx     ██                                 1.00s ⚡\n```\n\n### 🎯 Key Milestones\n\n- **🏗️ Phase 1: Foundation** - Basic shared memory with Apache Arrow\n- **⚡ Phase 2: Optimization** - Zero-copy operations and efficient data transfer  \n- **🔍 Phase 3: Intelligence** - FAISS integration for similarity search\n- **🚀 Phase 4: Approximation** - IVF indexing for ultimate performance\n\nThe evolution demonstrates a **16.7x performance improvement** from traditional pickle-based approaches to our current FAISS-optimized implementation.\n\n## 📈 Performance\n\nArrowShelf delivers exceptional performance for data-intensive applications:\n\n### FAISS Integration Benchmark\n- **Dataset**: 100,000 3D points (2.29 MB)\n- **Operation**: Approximate 10-NN search with IVF index\n- **Hardware**: 6-core parallel processing\n- **Result**: **1.00 seconds** processing time\n- **Throughput**: **~100K points/second**\n\n### Key Performance Benefits\n- **Zero-Copy Access**: Direct memory mapping eliminates serialization overhead\n- **Columnar Storage**: Optimized for analytical operations and vectorized computations\n- **Parallel Processing**: Efficient multi-core scaling with shared memory\n- **Memory Efficiency**: Reference counting prevents memory leaks\n\n## 🔧 API Reference\n\n### Core Functions\n\n```python\n# Store data in shared memory\nkey = arrowshelf.put(data)\n\n# Retrieve data as pandas DataFrame\ndf = arrowshelf.get(key)\n\n# Retrieve data as Arrow Table (zero-copy)\ntable = arrowshelf.get_arrow(key)\n\n# List all stored keys\nkeys = arrowshelf.list_keys()\n\n# Delete data from shared memory\narrowshelf.delete(key)\n\n# Close connection\narrowshelf.close()\n```\n\n### Advanced Operations\n\n```python\n# Batch operations\narrowshelf.delete_all()  # Clear all data\n\n# Connection management\narrowshelf.is_connected()  # Check connection status\n\n# Memory statistics\narrowshelf.memory_usage()  # Get usage statistics\n```\n\n## 🛠️ Use Cases\n\n### 🤖 Machine Learning\n- **Feature Engineering**: Share preprocessed datasets across training processes\n- **Model Serving**: Cache model predictions and intermediate results\n- **Hyperparameter Tuning**: Efficient data sharing in parallel optimization\n\n### 📊 Data Analytics\n- **ETL Pipelines**: Zero-copy data transformations\n- **Distributed Computing**: Shared memory for map-reduce operations\n- **Real-time Analytics**: High-throughput data processing\n\n### 🔬 Scientific Computing\n- **Numerical Simulations**: Share large arrays between simulation processes\n- **Image Processing**: Efficient pixel data sharing\n- **Geospatial Analysis**: Fast coordinate and geometry operations\n\n## 🏗️ Architecture\n\n```\n┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐\n│   Process A     │    │   ArrowShelf    │    │   Process B     │\n│                 │    │     Server      │    │                 │\n│  put(data) ────────▶│                 │◀──────── get(key)   │\n│                 │    │  Apache Arrow   │    │                 │\n│                 │    │ Shared Memory   │    │                 │\n└─────────────────┘    └─────────────────┘    └─────────────────┘\n```\n\n## 🤝 Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## 📋 Requirements\n\n- Python 3.7+\n- Apache Arrow\n- pandas\n- numpy\n\nOptional dependencies:\n- FAISS (for nearest neighbor search)\n- multiprocessing support\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## 🙏 Acknowledgments\n\n- Built on [Apache Arrow](https://arrow.apache.org/) columnar memory format\n- Optimized for [FAISS](https://github.com/facebookresearch/faiss) similarity search\n- Inspired by modern high-performance computing needs\n\n## 📞 Support\n\n- 📧 **Issues**: [GitHub Issues](https://github.com/LMLK-seal/ArrowShelf/issues)\n- 💬 **Discussions**: [GitHub Discussions](https://github.com/LMLK-seal/ArrowShelf/discussions)\n- 📖 **Documentation**: [Wiki](https://github.com/LMLK-seal/ArrowShelf/wiki)\n\n---\n\n**⭐ Star this repository if ArrowShelf helps accelerate your data processing workflows!**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flmlk-seal%2Farrowshelf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flmlk-seal%2Farrowshelf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flmlk-seal%2Farrowshelf/lists"}