{"id":31883747,"url":"https://github.com/kalfasyan/filoma","last_synced_at":"2026-04-05T17:01:27.557Z","repository":{"id":315818741,"uuid":"1014461005","full_name":"kalfasyan/filoma","owner":"kalfasyan","description":"profiling files, directories, image data","archived":false,"fork":false,"pushed_at":"2026-04-05T10:20:23.000Z","size":30191,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-05T11:22:41.984Z","etag":null,"topics":["data-analysis","profiler","validation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kalfasyan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-05T19:15:47.000Z","updated_at":"2026-04-05T10:20:28.000Z","dependencies_parsed_at":"2025-09-21T00:21:57.741Z","dependency_job_id":null,"html_url":"https://github.com/kalfasyan/filoma","commit_stats":null,"previous_names":["kalfasyan/filoma"],"tags_count":57,"template":false,"template_full_name":null,"purl":"pkg:github/kalfasyan/filoma","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kalfasyan%2Ffiloma","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kalfasyan%2Ffiloma/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kalfasyan%2Ffiloma/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kalfasyan%2Ffiloma/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kalfasyan","download_url":"https://codeload.github.com/kalfasyan/filoma/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kalfasyan%2Ffiloma/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31442924,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-05T15:22:31.103Z","status":"ssl_error","status_checked_at":"2026-04-05T15:22:00.205Z","response_time":75,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","profiler","validation"],"created_at":"2025-10-13T03:27:22.837Z","updated_at":"2026-04-05T17:01:27.550Z","avatar_url":"https://github.com/kalfasyan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    \u003cimg src=\"docs/assets/images/logo.png\" alt=\"filoma logo\" width=\"260\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://pypi.python.org/pypi/filoma\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/filoma.svg\" alt=\"PyPI version\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://pypi.python.org/pypi/filoma\"\u003e\u003cimg src=\"https://img.shields.io/badge/python-3.11%20%7C%203.12%20%7C%203.13%20%7C%203.14-blue\" alt=\"Python versions\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/kalfasyan/filoma/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-CC--BY--4.0-lightgrey\" alt=\"License\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/astral-sh/ruff\"\u003e\u003cimg src=\"https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json\" alt=\"Ruff\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/kalfasyan/filoma/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://github.com/kalfasyan/filoma/actions/workflows/ci.yml/badge.svg\" alt=\"Actions status\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://filoma.readthedocs.io/en/latest/\"\u003e\u003cimg src=\"https://readthedocs.org/projects/filoma/badge/?version=latest\" alt=\"Documentation Status\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eFast, multi-backend file/directory profiling and data preparation.\u003c/strong\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ccode\u003epip install filoma\u003c/code\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"docs/getting-started/installation.md\"\u003eInstallation\u003c/a\u003e •\n  \u003ca href=\"https://filoma.readthedocs.io/en/latest/\"\u003eDocumentation\u003c/a\u003e •\n  \u003ca href=\"docs/guides/brain.md\"\u003eAgentic Analysis\u003c/a\u003e •\n  \u003ca href=\"docs/guides/cli.md\"\u003eInteractive CLI\u003c/a\u003e •\n  \u003ca href=\"docs/getting-started/quickstart.md\"\u003eQuickstart\u003c/a\u003e •\n  \u003ca href=\"docs/tutorials/cookbook.md\"\u003eCookbook\u003c/a\u003e •\n  \u003ca href=\"https://github.com/kalfasyan/filoma/blob/main/notebooks/roboflow_demo.ipynb\"\u003eRoboflow Demo\u003c/a\u003e •\n  \u003ca href=\"https://github.com/kalfasyan/filoma\"\u003eSource Code\u003c/a\u003e\n\u003c/p\u003e\n\n\u003e 📖 **New to Filoma?** Check out the [**Cookbook**](docs/tutorials/cookbook.md) for practical, copy-paste recipes for common tasks!\n\n---\n\n`filoma` helps you analyze file directory trees, inspect file metadata, and prepare your data for exploration. It can achieve this blazingly fast using the best available backend (Rust, [`fd`](https://github.com/sharkdp/fd), or pure Python) ⚡🍃\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"docs/assets/images/filoma_ad.png\" alt=\"Filoma Package Overview\" width=\"400\"\u003e\n\u003c/p\u003e\n\n## Key Features\n\n- **🚀 High-Performance Backends**: Automatic selection of Rust, `fd`, or Python for the best performance.\n- **📈 DataFrame Integration**: Convert scan results to [Polars](https://github.com/pola-rs/polars) (or [pandas](https://github.com/pandas-dev/pandas)) DataFrames for powerful analysis.\n- **📊 Rich Directory Analysis**: Get detailed statistics on file counts, extensions, sizes, and more.\n- **🔍 Smart File Search**: Use regex and glob patterns to find files with `FdFinder`.\n- **🖼️ File/Image Profiling**: Extract metadata and statistics from various file formats.\n- **🛡️ Dataset Integrity \u0026 Quality**: Unified integrity checking for snapshots, manifests, and automated quality scans (corruption, duplicates, leakage, class balance). [📖 **Data Integrity Guide →**](docs/guides/data-integrity.md)\n- **🧠 Agentic Analysis**: Natural language interface for file discovery, deduplication, and metadata inspection. [📖 **Brain Guide →**](docs/guides/brain.md)\n- **🖥️ Interactive CLI**: Beautiful terminal interface for filesystem exploration and DataFrame analysis. [📖 **CLI Documentation →**](docs/guides/cli.md)\n- **🌐 MCP Server**: Expose all 21 filesystem tools to any MCP-compatible AI assistant (Claude Desktop, Cline, Cursor, etc.). [📖 **MCP Configuration →**](docs/guides/brain.md#mcp-server-configuration)\n\n\u003e **🎯 Local AI in 10 seconds:** `curl -sL https://raw.githubusercontent.com/kalfasyan/filoma/main/scripts/install.sh | sh` → Use with [Goose](https://github.com/block/goose) + [Ollama](https://ollama.com) for fully local filesystem analysis. [Learn more →](docs/guides/brain.md#goose--ollama-local--private---recommended)\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"docs/assets/images/filoma_graph.jpg\" alt=\"Filoma Package Overview\" width=\"800\"\u003e\n\u003c/p\u003e\n\n---\n\n## ⚡ Quick Start\n\n`filoma` provides a unified API for filesystem analysis.\n\n### End-to-End Example: Folder → DataFrame → Insights\n\nThis is the core Filoma workflow in one place: scan a folder, build a rich dataframe, filter it, and extract quick insights.\n\n```python\nimport filoma as flm\n\ndataset = \"notebooks/Weeds-3\"\n\n# 1) Fast scan + high-level summary\nanalysis = flm.probe(dataset)\nanalysis.print_summary()\n\n# 2) Build an enriched dataframe (paths, extension, sizes, ownership, timestamps, etc.)\ndf = flm.probe_to_df(dataset, enrich=True)\n\n# 3) Narrow to image files and inspect distribution\nimages = df.filter_by_extension([\"jpg\", \"png\"])\nprint(images.extension_counts())\nprint(images.directory_counts().head(3))\n\n# 4) Get the largest files quickly\nlargest = images.sort(\"size_bytes\", descending=True).head(5)\nprint(largest.select([\"path\", \"size_bytes\"]))\n```\n\nThis flow is typically the fastest way to move from raw folder structure to actionable dataset insight.\n\n### 1. File \u0026 Image Profiling\n\nExtract rich metadata and statistics from any file or image.\n\n```python\nimport filoma as flm\n\n# Profile any file\ninfo = flm.probe_file(\"README.md\")\nprint(info)\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003e📄 See Metadata Output\u003c/b\u003e\u003c/summary\u003e\n\n```text\nFilo(\n    path=PosixPath('README.md'),\n    size=12237,\n    mode_str='-rw-rw-r--',\n    owner='user',\n    modified=datetime.datetime(2025, 12, 30, 22, 45, 53),\n    is_file=True,\n    ...\n)\n```\n\u003c/details\u003e\n\nFor images, `probe_image` automatically extracts shapes, types, and pixel statistics.\n\n### 2. Directory Analysis\n\nScan entire directory trees in milliseconds. `filoma` automatically picks the fastest available backend (Rust → `fd` → Python).\n\n```python\n# Analyze a directory\nanalysis = flm.probe('.')\n\n# Print high-level summary\nanalysis.print_summary()\n```\n\n\u003cdetails open\u003e\n\u003csummary\u003e\u003cb\u003e📂 See Directory Summary Table\u003c/b\u003e\u003c/summary\u003e\n\n```text\n Directory Analysis: /project (🦀 Rust (Parallel)) - 0.60s\n┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓\n┃ Metric                   ┃ Value                ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩\n│ Total Files              │ 57,225               │\n│ Total Folders            │ 3,427                │\n│ Total Size               │ 2,084.90 MB          │\n│ Average Files per Folder │ 16.70                │\n│ Maximum Depth            │ 14                   │\n│ Empty Folders            │ 103                  │\n│ Analysis Time            │ 0.60s                │\n│ Processing Speed         │ 102,114 items/sec    │\n└──────────────────────────┴──────────────────────┘\n```\n\n\u003c/details\u003e\n\n```python\n# Or get a detailed report with extensions and folder stats\nanalysis.print_report()\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003e📊 See Detailed Directory Report\u003c/b\u003e\u003c/summary\u003e\n\n```text\n          File Extensions\n┏━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓\n┃ Extension  ┃ Count  ┃ Percentage ┃\n┡━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩\n│ .py        │ 240    │ 12.8%      │\n│ .jpg       │ 1,204  │ 64.2%      │\n│ .json      │ 431    │ 23.0%      │\n│ .svg       │ 28,674 │ 50.1%      │\n└────────────┴────────┴────────────┘\n\n          Common Folder Names\n┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ Folder Name   ┃ Occurrences ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ src           │ 1           │\n│ tests         │ 1           │\n│ docs          │ 1           │\n│ notebooks     │ 1           │\n└───────────────┴─────────────┘\n\n          Empty Folders (3 found)\n┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ Path                                       ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ /project/data/raw/empty_set_A              │\n│ /project/logs/old/unused                   │\n│ /project/temp/scratch                      │\n└────────────────────────────────────────────┘\n```\n\n\u003c/details\u003e\n\n### 3. DataFrame Analysis\n\nConvert scan results to Polars DataFrames for advanced analysis.\n\n```python\n# Scan and get an enriched filoma.DataFrame (Polars)\ndf = flm.probe_to_df('src', enrich=True)\n\n# Perform operations\ndf.filter_by_extension([\".py\", \".rs\"])\ndf.directory_counts()\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003e📊 See Enriched DataFrame Output\u003c/b\u003e\u003c/summary\u003e\n\n```text\nfiloma.DataFrame with 2 rows\nshape: (2, 18)\n┌───────────────────┬───────┬────────┬───────────────┬───┬─────────┬───────┬────────┬────────┐\n│ path              ┆ depth ┆ parent ┆ name          ┆ … ┆ inode   ┆ nlink ┆ sha256 ┆ xattrs │\n│ ---               ┆ ---   ┆ ---    ┆ ---           ┆   ┆ ---     ┆ ---   ┆ ---    ┆ ---    │\n│ str               ┆ i64   ┆ str    ┆ str           ┆   ┆ i64     ┆ i64   ┆ str    ┆ str    │\n╞═══════════════════╪═══════╪════════╪═══════════════╪═══╪═════════╪═══════╪════════╪════════╡\n│ src/async_scan.rs ┆ 1     ┆ src    ┆ async_scan.rs ┆ … ┆ 7601121 ┆ 1     ┆ null   ┆ {}     │\n│ src/filoma        ┆ 1     ┆ src    ┆ filoma        ┆ … ┆ 7603126 ┆ 8     ┆ null   ┆ {}     │\n└───────────────────┴───────┴────────┴───────────────┴───┴─────────┴───────┴────────┴────────┘\n\n✨ Enriched columns added: parent, name, stem, suffix, size_bytes, modified_time,\n   created_time, is_file, is_dir, owner, group, mode_str, inode, nlink, sha256, xattrs, depth\n```\n\n\u003c/details\u003e\n\n- **Seamless Pandas Integration**: Just use `df.pandas` for instant conversion.\n- **Lazy Loading**: `import filoma` is cheap; heavy dependencies load only when needed.\n\n### 4. Specialized DataFrame Operations\n\nFiloma's `DataFrame` extends Polars with filesystem-specific operations for quick filtering and summarization.\n\n```python\n# Filter by extensions\ndf.filter_by_extension([\".py\", \".rs\"])\n\n# Quick frequency analysis\ndf.extension_counts()\ndf.directory_counts()\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003e🔍 See Operation Examples\u003c/b\u003e\u003c/summary\u003e\n\n**`filter_by_extension([\".py\", \".rs\"])`**\n\n```text\nshape: (3, 1)\n┌─────────────────────┐\n│ path                │\n│ ---                 │\n│ str                 │\n╞═════════════════════╡\n│ src/async_scan.rs   │\n│ src/lib.rs          │\n│ src/filoma/dedup.py │\n└─────────────────────┘\n```\n\n**`extension_counts()`** — groups files by extension and returns counts.\n\n```text\nshape: (3, 2)\n┌────────────┬─────┐\n│ extension  ┆ len │\n│ ---        ┆ --- │\n│ str        ┆ u32 │\n╞════════════╪═════╡\n│ .py        ┆ 240 │\n│ .jpg       ┆ 124 │\n│ .json      ┆ 43  │\n└────────────┴─────┘\n```\n\n**`directory_counts()`** — summarizes file distribution across parent directories.\n\n```text\nshape: (3, 2)\n┌────────────┬─────┐\n│ parent_dir ┆ len │\n│ ---        ┆ --- │\n│ str        ┆ u32 │\n╞════════════╪═════╡\n│ src/filoma ┆ 12  │\n│ tests      ┆ 8   │\n│ docs       ┆ 5   │\n└────────────┴─────┘\n```\n\n\u003c/details\u003e\n\n---\n\n## 🗂️ Advanced Topics\n\n### Dataset Convenience Class\nUse the `Dataset` class for orchestration of snapshotting, profiling, integrity checks, and AI interactions:\n\n```python\nimport filoma as flm\n\nds = flm.Dataset(\"./my_data\")\n\n# Snapshot, Quality Scan, and Deduplication\nds.snap(mode=\"deep\")\nds.run_quality_scan()\nds.dedup()\n\n# Get an enriched DataFrame of the dataset\ndf = ds.to_dataframe()\nprint(df.extension_counts())\n\n# Agentic interaction with this specific dataset\nds.get_brain().run(\"Is there any class imbalance in my dataset?\")\n```\n\n### Dataset Integrity \u0026 Quality\nFiloma provides a comprehensive suite for dataset validation (corruption, leaks, balance) and manifest integrity:\n\n```python\nfrom filoma.core.verifier import DatasetVerifier\nverifier = DatasetVerifier(\"./data\")\nverifier.run_all()\nverifier.print_summary()\n```\n\n### Deduplication\nFind duplicate files, images (perceptual hash), or text files.\n\n```bash\n# Standard find\nfiloma dedup /path/to/dataset\n\n# Cross-directory find\nfiloma dedup train/ valid/ --cross-dir\n```\n\n### Agentic Analysis\nConnect a \"brain\" to your filesystem for natural language interaction:\n\n```python\nfrom filoma.brain import get_agent\n\nagent = get_agent()\nawait agent.run(\"Create a dataframe from notebooks/Weeds-3 with enrichment\")\nawait agent.run(\"Filter by extension: jpg, png\")\nawait agent.run(\"Summarize dataframe and show top directories\")\nawait agent.run(\"Sort dataframe by size descending and show top 5\")\n```\n\nOr use the interactive chat CLI:\n\n```bash\nfiloma brain chat\n# Then ask:\n# - create a dataframe from notebooks/Weeds-3\n# - filter by extension jpg,png\n# - summarize dataframe\n# - export dataframe to weeds_images.csv\n```\n\n#### Advanced Workflow Orchestration\nFiloma Brain now includes advanced orchestrator tools for enterprise-grade dataset analysis:\n\n```bash\n# Run advanced workflow examples\nmake brain-advanced\n\n# Or in code:\nawait agent.run(\"Run a corrupted file audit on /path/to/dataset\")\nawait agent.run(\"Generate a dataset hygiene report for /path/to/dataset\")\nawait agent.run(\"Assess the migration readiness of /path/to/dataset\")\n```\n\nThese tools provide structured, deterministic reports with detailed findings, recommendations, and confidence scores.\n\n### Interactive CLI\n```bash\nfiloma brain chat\n```\n\n[📖 **Browse all guides →**](docs/guides/index.md)\n\n---\n\n## 📊 Performance \u0026 Benchmarks\n\nNeed to compare backend performance? Check out the comprehensive [**Benchmarks Guide**](docs/reference/benchmarks.md)!\n\n**Local SSD** (1M files):\n- 🦀 **Rust**: 7.3s (136K files/sec)\n- ⚡ **Async**: 11.5s (87K files/sec)\n- 🐍 **Python**: 35.5s (28K files/sec)\n\n**Network Storage** (200K files, cold cache):\n- 🦀 **Rust**: 2.3s (86K files/sec)\n- ⚡ **Async**: 2.8s (70K files/sec)\n- 🐍 **Python**: 15.1s (13K files/sec)\n\n```bash\npython benchmarks/benchmark.py --path /your/directory -n 3 --backend profiling\n```\n\n---\n\n## License\n\nThis work is licensed under a [Creative Commons Attribution 4.0 International License][cc-by].\n\n[![CC BY 4.0][cc-by-image]][cc-by]\n\n[cc-by]: http://creativecommons.org/licenses/by/4.0/\n[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png\n\n---\n\n## Contributing\n\nContributions welcome! Please check the [issues](https://github.com/filoma/filoma/issues) for planned features and bug reports.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkalfasyan%2Ffiloma","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkalfasyan%2Ffiloma","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkalfasyan%2Ffiloma/lists"}