{"id":25132906,"url":"https://github.com/helviojunior/filecrawler","last_synced_at":"2025-04-08T04:16:49.030Z","repository":{"id":147107798,"uuid":"614551642","full_name":"helviojunior/filecrawler","owner":"helviojunior","description":"File Crawler index files and search hard-coded credentials","archived":false,"fork":false,"pushed_at":"2025-02-08T14:47:31.000Z","size":27647,"stargazers_count":33,"open_issues_count":0,"forks_count":9,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-30T04:05:40.986Z","etag":null,"topics":["crawler","crawling-python","elasticsearch","leaks","leaks-scanner"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/FileCrawler/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/helviojunior.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-15T20:24:51.000Z","updated_at":"2025-03-26T13:16:12.000Z","dependencies_parsed_at":"2024-06-24T23:41:28.243Z","dependency_job_id":"3312210b-c5be-4c01-bb6f-2e07f4b305cc","html_url":"https://github.com/helviojunior/filecrawler","commit_stats":null,"previous_names":[],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/helviojunior%2Ffilecrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/helviojunior%2Ffilecrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/helviojunior%2Ffilecrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/helviojunior%2Ffilecrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/helviojunior","download_url":"https://codeload.github.com/helviojunior/filecrawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247773721,"owners_count":20993639,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","crawling-python","elasticsearch","leaks","leaks-scanner"],"created_at":"2025-02-08T15:18:58.718Z","updated_at":"2025-04-08T04:16:49.008Z","avatar_url":"https://github.com/helviojunior.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# File Crawler\r\n\r\n[![Build](https://github.com/helviojunior/filecrawler/actions/workflows/build_and_publish.yml/badge.svg)](https://github.com/helviojunior/filecrawler/actions/workflows/build_and_publish.yml)\r\n[![Build](https://github.com/helviojunior/filecrawler/actions/workflows/build_and_test.yml/badge.svg)](https://github.com/helviojunior/filecrawler/actions/workflows/build_and_test.yml)\r\n[![Downloads](https://pepy.tech/badge/filecrawler/month)](https://pepy.tech/project/filecrawler)\r\n[![Supported Versions](https://img.shields.io/pypi/pyversions/filecrawler.svg)](https://pypi.org/project/filecrawler)\r\n[![Contributors](https://img.shields.io/github/contributors/helviojunior/filecrawler.svg)](https://github.com/helviojunior/filecrawler/graphs/contributors)\r\n[![PyPI version](https://img.shields.io/pypi/v/filecrawler.svg)](https://pypi.org/project/filecrawler/)\r\n[![License: GPL-3.0](https://img.shields.io/pypi/l/filecrawler.svg)](https://github.com/helviojunior/filecrawler/blob/main/LICENSE)\r\n\r\nFileCrawler officially supports Python 3.8+.\r\n\r\n## Main features\r\n\r\n* [x] List all file contents\r\n* [x] Index file contents at Elasticsearch\r\n* [x] Do OCR at several file types (with tika lib)\r\n* [x] Look for hard-coded credentials\r\n* [x] Much more...\r\n\r\n### Parsers:\r\n* [x] PDF files\r\n* [X] Microsoft Office files (Word, Excel etc)\r\n* [X] X509 Certificate files\r\n* [X] Image files (Jpg, Png, Gif etc)\r\n* [X] Java packages (Jar and war)\r\n* [X] Disassembly APK Files with APKTool\r\n* [X] Compressed files (zip, tar, gzip etc)\r\n* [X] SQLite3 database\r\n* [X] Containers (docker saved at tar.gz)\r\n* [X] E-mail (*.eml files) header, body and attachments\r\n\r\n### Indexers:\r\n* [x] Elasticsearch\r\n* [x] Stand-alone local files\r\n\r\n### Extractors:\r\n* [X] AWS credentials\r\n* [X] Github and gitlab credentials\r\n* [X] URL credentials\r\n* [X] Authorization header credentials\r\n\r\n### Alert:\r\n* [x] Send credential found via Telegram\r\n\r\n## IntelX Parser\r\n\r\nMotivated by several reasons I decided to move IntelX specific rules to a new tool called IntelParser available at https://github.com/helviojunior/intelparser/\r\n\r\n## Sample outputs\r\n\r\nIn additional File Crawler save some images with the found leaked credentials at **~/.filecrawler/** directory like the images bellow\r\n\r\n![Example 001](https://raw.githubusercontent.com/helviojunior/filecrawler/main/images/example_001.png)\r\n\r\n![Example 002](https://raw.githubusercontent.com/helviojunior/filecrawler/main/images/example_002.png)\r\n\r\n![Example 003](https://raw.githubusercontent.com/helviojunior/filecrawler/main/images/example_003.png)\r\n\r\n![Example 004](https://raw.githubusercontent.com/helviojunior/filecrawler/main/images/example_004.png)\r\n\r\n## Installing\r\n\r\n### Dependencies\r\n\r\n```bash\r\napt install default-jre default-jdk libmagic-dev git\r\n```\r\n\r\n### Installing FileCrawler\r\n\r\nInstalling from last release\r\n\r\n```bash\r\npip install -U filecrawler\r\n```\r\n\r\nInstalling development package\r\n\r\n```bash\r\npip install -i https://test.pypi.org/simple/ FileCrawler\r\n```\r\n\r\n## Running\r\n\r\n### Config file\r\n\r\nCreate a sample config file with default parameters\r\n\r\n```bash\r\nfilecrawler --create-config -v\r\n```\r\n\r\nEdit the configuration file **config.yml** with your desired parameters\r\n\r\n**Note:** You must adjust the Elasticsearch URL parameter before continue\r\n\r\n### Run\r\n\r\n```bash\r\n# Integrate with ELK\r\nfilecrawler --index-name filecrawler --path /mnt/client_files -T 30 -v --elastic\r\n\r\n# Just save leaks locally\r\nfilecrawler --index-name filecrawler --path /mnt/client_files -T 30 -v --local -o /home/out_test\r\n```\r\n\r\n## Help\r\n\r\n```bash\r\n$ filecrawler -h\r\n\r\nFile Crawler v0.1.3 by Helvio Junior\r\nFile Crawler index files and search hard-coded credentials.\r\nhttps://github.com/helviojunior/filecrawler\r\n    \r\nusage: \r\n    filecrawler module [flags]\r\n\r\nAvailable Integration Modules:\r\n  --elastic                  Integrate to elasticsearch\r\n  --local                    Save leaks locally\r\n\r\nGlobal Flags:\r\n  --index-name [index name]  Crawler name\r\n  --path [folder path]       Folder path to be indexed\r\n  --config [config file]     Configuration file. (default: ./fileindex.yml)\r\n  --db [sqlite file]         Filename to save status of indexed files. (default: ~/.filecrawler/{index_name}/indexer.db)\r\n  -T [tasks]                 number of connects in parallel (per host, default: 16)\r\n  --create-config            Create config sample\r\n  --clear-session            Clear old file status and reindex all files\r\n  -h, --help                 show help message and exit\r\n  -v                         Specify verbosity level (default: 0). Example: -v, -vv, -vvv\r\n\r\nUse \"filecrawler [module] --help\" for more information about a command.\r\n\r\n```\r\n\r\n# How-to install ELK from scratch\r\n\r\n[Installing Elasticsearch](https://github.com/helviojunior/filecrawler/blob/main/INSTALL_ELK.md)\r\n\r\n# Docker Support\r\n\r\n## Build filecrawler only:\r\n\r\n```bash\r\n$ docker build --no-cache -t \"filecrawler:client\" https://github.com/helviojunior/filecrawler.git#main\r\n```\r\n\r\nUsing Filecrawler's image:\r\n\r\nGoes to path to be indexed and run the commands bellow\r\n\r\n```bash\r\n$ mkdir -p $HOME/.filecrawler/\r\n$ docker run -v \"$HOME/.filecrawler/\":/u01/ -v \"$PWD\":/u02/ --rm -it \"filecrawler:client\" --create-config -v\r\n$ docker run -v \"$HOME/.filecrawler/\":/u01/ -v \"$PWD\":/u02/ --rm -it \"filecrawler:client\" --path /u02/ --no-db -T 30 -v --elastic --index-name filecrawler\r\n```\r\n\r\n\r\n## Build filecrawler + ELK image:\r\n\r\n```bash\r\n$ sysctl -w vm.max_map_count=262144\r\n$ docker build --no-cache -t \"filecrawler:latest\" -f Dockerfile.elk_server https://github.com/helviojunior/filecrawler.git#main\r\n```\r\n\r\nUsing Filecrawler's image:\r\n\r\nGoes to path to be indexed and run the commands bellow\r\n\r\n```bash\r\n$ mkdir -p $HOME/.filecrawler/\r\n$ docker run -p 443:443 -p 80:5601 -p 9200:9200 -v \"$HOME/.filecrawler/\":/u01/ -v \"$PWD\":/u02/ --rm -it \"filecrawler:latest\"\r\n\r\n#Inside of docker run\r\n$ filecrawler --create-config -v\r\n$ filecrawler --path /u02/ -T 30 -v --elastic --index-name filecrawler \r\n```\r\n\r\n## Using Docker with remote server using ssh forwarding\r\n```bash\r\n$ mkdir -p $HOME/.filecrawler/\r\n$ docker run -v \"$HOME/.ssh/\":/root/.ssh/ -v \"$HOME/.filecrawler/\":/u01/ -v \"$PWD\":/u02/ --rm -it --entrypoint /bin/bash \"filecrawler:client\"\r\n$ ssh -o StrictHostKeyChecking=no -Nf -L 127.0.0.1:9200:127.0.0.1:9200 user@server_ip\r\n$ filecrawler --create-config -v\r\n$ filecrawler --path /u02/ -T 30 --no-db -v --elastic --index-name filecrawler \r\n```\r\n\r\n\r\n# Credits\r\n\r\nThis project was inspired of:\r\n\r\n1. [FSCrawler](https://fscrawler.readthedocs.io/)\r\n2. [Gitleaks](https://gitleaks.io/)\r\n\r\n**Note:** Some part of codes was ported from this 2 projects\r\n\r\n# To do\r\n\r\n[Check the TODO file](https://github.com/helviojunior/filecrawler/blob/main/TODO.md)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhelviojunior%2Ffilecrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhelviojunior%2Ffilecrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhelviojunior%2Ffilecrawler/lists"}