{"id":15009945,"url":"https://github.com/bandrel/ocyara","last_synced_at":"2025-04-09T17:52:58.978Z","repository":{"id":57447805,"uuid":"78153488","full_name":"bandrel/OCyara","owner":"bandrel","description":"Performs OCR on image files and scans them for matches to YARA rules","archived":false,"fork":false,"pushed_at":"2018-10-30T14:14:11.000Z","size":226,"stargazers_count":41,"open_issues_count":3,"forks_count":8,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-09T16:55:03.622Z","etag":null,"topics":["ocr","optical-character-recognition","python","python-3","tesseract","tesseract-ocr-api","yara","yara-rules"],"latest_commit_sha":null,"homepage":"https://pypi.python.org/pypi/OCyara/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bandrel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-01-05T22:30:59.000Z","updated_at":"2025-03-05T21:40:47.000Z","dependencies_parsed_at":"2022-09-16T22:23:02.662Z","dependency_job_id":null,"html_url":"https://github.com/bandrel/OCyara","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bandrel%2FOCyara","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bandrel%2FOCyara/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bandrel%2FOCyara/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bandrel%2FOCyara/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bandrel","download_url":"https://codeload.github.com/bandrel/OCyara/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248083419,"owners_count":21045096,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ocr","optical-character-recognition","python","python-3","tesseract","tesseract-ocr-api","yara","yara-rules"],"created_at":"2024-09-24T19:29:14.819Z","updated_at":"2025-04-09T17:52:58.959Z","avatar_url":"https://github.com/bandrel.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OCyara\n[![Build Status](https://travis-ci.org/bandrel/OCyara.svg?branch=master)](https://travis-ci.org/bandrel/OCyara)\n\n[![PyPI version](https://badge.fury.io/py/OCyara.svg)](https://pypi.python.org/pypi/OCyara/)\n\nThe OCyara module performs OCR (Optical Character Recognition) on image\nfiles and scans them for matches to Yara rules.  OCyara also can process\nimages embedded in PDF files. For more information about Yara, visit\nhttps://virustotal.github.io/yara/.\n\n## Installation\n### Operating System Requirements\n\n- **Python 3.5+**\n- **Debian-based Linux distros** are currently the only supported\n  operating systems. Installation has only been tested on Kali\n  Rolling and Ubuntu 16.10. (Other Debian-based distros may work as\n  well, but may require manual compilation of Tesseract and/or Leptonica\n  to get support for all image types. GIF, and TIFF library support\n  seems to be troublesome with some Ubuntu LTS installations.)\n- **Tesseract OCR API**\n  To install Tesseract:\n\n  1. `apt-get update`\n  1. Install python3 header files: `apt-get install python3-dev`\n  2. Install Tesseract and its required libraries:\n     `apt-get install tesseract-ocr libtesseract-dev libleptonica-dev\n      libpng12-dev libjpeg62-dev libtiff5-dev zlib1g-dev`\n\n\n\n### Install Procedure\nThe easiest way to install OCyara is through the use of pip:\n\n  1. Ensure all the Operating System Requirements listed above have been\n     met\n  3. Run `pip install cython` (has to be installed separate like this\n     due to tesserocr currently lacking an \"install_requires\")\n  2. Run `pip install ocyara`\n\nAlong with OCyara, the following other packages will be automatically\ninstalled:\n - **cython** (\u003e=0.25.2) A compiler for writing C extensions for the\n   Python language. Used by the tesserocr python module.\n   https://pypi.python.org/pypi/Cython/\n - **tesserocr** (\u003e=2.1.3) A Python wrapper for the tesseract-ocr API\n   https://github.com/sirfz/tesserocr\n - **yara-python** (\u003e=3.5.0) The Python interface for YARA\n   https://github.com/VirusTotal/yara-python\n - **pillow** (\u003e=4.0.0) Python Imaging Library fork\n   https://github.com/python-pillow/Pillow\n - **tqdm** A fast, extensible progress bar for Python and CLI\n   https://github.com/tqdm/tqdm\n - **colorlog**\n   A colored formatter for the python logging module\n   http://pypi.python.org/pypi/colorlog\n\n\n## Usage\n\n\n### OCyara Class Usage Examples\n\n```python\n# Scan the current directory recursively for files that match rules in\n# \"rulefile.yara\"\n\nfrom ocyara import OCyara\n\nocy = OCyara('./', recursive=True)\nocy.run('rulefile.yara', file_magic=True)\nprint(ocy.list_matches())\n```\n\nReturns:\n```\nVisa tests/Example.pdf\nSSN tests/Example.pdf\nAmerican_Express tests/Example.pdf\nDiners_Club tests/Example.pdf\nJCB tests/Example.pdf\nDiscover tests/Example.pdf\ncredit_card tests/Example.pdf\nMasterCard tests/Example.pdf\ncard tests/Example.pdf\n```\n\nEach line printed has the rule that was matched and the file that\nmatched it.\n\n### CLI usage Example\nOCyara is not primarily intended to be used from the command line, but\nbasic cli capablilities have been implemented to allow for\neasily-approachable testing of the library's core functionality.\n\n```\nusage: ocyara.py [-h] YARA_RULES_FILE TARGET_FILE/S`\n\npositional arguments:\n\n  YARA_RULES_FILE  Path of file containing yara rules\n  TARGET_FILE/S    Directory or file name of images to scan.\n\noptional arguments:\n  -h, --help       show this help message and exit\n```\n\n### OCyara Class Structure\n\n```\nclass OCyara(builtins.object)\n |  Performs OCR (Optical Character Recognition) on image files and scans for matches to Yara rules.\n |\n |  OCyara also can process images embedded in PDF files.\n |\n |  Methods defined here:\n |\n |  __call__(self)\n |      Default call which outputs the results with the same output standard as the regular yara program\n |\n |  __init__(self, path:str, recursive=False, worker_count=6, verbose=0) -\u003e None\n |      Create an OCyara object that can scan the specified directory or file and store the results.\n |\n |      Arguments:\n |          path -- File or directory to be processed\n |\n |      Keyword Arguments:\n |          recursive -- Whether the specified path should be recursivly searched for images (default False)\n |          worker_count -- The number of worker processes that should be spawned when\n |                          run() is executed (default available CPU cores * 2)\n |          verbose -- An int() from 0-2 that sets the verbosity level.\n |                     0 is default, 1 is information and 2 is debug\n |\n |  join(self, showprogress=True)\n |\n |  list_matched_rules(self) -\u003e set\n |      Process the matchedfiles dictionary and return a list of rules that were matched.\n |\n |  list_matches(self, rules=None) -\u003e typing.Dict\n |      List matched files and thier contexts (if available) in dictionary form.\n |\n |      Keyword Arguments:\n |\n |          rules -- Accepts a string or list of strings indicating specific rules.\n |            Only matches pertaining to the specified rule/s will be returned. If no\n |            rules are specified, all matches will be returned.\n |\n |  run(self, yara_rule:str, auto_join=True, file_magic=False, save_context=False) -\u003e None\n |      Begin multithreaded processing of path files with the specified rule file.\n |\n |      Arguments:\n |          yara_rule -- A string file path of a Yara rule file\n |\n |      Keyword Arguments:\n |          auto_join -- If set to True, the main process will stall until all the\n |            worker processes have completed their work. If set to False, join()\n |            must be manually called following run() to ensure the queue is\n |            cleared and all workers have terminated.\n |\n |          show_progress -- Display a progress bar when join() is used.\n |\n |          file_magic -- If file_magic is enabled, ocyara will examine the contents\n |            of the target files to determine if they are an eligible image file\n |            type. For example, a JPEG file named 'picture.txt' will be processed by\n |            the OCR engine. file_magic uses the Linux \"file\" command.\n |\n |          save_context -- If True, when a file matches a yara rule, the returned\n |            results dictionary will also include the full ocr text of the matched\n |            file. This text can be further processed by the user if needed.\n |\n |  show_progress(self) -\u003e None\n |      Generate a progress bar based on the number of items remaining in queue.\n |\n |  ----------------------------------------------------------------------\n |  Static methods defined here:\n |\n |  check_file_type(path:str) -\u003e str\n |      Use the Linux \"file\" command to determine a file's type based on contents\n |      instead of file extension.\n |\n |      Arguments:\n |          path -- A string file path to be processed\n |\n |  ----------------------------------------------------------------------\n |  Data descriptors defined here:\n |\n |  __dict__\n |      dictionary for instance variables (if defined)\n |\n |  __weakref__\n |      list of weak references to the object (if defined)\n |\n |  yara_output\n |      Returns the same output format as the standard yara program:\n |      RuleName FileName, FileName\n |      RuleName FileName...\n |\n |      Where:\n |        RuleName is the name of the rule that was matched\n |        FileName is the name of the file in which the match was found\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbandrel%2Focyara","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbandrel%2Focyara","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbandrel%2Focyara/lists"}