{"id":20050017,"url":"https://github.com/hyperimpose/minutia","last_synced_at":"2025-07-08T08:33:42.612Z","repository":{"id":257486878,"uuid":"857794709","full_name":"hyperimpose/minutia","owner":"hyperimpose","description":"Summarizing the content of internet services","archived":false,"fork":false,"pushed_at":"2024-10-12T01:22:02.000Z","size":99,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-02T08:29:59.644Z","etag":null,"topics":["html","http","hyperimpose","hyperlink","minutia","parsing","python","url"],"latest_commit_sha":null,"homepage":"https://hyperimpose.org","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hyperimpose.png","metadata":{"files":{"readme":"README.org","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-15T16:21:25.000Z","updated_at":"2024-10-12T01:22:06.000Z","dependencies_parsed_at":"2025-03-02T08:38:33.409Z","dependency_job_id":null,"html_url":"https://github.com/hyperimpose/minutia","commit_stats":null,"previous_names":["hyperimpose/minutia"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hyperimpose/minutia","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyperimpose%2Fminutia","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyperimpose%2Fminutia/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyperimpose%2Fminutia/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyperimpose%2Fminutia/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hyperimpose","download_url":"https://codeload.github.com/hyperimpose/minutia/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyperimpose%2Fminutia/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264231935,"owners_count":23576742,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["html","http","hyperimpose","hyperlink","minutia","parsing","python","url"],"created_at":"2024-11-13T11:53:42.970Z","updated_at":"2025-07-08T08:33:42.588Z","avatar_url":"https://github.com/hyperimpose.png","language":"Python","readme":"#+OPTIONS: ^:nil\n\n* minutia\n\nminutia is a library for summarizing the content of various internet services.\nIt uses a URL based interface. Each URL is mapped to a dictionary that contains data about the resource linked,\nsuch as its title, its filesize and more.\n\n** Features\n\n- Supported protocols: HTTP\n- Customized extraction mechanisms for various services.\n- Calculates a TTL value that can be used for caching.\n- Supports multiple languages.\n- Built with Python async.\n- Includes tests that can be used to ensure that each service is parsed properly.\n\n- [media] Supports many content types: HTML, PDF, PNG, JPEG, MP3, ACC, FLAC and more.\n- [explicit] Detects explicit content using metadata and ML models.\n\n** Usage\n\nBefore this library can be used, it must be initialized. After initialization it can then be configured.\nWhen you are done using it, you should also terminate it.\n\nBelow is a simple example of how to use the library:\n#+BEGIN_SRC python\n  import asyncio\n  import minutia\n\n  async def main():\n      # Initialize\n      await minutia.init()\n\n      # Configure\n      minutia.set_lang(\"en-GB\")\n      minutia.set_max_filesize(6_553_500)  # 6.5 MB in bytes\n\n      # Use the library\n      \"ok\", data = await minutia.http.get(\"http://hyperimpose.org\")\n\n      # `data' will be a dictionary like:\n      # {\n      #     \"@\": \"http:html\",\n      #     \"t\": \"This is the title of the page\"\n      # }\n\n      # Terminate\n      await minutia.terminate()\n\n  asyncio.run(main())\n#+END_SRC\n\n** Installation\n\nYou can install this library using pip:\n#+BEGIN_SRC sh\n  pip install \"git+https://github.com/hyperimpose/minutia.git@master\"\n#+END_SRC\n\n** Optional Dependencies\nSome of the features listed above are optional. You can tell pip which optional features you want installed\nby using the syntax ~[media,explicit]~ after the installation url:\n\n#+BEGIN_SRC sh\n  pip install \"git+https://github.com/hyperimpose/minutia.git@master\"[media]\n#+END_SRC\n\nThe availability of those dependencies is checked once when minutia is imported.\nWhen an optional dependency is missing the library will log a warning to inform you about the fact.\n\nTable of available optional features:\n|----------+-----------------------------------------------------|\n| Name     | Description                                         |\n|----------+-----------------------------------------------------|\n| media    | Extract extra information from various media files. |\n| explicit | See [[#explicit][Explicit]].                                       |\n|----------+-----------------------------------------------------|\n\n*** Explicit\nThe library is able to detect explicit content using a Machine Learning (ML) model.\nTo run the ML model, tensorflow and other dependencies need to be installed which has the following caveats:\n- Increased installation size.\n- When the code is started for the first time it will need to download and cache the ML model.\n- *Tensorflow requires a CPU with AVX support*\n  - If this requirement is not met the program will crash and show this message: ~Illegal Instruction~.\n\nTo avoid those issues this feature is provided through a secondary service. The library can then be configured,\nat runtime, to connect to that service over a unix socket.\n\nAn example workflow to use this feature:\n- Do NOT use the ~explicit~ flag when installing minutia as a library for your project.\n  - This keeps the size of the library relatively small.\n- Create a new venv, clone this repo and install the library with the ~explicit~ flag.\n  - Install with: ~pip install ./minutia[explicit]~\n  - Run ~python minutia explicit -u /tmp/your_path_here.sock~ to start the service.\n    - CAUTION: the file path given after the -u option will be replaced on startup.\n- After initializing the library in your code setup the explicit client:\n  - Add this code: ~await minutia.setup_explicit_unix_socket(\"/tmp/your_path_here.sock\")~\n\n**** Video support\nThe explicit service can provide video support. Some extra runtime dependencies are needed for this to work.\nThe availability of those dependencies is checked once when the service is started.\nWhen a runtime dependency is missing the library will log a warning to inform you about the fact. \n\nTable of extra runtime dependencies:\n|---------+---------------------------------------|\n| Name    | Description                           |\n|---------+---------------------------------------|\n| ffmpeg  | Enables explicit detection of videos. |\n| ffprobe | MUST be installed with ~ffmpeg~.      |\n|---------+---------------------------------------|\n\n** API\n\n*** minutia\n\n**** Initialization / Termination\n\n|-------------------+-----------------------|\n| Callable          | Description           |\n|-------------------+-----------------------|\n| async init()      | Intialize the library |\n| async terminate() | Terminate the library |\n|-------------------+-----------------------|\n\n**** Configuration\n\n|-----------------------------+------------------------------------------------------------+---------|\n| Callable                    | Description                                                | Default |\n|-----------------------------+------------------------------------------------------------+---------|\n| set_http_useragent(ua: str) | The useragent to use when making HTTP requests.            |         |\n| set_lang(lang: str)         | The default language to request content in. The value      | \"en\"    |\n|                             | is passed in HTTP headers such as Accept-Language.         |         |\n| set_max_filesize(i: int)    | The max number of bytes to download for deep inspecion of  | 14_600  |\n|                             | supported media files. Set to \u003c= 0 to disable the feature. |         |\n| set_max_htmlsize(i: int)    | The max bytes to download when parsing HTML pages.         | 14_600  |\n|-----------------------------+------------------------------------------------------------+---------|\n\n**** Setup\n\n|---------------------------------------------+-------------------------------------------------+---------|\n| Callable                                    | Description                                     | Default |\n|---------------------------------------------+-------------------------------------------------+---------|\n| async setup_explicit_unix_socket(path: str) | Path to the explicit service. When called a new | \"\"      |\n|                                             | client is started. \"\" disables the feature.     |         |\n|---------------------------------------------+-------------------------------------------------+---------|\n\n*** minutia.http\n\nThis module is used when working with HTTP/HTTPS links.\n\n|--------------------------------------+--------------------------------------------------------------|\n| Callable                             | Description                                                  |\n|--------------------------------------+--------------------------------------------------------------|\n| async get(link: str, lang: str = \"\") | Visit the link and return information about it. If `lang' is |\n|                                      | given then it will be used instead of the default lang set.  |\n|--------------------------------------+--------------------------------------------------------------|\n\n*** Logging\n\nminutia is using the ~logging~ module to log various events. Everything is logged under the ~minutia~\nlogger.\n\nWhen the library is imported it might log information about the availability of various features. If you want\nto capture those you must configure logging in your application before importing minutia.\n\n** Developer Notes\n\nThe library has an extra installation option ~dev~ to be used during development. It is built using Flit.\n\nYou can setup a development environment with all the dependencies by running the following:\n#+BEGIN_SRC sh\n  python -m venv venv\n  source venv/bin/activate\n  pip install flit\n  flit install\n#+END_SRC\n\n*** Project Structure\nThis project provides both a library for use in other python projects and standalone services to provide\nextra features to the library.\n\n#+BEGIN_SRC\n  /minutia            The library to import in other programs\n  /services/explicit  The explicit service\n  /__main__.py        Magic module to start the services from the CLI\n#+END_SRC\n\n** License\n\nminutia is licensed under the [[https://www.gnu.org/licenses/agpl-3.0.html][GNU Affero General Public License version 3 (AGPLv3)]].\n#+BEGIN_CENTER\n[[https://www.gnu.org/graphics/agplv3-with-text-162x68.png]]\n#+END_CENTER\n\nA copy of this license is included in the file [[../../COPYING][COPYING]].\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyperimpose%2Fminutia","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhyperimpose%2Fminutia","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyperimpose%2Fminutia/lists"}