{"id":20196397,"url":"https://github.com/raul23/organize-ebooks","last_synced_at":"2025-04-10T10:43:26.633Z","repository":{"id":65407522,"uuid":"589673197","full_name":"raul23/organize-ebooks","owner":"raul23","description":"Automatically organize folders with potentially huge amounts of unorganized ebooks","archived":false,"fork":false,"pushed_at":"2024-04-09T19:12:56.000Z","size":1626,"stargazers_count":19,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-04-09T23:18:24.510Z","etag":null,"topics":["calibre","catdoc","djvu","djvulibre","docker","ebook","ebook-convert","epub","ghostscript","isbn","linux","macos","ocr","organizer","pdf","poppler","python","tesseract","textutil","ubuntu"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raul23.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-01-16T17:04:34.000Z","updated_at":"2024-04-14T22:22:04.342Z","dependencies_parsed_at":"2024-04-14T22:22:03.321Z","dependency_job_id":"c3dfab8d-654c-4f22-9778-f12c3db471c3","html_url":"https://github.com/raul23/organize-ebooks","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raul23%2Forganize-ebooks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raul23%2Forganize-ebooks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raul23%2Forganize-ebooks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raul23%2Forganize-ebooks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raul23","download_url":"https://codeload.github.com/raul23/organize-ebooks/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248200206,"owners_count":21063850,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["calibre","catdoc","djvu","djvulibre","docker","ebook","ebook-convert","epub","ghostscript","isbn","linux","macos","ocr","organizer","pdf","poppler","python","tesseract","textutil","ubuntu"],"created_at":"2024-11-14T04:23:51.515Z","updated_at":"2025-04-10T10:43:26.607Z","avatar_url":"https://github.com/raul23.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"===============\norganize-ebooks\n===============\nAutomatically organize folders with potentially huge amounts of unorganized ebooks. This is a Python port of `organize-ebooks.sh \u003chttps://github.com/na--/ebook-tools/blob/master/organize-ebooks.sh\u003e`_ \nfrom `ebook-tools \u003chttps://github.com/na--/ebook-tools\u003e`_ written in shell by `na-- \u003chttps://github.com/na--\u003e`_.\n\nThis is done by renaming the files with proper names and moving them to other folders. The new names are obtained based on the ISBNs\nfound in the ebook files. These ISBNs are extracted by using progressively more complex methods (from searching the filename to OCR-ing\nthe given file) depending on the user's specified options.\n\nAbout\n=====\n`organize_ebooks.py \u003c./organize_ebooks/scripts/organize_ebooks.py\u003e`_ automatically organize folders with potentially huge amounts of unorganized \nebooks. This is done by renaming the files with proper names and moving them to other folders.\n\nThe new names are obtained based on the ISBNs found in the ebook files. These ISBNs are extracted by using progressively more complex methods (from \nsearching the filename to OCR-ing the given file) depending on the user's specified options (see `Basic command \u003c#basic-command\u003e`_).\n\nIt is a Python port of `organize-ebooks.sh \u003chttps://github.com/na--/ebook-tools/blob/master/organize-ebooks.sh\u003e`_ \nfrom `ebook-tools \u003chttps://github.com/na--/ebook-tools\u003e`_ written in shell by `na-- \u003chttps://github.com/na--\u003e`_.\n\n`:star:` Other related Python projects based on ``ebook-tools``:\n\n- `convert-to-txt \u003chttps://github.com/raul23/convert-to-txt\u003e`_: convert documents (pdf, djvu, epub, word) to txt\n- `find-isbns \u003chttps://github.com/raul23/find-isbns\u003e`_: find ISBNs from ebooks (pdf, djvu, epub) or any string given as input to the script\n- `ocr \u003chttps://github.com/raul23/ocr\u003e`_: run OCR on documents (pdf, djvu, and images)\n- `split-ebooks-into-folders \u003chttps://github.com/raul23/split-ebooks-into-folders\u003e`_: split the supplied ebook files into \n  folders with consecutive names\n- `interactive-organizer \u003chttps://github.com/raul23/interactive-organizer\u003e`_: interactively and manually check the ebook files\n  that were organized by ``organized_ebooks``.\n  \nDependencies\n============\n`:warning:` \n\n   You can ignore this section and go straight to pulling the `Docker image \u003c#installing-with-docker-recommended\u003e`_ which contains all the \n   required dependencies and the Python package ``organize_ebooks`` already installed. This section is more for showing how I setup my system\n   when porting the shell script `organize-ebooks.sh \u003chttps://github.com/na--/ebook-tools/blob/master/organize-ebooks.sh\u003e`_ et al. to Python.\n\nThis is the environment on which the Python package `organize_ebooks \u003c./organize_ebooks/\u003e`_ was developed and tested:\n\n* **Platform:** macOS\n* **Python**: version **3.7**\n* `p7zip \u003chttps://sourceforge.net/projects/p7zip/\u003e`_ for ISBN searching in ebooks that are in archives.\n* `Tesseract \u003chttps://github.com/tesseract-ocr/tesseract\u003e`_ for running OCR on books - version 4 gives \n  better results. \n  \n  `:warning:` OCR is a slow resource-intensive process. Hence, by default only the first 7 and last 3 pages are OCR-ed through the option\n  ``--ocr-only-first-last-pages``. More info at `Script options \u003c#script-options\u003e`_.\n* `Ghostscript \u003chttps://www.ghostscript.com/\u003e`_: ``gs`` converts *pdf* to *png* (useful for OCR)\n* `textutil \u003chttps://ss64.com/osx/textutil.html\u003e`_ or `catdoc \u003chttp://www.wagner.pp.ru/~vitus/software/catdoc/\u003e`_: for converting *doc* to *txt*\n\n  **NOTE:** On macOS, you don't need ``catdoc`` since it has the built-in ``textutil``\n  command-line tool that converts any *txt*, *html*, *rtf*, \n  *rtfd*, *doc*, *docx*, *wordml*, *odt*, or *webarchive* file\n* `DjVuLibre \u003chttp://djvu.sourceforge.net/\u003e`_: \n\n  - it includes ``ddjvu`` for converting *djvu* to *tif* image (useful for OCR), and ``djvused`` to get number of pages from a *djvu* document\n  - it includes ``djvutxt`` for converting *djvu* to *txt*\n  \n    `:warning:` \n  \n    - To access the *djvu* command line utilities and their documentation, you must set the shell variable ``PATH`` and ``MANPATH`` appropriately. \n      This can be achieved by invoking a convenient shell script hidden inside the application bundle::\n  \n       $ eval `/Applications/DjView.app/Contents/setpath.sh`\n   \n      **Ref.:** ReadMe from DjVuLibre\n    - You need to softlink ``djvutxt`` in ``/user/local/bin`` (or add it in ``$PATH``)\n* `poppler \u003chttps://poppler.freedesktop.org/\u003e`_: \n\n  - it includes ``pdftotext`` for converting *pdf* to *txt*\n  - it includes ``pdfinfo`` to get number of pages from a *pdf* document if `mdls (macOS) \u003chttps://ss64.com/osx/mdls.html\u003e`_ is not found.\n\n`:information_source:` *epub* is converted to *txt* by using ``unzip -c {input_file}``\n\n|\n\n**Optionally:**\n\n- `calibre \u003chttps://calibre-ebook.com/\u003e`_: \n\n  - Versions **2.84** and above are preferred because of their ability to manually specify from which\n    specific online source we want to fetch metadata. For earlier versions you have to set \n    ``ISBN_METADATA_FETCH_ORDER`` and ``ORGANIZE_WITHOUT_ISBN_SOURCES`` to empty strings.\n\n  - for fetching metadata from online sources\n  \n  - for getting an ebook's metadata with ``ebook-meta`` in order to search it for ISBNs\n\n  - for converting {*pdf*, *djvu*, *epub*, *msword*} to *txt* (for ISBN searching) by using calibre's \n    `ebook-convert \u003chttps://manual.calibre-ebook.com/generated/en/ebook-convert.html\u003e`_\n  \n    `:warning:` ``ebook-convert`` is slower than the other conversion tools (``textutil``, ``catdoc``, ``pdftotext``, ``djvutxt``)\n\n- **Optionally** `poppler \u003chttps://poppler.freedesktop.org/\u003e`_, `catdoc \u003chttp://www.wagner.pp.ru/~vitus/software/catdoc/\u003e`_ \n  and `DjVuLibre \u003chttp://djvu.sourceforge.net/\u003e`_ can be installed for **faster** than calibre's conversion of ``.pdf``, ``.doc`` and ``.djvu`` files\n  respectively to ``.txt``.\n\n- **Optionally** the `Goodreads \u003chttps://www.mobileread.com/forums/showthread.php?t=130638\u003e`_ and \n  `WorldCat xISBN \u003chttps://github.com/na--/calibre-worldcat-xisbn-metadata-plugin\u003e`_ calibre plugins can be installed for better metadata fetching.\n\n|\n\n`:star:`\n\n  If you only install **calibre** among these dependencies, you can still have\n  a functioning program that will organize ebook collections: \n  \n  * fetching metadata from online sources will work: by `default \n    \u003chttps://manual.calibre-ebook.com/generated/en/fetch-ebook-metadata.html#\n    cmdoption-fetch-ebook-metadata-allowed-plugin\u003e`__\n    **calibre** comes with Amazon and Google sources among others\n  * conversion to *txt* will work: `calibre`'s own ``ebook-convert`` tool\n    will be used. However, accuracy and performance will be affected as explained \n    in the list of dependencies above.\n\nInstalling with Docker (Recommended) ⭐\n=======================================\nInstallation instructions\n-------------------------\n`:information_source:` \n\n  It is recommended to install the Python package `organize_ebooks \u003c./organize_ebooks/\u003e`_ with **Docker** because the Docker\n  container has all the many `dependencies \u003c#dependencies\u003e`_ already installed along with the Python package ``organize_ebooks``.\n\n1. Pull the Docker image from `hub.docker.com \u003chttps://hub.docker.com/repository/docker/raul23/organize/general\u003e`_:\n\n   .. code-block:: bash\n\n      docker pull raul23/organize:latest\n\n2. Run the Docker container:\n\n   .. code-block:: bash\n\n      docker run -it -v /host/input/folder:/unorganized-books raul23/organize:latest\n   \n   `:information_source:` \n   \n      - ``/host/input/folder`` is a directory within your OS that can contain all the ebooks to be organized and\n        is mounted as ``/unorganized-books`` within the Docker container.\n      - You can use the ``-v`` option mulitple times to mount several host output folders within the container, e.g.:\n        \n        .. code-block:: bash\n        \n           docker run -it -v /host/input/folder:/unorganized-books -v /host/output/folder:/output-folder raul23/organize:latest\n      - ``raul23/organize:latest`` is the name of the image upon which the Docker container will be created.\n\n3. Now that you are within the Docker container, you can run the Python script ``organize_ebooks`` with the desired `options \u003c#script-options\u003e`_::\n\n    user:~$ organize_ebooks /unorganized-books/\n   \n   `:information_source:` \n   \n       - This basic command instructs the script ``organize_ebooks`` to organize the ebooks within ``/unorganized-books/``\n         and to save the renamed ebooks within the working directory which is the default location of the ``-o`` option (output folder).\n       - When you log in as ``user`` (non-root) within the Docker container, your working directory is ``/ebook-tools``.\n\nContent of the Docker image\n---------------------------\n`:information_source:` \n \n - The layers of the Docker image can be checked in details at the project's `Docker repo \n   \u003chttps://hub.docker.com/layers/raul23/organize/latest/images/sha256-b7a93954ff08a59a1539a45b8811d4740ca6d5fc87fc9e37d80f206fd456f55a?context=repo\u003e`_ where you can find the commands used in the Dockerfile for installing all the dependencies in the base OS (Ubuntu 18.04).\n - This Python-based Docker image is derived from the project `ebook-tools \u003chttps://github.com/na--/ebook-tools\u003e`_ (shell scripts\n   by `na-- \u003chttps://github.com/na--\u003e`_) which you can find at the `Docker Hub \u003chttps://hub.docker.com/r/ebooktools/scripts/tags\u003e`_. One of the main \n   differences being that the base OS is Ubuntu 18.04 and Debian, respectively.\n\nThe `Docker image \u003chttps://hub.docker.com/repository/docker/raul23/organize/general\u003e`_ for this project contains the following components:\n\n1. Ubuntu 18.04: the base system of the Docker image\n2. All the `dependencies \u003c#dependencies\u003e`_ (required and optional) needed for supporting all the features (e.g. OCR, document \n   conversion to text) offered by the package ``organize_ebooks``:\n\n   - Python 3.6.9 along with ``setuptools`` and ``wheel``\n   - p7zip: ``7z``\n   - Tesseract\n   - Ghostscript: ``gs``\n   - ``catdoc``\n   - DjVuLibre: ``ddjvu``, ``djvused``, ``djvutxt``\n   - Poppler: ``pdftotext`` and ``pdfinfo``\n   - calibre: ``ebook-convert``, ``ebook-meta``, calibre's metadata plugins (including Goodreads and WorldCat xISBN)\n   \n     The Goodreads plugin (goodreads.zip) is from this forum post (by a calibre Developer) (2022-12-23): \n     `mobileread.com \u003chttps://www.mobileread.com/forums/showpost.php?p=4283801\u0026postcount=5\u003e`_\n   - ``unzip``\n3. The Python package ``organize_books`` is installed. You can call the corresponding script with any of the `options \u003c#script-options\u003e`_::\n\n    user:~$ organize_ebooks /unorganized-books/\n4. The Python package `interactive_organizer \u003chttps://github.com/raul23/interactive-organizer\u003e`_ is installed. \n   You can call the corresponding script with any of the `options \u003chttps://github.com/raul23/interactive-organizer#script-s-list-of-options\u003e`_::\n\n    user:~$ interactive_organizer /uncertain/\n5. ``user``: a user named ``user`` is created with UID 1000. ``user`` doesn't have root privileges within the Docker container. Thus\n   you can't among other things install packages with ``apt-get install``.\n\nInstalling the development version\n==================================\nInstall\n-------\n`:warning:` \n\n   You can ignore this section and go straight to pulling the `Docker image \u003c#installing-with-docker-recommended\u003e`_ which contains all the \n   required dependencies and the Python package ``organize_ebooks`` already installed. This section is for installing the bleeding-edge\n   version of the Python package ``organize_ebooks`` after you have installed yourself the many `dependencies \u003c#dependencies\u003e`_.\n\nAfter you have installed the `dependencies \u003c#dependencies\u003e`_, you can then install the development (bleeding-edge) \nversion of the package `organize_ebooks \u003c./organize_ebooks/\u003e`_:\n\n.. code-block:: bash\n \n   pip install git+https://github.com/raul23/organize-ebooks#egg=organize-ebooks\n \n**NOTE:** the development version has the latest features \n \n**Test installation**\n\n1. Test your installation by importing ``organize_ebooks`` and printing its\n   version:\n   \n   .. code-block:: bash\n\n      python -c \"import organize_ebooks; print(organize_ebooks.__version__)\"\n\n2. You can also test that you have access to the ``organize_ebooks.py`` script by\n   showing the program's version:\n\n   .. code-block:: bash\n\n      organize_ebooks --version\n\nUninstall\n---------\nTo uninstall the development version of the package `organize_ebooks \u003c./organize_ebooks/\u003e`_:\n\n.. code-block:: bash\n\n   pip uninstall organize_ebooks\n\nScript options\n==============\nList of options\n---------------\nTo display the script `organize_ebooks.py \u003c./organize_ebooks/scripts/organize_ebooks.py\u003e`_ list of options and their descriptions::\n\n  $ organize_ebooks -h\n\n  usage: organize_ebooks [OPTIONS] {folder_to_organize}\n\n  Automatically organize folders with potentially huge amounts of unorganized ebooks.\n  This is done by renaming the files with proper names and moving them to other folders.\n\n  This script is based on the great ebook-tools written in shell by na-- (See https://github.com/na--/ebook-tools).\n\n  General options:\n    -h, --help                                      Show this help message and exit.\n    -v, --version                                   Show program's version number and exit.\n    -q, --quiet                                     Enable quiet mode, i.e. nothing will be printed.\n    --verbose                                       Print various debugging information, e.g. print traceback when there is an exception.\n    -d, --dry-run                                   If this is enabled, no file rename/move/symlink/etc. operations will actually be executed.\n    -s, --symlink-only                              Instead of moving the ebook files, create symbolic links to them.\n    -k, --keep-metadata                             Do not delete the gathered metadata for the organized ebooks, instead save it in an \n                                                    accompanying file together with each renamed book. It is very useful for semi-automatic \n                                                    verification of the organized files for additional verification, indexing or processing at \n                                                    a later date.\n    -r, --reverse                                   If this is enabled, the files will be sorted in reverse (i.e. descending) order. By default, \n                                                    they are sorted in ascending order.\n    --log-level {debug,info,warning,error}          Set logging level. (default: info)\n    --log-format {console,only_msg,simple}          Set logging formatter. (default: only_msg)\n\n  Convert-to-txt options:\n    --djvu {djvutxt,ebook-convert}                  Set the conversion method for djvu documents. (default: djvutxt)\n    --epub {epubtxt,ebook-convert}                  Set the conversion method for epub documents. (default: epubtxt)\n    --msword {catdoc,textutil,ebook-convert}        Set the conversion method for msword documents. (default: textutil)\n    --pdf {pdftotext,ebook-convert}                 Set the conversion method for pdf documents. (default: pdftotext)\n\n  Options related to extracting ISBNS from files and finding metadata by ISBN:\n    --max-isbns NUMBER                              Maximum number of ISBNs to try when fetching metadata from online sources by ISBNs. (default: 5)\n    -i, --isbn-regex ISBN_REGEX                     This is the regular expression used to match ISBN-like numbers in the supplied books. (default:\n                                                    (?\u003c![0-9])(-?9-?7[789]-?)?((-?[0-9]-?){9}[0-9xX])(?![0-9]))\n    --isbn-blacklist-regex REGEX                    Any ISBNs that were matched by the ISBN_REGEX above and pass the ISBN validation algorithm are\n                                                    normalized and passed through this regular expression. Any ISBNs that successfully match against \n                                                    it are discarded. The idea is to ignore technically valid but probably wrong numbers like \n                                                    0123456789, 0000000000, 1111111111, etc.. (default: ^(0123456789|([0-9xX])\\2{9})$)\n    --isbn-direct-files REGEX                       This is a regular expression that is matched against the MIME type of the searched files. Matching \n                                                    files are searched directly for ISBNs, without converting or OCR-ing them to .txt first. \n                                                    (default: ^text/(plain|xml|html)$)\n    --isbn-ignored-files REGEX                      This is a regular expression that is matched against the MIME type of the searched files. Matching \n                                                    files are not searched for ISBNs beyond their filename. By default, it tries to ignore .gif and \n                                                    .svg images, audio, video and executable files and fonts. \n                                                    (default: ^(image/(gif|svg.+)|application/(x-shockwave-flash|CDFV2|vnd.ms-\n                                                    opentype|x-font-ttf|x-dosexec|vnd.ms-excel|x-java-applet)|audio/.+|video/.+)$)\n    --reorder-files LINES [LINES ...]               These options specify if and how we should reorder the ebook text before searching for ISBNs in \n                                                    it. By default, the first 400 lines of the text are searched as they are, then the last 50 are \n                                                    searched in reverse and finally the remainder in the middle. This reordering is done to improve \n                                                    the odds that the first found ISBNs in a book text actually belong to that book (ex. from the \n                                                    copyright section or the back cover), instead of being random ISBNs mentioned in the middle of the \n                                                    book. No part of the text is searched twice, even if these regions overlap. Set it\n                                                    to `False` to disable the functionality or `first_lines last_lines` to enable it with the \n                                                    specified values. (default: 400 50)\n    --irs, --isbn-return-separator SEPARATOR        This specifies the separator that will be used when returning any found ISBNs. (default: ' - ')\n    -m, ---metadata-fetch-order METADATA_SOURCE [METADATA_SOURCE ...]\n                                                    This option allows you to specify the online metadata sources and order in which the subcommands \n                                                    will try searching in them for books by their ISBN. The actual search is done by calibre's `fetch-\n                                                    ebook-metadata` command-line application, so any custom calibre metadata plugins can also be used. \n                                                    To see the currently available options, run `fetch-ebook-metadata --help` and check the \n                                                    description for the `--allowed-plugin` option. If you use Calibre versions that are older than \n                                                    2.84, it's required to manually set this option to an empty string. \n                                                    (default: ['Goodreads', 'Google', 'Amazon.com', 'ISBNDB', 'WorldCat xISBN', 'OZON.ru'])\n\n  OCR options:\n    --ocr, --ocr-enabled {always,true,false}        Whether to enable OCR for .pdf, .djvu and image files. It is disabled by default. (default: false)\n    --ocrop, --ocr-only-first-last-pages PAGES PAGES\n                                                    Value 'n m' instructs the script to convert only the first n and last m pages when OCR-ing ebooks. \n                                                    (default: 7 3)\n\n  Organize options:\n    --skip-archives                                 Skip all archives (e.g. zip, 7z) except epub files.\n    -c, --corruption-check {check_only,true,false}  `check_only`: do not organize or rename files, just check them for corruption (ex. zero-filled \n                                                    files, corrupt archives or broken .pdf files). `true`: check corruption and organize/rename files. \n                                                    `false`: skip corruption check. This option is useful with the `output-folder-corrupt` option.\n                                                    (default: true)\n    -t, --tested-archive-extensions REGEX           A regular expression that specifies which file extensions will be tested with `7z t` for \n                                                    corruption.\n                                                    (default: ^(7z|bz2|chm|arj|cab|gz|tgz|gzip|zip|rar|xz|tar|epub|docx|odt|ods|cbr|cbz|maff|iso)$)\n    --owi, --organize-without-isbn                  Specify whether the script will try to organize ebooks if there were no ISBN found in the book or \n                                                    if no metadata was found online with the retrieved ISBNs. If enabled, the script will first try to \n                                                    use calibre's `ebook-meta` command-line tool to extract the author and title metadata from the \n                                                    ebook file. The script will try searching the online metadata sources (`organize-without-isbn-\n                                                    sources`) by the extracted author \u0026 title and just by title. If there is no useful metadata or \n                                                    nothing is found online, the script will try to use the filename for searching.\n    --owis, --organize-without-isbn-sources METADATA_SOURCE [METADATA_SOURCE ...]\n                                                    This option allows you to specify the online metadata sources in which the script will try \n                                                    searching for books by non-ISBN metadata (i.e. author and title). The actual search is done by \n                                                    calibre's `fetch-ebook-metadata` command-line application, so any custom calibre metadata plugins \n                                                    can also be used. To see the currently available options, run `fetch-ebook-metadata --help` and \n                                                    check the description for the `--allowed-plugin` option. Because Calibre versions older than 2.84 \n                                                    don't support the `--allowed-plugin` option, if you want to use such an old Calibre\n                                                    version you should manually set `organize_without_isbn_sources` to an empty string. \n                                                    (default: ['Goodreads', 'Google', 'Amazon.com'])\n    -w, --without-isbn-ignore REGEX                 This is a regular expression that is matched against lowercase filenames. All files that do not \n                                                    contain ISBNs are matched against it and matching files are ignored by the script, even if \n                                                    `organize-without-isbn` is true. The default value is calibrated to match most periodicals \n                                                    (magazines, newspapers, etc.) so the script can ignore them. (default: complex default value, see \n                                                    the README)\n    --pamphlet-included-files REGEX                 This is a regular expression that is matched against lowercase filenames. All files that do not \n                                                    contain ISBNs and do not match `without-isbn-ignore` are matched against it and matching files are \n                                                    considered pamphlets by default. They are moved to `output_folder_pamphlets` if set, otherwise \n                                                    they are ignored. (default: \\.(png|jpg|jpeg|gif|bmp|svg|csv|pptx?)$)\n    --pamphlet-excluded-files REGEX                 This is a regular expression that is matched against lowercase filenames. If files do not contain \n                                                    ISBNs and match against it, they are NOT considered as pamphlets, even if they have a small size \n                                                    or number of pages. (default: \\.(chm|epub|cbr|cbz|mobi|lit|pdb)$)\n    --pamphlet-max-pdf-pages PAGES                  .pdf files that do not contain valid ISBNs and have a lower number pages than this are considered \n                                                    pamplets/non-ebook documents. (default: 50)\n    --pamphlet-max-filesize-kib SIZE                Other files that do not contain valid ISBNs and are below this size in KiBs are considered \n                                                    pamplets/non-ebook documents. (default: 250)\n\n  Input/Output options:\n    folder_to_organize                              Folder containing the ebook files that need to be organized.\n    -o, --output-folder PATH                        The folder where ebooks that were renamed based on the ISBN metadata will be moved to. (default:\n                                                    /Users/test/PycharmProjects/testing/organize/test_installation)\n    --ofu, --output-folder-uncertain PATH           If `organize-without-isbn` is enabled, this is the folder to which all ebooks that were renamed \n                                                    based on non-ISBN metadata will be moved to. (default: None)\n    --ofc, --output-folder-corrupt PATH             If specified, corrupt files will be moved to this folder. (default: None)\n    --ofp, --output-folder-pamphlets PATH           If specified, pamphlets will be moved to this folder. (default: None)\n    --oft, --output-filename-template TEMPLATE      This specifies how the filenames of the organized files will look. It is a bash string that is \n                                                    evaluated so it can be very flexible (and also potentially unsafe). \n                                                    (default: ${d[AUTHORS]// \u0026 /, } - ${d[SERIES]:+[${d[SERIES]}] - }${d[TITLE]/:/ -}\n                                                    ${d[PUBLISHED]:+ (${d[PUBLISHED]%-*})}${d[ISBN]:+[${d[ISBN]}]}.${d[EXT]})\n  --ome, --output-metadata-extension EXTENSION      If `keep-metadata` is enabled, this is the extension of the additional metadata file that is saved \n                                                    next to each newly renamed file. (default: meta)\n\nExplaining some of the options/arguments\n----------------------------------------\n- ``--keep-metadata``: as stated in its description above, the metadata files that are created alongside the renamed ebook files\n  are useful for the script `interactive_organizer \u003chttps://github.com/raul23/interactive-organizer\u003e`_ which used them for\n  various post-processing tasks such as showing the differences between the old and new filenames.\n- ``--log-level``: if it is set to the logging level ``warning``, you will only be shown on the terminal those documents that were\n  skipped (e.g. the file is an image) or failed (e.g. corrupted file).\n- ``--max-isbns``: especially when organizing epub files (they can contain many files since they are archives), \n  many valid ISBNs can be found and thus the fetching of metadata from online sources might take longer than usual.\n  By limiting the number of ISBNs to check, the script can run faster by not being bogged down by testing lots of ISBNs. And usually it is\n  the first ISBN found that is the correct one since it appears in the very first pages of the document which is the most\n  likely place to find it (the script searches ISBNs in the first pages, then in the end, and finally in the middle of the file).\n- ``--skip-archives``: by default all archives (e.g. 7z, zip) are searched for ISBNs and this means that they will be decompressed and\n  each extracted file will be recursively searched for ISBNs. Thus you can just skip these archives (except epub documents) when\n  organizing your ebooks by using this flag.\n- ``--corruption-check``: corruption check with ``pdfinfo`` can be very sensitive by flagging some PDF files as corrupted even though\n  they can be opened without problems::\n  \n   Syntax Error: Dictionary key must be a name object\n   Syntax Error: Couldn't find trailer dictionary\n   \n  Thus by setting this option to 'false', you can skip any corruption check (whether by ``pdfinfo`` or ``7z``). \n  By default, corruption check is enabled. Also if you set it to 'check_only', only corruption check will be performed, i.e.\n  no organization or renaming of ebooks will be done.\n- The choices for ``--ocr`` are {always, true, false}\n  \n  - 'always': If the conversion to text was successful but no ISBNs were found, then OCR is run on the document. Also, if the\n    conversion failed (e.g. its content is empty or doesn't contain any text), then OCR is applied to the document.\n  - 'true': OCR is applied to the document only if the conversion to text failed.\n  - 'false': No OCR is applied after the conversion to text.\n- ``--owi, --organize-without-isbn``: if no ISBNs could be found within the document, the document can still be organized \n  based on its author and/or title or filename by calling calibre's ``fetch-ebook-metadata`` command-line application which \n  fetches metadata from online metadata sources (by default they are 'Goodreads', 'Google', 'Amazon.com').\n  \n  These ebooks are then saved under the user specifed uncertain folder (``--ofu, --output-folder-uncertain``).\n\nScript usage\n============\nBasic command\n-------------\nAt bare minimum, the script ``organize_ebooks`` requires an input folder containing the ebooks to organize. Thus, the following is one the\nmost basic command you can provide to the script:\n\n.. code-block:: bash\n\n   organize_ebooks ~/ebooks/input_folder/\n \nThe ebooks in the input folder will be searched for ISBNs. The script tries to find ISBN numbers in the given ebook \nfile by using progressively more \"expensive\" tactics (as stated in `lib.sh \u003chttps://github.com/na--/ebook-tools/blob/master/lib.sh#L519\u003e`_ \nfrom `ebook-tools \u003chttps://github.com/na--/ebook-tools\u003e`_). \n\nThese are the steps in order followed by the ``organize_ebooks`` script when searching ISBNs for a given ebook \n(as soon as ISBNs are found, the script return them):\n\n1. The first location it tries to find ISBNs is the filename. \n2. Then it checks the contents directly if it is a text file. \n3. The next place that is searched for ISBNs is the file metadata by calling calibre's ``ebook-meta``. \n4. The file is decompressed with ``7z`` if it is an archive and the extracted files are recursively searched for ISBNs (epubs are excluded from this \n   step even though they are basically zipped HTML files as explained in `epub and archives \u003c#epub-and-archives\u003e`_).\n5. The file is converted to ``txt`` and its text content is searched for ISBNs.\n6. If OCR is enabled (through the ``--ocr`` option), the file is OCR-ed and the resultant text content is searched for ISBNs.\n\nUseful commands\n---------------\nOrganize ebooks with and without ISBNs: ``--owi``\n\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\n.. code-block:: bash\n\n   organize_ebooks ~/input_folder/ -o ~/outut_folder/ --ofc ~/corrupt/ --ofu ~/uncertain/ --owi\n\n`:information_source:`\n\n - ``--ofu, --output-folder-uncertain``: this folder will contain any document that could be identified based on non-ISBN metadata (e.g. title) \n   from online sources (e.g. Goodreads). However this folder is only used along with the flag ``--owi`` (next option explained).\n - ``--owi, --organize-without-isbn``: This flag instructs the script to fetch metadata from online sources in case no ISBN could be found in \n   an ebook. The filename or the author and/or title are used for fetching metadata about the book.\n\nAdd more information to the filename (e.g. publisher or languages): ``--oft``\n\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\nBy default (see the `--oft \u003c#list-of-options\u003e`_ option), this is the bash string used as template when naming ebooks::\n\n ${d[AUTHORS]// \u0026 /, } - ${d[SERIES]:+[${d[SERIES]}] - }${d[TITLE]/:/ -}${d[PUBLISHED]:+ (${d[PUBLISHED]%-*})}${d[ISBN]:+ [${d[ISBN]}]}.${d[EXT]})\n\nFor example, it produces the following filenames::\n\n Cory Doctorow - Little Brother (2008) [9780007288427]\n Eric von Hippel - Democratizing Innovation (2005) [0262002744]\n Steve Jones - Almost Like a Whale - The Origin of Species Updated (2000) [9780385409858].html\n\nIf you want to add other data to the filenames such as the publisher and languages, here is how you can modify\nthis bash string:\n\n.. code-block:: bash\n\n   ${d[AUTHORS]// \u0026 /, } - ${d[SERIES]:+[${d[SERIES]}] - }${d[TITLE]/: -} (${d[PUBLISHER]:+${d[PUBLISHER]}}, ${d[PUBLISHED]:+${d[PUBLISHED]%%-*}})${d[ISBN]:+ [${d[ISBN]}]}${d[LANGUAGES]:+ [${d[LANGUAGES]}]}.${d[EXT]}\n\nHere is an example of a filename that is generated based on this modified bash string::\n\n Cory Doctorow - With a Little Help (CorDoc-Company, Limited, 2010) [9780557943050] [eng].epub\n\nThis is how you would call the script ``organize_ebooks`` with this modified string (``--oft, --output-filename-template`` option):\n\n.. code-block:: bash \n\n   organize_ebooks ~/input -o ~/output/ --oft '${d[AUTHORS]// \u0026 /, } - ${d[SERIES]:+[${d[SERIES]}] - }${d[TITLE]/: -} (${d[PUBLISHER]:+${d[PUBLISHER]}}, ${d[PUBLISHED]:+${d[PUBLISHED]%%-*}})${d[ISBN]:+ [${d[ISBN]}]}${d[LANGUAGES]:+ [${d[LANGUAGES]}]}.${d[EXT]}'\n\n`:warning:` When calling the Python script, it is important to surround the bash string within **single** quotes (not double quotes or the\nbash string will be evaluated right in the command line and we don't want that).\n\nExample: organize a collection of assorted documents\n====================================================\nThrough the script ``organize_ebooks.py``\n-----------------------------------------\nTo organize a collection of documents (ebooks, pamplets) through the script ``organize_ebooks.py``:\n\n.. code-block:: bash\n\n   organize_ebooks ~/input_folder/ -o ~/output_folder/ --ofp ~/pamphlets/\n \n`:information_source:` Explaining the command\n\n- I only specify the input and two ouput folders and thus ignore corrupted files (``--ofu`` not used) and \n  ebooks without ISBNs (``--ofu`` and ``--owi`` not used). These ignored files will just be skipped.\n- Also books made up with images will be skipped since OCR was not choosen (``--ocr`` is set to 'false' by default).\n\nThrough the Python API\n----------------------\nLet's say we have this folder containing assorted documents:\n\n.. image:: ./images/input_folder.png\n   :target: ./images/input_folder.png\n   :align: left\n   :alt: Example: documents to organize\n\n|\n\nTo organize this collection of documents (ebooks, pamphlets) through the Python API (i.e. ``organize_ebooks`` package): \n\n.. code-block:: python\n\n   from organize_ebooks.lib import organizer\n\n   retcode = organizer.organize('/Users/test/ebooks/input_folder/',\n                                output_folder='/Users/test/ebooks/output_folder',\n                                output_folder_corrupt='/Users/test/ebooks/corrupt/',\n                                output_folder_pamphlets='/Users/test/ebooks/pamphlets/',\n                                output_folder_uncertain='/Users/test/ebooks/uncertain/',\n                                organize_without_isbn=True,\n                                keep_metadata=True)\n\n`:information_source:` Explaining the parameters of the function ``organize()``\n\n- The first parameter to ``organize()`` is the input folder containing the documents to organize\n- ``output_folder``: this is the folder where every ebooks whose ISBNs could be retrieved will be saved and renamed with proper names. \n  Thus the program is highly confident that these ebooks are correctly labeled based on the found ISBNs.\n- ``output_folder_corrupt``: any document that was checked (with ``pdfinfo``) and found to be corrupted will be saved in this folder.\n- ``output_folder_pamphlets``: this is the folder that will contain any documents without valid ISBNs (e.g. HMTL pages) that satisfy certain \n  criteria for pamphlets (such as small size and low number of pages).\n- ``output_folder_uncertain``: this folder will contain any documents that could be identified based on non-ISBN metadata (e.g. title) \n  from online sources (e.g. Goodreads). However this folder is only used if the flag ``organize_without_isbn`` (next option explained) \n  is set to True.\n- ``organize_without_isbn``: If True, this flag specifies to fetch metadata from online sources in case no ISBN could be found in ebooks.\n- ``keep_metadata``: If True, a metadata file will be saved along the renamed ebooks in the output folder. Also, documents that were\n  identified as corrupted will be saved along with a metadata file that will contain info about the detected corruption.\n- If everything went well with the organization of documents, ``organize()`` will return 0 (success). Otherwise, ``retcode`` will be 1 (failure).\n\nSample output:\n\n.. image:: ./images/script_output.png\n   :target: ./images/script_output.png\n   :align: left\n   :alt: Example: output terminal\n\n|\n\nContents of the different folders after the organization:\n\n.. image:: ./images/output_folder2.png\n   :target: ./images/output_folder2.png\n   :align: left\n   :alt: Example: output folder\n\n|\n\n.. image:: ./images/pamphlets_and_uncertain.png\n   :target: ./images/pamphlets_and_uncertain.png\n   :align: left\n   :alt: Example: pamphlets and uncertain folders\n\n|\n\nBy default when using the API, the loggers are disabled. If you want to enable them, call the\nfunction ``setup_log()`` (with the desired log level in all caps) at the beginning of your code before \nthe function ``organize()``:\n\n.. code-block:: python\n\n\n   from organize_ebooks.lib import organizer, setup_log\n\n   setup_log(logging_level='INFO')\n   retcode = organizer.organize('/Users/test/ebooks/input_folder/',\n                                output_folder='/Users/test/ebooks/output_folder',\n                                output_folder_corrupt='/Users/test/ebooks/corrupt/',\n                                output_folder_pamphlets='/Users/test/ebooks/pamphlets/',\n                                output_folder_uncertain='/Users/test/ebooks/uncertain/',\n                                organize_without_isbn=True,\n                                keep_metadata=True)\n\nSample output:\n\n.. image:: ./images/script_output_debug.png\n   :target: ./images/script_output_debug.png\n   :align: left\n   :alt: Example: output terminal with debug messages\n\nNotes\n=====\n- Having multiple metadata sources can slow down the ebooks organization. \n\n  - By default, we have for ``metadata-fetch-order``:: \n  \n     ['Goodreads', 'Amazon.com', 'Google', 'ISBNDB', 'WorldCat xISBN', 'OZON.ru']\n  \n  - By default, we have for ``organize-without-isbn-sources``::\n     \n     ['Goodreads', 'Amazon.com', 'Google']\n  \n  I usually get results from ``Google`` and ``Goodreads``.\n\n- Books that are sometimes **skipped** for insufficient information from filename\\\\ISBN or wrong filename\\\\ISBN\n\n  - Solution manuals\n  - Obscure and/or non-english books\n  - Very old books without any ISBN\n  - A book with an invalid ISBN from the get go: only found two such books so far (French math books)\n  - Books with an invalid ISBN because when converting them to text for extracting their ISBNs, an extra number was added to \n    the ISBN (and not at the end but in the middle of it) which made it invalid\n    \n    For the moment, I don't know what to do about this case\n  - Books whose ISBNs couldn't be extracted because the conversion to text (with or without OCR) was not cleaned, i.e.\n    it added extra characters (not necessarily numbers) such as '·' or '\\uf73' between the numbers of the ISBN which \"broke\" the regex\n    \n    Solution: I had to modify ``find_isbns()`` to take into account these annoying \"artifacts\" from the conversion procedure\n\n  Obviously, they are skipped if I didn't enable OCR with the option ``--ocr-enabled`` (by default it is set to 'false')\n- I was trying to build a docker image based from `ebooktools/scripts \u003chttps://hub.docker.com/r/ebooktools/scripts/tags\u003e`_ \n  which contains all the necessary dependencies (e.g. calibre, Tesseract) for a Debian system and I was going to add the Python\n  package `organize_ebooks \u003c./organize_ebooks/\u003e`_ . However, I couldn't build an image from the base \n  OS ``debian:sid-slim`` as specified in its `Dockerfile \u003chttps://github.com/na--/ebook-tools/blob/master/Dockerfile\u003e`_::\n\n   The following signatures couldn't be verified because the public key is not available: NO_PUBKEY\n\n  Thus, I created an image from scratch starting with ``ubuntu:18.04`` that I am trying to push to hub.docker.com but I am always\n  getting the error ``requested access to the resource is denied`` (see `solution \u003c#docker-error-requested-access-to-the-resource-is-denied\u003e`_). \n\n``epub`` and archives\n---------------------\nWhen searching for ISBNs, the Python script ``organize_ebooks`` doesn't decompress *epub* files with ``7z`` because it would be a very slow\noperation since ``7z`` decompresses archives and recursively scans the contents which can be many files within an *epub* file. \nThen you would have to search ISBNs for each of the extracted files which would increase the running time of the script.\n\nInstead, *epub* files are decompressed with ``unzip -c`` which extracts files to stdout/screen and then the output is redirected to\na temporary text file. This text file is then searched for ISBNs. Hence the searching for ISBNs is quicker when applying ``unzip``\nto *epub* files than with ``7z``.\n\nAlso, the reason for using ``unzip`` is to make the conversion of *epub* files to text quicker and more accurate than calibre's \n``ebook-convert``.\n\n`:information_source:` epubs are basically zipped HTML files\n\nConversion to text: files supported\n-----------------------------------\nThese are the files that are supported for conversion to *txt* and the corresponding conversion tools used:\n\n+---------------------+------------------------------+------------------------------+------------------------------+\n| Files supported     | Conversion tool #1           | Conversion tool #2           | Conversion tool #3           |\n+=====================+==============================+==============================+==============================+\n| *pdf*               | ``pdftotext``                | ``ebook-convert`` (calibre)  | -                            |\n+---------------------+------------------------------+------------------------------+------------------------------+\n| *djvu*              | ``djvutxt``                  | ``ebook-convert`` (calibre)  | -                            |\n+---------------------+------------------------------+------------------------------+------------------------------+\n| *epub*              | ``epubtxt``                  | ``ebook-convert`` (calibre)  | -                            |\n+---------------------+------------------------------+------------------------------+------------------------------+\n| *docx* (Word 2007)  | ``ebook-convert`` (calibre)  | -                            | -                            |\n+---------------------+------------------------------+------------------------------+------------------------------+\n| *doc* (Word 97)     | ``textutil`` (macOS)         | ``catdoc``                   | ``ebook-convert`` (calibre)  |\n+---------------------+------------------------------+------------------------------+------------------------------+\n| *rtf*               | ``ebook-convert`` (calibre)  | -                            | -                            |\n+---------------------+------------------------------+------------------------------+------------------------------+\n\n`:information_source:` Some explanations about the table\n\n- ``epubtxt`` is a fancy way to say ``unzip``.\n- By default, ``ebook-convert`` (calibre) is always used as a last resort when other methods already exist since it is slower than\n  the other conversion tools.\n\nFor comparison, here are the times taken to convert completely a 154-pages PDF document to *txt* for both supported conversion methods:\n\n- ``pdftotext``: 4.27s\n- ``ebook-convert`` (calibre): 80.91s \n\nDocker error: ``requested access to the resource is denied`` 😡\n---------------------------------------------------------------\n`:information_source:` If you are having trouble pushing your docker image to hub.docker.com with an old macOS, here is what worked for me\n\n  I was trying to push to hub.docker.com but I was getting the error ``requested access to the resource is denied``. \n\n  I tried everything that was suggested on various forums: checking that I \n  named my image and repo correctly, making sure I was logged in before pushing, making sure that I was not pushing to a private\n  repo or to docker.io/library/, making sure that my Docker client was running, and so on. \n\n  I was finally able to push the Docker image to hub.docker.com by installing Ubuntu 22.04 in a virtual machine since I was\n  finally convinced that my very old macOS wasn't compatible with Docker anymore 😞. Also my Docker version was way too old\n  and the latest Docker requires newer versions of macOS. The only ``docker`` operation I was not able to accomplish (as far as I know)\n  with my old macOS was ``docker push``.\n\n  👉 **SOLUTION:** if you tried everything under the sun to fix the ``push`` problem but you still couldn't solve it, then the \n  solution is to finally accept that your old macOS (or any other OS) is the cause and you should try Docker on a newer system. Since I didn't want to \n  install a newer version of macOS (I don't want to break my current programs and I don't think my system is able to support it), I opted for \n  installing Docker with Ubuntu 22.04 under a virtual machine.\n\n  What I noticed strange though was that on my old macOS when I logged out from Docker, I got the following message::\n\n   Not logged in to https://index.docker.io/v1/\n\n  However on Ubuntu 22.04, this is what I get when I log out from Docker (and this is what I see from `other people \n  \u003chttps://jhooq.com/requested-access-to-resource-is-denied/\u003e`_ using Docker)::\n\n     Removing login credentials for https://index.docker.io/v1/\n     \n  Maybe on the old macOS I was not correctly authenticated (even though I got the message ``Login Succeeded``) and thus I couldn't do the ``docker push``.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraul23%2Forganize-ebooks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraul23%2Forganize-ebooks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraul23%2Forganize-ebooks/lists"}