{"id":13660959,"url":"https://github.com/c-w/gutenberg","last_synced_at":"2025-10-03T16:31:51.405Z","repository":{"id":19975681,"uuid":"23242753","full_name":"c-w/gutenberg","owner":"c-w","description":"A simple interface to the Project Gutenberg corpus.","archived":true,"fork":false,"pushed_at":"2023-01-12T09:18:23.000Z","size":6147,"stargazers_count":323,"open_issues_count":6,"forks_count":60,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-01-15T01:54:27.174Z","etag":null,"topics":["gutenberg-ebooks","gutenberg-metadata","python2","python3"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/c-w.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-08-22T23:16:08.000Z","updated_at":"2025-01-04T04:35:34.000Z","dependencies_parsed_at":"2023-01-14T13:15:33.121Z","dependency_job_id":null,"html_url":"https://github.com/c-w/gutenberg","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/c-w%2Fgutenberg","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/c-w%2Fgutenberg/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/c-w%2Fgutenberg/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/c-w%2Fgutenberg/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/c-w","download_url":"https://codeload.github.com/c-w/gutenberg/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235156016,"owners_count":18944828,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gutenberg-ebooks","gutenberg-metadata","python2","python3"],"created_at":"2024-08-02T05:01:27.962Z","updated_at":"2025-10-03T16:31:44.012Z","avatar_url":"https://github.com/c-w.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":":Warning: All the maintainers of this project have moved on so this project isn't receiving support anymore. If you wish to revive the project, reach out to `me \u003chttps://justamouse.com/\u003e`_ and we'll make it happen.\n:Alternatives: Check out `gutenbergpy \u003chttps://github.com/raduangelescu/gutenbergpy\u003e`_.\n\n*********\nGutenberg\n*********\n\n.. image:: https://github.com/c-w/gutenberg/workflows/CI/badge.svg\n    :target: https://github.com/c-w/gutenberg/actions?query=workflow%3ACI\n\n.. image:: https://github.com/c-w/gutenberg/workflows/Daily/badge.svg\n    :target: https://github.com/c-w/gutenberg/actions?query=workflow%3Adaily\n\n.. image:: https://codecov.io/gh/c-w/gutenberg/branch/master/graph/badge.svg\n  :target: https://codecov.io/gh/c-w/gutenberg\n\n.. image:: https://img.shields.io/pypi/v/gutenberg.svg\n    :target: https://pypi.python.org/pypi/gutenberg/\n\n.. image:: https://img.shields.io/pypi/pyversions/gutenberg.svg\n    :target: https://pypi.python.org/pypi/gutenberg/\n\n\nOverview\n========\n\nThis package contains a variety of scripts to make working with the `Project\nGutenberg \u003chttp://www.gutenberg.org\u003e`_ body of public domain texts easier.\n\nThe functionality provided by this package includes:\n\n* Downloading texts from Project Gutenberg.\n* Cleaning the texts: removing all the crud, leaving just the text behind.\n* Making meta-data about the texts easily accessible.\n\nThe package has been tested with Python 3.7+.\n\nAn HTTP interface to this package exists too.\n`Try it out! \u003chttps://github.com/c-w/gutenberg-http\u003e`_\n\n\nInstallation\n============\n\nThis project is on `PyPI \u003chttps://pypi.python.org/pypi/Gutenberg\u003e`_, so I'd\nrecommend that you just install everything from there using your favourite\nPython package manager.\n\n.. sourcecode :: sh\n\n    pip install gutenberg\n\nIf you want to install from source or modify the package, you'll need to clone\nthis repository:\n\n.. sourcecode :: sh\n\n    git clone https://github.com/c-w/Gutenberg.git\n\nNow, you should probably install the dependencies for the package and verify\nyour checkout by running the tests.\n\n.. sourcecode :: sh\n\n    cd Gutenberg\n\n    virtualenv --no-site-packages virtualenv\n    source virtualenv/bin/activate\n    pip install -r requirements-dev.pip\n    pip install .\n\n    nose2\n\nAlternatively, you can also run the project via Docker:\n\n.. sourcecode :: sh\n\n    docker build -t gutenberg .\n\n    docker run -it -v /some/mount/path:/data gutenberg python\n\n\nPython 3\n--------\n\nThis package depends on BSD-DB and you will need to manually install it.\n\nIf getting BSD-DB to run on your platform is difficult, take a look at\n`gutenbergpy \u003chttps://github.com/raduangelescu/gutenbergpy\u003e`_ which only\ndepends on SQLite or MongoDB.\n\nLinux\n*****\n\nOn Linux, you can usually install BSD-DB using your distribution's package\nmanager. For example, on Ubuntu, you can use apt-get:\n\n.. sourcecode :: sh\n\n    sudo apt-get install libdb++-dev\n    export BERKELEYDB_DIR=/usr\n    pip install .\n\nMacOS\n*****\n\nOn Mac, you can install BSD-DB using `homebrew \u003chttps://homebrew.sh/\u003e`_:\n\n.. sourcecode :: sh\n\n    brew install berkeley-db4\n    pip install .\n\nWindows\n*******\n\nOn Windows, it's easiest to download a pre-compiled version of BSD-DB from\n`pythonlibs \u003chttp://www.lfd.uci.edu/~gohlke/pythonlibs/\u003e`_ which works great.\n\nFor example, if you have Python 3.5 on a 64-bit version of Windows, you\nshould download :code:`bsddb3‑6.2.1‑cp35‑cp35m‑win_amd64.whl`.\n\nAfter you download the wheel, install it and you're good to go:\n\n.. sourcecode :: bash\n\n    pip install bsddb3‑6.2.1‑cp35‑cp35m‑win_amd64.whl\n    pip install .\n\nLicense conflicts\n*****************\n\nSince its v6.x releases, BSD-DB switched to the `AGPL3 \u003chttps://www.gnu.org/licenses/agpl-3.0.en.html\u003e`_\nlicense which is stricter than this project's `Apache v2 \u003chttps://www.apache.org/licenses/LICENSE-2.0\u003e`_\nlicense. This means that unless you're happy to comply to the `terms \u003chttps://tldrlegal.com/license/gnu-affero-general-public-license-v3-(agpl-3.0)\u003e`_\nof the AGPL3 license, you'll have to install an ealier version of BSD-DB\n(anything between 4.8.30 and 5.x should be fine). If you are happy to use this\nproject under AGPL3 (or if you have a commercial license for BSD-DB), set the\nfollowing environment variable before attempting to install BSD-DB:\n\n.. sourcecode :: bash\n\n    YES_I_HAVE_THE_RIGHT_TO_USE_THIS_BERKELEY_DB_VERSION=1\n\n\nApache Jena Fuseki\n------------------\n\nAs an alternative to the BSD-DB backend, this package can also use `Apache Jena Fuseki \u003chttps://jena.apache.org/documentation/fuseki2/\u003e`_\nfor the metadata store. The Apache Jena Fuseki backend is activated by\nsetting the :code:`GUTENBERG_FUSEKI_URL` environment variable to the HTTP\nendpoint at which Fuseki is listening. If the Fuseki server has HTTP basic\nauthentication enabled, the username and password can be provided via the\n:code:`GUTENBERG_FUSEKI_USER` and :code:`GUTENBERG_FUSEKI_PASSWORD` environment\nvariables.\n\nFor local development, the Fuseki server can be run via Docker:\n\n.. sourcecode :: bash\n\n    docker run \\\n        --detach \\\n        --publish 3030:3030 \\\n        --env ADMIN_PASSWORD=some-password \\\n        --volume /some/mount/location:/fuseki \\\n        stain/jena-fuseki:3.6.0 \\\n        /jena-fuseki/fuseki-server --loc=/fuseki --update /ds\n\n    export GUTENBERG_FUSEKI_URL=http://localhost:3030/ds\n    export GUTENBERG_FUSEKI_USER=admin\n    export GUTENBERG_FUSEKI_PASSWORD=some-password\n\n\nUsage\n=====\n\nDownloading a text\n------------------\n\n.. sourcecode :: python\n\n    from gutenberg.acquire import load_etext\n    from gutenberg.cleanup import strip_headers\n\n    text = strip_headers(load_etext(2701)).strip()\n    print(text)  # prints 'MOBY DICK; OR THE WHALE\\n\\nBy Herman Melville ...'\n\n.. sourcecode :: sh\n\n    python -m gutenberg.acquire.text 2701 moby-raw.txt\n    python -m gutenberg.cleanup.strip_headers moby-raw.txt moby-clean.txt\n\n\nLooking up meta-data\n--------------------\n\nA bunch of meta-data about ebooks can be queried:\n\n.. sourcecode :: python\n\n    from gutenberg.query import get_etexts\n    from gutenberg.query import get_metadata\n\n    print(get_metadata('title', 2701))  # prints frozenset([u'Moby Dick; Or, The Whale'])\n    print(get_metadata('author', 2701)) # prints frozenset([u'Melville, Hermann'])\n\n    print(get_etexts('title', 'Moby Dick; Or, The Whale'))  # prints frozenset([2701, ...])\n    print(get_etexts('author', 'Melville, Hermann'))        # prints frozenset([2701, ...])\n\nYou can get a full list of the meta-data that can be queried by calling:\n\n.. sourcecode :: python\n\n    from gutenberg.query import list_supported_metadatas\n\n    print(list_supported_metadatas()) # prints (u'author', u'formaturi', u'language', ...)\n\nBefore you use one of the :code:`gutenberg.query` functions you must populate the\nlocal metadata cache. This one-off process will take quite a while to complete\n(18 hours on my machine) but once it is done, any subsequent calls to\n:code:`get_etexts` or :code:`get_metadata` will be *very* fast. If you fail to populate the\ncache, the calls will raise an exception.\n\nTo populate the cache:\n\n.. sourcecode :: python\n\n    from gutenberg.acquire import get_metadata_cache\n    cache = get_metadata_cache()\n    cache.populate()\n\n\nIf you need more fine-grained control over the cache (e.g. where it's stored or\nwhich backend is used), you can use the :code:`set_metadata_cache` function to switch\nout the backend of the cache before you populate it. For example, to use the\nSqlite cache backend instead of the default Sleepycat backend and store the\ncache at a custom location, you'd do the following:\n\n.. sourcecode :: python\n\n    from gutenberg.acquire import set_metadata_cache\n    from gutenberg.acquire.metadata import SqliteMetadataCache\n\n    cache = SqliteMetadataCache('/my/custom/location/cache.sqlite')\n    cache.populate()\n    set_metadata_cache(cache)\n\n\nLimitations\n===========\n\nThis project *deliberately* does not include any natural language processing\nfunctionality. Consuming and processing the text is the responsibility of the\nclient; this library merely focuses on offering a simple and easy to use\ninterface to the works in the Project Gutenberg corpus.  Any linguistic\nprocessing can easily be done client-side e.g. using the `TextBlob\n\u003chttp://textblob.readthedocs.org\u003e`_ library.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fc-w%2Fgutenberg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fc-w%2Fgutenberg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fc-w%2Fgutenberg/lists"}