{"id":20817529,"url":"https://github.com/greed2411/tokyo","last_synced_at":"2025-05-07T14:06:30.213Z","repository":{"id":125880840,"uuid":"270027219","full_name":"greed2411/tokyo","owner":"greed2411","description":"tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text  💪. ","archived":false,"fork":false,"pushed_at":"2020-06-13T03:50:30.000Z","size":20,"stargazers_count":18,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-05-07T14:06:19.380Z","etag":null,"topics":["apache-tika","clojure","document-processing","extension","extract-text","filetype","mime-types","ring","text-extraction","text-parser","text-parsing"],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"epl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/greed2411.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-06-06T15:48:40.000Z","updated_at":"2023-08-28T11:59:51.000Z","dependencies_parsed_at":"2023-07-08T04:45:22.070Z","dependency_job_id":null,"html_url":"https://github.com/greed2411/tokyo","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greed2411%2Ftokyo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greed2411%2Ftokyo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greed2411%2Ftokyo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greed2411%2Ftokyo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/greed2411","download_url":"https://codeload.github.com/greed2411/tokyo/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252892503,"owners_count":21820648,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-tika","clojure","document-processing","extension","extract-text","filetype","mime-types","ring","text-extraction","text-parser","text-parsing"],"created_at":"2024-11-17T21:42:45.215Z","updated_at":"2025-05-07T14:06:30.206Z","avatar_url":"https://github.com/greed2411.png","language":"Clojure","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tokyo\n\n[![greed2411](https://circleci.com/gh/greed2411/tokyo.svg?style=svg)](https://app.circleci.com/pipelines/github/greed2411/tokyo?branch=master)\n\n\n\u003e When you hit rock-bottom, you still have a way to go until the abyss.- Tokyo, Netflix's \"Money Heist\" (La Casa De Papel)\n\n\u003cp align=\"center\"\u003e\n  \u003cbr\u003e\n  \u003cimg src=\"https://res.cloudinary.com/teepublic/image/private/s--RcVSHez1--/t_Preview/b_rgb:36538b,c_limit,f_jpg,h_630,q_90,w_630/v1569759975/production/designs/6137078_0.jpg\"/\u003e\n  \u003cbr\u003e\n  \u003cem\u003eimage belongs to teepublic\u003c/em\u003e\n  \u003cbr\u003e\n  \u003cbr\u003e\n\u003c/p\u003e\n\n\n\nWhen one is limited by the technology of the time, One resorts to Java APIs using Clojure.\n\nThis is my first attempt on Clojure to have a REST API which when uploaded a file, identifies it's `mime-type`, `extension` and `text` if present inside the file and returns information as JSON.\nThis works for several type of files. Including the ones which require OCR, thanks to Tesseract. Complete [list](https://tika.apache.org/0.9/formats.html) of supported file formats by Tika.\n\nUses [ring](https://github.com/ring-clojure/ring) for Clojure HTTP server abstraction, [jetty](https://www.eclipse.org/jetty/) for actual HTTP server, [pantomime](https://github.com/michaelklishin/pantomime) for a clojure abstraction over [Apache Tika](https://tika.apache.org/) and also optionally served using [traefik](https://containo.us/traefik/) acting as reverse-proxy.\n\n\n## Installation\n\nTwo options:\n1. Download [openjdk-11](https://openjdk.java.net/) and install [lein](https://leiningen.org/). Followed by `lein uberjar`\n2. Use the `Dockerfile` (Recommended)\n\n## Building\n\n1. You can obtain the `.jar` file from releases (if it's available).\n2. Else build the docker image using `Dockerfile`.\n\n```\ndocker build ./ -t tokyo\ndocker run tokyo:latest\n```\n\nNote: the server defaults to running on port 80, because it has been exposed in the docker image.\nYou can change the port number by setting an enviornment variable `TOKYO_PORT` inside the `Dockerfile`, or in your shell prompt to whichever port number you'd like when running the `.jar` file.\n\nI've also added a `docker-compose.yml` which uses [traefik](https://containo.us/traefik/) as reverse proxy. use `docker-compose up`.\n\n## Usage\n\n1. the `/file` route. make a `POST` request by uploading a file.\n    * the command line approach using [curl](https://curl.haxx.se/)\n\n\n    ```bash\n    curl -XPOST  \"http://localhost:80/file\" -F file=@/path/to/file/sample.doc\n\n    {\"mime-type\":\"application/msword\",\"ext\":\".bin\",\"text\":\"Lorem ipsum \\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio.\"}\n    ```\n\n    * The Python Way using [requests](https://requests.readthedocs.io/en/master/)\n\n    ```python\n    \u003e\u003e\u003e import requests\n    \u003e\u003e\u003e import json\n\n    \u003e\u003e\u003e url = \"http://localhost:80/file\"\n    \u003e\u003e\u003e files = {\"file\": open(\"/path/to/file/sample.doc\")}\n    \u003e\u003e\u003e response = requests.post(url, files=files)\n    \u003e\u003e\u003e json.loads(response.content)\n\n    {'mime-type': 'application/msword', 'ext': '.bin', 'text': 'Lorem ipsum \\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio.'}\n    ```\n\n    the general API response,json-schema is of the form:\n    ```\n    :mime-type (string) - the mime-type of the file. eg: application/msword, text/plain etc.\n    :ext       (string) - the extension of the file. eg: .txt, .jpg etc.\n    :text      (string) - the text content of the file.\n    ```\n\nNote: The files being uploaded are stored as temp files, in `/tmp` and removed after an hour later. (assuming the jvm is still running for that hour or so).\n\n2. just a `/`, `GET` request returns `Hello World` as plain text. to act as ping.\n\nIf going down the path of using `docker-compose`. The request gets altered to\n\n```bash\ncurl -XPOST  -H Host:tokyo.localhost http://localhost/file -F file=@/path/to/file/sample.doc\n\n{\"mime-type\":\"application/msword\",\"ext\":\".bin\",\"text\":\"Lorem ipsum \\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio.\"}\n```\n\nand\n\n```python\n\u003e\u003e\u003e response = requests.post(url, files=files, headers={\"Host\": \"tokyo.localhost\"})\n```\n\nwhere `tokyo.localhost` has been mentioned in `docker-compose.yml`\n\n### Why?\n\nI had to do this because neither Python's [filetype](https://github.com/h2non/filetype.py) (doesn't identify .doc, .docx, plain text), [textract](https://github.com/deanmalmgren/textract) (hacky way of extracting text, and one needs to know the extension before extracting) are as good as Tika. The Go version, [filetype](https://github.com/h2non/filetype) didn't support a way to extract text. So I resorted to spiraling down the path of using Java's [Apache Tika](https://tika.apache.org/) using the Clojure [pantomime](https://github.com/michaelklishin/pantomime) library.\n\n\n## License\n\nCopyright © 2020 greed2411/tokyo\n\nThis program and the accompanying materials are made available under the\nterms of the Eclipse Public License 2.0 which is available at\nhttp://www.eclipse.org/legal/epl-2.0.\n\nThis Source Code may also be made available under the following Secondary\nLicenses when the conditions for such availability set forth in the Eclipse\nPublic License, v. 2.0 are satisfied: GNU General Public License as published by\nthe Free Software Foundation, either version 2 of the License, or (at your\noption) any later version, with the GNU Classpath Exception which is available\nat https://www.gnu.org/software/classpath/license.html.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgreed2411%2Ftokyo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgreed2411%2Ftokyo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgreed2411%2Ftokyo/lists"}