{"id":20372878,"url":"https://github.com/sidmishraw/autobot","last_synced_at":"2026-06-10T07:31:50.409Z","repository":{"id":86400473,"uuid":"101383479","full_name":"sidmishraw/autobot","owner":"sidmishraw","description":"PDF parsing and extraction utility using Apache Tika","archived":false,"fork":false,"pushed_at":"2017-09-08T06:09:48.000Z","size":50602,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-04T20:43:23.808Z","etag":null,"topics":["apache-tika","data-extraction","java","pdf-parsing","pdfbox"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sidmishraw.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-08-25T08:30:02.000Z","updated_at":"2017-08-25T16:39:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"a32780b3-ae7f-4fcb-b4d5-7a6876f1cdad","html_url":"https://github.com/sidmishraw/autobot","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sidmishraw/autobot","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sidmishraw%2Fautobot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sidmishraw%2Fautobot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sidmishraw%2Fautobot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sidmishraw%2Fautobot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sidmishraw","download_url":"https://codeload.github.com/sidmishraw/autobot/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sidmishraw%2Fautobot/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34142638,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-10T02:00:07.152Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-tika","data-extraction","java","pdf-parsing","pdfbox"],"created_at":"2024-11-15T01:15:22.976Z","updated_at":"2026-06-10T07:31:50.379Z","avatar_url":"https://github.com/sidmishraw.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# autobot - PDF parsing and extraction utility using Apache Tika\n\nAutobot parses the PDF files using Apache Tika and extracts the title, authorString and contents of the IEEE Xplore PDFs.\n\n\nPlease download the utility jar from the link below:\nhttps://github.com/sidmishraw/autobot/blob/master/build/libs/autobot-1.0.0.jar\n\nDescription:\n\nIt requires 2 inputs:\n\n1\u003e  Absolute file-path of a file named “conf.txt”\n\n\tThis file will have the list of all file-paths of the input PDF documents on each line\n\nFor eg:\n\n```conf.txt\npath-to-pdfs\\04403110.pdf\npath-to-pdfs\\04403128.pdf\npath-to-pdfs\\04403127.pdf\n```\n2\u003e Absolute file-path of the output directory.\n\n\n\nUsage:\n\njava -jar autobot-1.0.0.jar “path-to-conf.txt” “path-to-output-directory”.\n\nFor eg: \n```\njava -jar autobot-1.0.0.jar \"/Users/sidmishraw/Downloads/conf.txt\" \"/Users/sidmishraw/Downloads/outpdfs\"\n```\n\n\nCaveats:\n\n• It cannot get the exact author names, but I’ve made it to extract and group together the author name area string together and it is named “authorString”.\n\n```JSON\n{\n  \"title\": \"Incompleteness Errors in Ontology\",\n  \"authorString\": [\n    \"1 Muhammad Abdul Qadir, 2Muhammad Fahad, 3Syed Adnan Hussain Shah Muhammad Ali Jinnah University, Islamabad, Pakistan\",\n    \"1aqadir@jinnah.edu.pk, 2mhd.fahad@gmail.com, 3syedadnan@gmail.com\"\n  ],\n  \"content\": \"Abstract\\nOntology ev…\"\n}\n```\n\nAs you can see from the example, if there are numbered bullets in-front of the name’s etc, it is still difficult to remove them.\n\nSome, PDF documents turn out good:\n\n```JSON\n{\n  \"title\": \"Privacy Preserving Collaborative Filtering using Data Obfuscation\",\n  \"authorString\": [\n    \"Rupa Parameswaran Georgia Institute of Technology\",\n    \"School of Electrical and Computer Engineering Atlanta, GA\",\n    \"rupa@ece.gatech.edu\",\n    \"Douglas M Blough Georgia Institute of Technology\",\n    \"School of Electrical and Computer Engineering Atlanta, GA\",\n    \"doug.blough@ece.gatech.edu\"\n  ],\n  \"content\": \"Abstract\\n…\"\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsidmishraw%2Fautobot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsidmishraw%2Fautobot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsidmishraw%2Fautobot/lists"}