{"id":21620741,"url":"https://github.com/miku/grobidclient","last_synced_at":"2025-04-11T09:13:40.450Z","repository":{"id":250940599,"uuid":"835651226","full_name":"miku/grobidclient","owner":"miku","description":"A Go (golang) client for GROBID. ","archived":false,"fork":false,"pushed_at":"2025-02-18T16:40:22.000Z","size":7883,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-25T06:33:14.291Z","etag":null,"topics":["cli","document-analysis","golang","grobid"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/miku.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-30T09:06:36.000Z","updated_at":"2024-10-09T16:12:58.000Z","dependencies_parsed_at":"2024-11-25T00:01:09.270Z","dependency_job_id":null,"html_url":"https://github.com/miku/grobidclient","commit_stats":null,"previous_names":["miku/grobidclient"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fgrobidclient","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fgrobidclient/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fgrobidclient/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fgrobidclient/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/miku","download_url":"https://codeload.github.com/miku/grobidclient/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248365262,"owners_count":21091756,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","document-analysis","golang","grobid"],"created_at":"2024-11-24T23:12:37.697Z","updated_at":"2025-04-11T09:13:40.389Z","avatar_url":"https://github.com/miku.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# grobidclient\n\nA [Go](https://go.dev) client library and CLI for\n[GROBID](https://github.com/kermitt2/grobid) document parsing service. To\ninstall the CLI:\n\n```\n$ go install github.com/miku/grodidclient/cmd/grobidcli@latest\n```\n\nThis CLI and library includes functionality:\n\n* to run parsing on a single PDF file\n* to run parsing recursively on files in a directory\n* to convert TEI XML to a JSON format, akin to [grobid-tei-xml](https://pypi.org/project/grobid-tei-xml/) (Python, cf. [#41](https://github.com/kermitt2/grobid_client_python/issues/41))\n\n## Usage\n\nThe CLI allows to access the various services, receive parsed XML or JSON\nresults or to process a complete directory of PDF files (in parallel).\n\n```shell\n\n░░      ░░░       ░░░░      ░░░       ░░░        ░░       ░░...\n▒  ▒▒▒▒▒▒▒▒  ▒▒▒▒  ▒▒  ▒▒▒▒  ▒▒  ▒▒▒▒  ▒▒▒▒▒  ▒▒▒▒▒  ▒▒▒▒  ▒...\n▓  ▓▓▓   ▓▓       ▓▓▓  ▓▓▓▓  ▓▓       ▓▓▓▓▓▓  ▓▓▓▓▓  ▓▓▓▓  ▓...\n█  ████  ██  ███  ███  ████  ██  ████  █████  █████  ████  █...\n██      ███  ████  ███      ███       ███        ██       ██...\n\ngrobidcli | valid service (-s) names:\n\n  processFulltextDocument\n  processHeaderDocument\n  processReferences\n  processCitationList\n  processCitationPatentST36\n  processCitationPatentPDF\n\nNote: options passed to grobid API are prefixed with \"g-\", like \"g-ira\"\n\n  -H\tuse sha1 of file contents as the filename\n  -O string\n    \toutput directory to write parsed files to\n  -P\tdo a ping, then exit\n  -S string\n    \tserver URL (default \"http://localhost:8070\")\n  -T duration\n    \tclient timeout (default 1m0s)\n  -W string\n    \tpath to WARC file to extract PDFs and parse them (experimental)\n  -c string\n    \tpath to config file, often config.json\n  -d string\n    \tinput directory to scan for PDF, txt, or XML files\n  -debug\n    \tuse debug result writer, does not create any output files\n  -f string\n    \tsingle input file to process\n  -g-cc\n    \tgrobid: consolidate citations\n  -g-ch\n    \tgrobid: consolidate header\n  -g-force\n    \tgrobid: force reprocess\n  -g-gi\n    \tgrobid: generate ids\n  -g-ira\n    \tgrobid: include raw affiliations\n  -g-irc\n    \tgrobid: include raw citations\n  -g-ss\n    \tgrobid: segment sentences\n  -j\toutput json for a single file\n  -n int\n    \tnumber of concurrent workers (default 12)\n  -r int\n    \tmax retries (default 10)\n  -s string\n    \ta valid service name (default \"processFulltextDocument\")\n  -v\tbe verbose\n  -version\n    \tshow version\n\nExamples:\n\nProcess a single PDF file and get back TEI-XML\n\n  $ grobidcli -S localhost:8070 -f testdata/pdf/062RoisinAronAmericanNaturalist03.pdf\n\nProcess a single PDF file and get back JSON\n\n  $ grobidcli -j -f testdata/pdf/062RoisinAronAmericanNaturalist03.pdf\n\nProcess a directory of PDF files\n\n  $ grobidcli -d fixtures\n```\n\nProcess a single PDF.\n\n```xml\n$ grobidcli -f testdata/pdf/062RoisinAronAmericanNaturalist03.pdf | xmllint --format - | head -10\n\u003c?xml version=\"1.0\" encoding=\"UTF-8\"?\u003e\n\u003cTEI xmlns=\"http://www.tei-c.org/ns/1.0\" xmlns:xsi=\"http://www.w3.org/2001/XML...\n        \u003cteiHeader xml:lang=\"en\"\u003e\n                \u003cfileDesc\u003e\n                        \u003ctitleStmt\u003e\n                                \u003ctitle level=\"a\" type=\"main\"\u003eSplit Sex Ratios ...\n                                \u003cfunder ref=\"#_ZXgvsGF\"\u003e\n                                        \u003corgName type=\"full\"\u003eBelgian National ...\n                                \u003c/funder\u003e\n                        \u003c/titleStmt\u003e\n\n...\n```\n\nProcess a single PDF and convert to JSON:\n\n```json\n$ grobidcli -j -S http://localhost:8070 -f testdata/pdf/1906.02444.pdf | jq .\n{\n  \"grobid_version\": \"0.8.0\",\n  \"grobid_ts\": \"2024-08-27T16:56+0000\",\n  \"header\": {\n    \"authors\": [\n      {\n        \"full_name\": \"Davor Kolar\",\n        \"given_name\": \"Davor\",\n        \"surname\": \"Kolar\",\n        \"email\": \"dkolar@fsb.hr\"\n      },\n      {\n        \"full_name\": \"Dragutin Lisjak\",\n        \"given_name\": \"Dragutin\",\n        \"surname\": \"Lisjak\",\n        \"email\": \"dlisjak@fsb.hr\"\n      },\n      {\n        \"full_name\": \"Michał Paj Ąk\",\n        \"given_name\": \"Michał\",\n        \"surname\": \"Paj Ąk\"\n      },\n      {\n        \"full_name\": \"Danijel Pavkovic\",\n        \"given_name\": \"Danijel\",\n        \"surname\": \"Pavkovic\",\n        \"email\": \"dpavkovic@fsb.hr\"\n      }\n    ],\n    \"date\": \"2019-06-06\",\n    \"doi\": \"10.1177/ToBeAssigned\",\n    \"arxiv_id\": \"1906.02444v1[cs.LG]\"\n  },\n  \"pdfmd5\": \"E04A100BC6A02EFBF791566D6CB62BC9\",\n  \"lang\": \"en\",\n  \"citations\": [\n    {\n      \"authors\": [\n        {\n          \"full_name\": \"O Abdeljaber\",\n          \"given_name\": \"O\",\n          \"surname\": \"Abdeljaber\"\n        },\n        {\n          \"full_name\": \"O Avci\",\n          \"given_name\": \"O\",\n          \"surname\": \"Avci\"\n        },\n        {\n          \"full_name\": \"S Kiranyaz\",\n          \"given_name\": \"S\",\n          \"surname\": \"Kiranyaz\"\n        },\n        {\n          \"full_name\": \"M Gabbouj\",\n          \"given_name\": \"M\",\n          \"surname\": \"Gabbouj\"\n        },\n        {\n          \"full_name\": \"D J Inman\",\n          \"given_name\": \"D\",\n          \"middle_name\": \"J\",\n          \"surname\": \"Inman\"\n        }\n      ],\n      \"id\": \"b0\",\n      \"date\": \"2017\",\n      \"title\": \"Real-time vibration-based stru...\",\n      \"journal\": \"J. Sound Vib\",\n      \"volume\": \"388\",\n      \"pages\": \"154-170\",\n      \"first_page\": \"154\",\n      \"last_page\": \"170\"\n    },\n    ...\n  ],\n  \"abstract\": \"Recent trends focusing on Industry 4.0 conce...\",\n  \"body\": \"Introduction Rotating machines in general consis...\"\n}\n```\n\nProcess pdf files in a directory in parallel.\n\n```shell\n$ grobidcli -d testdata/pdf\n2024/07/30 20:48:35 scanning testdata/pdf/\n2024/07/30 20:48:37 got result [200]: testdata/pdf/62-Article Text-140-1-10-20190621.pdf\n2024/07/30 20:48:39 got result [200]: testdata/pdf/062RoisinAronAmericanNaturalist03.pdf\n```\n\nBy default, for each PDF file a separate file is written to a file with the\n`grobid.tei.xml` extension.\n\n## Example library usage\n\nPackage documentation on\n[pkg.go.dev](https://pkg.go.dev/github.com/miku/grobidclient). Example takes\nfrom the [grobidcli\ntool](https://github.com/miku/grobidclient/blob/main/cmd/grobidcli/main.go).\n\n```go\nimport (\n    ...\n    \"fmt\"\n    \"json\"\n    \"log\"\n    ...\n\n    \"github.com/miku/grobidclient\"\n    \"github.com/miku/grobidclient/tei\"\n)\n    ...\n    opts := \u0026grobidclient.Options{\n        GenerateIDs:            *generateIDs,\n        ConsolidateHeader:      *consolidateHeader,\n        ConsolidateCitations:   *consolidateCitations,\n        IncludeRawCitations:    *includeRawCitations,\n        IncluseRawAffiliations: *includeRawAffiliations,\n        TEICoordinates:         []string{\n            \"ref\",\n            \"figure\",\n            \"persName\",\n            \"formula\",\n            \"biblStruct\",\n        },\n        SegmentSentences:       *segmentSentences,\n        Force:                  *forceReprocess,\n        Verbose:                *verbose,\n        OutputDir:              *outputDir,\n        CreateHashSymlinks:     *createHashSymlinks,\n    }\n    switch {\n    case *inputFile != \"\":\n        result, err := grobid.ProcessPDF(\"my.pdf\",\n            \"processFulltextDocument\", opts)\n        if err != nil {\n            log.Fatal(err)\n        }\n        switch {\n        case *jsonFormat:\n            doc, err := tei.ParseDocument(\n                bytes.NewReader(result.Body))\n            if err != nil {\n                log.Fatal(err)\n            }\n            enc := json.NewEncoder(os.Stdout)\n            if err := enc.Encode(doc); err != nil {\n                log.Fatal(err)\n            }\n        case result.StatusCode == 200:\n            fmt.Println(result.StringBody())\n        default:\n            log.Fatal(result)\n        }\n    ...\n```\n\n## Notes on server setup\n\n* [Production Grobid Server Configuration](https://github.com/kermitt2/grobid/issues/443#issuecomment-505208132)\n\n## TODO and IDEAS\n\n* [ ] allow to process WARC files\n* [ ] allow to group all output from one go into a single file (XML in JSON, really...)\n\nIt would be nice to be able to point to a WARC file and parse all found PDFs in\nthat WARC file.\n\n```shell\n$ grobidcli -W https://is.gd/Jpz7OH -o parsed.json\n```\n\n* [ ] try to cache processing; cache may be keyed on content hash\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiku%2Fgrobidclient","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmiku%2Fgrobidclient","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiku%2Fgrobidclient/lists"}