{"id":24318530,"url":"https://github.com/rdmpage/citation-parsing","last_synced_at":"2025-09-27T03:31:29.910Z","repository":{"id":142280131,"uuid":"374633734","full_name":"rdmpage/citation-parsing","owner":"rdmpage","description":"Exploring citation parse using Conditional Random Fields (CRF)","archived":false,"fork":false,"pushed_at":"2023-01-21T12:13:31.000Z","size":84149,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2023-03-13T03:37:36.623Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rdmpage.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":"citation-parsing.bbprojectd/Scratchpad.txt","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-06-07T11:01:47.000Z","updated_at":"2022-07-28T21:47:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"95f8a7f7-2478-4bb5-aa92-1e2705769679","html_url":"https://github.com/rdmpage/citation-parsing","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdmpage%2Fcitation-parsing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdmpage%2Fcitation-parsing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdmpage%2Fcitation-parsing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdmpage%2Fcitation-parsing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rdmpage","download_url":"https://codeload.github.com/rdmpage/citation-parsing/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234376967,"owners_count":18822424,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-17T14:38:22.971Z","updated_at":"2025-09-27T03:31:19.899Z","avatar_url":"https://github.com/rdmpage.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Citation Parsing\n\n## Introduction\n\nExploring citation parsing using Conditional Random Fields (CRF). Heavily influenced by [ParsCit](https://github.com/knmnyn/ParsCit) and [AnyStyle](https://anystyle.io). My main goal here is to get something simple working as a starting point for learning more about CRF. Nothing here is state of the art, for that see, e.g.:\n\n- [Synthetic vs. Real Reference Strings for Citation Parsing, and the Importance of Re-training and Out-Of-Sample Data for Meaningful Evaluations: Experiments with GROBID, GIANT and Cora](https://arxiv.org/abs/2004.10410)\n- [GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing](http://ceur-ws.org/Vol-2563/aics_25.pdf)\n- [Neural-ParsCit](https://github.com/WING-NUS/Neural-ParsCit)\n\n\n## Data\n\n`editor.html` is a simple HTML editor inspired by [MarsEdit Live Source Preview](https://red-sweater.com/blog/3025/marsedit-live-source-preview) where you can edit XML and see a live preview.\n\n`data/core.xml` is the training data from AnyStyle (1510 references).\n\n`dict.php` uses the dictionary that comes with ParsCit.\n\n## CRF\n\nFor background see [Conditional random fields](https://en.wikipedia.org/wiki/Conditional_random_field). I use [CRF++: Yet Another CRF toolkit](http://taku910.github.io/crfpp/), which is also used in ParsCit.\n\n### Heroku C++\n\nTo get CRF++ to work on Heroku we need to compile the executable. For background on this see [How to run an executable on Heroku from node, works locally](https://stackoverflow.com/questions/39685489/how-to-run-an-executable-on-heroku-from-node-works-locally) and [C++ buildpack](https://elements.heroku.com/buildpacks/felkr/heroku-buildpack-cpp).\n\nI forked [felkr/heroku-buildpack-cpp](https://github.com/felkr/heroku-buildpack-cpp) and added it as a buildpack to my Heroku app (under the `Settings` tab). I put the source code for CRF++ into the root folder of the app (which makes things messy) then when the app is deployed CRF++ is compiled. Note that I tried simply logging in to the Heroku app:\n\n`heroku run bash -a citation-parser`\n\nand compiling the code. This failed with g++ errors:\n\n```\nconfigure: error: Your compiler is not powerful enough to compile CRF++.\n```\n\nTurns out that g++ is [only available at build time](https://devcenter.heroku.com/articles/stack-packages), hence to use g++ I need a buildpack.\n\nThe buildpack compiled the code, but when I logged into the shell the executable wouldn’t run:\n\n```\nheroku run bash -a citation-parser\n./crf_learn\n/app/.libs/crf_learn: error while loading shared libraries: libcrfpp.so.0: cannot open shared object file: No such file or directory\n```\n\nFor whatever reason the executable is looking for a shared library which doesn’t exist. To fix this I edited the buildpack  [compile](https://github.com/rdmpage/heroku-buildpack-cpp/blob/master/bin/compile) script to set the `\"LDFLAGS=--static\" --disable-shared` flags for `configure`. This then compiled an executable that worked.\n\n#### Update\n\nI’ve now updated the [buildpack](https://github.com/rdmpage/heroku-buildpack-cpp) to use the `src` folder so that this repo is much tidier.\n\n### Apple Silicon\n\nCRF++ didn’t want to build using autotools, but it is available on Homebrew so we can just `brew install crf++` to get a working version which is installed in `/opt/homebrew/bin`.\n\n## Use\n\nTo train model we need some data that has been marked up. I follow AnyStyle’s XML, e.g.:\n\n```\n\u003c?xml version=\"1.0\" encoding=\"UTF-8\"?\u003e\n\u003cdataset\u003e\n  \u003csequence\u003e\n    \u003cauthor\u003eHeidegger M.,\u003c/author\u003e\n    \u003cdate\u003e1927,\u003c/date\u003e\n    \u003ctitle\u003eÊtre et temps,\u003c/title\u003e\n    \u003ceditor\u003eGallimard, Ed.\u003c/editor\u003e\n    \u003cdate\u003e1986,\u003c/date\u003e\n    \u003clocation\u003eParis.\u003c/location\u003e\n  \u003c/sequence\u003e\n.\n.\n.\n\u003c/dataset\u003e\n```\n\nWe need to convert this into the format expected by CRF, which is one token per line, with features following, and then the tag indicating what part of the sequence this token belongs to.\n\n`php parse_train.php data/core.xml` parses the training XML and outputs a `.train` file with the features and tags. Having converted the training data we now build the model using `crf_learn` in the CRF++ package:\n\n`crf_learn data/parsCit.template data/core.train core.model`\n\n```\n.\n.\n.\nDone!1065.81 s\n```\n\nNote the template file `data/parsCit.template` which tells CRF++ how to process the features, see [Preparing feature templates](http://taku910.github.io/crfpp/#templ).\n\nTo use the model we need to take some data and convert it into the training format. `refs_to_train.php` reads a text file with one reference string per line and outputs XML with each line enclosed in a `\u003ctitle\u003e` tag. This file can then be processed as if it were training data. \n\n```\nphp refs_to_train.php refs.txt\n\nphp parse_train.php refs.src.xml\n```\n\nNow we use our model to process the data using `crf_test`. In this case `crf_test` takes the data (each reference tagged with `\u003ctitle\u003e`) and outputs the tags based on the model. These tags are the ones we use to extracted the structured data. \n\n```\ncrf_test  -m core.model refs.src.train \u003e out.train\n```\n\nWe then convert the output (trained format) to XML, and we then can convert the XML to a “native” format (e.g., RIS for bibliographic data).\n\n```\nphp parse_results_to_xml.php out.train \u003e out.xml\n\nphp parse_results_to_native.php out.xml\n```\n\nNeed to think about how to post process tags, and how to handle cases like this where a date has been inserted in the title so that we with the initial model we end up with two dates and titles:\n\n```\n\u003cauthor\u003eAguilar, C., K. Siu-Ting, and P. J. Venegas.\u003c/author\u003e\n\u003cdate\u003e2007.\u003c/date\u003e\n\u003ctitle\u003eThe rheophilous tadpole of Telmatobius atahualpai Wiens,\u003c/title\u003e\n\u003cdate\u003e1993\u003c/date\u003e\n\u003ctitle\u003e(Anura: Ceratophryidae).\u003c/title\u003e\n\u003cjournal\u003eSouth American Journal of Herpetology\u003c/journal\u003e\n\u003cvolume\u003e2:\u003c/volume\u003e\n\u003cpages\u003e165–174.\u003c/pages\u003e \n\n```\n\n## Generating additional data to use for testing or training\n\nTake a RIS file and output Anystyle XML format:\n\n```\nphp ris_to_training.php nsp.ris \u003e nsp.xml\n```\n\nConvert to training format:\n\n```\nphp parse_train.php nsp.xml\n```\n\nAdd the output to `core.train` and then rebuild model:\n\n`crf_learn data/parsCit.template core.train core.model`\n\nDo this with each new set of training data so that we build a better model (we hope).\n\n## Adding \"fails” to training data\n\nAs above we can add items that we know have failed to our training set. Just follow the steps:\n\n```\nphp parse_train.php fail.xml\n```\n\nAdd the output to `core.train` and then rebuild model:\n\n`crf_learn data/parsCit.template core.train core.model`\n\nIt is a good idea to then rename the `fail.xml` as it is now in the model, and start a new, empty `fail.xml` to collect new failures. If we keep doing this iteratively the model should improve.\n\n\n## Testing\n\nTake some references marked up in XML and generate training format.\n\nphp parse_train.php fail.xml\n\nRun crf_test to get tags from model\n\ncrf_test  -m core.model fail.train \u003e f.train\n\nOutput from crf_test has original tags and ones from model, so compare those\n\nphp parse_results_to_test.php f.train\n\n\n## Examples \n\n```\nHogg, H.R. (1896). Araneidae. In B. Spencer (ed.) Report of the Horn Expedition to Central Australia. Pt. 2. Zoology. pp. 309-356. Melville, Mullen and Slade, Melbourne.\n```\n\n\n\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frdmpage%2Fcitation-parsing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frdmpage%2Fcitation-parsing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frdmpage%2Fcitation-parsing/lists"}