{"id":18422460,"url":"https://github.com/sri-csl/safedocs-recognizer","last_synced_at":"2025-04-13T12:28:29.437Z","repository":{"id":130329129,"uuid":"422256132","full_name":"SRI-CSL/safedocs-recognizer","owner":"SRI-CSL","description":"DARPA SafeDocs TA1 software suite to bundle and orchestrate various format-aware tracing tools.","archived":false,"fork":false,"pushed_at":"2022-05-05T14:44:57.000Z","size":1405,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":16,"default_branch":"main","last_synced_at":"2024-12-24T18:29:20.429Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SRI-CSL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-28T15:24:34.000Z","updated_at":"2023-05-25T19:30:16.000Z","dependencies_parsed_at":"2023-05-01T06:00:37.982Z","dependency_job_id":null,"html_url":"https://github.com/SRI-CSL/safedocs-recognizer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SRI-CSL%2Fsafedocs-recognizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SRI-CSL%2Fsafedocs-recognizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SRI-CSL%2Fsafedocs-recognizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SRI-CSL%2Fsafedocs-recognizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SRI-CSL","download_url":"https://codeload.github.com/SRI-CSL/safedocs-recognizer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239101089,"owners_count":19581787,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T04:30:12.954Z","updated_at":"2025-02-16T07:25:26.150Z","avatar_url":"https://github.com/SRI-CSL.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# safedocs-recognizer\nDARPA SafeDocs TA1 software suite to bundle and orchestrate various format-aware tracing tools.\n\n## How to run\n\nThe first step is copying (or create a symlink) documents to the localdocs directory and creating the document index.\n\n```\nsh build_index.sh\n```\n\nThe database should then be started to store processing results.\n\n```\ndocker compose up\n```\n\nBuild the CLI tool\n\n```\ngo build\n```\n\nBuild the tooling\n\n```\nsh build-components.sh\n```\n\n### Examples\n\n#### Running tools without recognizer hardness\n\n```\ndocker run --rm -i mr_file-features stdin \u003c pdf-sample.pdf\n```\n\n```\ndocker run --rm -i mr_qpdf_10.1.0 stdin \u003c pdf-sample.pdf \n```\n\n#### mupdf example within recognizer\n\nBaseline and non-baseline processing (for performance reasons and prevent multiple passes over 1mil files, the consensus component combines bitcov and cfg tools)\n\n```\n./recognizer process --tag mr_mupdf_1.16.1 --subset evalThree --universe univA --baseline\n./recognizer process --tag mr_mupdf_1.16.1 --subset evalThree10kTest --universe univA\n./recognizer process --tag mr_file-features --subset evalThree --baseline\n```\n\nIntegrated components\nDerive model\n```\n./recognizer bitcov --parser mupdf --universe univA\n./recognizer bitcov --parser mupdf --universe univB\n```\n\nMetrics comparing 10k non-baseline files with models A and B\n```\n./recognizer bitcov-diff --model mupdf_univA_model.png --parser mupdf\n./recognizer bitcov-diff --model mupdf_univB_model.png --parser mupdf\n```\n\nDerive model\n```\n./recognizer flat-cfg --parser mupdf --universe univA\n./recognizer flat-cfg --parser mupdf --universe univB\n```\n\nMetrics comparing 10k non-baseline files with models A and B\n```\n./recognizer flat-cfg-diff --parser mupdf --model mupdf_univA_flat_cfg_model.txt\n./recognizer flat-cfg-diff --parser mupdf --model mupdf_univB_flat_cfg_model.txt\n```\n\n#### Misc\n\nHelper scripts\n\nExtract PDF Object that QPDF fails to parse\n```\ndocker run --rm -i mr_file-features stdin \u003c localdocs/temp/163e61e6c3dd768854b2ead5616cbc2c2dbd9c8559aaca9fb8e8005f20d8e397_parsley | awk -v pdf_object=$(docker run --rm -i mr_qpdf stdin \u003c localdocs/temp/163e61e6c3dd768854b2ead5616cbc2c2dbd9c8559aaca9fb8e8005f20d8e397_parsley | awk -f invalid_object.awk) -f extract_bytes.awk\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsri-csl%2Fsafedocs-recognizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsri-csl%2Fsafedocs-recognizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsri-csl%2Fsafedocs-recognizer/lists"}