{"id":18750505,"url":"https://github.com/os-climate/crrf-det","last_synced_at":"2025-04-12T23:32:16.847Z","repository":{"id":186240294,"uuid":"572812693","full_name":"os-climate/crrf-det","owner":"os-climate","description":"A web application for PDF content and table extraction, featuring image-based visual layout analysis, indexed document search, batch processing and extraction result annotation.","archived":false,"fork":false,"pushed_at":"2024-06-26T09:00:32.000Z","size":6956,"stargazers_count":5,"open_issues_count":1,"forks_count":3,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-26T18:04:21.057Z","etag":null,"topics":["annotation","data-extraction","layout-analysis","pdf","table-extraction"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/os-climate.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-12-01T04:40:16.000Z","updated_at":"2024-02-23T05:30:17.000Z","dependencies_parsed_at":null,"dependency_job_id":"a0d14d50-4e9c-417f-afee-d7a6fc6d7dee","html_url":"https://github.com/os-climate/crrf-det","commit_stats":null,"previous_names":["os-climate/crrf-det"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/os-climate%2Fcrrf-det","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/os-climate%2Fcrrf-det/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/os-climate%2Fcrrf-det/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/os-climate%2Fcrrf-det/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/os-climate","download_url":"https://codeload.github.com/os-climate/crrf-det/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248647257,"owners_count":21139081,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotation","data-extraction","layout-analysis","pdf","table-extraction"],"created_at":"2024-11-07T17:12:12.514Z","updated_at":"2025-04-12T23:32:11.837Z","avatar_url":"https://github.com/os-climate.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CRRF Data Extraction Toolkit\n\nA web application for PDF content and table extraction, featuring image-based visual layout analysis, indexed document search, batch processing and extraction result annotation.\n\n\u003cimg src=\"docs/crrf-det-screen.jpg\" style=\"border: 1px solid #aaa\"/\u003e\n\n## Features\n\n**PDF Processing (t-pdf)**\n\n- 📄 PDF page content understanding using an image-based visualized method, segmenting tables and text boxes\n- 🧪 Unit test controlled layout analysis results for quality assurance\n- 🚀 High speed analysis: Image processing written in NumPy + scikit-image, achieving 3 page/sec per 1000 \u003ca href=\"https://www.geekbench.com\" target=\"_blank\"\u003eGeekbench score\u003c/a\u003e on a single core.\n- 🧬 Conversion from PDF files to structured JSON\n\n**Documents Management**\n- 📁 Manage a repository of folders of PDF files\n- 🔎 Search using keywords and phrases (ngram) inside the PDF documents, designed for numerical value extraction, with:\n    * \"double quoted phrases\"\n    * -excluded_words\n    * -\"excluded phrases\"\n- 🏷️ Manage a list of persisted search queries, known as \"filters\", for quick recalling and batch execution. Associate a search query with a list of tags.\n- Fully asynchronous task processing, with configurable number of parallel processes\n\n**Batch Processing, User and Annotation Projects**\n- 💼 Create batch processing projects to run a selection of \"filters\" against a selection of folders and documents, generating a collection of segments in JSON format for download.\n- 🏷️ Convert the segments into an annotation project\n- 📱 A mobile-browser-friendly infinite-scrolling web app for annotating small segments collected from the documents\n- 🧑‍💼 Invitation based user registration system, with admin-accessible document managements and user-accessible annotation\n\n## Developing\n\nClone the repo:\n\n    $ git clone git@github.com:os-climate/crrf-det.git\n\n**Frontend**\n\nWe use \u003ca href=\"https://vitejs.dev\" target=\"_blank\"\u003eVite\u003c/a\u003e as our frontend tooling for a \u003ca href=\"https://reactjs.org\" target=\"_blank\"\u003eReact\u003c/a\u003e based frontend. To start the frontend, first install \u003ca href=\"https://nodejs.org/en/\" target=\"_blank\"\u003eNode.js\u003c/a\u003e in your local environment, and make sure the `npm` command is available. Then:\n\n    $ cd crrf-det/src/fe\n    $ npm install\n    $ npm run dev\n\nAfter dependency installation, this will launch the frontend server at \u003ca href=\"http://localhost:5173/\" target=\"_blank\"\u003ehttp://localhost:5173/\u003c/a\u003e. Note that the default setup in the repository assumes that you run the development on `localhost`. For instructions to deploy the program to another host, consult the **Deployment** section.\n\n**Backend**\n\nWe use \u003ca href=\"https://www.docker.com\" target=\"_blank\"\u003eDocker\u003c/a\u003e as the backend development environment. To launch the backend, first install the respective Docker edition for your local environment. Then:\n\n    $ cd crrf-det\n    $ docker-compose build\n    $ docker-compose up\n\nThis will bring up a \u003ca href=\"https://sanic.dev/en/\" target=\"_blank\"\u003eSanic\u003c/a\u003e based backend at port `8000`, with a \u003ca href=\"https://redis.io\" target=\"_blank\"\u003eRedis\u003c/a\u003e database at port `6379`. Additionally, it creates the `dev-data` folder (at the same level of `docker-compose.yml`) for persisted data. No information is persisted in the Redis database. It is used primarily for running and keep tracking of asynchronous tasks.\n\nNote our setup uses an x86_64 base image.\n\nYou need a first admin user to use any of the functionalities. To create one, do it manually inside the container:\n\n    $ (sudo) docker exec -ti crrf-det-be-1 bash\n    # python\n    \u003e\u003e\u003e import data.user\n    \u003e\u003e\u003e data.user.add('admin', 'password', 0)\n    \u003e\u003e\u003e quit()\n\nVisit \u003ca href=\"http://localhost:5173/\" target=\"_blank\"\u003ehttp://localhost:5173/\u003c/a\u003e and login using the user. Note the argument `0` at the end of the call refers to the level of the user. You need level `0` (the highest) to access PDF documents and project functionalities. Levels \u003e 0 can only access the annotation app.\n\n## Tests\n\nUnit tests currently only covers the PDF page layout analysis portion of the code, which is in Python. Once you have the development containers setup, you can then go inside and start the tests:\n\n    $ (sudo) docker exec -ti crrf-det-be-1 bash\n    # python -m unittest\n\nThe tests only guarantees that the layout analysis code, including the portion that breaks columns, rows, and eventually guess the location of the table, is working as intended.\n\n## Deployment\n\nWe have written a small script to build size-optimized Docker images for deployment. To build for deployment, first determine the target hostname and port (must be known due to CORS in the backend, and API endpoints in the frontend). Then:\n\n    $ cd crrf-det/deploy\n    $ ./build.sh //hostname:port\n    $ (sudo) docker save det-be-dist -o det-be-dist.tar\n    $ (sudo) docker save det-fe-dist -o det-fe-dist.tar\n\nNote that the `//hostname:port` is only used in building the frontend, by hard-coding the destination API endpoints into the code before compilation. To setup backend handling of CORS, you need to set `HOST_FE_URL` variable in `docker-compose.yml`.\n\nOnce you have to two (frontend and backend) images (.tar), copy them to your host, and use the reference `docker-compose.yml` file in the `deploy` folder to set it up.\n\n\u003cstrong style=\"color:red\"\u003e!!! Security Consideration !!!\u003c/strong\u003e\n\nSome environment variables should be changed during the deployment:\n\n      - JWT_SECRET=crrf-det-jwt-SECRET!!!501015\n      - PASSWORD_SALT=crrf-det-salt-50-10-15\n      - URL_SIGN_SECRET=86c935bc079ba1fef55809e2f575426c\n\nThese variables control the encryption of relevant parts. **Using the example as is opens up opportunities for an attacker to generate your authentication token.**\n\nFor `JWT_SECRET` and `PASSWORD_SALT` just enter some long enough random strings will be enough. To generate `URL_SIGN_SECRET`, a safe way would be to do it inside the container:\n\n    $ (sudo) docker exec -ti crrf-det-be-1 bash\n    # python\n    \u003e\u003e\u003e import service.sign\n    \u003e\u003e\u003e service.sign.generate_key()\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fos-climate%2Fcrrf-det","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fos-climate%2Fcrrf-det","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fos-climate%2Fcrrf-det/lists"}