{"id":21032193,"url":"https://github.com/rami-sabbagh/ourmarks","last_synced_at":"2025-07-03T06:04:34.814Z","repository":{"id":39572879,"uuid":"376110788","full_name":"Rami-Sabbagh/OurMarks","owner":"Rami-Sabbagh","description":"A module for extracting exams marks from official PDFs, for the Faculty of Information Technology Engineering at Damascus University","archived":false,"fork":false,"pushed_at":"2023-03-01T01:02:01.000Z","size":5575,"stargazers_count":11,"open_issues_count":6,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-03T06:04:32.581Z","etag":null,"topics":["damascus","exams","fite","marks","npm-module","pdf","typescript","typescript-library","university"],"latest_commit_sha":null,"homepage":"https://rami-sabbagh.github.io/OurMarks/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Rami-Sabbagh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-06-11T18:24:58.000Z","updated_at":"2023-09-16T17:48:35.000Z","dependencies_parsed_at":"2024-11-19T12:42:00.085Z","dependency_job_id":"7f3c5304-d3c7-4b91-bba0-bc0b6b363a80","html_url":"https://github.com/Rami-Sabbagh/OurMarks","commit_stats":{"total_commits":220,"total_committers":5,"mean_commits":44.0,"dds":0.4409090909090909,"last_synced_commit":"d63b3536e71a2a0610a00f6b33735756eaedb7b3"},"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"purl":"pkg:github/Rami-Sabbagh/OurMarks","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rami-Sabbagh%2FOurMarks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rami-Sabbagh%2FOurMarks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rami-Sabbagh%2FOurMarks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rami-Sabbagh%2FOurMarks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Rami-Sabbagh","download_url":"https://codeload.github.com/Rami-Sabbagh/OurMarks/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rami-Sabbagh%2FOurMarks/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263271505,"owners_count":23440396,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["damascus","exams","fite","marks","npm-module","pdf","typescript","typescript-library","university"],"created_at":"2024-11-19T12:41:31.486Z","updated_at":"2025-07-03T06:04:34.791Z","avatar_url":"https://github.com/Rami-Sabbagh.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# OurMarks\n\n[![Codecov](https://codecov.io/gh/Rami-Sabbagh/OurMarks/branch/main/graph/badge.svg?token=5VYTRPESUE)](https://codecov.io/gh/Rami-Sabbagh/OurMarks)\n[![CodeFactor](https://www.codefactor.io/repository/github/rami-sabbagh/ourmarks/badge/main)](https://www.codefactor.io/repository/github/rami-sabbagh/ourmarks/overview/main)\n[![Checks](https://badgen.net/github/checks/Rami-Sabbagh/OurMarks)](https://github.com/Rami-Sabbagh/OurMarks/actions)\n[![License](https://badgen.net/npm/license/ourmarks)](https://github.com/Rami-Sabbagh/OurMarks/blob/main/LICENSE)\n[![NPM](https://badgen.net/npm/v/ourmarks)][ourmarks npm]\n[![Minzipped Size](https://badgen.net/bundlephobia/minzip/ourmarks)](https://bundlephobia.com/package/ourmarks)\n[![Tree Shaking](https://badgen.net/bundlephobia/tree-shaking/ourmarks)][ourmarks bundlephobia]\n[![Twitter](https://badgen.net/twitter/follow/rami_sab07)](https://twitter.com/rami_sab07)\n\nA module for extracting exams marks from official PDFs, for the Faculty of Information Technology Engineering at Damascus University\n\n![A marks document parsed using the module and viewed using Google sheets](https://github.com/Rami-Sabbagh/OurMarks/raw/f4e6353c17316638af21de1c0802c5700a42be67/images/overview.png)\n\n## Introduction\n\nStudents exams marks at the Faculty of Information Technology Engineering at Damascus University are published as PDF documents of excel tables.\n\nThe PDF documents doesn't allow the exams marks to be used in excel sheets and other programs, because they're only made to be displayed.\n\nThat's why the OurMarks module was created, the module extracts the marks records from the PDF documents into structured data items that can be exported as CSV tables, and used for any computational purposes.\n\nThis opens the opportunity for:\n\n- Building a structured database of the student's marks.\n- Creating applications for displaying the marks.\n- Doing statistical data analysis on the marks.\n- Building profiles for students.\n- And much more...\n\n## Features\n\n- Top-Level API made simple for direct usage\n- Written in TypeScript, and so type definitions and IDE auto-complete through VS Code and other IDEs are available\n- Well documented and available on [npm][ourmarks npm]\n- Supports [Node.js] and the browser\n- Introduces no side-effects\n\n## Example\n\n### Node.js (TypeScript)\n\n```ts\nimport * as fs from 'fs';\nimport * as path from 'path';\n\nimport { getDocument } from 'pdfjs-dist/legacy/build/pdf';\nimport { extractMarksFromDocument } from 'ourmarks';\n\n// Read the document's data\nconst TARGET_DOCUMENT = path.resolve(__dirname, './documents/1617010032_programming 3 -2-f1-2021.pdf');\nconst documentData = fs.readFileSync(TARGET_DOCUMENT);\n\n// Parse the marks\nasync function main() {\n    const document = await getDocument(documentData).promise;\n    const marksRecords = await extractMarksFromDocument(document);\n    document.destroy();\n\n    console.log(marksRecords);\n}\n\n// Run the asynchronous function\nmain().catch(console.error);\n```\n\n## Getting Started\n\n### Installation\n\n```bash\nnpm install ourmarks pdfjs-dist\n```\n\nor\n\n```bash\nyarn add ourmarks pdfjs-dist\n```\n\n### Basic Usage\n\nThe module provides 2 top-level asynchronous functions for extracting marks from PDF documents.\n\nIt's expected to have the document loaded using PDF.js first, which is very simple:\n\n```ts\nimport { getDocument } from 'pdfjs-dist';\n\n// Inside your main asynchronous function\nasync function main() {\n    const document = await getDocument(rawPDFBinaryData).promise;\n\n    // ...\n\n    // Don't forget to destroy the document inorder to free the resources allocated.\n    document.destroy();\n}\n\n// Run the asynchronous function\nmain().catch(console.error);\n```\n\n\u003e On `node.js` you have to import `pdfjs-dist/legacy/build/pdf` instead due to compatibility reasons.\n\n\u003e `rawPDFBinaryData` can be a Node.js `Buffer` object, a url to the document, a `Uint8Array` and multiple other options as provided by [PDF.js]\n\nThen the whole document can be processed at once using `extractMarksFromDocument`:\n\n```ts\nimport { extractMarksFromDocument } from 'ourmarks';\n\n// Inside the main() function defined earlier:\nconst marksRecords = await extractMarksFromDocument(document);\n```\n\nOr it can be processed page by page using `extractMarksFromPage`:\n\n```ts\nimport { extractMarksFromPage, MarkRecord } from 'ourmarks';\n\nconst wholeRecords: MarkRecord[] = [];\n\n// Inside the main() function defined earlier:\nfor (let i = 1; i \u003c= document.numPages; i++) {\n    const page = await document.getPage(i);\n    const pageRecords = await extractMarksFromPage(page);\n\n    wholeRecords.push(...pageRecords);\n}\n```\n\n## API Documentation\n\nIn addition to the top-level `extractMarksFromDocument` and `extractMarksFromPage` functions, there are a bunch of other lower-level functions for advanced users.\n\nIt's completely unnecessary to use them, but if you want to play around with how the module internally works, you can check the [api documentation][apidocs] and read the 'how it works' section below.\n\n## How it works\n\nThe marks extractor works through a list of 7 steps:\n\n### Step 01: Load the document for parsing\n\nThe PDF document is loaded using the `PDF.js` library so it can be parsed.\n\nOnce the document has been loaded, it's possible to load each of its pages.\n\n### Step 02: Load each page in the document\n\nEach page in the document is loaded.\n\nOnce a page is loaded, it's possible to read its content for processing.\n\n### Step 03: Get the text items of each page\n\nFor each page, a list of all the text items in it is created.\n\nEach text item has the following data structure:\n\n| Field Name    | Type                    | Description                                                                    |\n|---------------|-------------------------|--------------------------------------------------------------------------------|\n| string        | `string`                | The content of the item                                                        |\n| direction     | `'ttb'` `'ltr'` `'rtl'` | The direction of the item's content                                            |\n| width         | `number`                | The width of the item, in document units                                       |\n| height        | `number`                | The height of the item, in document units                                      |\n| tranform      | `number[]`              | The 3x3 transformation matrix of the item, with only 6 values stored           |\n| tranform`[0]` | `number`                | The (0,0) value in the item's tranformation matrix, represents **scale x**     |\n| tranform`[1]` | `number`                | The (1,0) value in the item's tranformation matrix, represents **skew**        |\n| tranform`[2]` | `number`                | The (0,1) value in the item's tranformation matrix, represents **skew**        |\n| tranform`[3]` | `number`                | The (1,1) value in the item's tranformation matrix, represents **scale y**     |\n| tranform`[4]` | `number`                | The (0,2) value in the item's tranformation matrix, represents **translate x** |\n| tranform`[5]` | `number`                | The (1,2) value in the item's tranformation matrix, represents **translate y** |\n\n### Step 04: Filter and simplify the text items\n\nWith the text items stored in a list, the loaded PDF document can be discarded safely as it's no longer needed.\n\nThe items list is filtered from:\n\n- Items with `ttb` direction, we're only intereseted in English and Arabic items.\n- Item with non-zero `tranform[1]` and `tranform[2]`, we're not interested in any items with any rotation/skewing.\n- Items with empty `''` content.\n- Items with zero `transform[4]` or `tranform[5]`, as they are invisible/invalid.\n\nThen each item is mapped into a more simplified data structure:\n\n\u003e Each item is determined as Arabic if it has `rtl` direction\n\n| Field Name | Type               | Description                                            |\n|------------|--------------------|--------------------------------------------------------|\n| value      | `string`           | The content of the simplified item                     |\n| arabic     | `'true'` `'false'` | Whether the item contains any Arabic characters or not |\n| x          | `number`           | The X coordinates of the item, equal to `tranform[4]`  |\n| y          | `number`           | The Y coordinates of the item, equal to `tranform[5]`  |\n| width      | `number`           | The width of the item                                  |\n| height     | `number`           | The height of the item                                 |\n\n### Step 05: Merge close text items\n\n![The original Arabic items](https://github.com/Rami-Sabbagh/OurMarks/raw/f4e6353c17316638af21de1c0802c5700a42be67/images/items_highlighted.png)\n\n\u003e **Update at 2022-09-21:** The new versions of pdf-js no longer produce this issue!\n\n\u003e **As of OurMarks 3.0.0 this step has been disabled by default but still available behind an option.**\n\nIt was found that Arabic content is stored as independent text items of each character.\n\nAnd so the characters has to be merged back into proper items.\n\n![The Arabic items after merging](https://github.com/Rami-Sabbagh/OurMarks/raw/f4e6353c17316638af21de1c0802c5700a42be67/images/items_merged.png)\n\nA simple algorithm was created to solve that, here's an overview:\n\n\u003e Please note that the coordinates in the PDF documents are bottom-left corner based.\n\n1. Sort the list of items in **ascending** order, first by their Y coordinates, then by their X coordinates.\n2. For each range in the list with the same Y coordinates do:\n    - Iterate over the row's items in left to right order:\n        1. Check if the current item should be merged with the previous one:\n            - They should match in height.\n            - Neither of the items should be protected.\n                - An item is considered protected if it's a number of 5 digits (a student id).\n            - Define `errorTolerance = currentItem.height / 10`.\n            - The condition `currentItem.x \u003c= previousItem.X + previousItem.width + errorTolerance` should be met.\n        2. If the items should be merged, do that by:\n            - Concatenating their content in the correct direction.\n            - Setting the merged item to be Arabic if any of the items were arabic.\n            - Calculating the new width of the item using `currentItem.x + currentItem.width - previousItem.x`.\n\n\u003e Please note that the previous item is the item on the left, and the next item is the item on the right.\nThat's because how the items list was sorted.\n\n### Step 06: Shape the items into a table structure\n\nThe text items can be now shaped into a table structure, which is a 2-dimensional list of the items.\n\nThe first dimension is for the rows, and the second dimension is for the cells.\n\n1. Sort the list of items in **descending** order, first by their Y coordinates, then by their X coordinates.\n    - Items should be considiered they have the same Y coordinates if their Y projections do intersect:\n        - `itemA.y \u003e= itemB.y \u0026\u0026 itemA.y \u003c= itemB.y + itemB.height` or `itemB.y \u003e= itemA.y \u0026\u0026 itemB.y \u003c= itemA.y + itemA.height`\n2. Each range in the list with the same Y coordinates do is a row, with the items (considered as cells) being sorted from left to right.\n\n### Step 07: Extract marks records from the table\n\nNow the simplified text items has been stored in a table structure,\nit's possible to iterate over its rows and extract marks records.\n\nMark records have the following data structure:\n\n| Field Name        | Type              | Description                                                                    |\n|-------------------|-------------------|--------------------------------------------------------------------------------|\n| studentId         | `number`          | The exam ID of the student, a 5 digits number                                  |\n| studentName       | `string` / `null` | The full name of the student, may contain his father's name in some situations |\n| studentFatherName | `string` / `null` | The name of the student's father when not included in the full name            |\n| practicalMark     | `number` / `null` | The practical mark of the exam, usually out of 20 or 30                        |\n| theoreticalMark   | `number` / `null` | The theoretical mark of the exam, usually out of 80 or 70                      |\n| examMark          | `number` / `null` | The total mark of the exam, should be out of 100                               |\n\n\u003e All the fields (except the `studentId`) can be `null` because they might be missing from the table, or malformed with other values.\n\nThe marks records are extracted using the following algorithm:\n\n- For each row in the table that starts with a 5 digits number:\n    1. The `studentId` is that item.\n    2. Create a list for storing marks.\n    3. Iterate over the rest of items in the row:\n        1. Skip items with length over 255 characters.\n        2. The item is considered a _mark_ if it's a number of 1 to 3 digits, and not detected as Arabic.\n        3. If the item is Arabic and the marks list is empty:\n            - If `studentName` is `null`, then set it to the item.\n            - If `studentFatherName` is `null` then set it to the item.\n        4. If the item is considered a _mark_, and the marks list has less than 3 items, then push the item to the list.\n    4. If there's 1 item in the marks list, then give it to `examMark`.\n    5. Else if there's 3 items in the marks list, then give them to `practicalMark`, `theoreticalMark` and `examMark` in this order.\n\n[PDF.js]: https://mozilla.github.io/pdf.js/\n[Node.js]: https://nodejs.org/en/\n[apidocs]: https://rami-sabbagh.github.io/OurMarks/\n[ourmarks npm]: https://www.npmjs.com/package/ourmarks\n[ourmarks bundlephobia]: https://bundlephobia.com/package/ourmarks\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frami-sabbagh%2Fourmarks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frami-sabbagh%2Fourmarks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frami-sabbagh%2Fourmarks/lists"}