{"id":18628573,"url":"https://github.com/sr-murthy/doc_classifier","last_synced_at":"2025-11-04T02:30:32.028Z","repository":{"id":92607664,"uuid":"235341126","full_name":"sr-murthy/doc_classifier","owner":"sr-murthy","description":"A simple experimental document classification tool based on a domain-dependent, keywords-based document class map and a keyword frequency score","archived":false,"fork":false,"pushed_at":"2024-04-09T18:13:50.000Z","size":76,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-12-27T06:43:56.288Z","etag":null,"topics":["document-classification"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sr-murthy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-21T12:48:12.000Z","updated_at":"2024-04-12T13:01:48.000Z","dependencies_parsed_at":"2024-06-11T20:41:12.024Z","dependency_job_id":"8b0ab649-edd4-4c19-9698-5e018adefa9e","html_url":"https://github.com/sr-murthy/doc_classifier","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sr-murthy%2Fdoc_classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sr-murthy%2Fdoc_classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sr-murthy%2Fdoc_classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sr-murthy%2Fdoc_classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sr-murthy","download_url":"https://codeload.github.com/sr-murthy/doc_classifier/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239425338,"owners_count":19636346,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-classification"],"created_at":"2024-11-07T04:48:19.582Z","updated_at":"2025-02-18T06:44:24.617Z","avatar_url":"https://github.com/sr-murthy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Document Classifier\n===================\n\nDocument classification tool based on a domain-dependent, keywords-based document class map and a simple keyword frequency score.\n\nCurrently only financial statements (in CSV format) can be classified. A keyword-based class map for a given document type (stored as a JSON file in `static`) is used to create a frequency score for keywords occurring in a user-specified list of columns in the CSV document.\n\nScoring \u0026 Classification Functions\n----------------------------------\n\nThe scoring function, for a given CSV document and a list of keywords, is given by\n\n$$\nscore(W, C) = \\frac{\\sum_{w_i \\in W} \\sum_{c_{j,k} \\in C} I_{i,j,k}}{r \\cdot m \\cdot n}\n$$\n\nwhere $W$ is the set of $r$ keywords $w_1,\\ldots,w_r$ to search for, $C$ is the user-defined set of $n$ columns $C_1,\\ldots,C_n$ in which to perform the keyword search - the $j$-th column containing $m$ strings $c_{1,j},\\ldots,c_{m,j}$ - and $I_{i,j,k}$ is an indicator function for the presence of keywords in the $m \\cdot n$ column entries $c_{j,k}$ given by\n\n$$\nI_{i,j,k} = \\begin{cases}\n            1, \\hskip{3em} w_i \\in c_{j,k} \\\\\n            0, \\hskip{3em} w_i \\in c_{j,k}\n            \\end{cases}\n$$\n\nNote: the scoring function is guaranteed to be a value between 0 and 1 (inclusive) as the frequency score (numerator in the scoring function) can be a maximum of $r \\cdot m \\cdot n$.\n\nGiven a CSV document $\\mathcal{D}$ of type $\\mathcal{T}(\\mathcal{D})$, with $t$ classes $L_1,\\ldots,L_t$ defined in its class keywords map (a JSON file with keys being the class names/IDs $L_t,\\ldots,L_t$ and values being lists of keywords associated with the classes), and $C$ being the user-defined set of $n$ columns $C_1,\\ldots,C_n$ in which to perform the keywords search, the classification function is given by\n\n$$\nclassify(D, C) = argmax_{L_i \\in L} \\hskip{1em} score(W(L_i), C)\n$$\n\n\nUsage\n-----\n\nHere's a simple example of classifying a sample income statement with only a single keywords-search column .\n\n    [path/to/doc_classifier/src]$ ./classify.py -t 'financial statements' -f ../sample_data/income_statement/microsoft.csv --verbose\n\n    Classification: income\n    Keywords score map: {\n        \"income\": 0.03428571428571429,\n        \"cash flow\": 0.013333333333333334,\n        \"balance sheet\": 0.0\n    }\n\nThis is an example of classifying a sample income statement with multiple keywords-search columns.\n\n    [/path/to/doc_classifier/src]$ ./classify.py -t 'financial statements' -f ../sample_data/income_statement/microsoft2.csv  -c 'line item 1, line item 2' --verbose\n\n    Classification: income\n    Keywords score map: {\n        \"income\": 0.025714285714285714,\n        \"cash flow\": 0.006666666666666667,\n        \"balance sheet\": 0.0\n    }\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsr-murthy%2Fdoc_classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsr-murthy%2Fdoc_classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsr-murthy%2Fdoc_classifier/lists"}