{"id":18686519,"url":"https://github.com/docsaidlab/docclassifier","last_synced_at":"2025-04-12T05:05:25.290Z","repository":{"id":218176010,"uuid":"740276194","full_name":"DocsaidLab/DocClassifier","owner":"DocsaidLab","description":"A zero-shot document classifier.","archived":false,"fork":false,"pushed_at":"2024-06-15T23:36:02.000Z","size":48206,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-06-16T00:30:19.691Z","etag":null,"topics":["clip","document-classification","feature-learning","lightning","partial-fc","python","pytorch"],"latest_commit_sha":null,"homepage":"https://docsaid.org/en/docs/category/docclassifier","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DocsaidLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-08T02:16:09.000Z","updated_at":"2024-06-15T23:36:06.000Z","dependencies_parsed_at":"2024-04-12T16:26:23.063Z","dependency_job_id":"bcae5515-eaf5-4c55-81ef-ccb9211eb30c","html_url":"https://github.com/DocsaidLab/DocClassifier","commit_stats":null,"previous_names":["docsaidlab/docclassifier"],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DocsaidLab%2FDocClassifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DocsaidLab%2FDocClassifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DocsaidLab%2FDocClassifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DocsaidLab%2FDocClassifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DocsaidLab","download_url":"https://codeload.github.com/DocsaidLab/DocClassifier/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223497105,"owners_count":17155060,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clip","document-classification","feature-learning","lightning","partial-fc","python","pytorch"],"created_at":"2024-11-07T10:27:57.348Z","updated_at":"2025-04-12T05:05:25.264Z","avatar_url":"https://github.com/DocsaidLab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"**[English](./README.md)** | [中文](./README_tw.md)\n\n# DocClassifier\n\n\u003cp align=\"left\"\u003e\n   \u003ca href=\"./LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-Apache%202-dfd.svg\"\u003e\u003c/a\u003e\n   \u003ca href=\"\"\u003e\u003cimg src=\"https://img.shields.io/badge/python-3.10+-aff.svg\"\u003e\u003c/a\u003e\n   \u003ca href=\"https://github.com/DocsaidLab/DocClassifier/releases\"\u003e\u003cimg src=\"https://img.shields.io/github/v/release/DocsaidLab/DocClassifier?color=ffa\"\u003e\u003c/a\u003e\n   \u003ca href=\"https://pypi.org/project/docclassifier_docsaid/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/docclassifier_docsaid.svg\"\u003e\u003c/a\u003e\n   \u003ca href=\"https://pypi.org/project/docclassifier_docsaid/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/dm/docclassifier_docsaid?color=9cf\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n## Introduction\n\n\u003cdiv align=\"center\"\u003e\n   \u003cimg src=\"https://github.com/DocsaidLab/DocClassifier/raw/main/docs/title.jpg?raw=true\" width=\"800\"\u003e\n\u003c/div\u003e\n\nDocClassifier is a document image classification system based on Metric Learning technology, inspired by the challenges faced by traditional classifiers in handling the rapid increase in document types and their definitional ambiguities. It adopts the PartialFC feature learning architecture and integrates techniques such as CosFace and ArcFace, allowing the model to perform accurate classification without a large number of predefined categories. By expanding the dataset and incorporating ImageNet-1K and CLIP models, we enhanced performance and increased the model's adaptability and scalability. The model is trained using PyTorch, infers on ONNXRuntime, and supports conversion to the ONNX format for deployment across different platforms. Our testing showed the model achieved over 90% accuracy, with fast inference speed and the ability to quickly add new document types, meeting the needs of most application scenarios.\n\n## Documentation\n\nGiven the extensive usage instructions and settings explanations for this project, we only summarize the \"Model Design\" section here.\n\nFor more details, please refer to the [**DocClassifier Documents**](https://docsaid.org/en/docs/docclassifier/).\n\n## Installation\n\n### via PyPI\n\n1. Install the package from PyPI:\n\n   ```bash\n   pip install docclassifier-docsaid\n   ```\n\n2. Verify the installation:\n\n   ```bash\n   python -c \"import docclassifier; print(docclassifier.__version__)\"\n   ```\n\n3. If the version number is displayed, the installation was successful.\n\n### via Git Clone\n\n1. Clone this repository:\n\n   ```bash\n   git clone https://github.com/DocsaidLab/DocClassifier.git\n   ```\n\n2. Install the wheel package:\n\n   ```bash\n   pip install wheel\n   ```\n\n3. Build the wheel file:\n\n   ```bash\n   cd DocClassifier\n   python setup.py bdist_wheel\n   ```\n\n4. Install the built wheel file:\n\n   ```bash\n   pip install dist/docclassifier_docsaid-*-py3-none-any.whl\n   ```\n\n## Inference\n\n\u003e [!TIP]\n\u003e We have designed an automatic model download feature. When the program detects that you are missing the model, it will automatically connect to our server to download it.\n\nHere is a simple example:\n\n```python\nimport cv2\nfrom skimage import io\nfrom docclassifier import DocClassifier\n\nimg = io.imread('https://github.com/DocsaidLab/DocClassifier/blob/main/docs/test_driver.jpg?raw=true')\nimg = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)\n\nmodel = DocClassifier()\n\nmost_similar, max_score = model(img)\nprint(f'most_similar: {most_similar}, max_score: {max_score:.4f}')\n# \u003e\u003e\u003e most_similar: None, max_score: 0.0000\n```\n\nBy default, this example returns `None` and `0.0000` because the difference between our default registration data and the input image is significant. Therefore, the model finds the similarity between the image and the registration data to be very low.\n\nIn this case, you may consider lowering the `threshold` parameter:\n\n```python\nmodel = DocClassifier(\n    threshold=0.6\n)\n\n# Re-run the inference\nmost_similar, max_score = model(img)\nprint(f'most_similar: {most_similar}, max_score: {max_score:.4f}')\n# \u003e\u003e\u003e most_similar: Taiwan driver's license front, max_score: 0.6116\n```\n\n\u003e [!TIP]\n\u003e MRZScanner has been encapsulated with `__call__`, so you can directly call the instance for inference.\n\n## Model Design\n\nCreating a comprehensive model involves multiple adjustments and design iterations.\n\n### First Generation Model\n\n![arch_1.jpg](https://github.com/DocsaidLab/DocClassifier/raw/main/docs/arch1.jpg?raw=true)\n\nThe first-generation model, our earliest version, has a basic architecture divided into four parts:\n\n1. **Feature Extraction**\n\n   ![pp-lcnet.jpg](https://github.com/DocsaidLab/DocClassifier/raw/main/docs/lcnet_arch.jpg?raw=true)\n\n   This part converts images into vectors using [**PP-LCNet**](https://arxiv.org/abs/2109.15099) as the feature extractor.\n\n   The input image is a 128 x 128 RGB image, which outputs a 256-dimensional vector after feature extraction.\n\n2. **CosFace**\n\n   [![cosface.jpg](https://github.com/DocsaidLab/DocClassifier/raw/main/docs/cosface.jpg?raw=true)](https://arxiv.org/pdf/1801.09414.pdf)\n\n   To test the effectiveness of metric learning, we directly used [**CosFace**](https://arxiv.org/abs/1801.09414), skipping traditional classifiers. CosFace introduces a margin parameter to the softmax loss function, enhancing the model's ability to distinguish different classes during training.\n\n3. **Dataset**\n\n   To train the model, we created a simple web crawler to collect document images.\n\n   Approximately 650 different documents, mostly credit cards from major banks, were gathered.\n\n   This dataset is available here: [**UniquePool**](https://github.com/DocsaidLab/DocClassifier/tree/main/data/unique_pool).\n\n4. **Training**\n\n   We used PyTorch for model training, considering each image as a separate class to ensure the model could identify subtle differences between documents. However, this approach required data augmentation due to the limited number of original images (only one per class).\n\n   We used [**Albumentations**](https://github.com/albumentations-team/albumentations) for data augmentation to increase the dataset size.\n\n---\n\nThe first-generation model validated our concept but revealed issues in practical applications:\n\n1. **Stability**\n\n   The model was unstable, sensitive to environmental changes, and document distortions during alignment significantly impacted performance.\n\n2. **Performance**\n\n   The model struggled with similar documents, indicating poor feature learning and difficulty distinguishing between different documents.\n\nOur conclusion: **The model was overfitting!**\n\n### Second Generation Model\n\n![arch_2.jpg](https://github.com/DocsaidLab/DocClassifier/raw/main/docs/arch2.jpg?raw=true)\n\nThe second-generation model introduced several improvements:\n\n1. **More Data**\n\n   We expanded the dataset by including [**Indoor Scene Recognition**](https://web.mit.edu/torralba/www/indoor.html) from MIT, adding 15,620 images of 67 different indoor scenes.\n\n2. **Using PartialFC**\n\n   ![partialfc.jpg](https://github.com/DocsaidLab/DocClassifier/raw/main/docs/pfc_arch.jpg?raw=true)\n\n   As class numbers increased, we encountered the issue of large classification heads. [**PartialFC**](https://arxiv.org/abs/2203.15565) was introduced, demonstrating that sampling only 10% of classes during Softmax-based loss function training retained accuracy.\n\n3. **More Data Augmentation**\n\n   To combat overfitting, we augmented the dataset by defining each image's transformations (rotations, flips, and crops) as separate classes, expanding the dataset to (15,620 + 650) x 24 = 390,480 images (classes).\n\n4. **Switching to ImageNet-1K**\n\n   We replaced **Indoor Scene Recognition** with [**ImageNet-1K**](https://www.image-net.org/), providing 1,281,167 images across 1,000 classes. This solved the overfitting issue by significantly increasing data diversity.\n\n### Third Generation Model\n\n![arch_3.jpg](https://github.com/DocsaidLab/DocClassifier/raw/main/docs/arch3.jpg?raw=true)\n\nTo achieve more stable models, we integrated new techniques:\n\n1. **CLIP**\n\n   Inspired by OpenAI's [**CLIP**](https://arxiv.org/abs/2103.00020), we aligned our model's features with CLIP's more robust feature vectors.\n\n   The process:\n\n   1. Maintain the second-generation architecture.\n   2. Extract image features using our CNN backbone and CLIP-Image branch.\n   3. Compute the KLD loss between the two feature vectors.\n   4. Integrate the KLD loss into the original loss function, freezing CLIP-Image branch parameters.\n\n   This approach significantly improved stability and validation dataset performance by nearly 5%.\n\n2. **Stacking Normalization Layers**\n\n   Experimenting with different normalization layers, we found a combination of BatchNorm and LayerNorm yielded the best results, enhancing performance by around 5%.\n\n   ```python\n   self.embed_feats = nn.Sequential(\n       nn.Linear(in_dim_flatten, embed_dim, bias=False),\n       nn.LayerNorm(embed_dim),\n       nn.BatchNorm1d(embed_dim),\n       nn.Linear(embed_dim, embed_dim, bias=False),\n       nn.LayerNorm(embed_dim),\n       nn.BatchNorm1d(embed_dim),\n   )\n   ```\n\n## Conclusion\n\nThe third-generation model achieved significant improvements in stability and performance, showing satisfactory results in practical applications.\n\nWe consider the project's phase objectives met and hope our findings will benefit others.\n\n## Citation\n\nWe appreciate the work of those before us, which greatly contributed to our research.\n\nIf you find our work helpful, please cite our repository:\n\n```bibtex\n@misc{lin2024docclassifier,\n  author = {Kun-Hsiang Lin, Ze Yuan},\n  title = {DocClassifier},\n  year = {2024},\n  publisher = {GitHub},\n  url = {https://github.com/DocsaidLab/DocClassifier},\n  note = {GitHub repository}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdocsaidlab%2Fdocclassifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdocsaidlab%2Fdocclassifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdocsaidlab%2Fdocclassifier/lists"}