https://github.com/emrys-hong/sap-internship-dhl
this is one of the pipline for production i did when doing internship at sap -- extracting structured information out of DHL's addressing tag
https://github.com/emrys-hong/sap-internship-dhl
Last synced: 3 months ago
JSON representation
this is one of the pipline for production i did when doing internship at sap -- extracting structured information out of DHL's addressing tag
- Host: GitHub
- URL: https://github.com/emrys-hong/sap-internship-dhl
- Owner: Emrys-Hong
- Created: 2018-08-17T10:21:48.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T02:22:39.000Z (over 2 years ago)
- Last Synced: 2025-01-15T08:19:23.374Z (4 months ago)
- Language: Python
- Size: 5.56 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 27
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DHL-pipline-contextual-method
## dependency
install the dependency in requirements.txt
cuda8.0 tensorflow 1.3.0## download trained files
```frozen_inference_graph.pb``` and ```labelmap.pbtxt``` are needed for object_detection.py, can be found in ```trained_models``` folderclassification models and dataset ```pvsh``` and ```4_kind_pre``` can be found in ```trained_models``` folder. i used dataset ```pvsh``` to classify and got an accuracy of 88. need to specify the path in ```classification.py```
```embeddings etc``` are needed in contextual.py, can be found in trained models. download the ```sequence_tagging_thailand``` from the ```trained_model``` folder and put ```model/contextual.py``` in ```sequence_tagging_thailand``` to predict.
## Procedure
```model/object_detection.py``` input: raw image from DHL, output: four coordinate for address and bar code```preprocess/crop.py``` input: coordinate from object detection, output: cropped image for image and bar code
```preprocess/deskew.py``` input: image; output: deskewed image
```preprocess/tesseract.py``` input: image; output: text after OCR
```model/classification.py``` input: image from deskew; output: a binary value whether it is printed
```model/contextual.py``` input: text from tesseract.py; output: structured final data
```test.py``` test the result for pipline.
## how to generate those files if they do not work
```model/contextual.py``` is using [this github](https://github.com/guillaumegenthial/sequence_tagging) to produce result. they produce SOTA result on coNLL classification tasks. i downloaded the fasttext embedding for thai (if you want to test on english, i used glove embedding. to produce the test data, use ```parcel_data.xls``` and ```generate_contextual_data.ipynb``` in ```extra_file``` folder## download files
download the ```results```(trained models in it) and ```data``` (trained data in it) folder and put it in root directory## results
for entity_linking code run:
```CUDA_VISIBLE_DEVIECS=1 python test_entity_linking.py test_images/63.png```
predictions:
```
barcode is: TDPSHO720171
preprovince ขอนแก่น
prepostcode 40000
prename อเมืองจขชอนนคน
pre_address ส.จุฑามาศ. เลิน ห้อง 610
หอสทธิลักษณ์ 473/โหมุ27 ด.ติลฯ
```
results:
```
Barcode: TDPSH07201711223
Province: ขอนแก่น
Zipcode: 40000
Name: จุฑามาศ เมลิน ห้อง610
State: ขอนแก่น
Address: หอสุทธิลักษณ์ 473/1 ม.27 ตำบลศิลา
```for contextual NER code run:
```CUDA_VISIBLE_DEVIECS=1 python test_contextual.py test_images/63.png```for the newest combined code run:
```CUDA_VISIBLE_DEVIECS=1 python test_contextual.py test_images/63.png```## running time and memory taken
for cpu only
```cpu memory: 725.6MB; running time per picture: 22.2s```
for gpu and cpu
```cpu memory: 2566MB; GPU memory: 19700MB; running time per picture: 12.9s```