https://github.com/eurohlam/grocery-expenses-pipeline
command-line pipeline that parses grocery receipts and converts them into key-value table
https://github.com/eurohlam/grocery-expenses-pipeline
java ocr opencv tesseract
Last synced: 4 months ago
JSON representation
command-line pipeline that parses grocery receipts and converts them into key-value table
- Host: GitHub
- URL: https://github.com/eurohlam/grocery-expenses-pipeline
- Owner: eurohlam
- License: gpl-3.0
- Created: 2025-02-05T00:59:03.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-02-18T00:07:41.000Z (4 months ago)
- Last Synced: 2025-02-18T01:22:26.520Z (4 months ago)
- Topics: java, ocr, opencv, tesseract
- Language: Java
- Homepage:
- Size: 7.46 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# grocery-expenses-pipeline
The idea of this project is to create a simple command-line pipeline that parses grocery receipts and convert them into
key-value table that can be represented as a HashTable or a JSON
The project is implemented in Java 17 and uses:
- [tesseract](https://github.com/tesseract-ocr/) as OCR (optical character recognition) engine
- [opencv4](https://opencv-java-tutorials.readthedocs.io/en/latest/) to process the receipt image in order to improve it for OCR.
Only [bilateralFilter](https://docs.opencv.org/4.x/dc/dd3/tutorial_gausian_median_blur_bilateral_filter.html) is used at the moment.> [!NOTE]
> There are plenty of similar projects in github. The most of them are implemented in phyton, few in java. ChatGPT can parse
> receipts very well. However, it uses the same tesseract and pretty straight forward parsing rules behind the scene. Also,
> free version of ChatGPT is limited to processing 2-3 images per day.
> So, I decided that I can do my own implementation using my favourite Java.## How to run tesseract via tess4j on MacOS
Install tessract via MAC port
sudo port install tesseract
In your java application:
Add VM argument
-Djna.library.path=/opt/local/var/macports/software/tesseract/tesseract-5.4.1_2.darwin_22.x86_64/opt/local/libOr add Env variable
DYLD_LIBRARY_PATH=/opt/local/var/macports/software/tesseract/tesseract-5.4.1_2.darwin_22.x86_64/opt/local/lib
Path to tessdata should be
/opt/local/share/tessdata/
Maven dependency
net.sourceforge.tess4j
tess4j
5.12.0
## How to run opencv4 via java
Install opencv4 with java support via MAC port
sudo port install opencv4 +java
Check the installation
sudo port contents opencv4 | grep java
If everything is fine tt should show:
Port opencv4 @4.9.0_4+java contains:
/opt/local/libexec/opencv4/java/jar/opencv-490.jar
/opt/local/libexec/opencv4/java/jni/libopencv_java490.dylibIn your java application:
Add VM argument
-Djava.library.path=/opt/local/libexec/opencv4/java/jni/
Your code must contain the following line before triggering opencv Imgcodecs
System.loadLibrary(Core.NATIVE_LIBRARY_NAME);
Maven dependency
org.openpnp
opencv
4.9.0-0
## How to run the pipeline
Build the jar
mvn clean package
Run java
java \
-Djna.library.path=/opt/local/var/macports/software/tesseract/tesseract-5.4.1_2.darwin_22.x86_64/opt/local/lib \
-Djava.library.path=/opt/local/libexec/opencv4/java/jni/ \
-jar target/grocery-expenses-pipeline-1.0-SNAPSHOT.jar \
-file src/test/resources/samples/IMG_5213.jpeg \
-restEndpoint http://localhost:8080 \
-storeName PackNSavewhere
* -file - is a path to an image that needs to be processed. This is a mandatory argument
* -restEndpoint - is a REST-endpoint where a result of recognition and parsing needs to be sent as a JSON. This is an optional argument.
* -storeName - is a name of store where you got current receipt. By default, it is "Undefined"