Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kolia1985/kolia1985
Mykola Melnyk profile
https://github.com/kolia1985/kolia1985
data-engineering data-science spark
Last synced: 9 days ago
JSON representation
Mykola Melnyk profile
- Host: GitHub
- URL: https://github.com/kolia1985/kolia1985
- Owner: mykolamelnykml
- Created: 2024-11-19T11:57:30.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-12-13T06:34:46.000Z (2 months ago)
- Last Synced: 2024-12-13T07:28:12.443Z (2 months ago)
- Topics: data-engineering, data-science, spark
- Homepage: https://stabrise.com
- Size: 11.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Greetings! π
My name is Mykola Melnyk, and I'm an ML expert with two decades of experience in the software development. I specialize in transforming *complex business ideas into scalable, secure, and efficient AI-driven products*. I have expert knowledge in various areas, enabling me to *deliver cutting-edge, top-tier AI solutions* that drive business growth and improve efficiency.
### Key Areas of My Specialization:
π **Natural Language Processing (NLP), Computer Vision (CV), and Optical Character Recognition (OCR):** 5+ years of experience in document processing, understanding, and anonymization. Led the development of Spark OCR (Visual NLP) using technologies such as Python/Scala, PySpark, PyTorch, LLMs, LLama 3, Mini Gemini, LangChain, and Hugging Face Transformers.
β‘ **Big Data Processing with Apache Spark:** 7+ years of experience designing and optimizing large-scale data pipelines for high-performance processing. In-depth knowledge of Spark internals, Spark Structured Streaming, and creator/contributor to the open-source spark-pdf datasource project written in Scala, enhancing Sparkβs capabilities.
π **Data De-identification & Anonymization:** Expert in anonymizing sensitive data from text, images, PDFs, and DICOM files. I ensure privacy, security, and compliance with GDPR and HIPAA standards using NLP, OCR, and computer vision to remove or mask personal information, safeguarding data confidentiality.
𧬠**Healthcare, Pharma, MedTech, BioTech Expertise:** Over 5 years of experience in the healthcare and life sciences sectors, with a strong understanding of formats like DICOM, and expertise in delivering solutions specifically tailored to meet the unique needs of these industries.
### TOP 5 Reasons to Work With Me
β End-to-End Expertiseβ Complex Problem-Solving Ability
β Timely Delivery
β Transparent Communication
β Scalable Solutions
### Professional Skills
π οΈ **Programming Languages:** Python, Scala
π **Data Science & Machine Learning:** NLP, Computer Vision, Large Language Models (LLMs), Optical Character Recognition (OCR), Model Productionalization, Deep Learning (PyTorch, TensorFlow, Hugging Face Transformers, ONNX, Pandas, CLIP)
π‘ **LLMs and Related Tools:** OpenAI GPT, Gemini, Llama 3, FLUX, Together.ai, Ollama, Hugging Face, Langchain, LlamaIndex, LangServe, LangGraph, QLORA, Streamlit, Gradio
β‘ **Big Data & Distributed Systems:** Big Data Processing, ETL, Stream Processing, Real-Time Aggregation, Apache Spark (PySpark, Spark ML, Spark Structured Streaming), Kinesis, Kafka, Databricks
π **Cloud Computing & Infrastructure:** Amazon Web Services (AWS), Distributed Systems, CI/CD Pipelines, Docker, Jenkins, Graphite, Grafana, Elasticsearch, Kibana
βοΈ **Databases:** PostgreSQL, MongoDB, Redis, DynamoDB
πΌ **CRMs:** Hubspot, ZohoCRM
### Availability
Committed to long-term collaborations. Available full-time for your next project.## My Projects
## Spark PDF DataSource
---
**Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)
**Home page**: [https://stabrise.com/spark-pdf/](https://stabrise.com/spark-pdf/)
**Quick Start Jupyter Notebook**: [PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb)
---
The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.
### Key features:
- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package## ScaleDP
---
**Source Code**: [https://github.com/StabRise/scaledp](https://github.com/StabRise/scaledp)
**Home page**: [https://stabrise.com/scaledp/](https://stabrise.com/scaledp/)
**Quick Start Jupyter Notebook**: [https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb](https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb)
---
ScaleDP is an Open-Source Library for processing documents using Apache Spark.
### Key features:
- Load PDF documents/Images
- Extract text from PDF documents/Images
- Extract images from PDF documents
- OCR Images/PDF documents
- Run NER on text extracted from PDF documents/Images
- Visualize NER results## Github
[![Mykola's GitHub stats](https://github-readme-stats-sigma-five.vercel.app/api?username=mykolamelnykml&include_all_commits=true&count_private=true&show_icons=true)](https://github.com/mykolamelnykml)