https://github.com/kolia1985/kolia1985

Mykola Melnyk profile
https://github.com/kolia1985/kolia1985

data-engineering data-science spark

Last synced: over 1 year ago
JSON representation

Mykola Melnyk profile

Host: GitHub
URL: https://github.com/kolia1985/kolia1985
Owner: mykolamelnykml
Created: 2024-11-19T11:57:30.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-19T11:32:48.000Z (over 1 year ago)
Last Synced: 2025-03-19T12:31:03.004Z (over 1 year ago)
Topics: data-engineering, data-science, spark
Homepage: https://stabrise.com
Size: 21.5 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Greetings! 👋

My name is Mykola Melnyk, and I'm an ML expert with two decades of experience in the software development. I specialize in transforming *complex business ideas into scalable, secure, and efficient AI-driven products*. I have expert knowledge in various areas, enabling me to *deliver cutting-edge, top-tier AI solutions* that drive business growth and improve efficiency.

### Key Areas of My Specialization:

📄 **Natural Language Processing (NLP), Computer Vision (CV), and Optical Character Recognition (OCR):** 5+ years of experience in document processing, understanding, and anonymization. Led the development of Spark OCR (Visual NLP) using technologies such as Python/Scala, PySpark, PyTorch, LLMs, LLama 3, Mini Gemini, LangChain, and Hugging Face Transformers.

⚡ **Big Data Processing with Apache Spark:** 7+ years of experience designing and optimizing large-scale data pipelines for high-performance processing. In-depth knowledge of Spark internals, Spark Structured Streaming, and creator/contributor to the open-source spark-pdf datasource project written in Scala, enhancing Spark’s capabilities.

🔒 **Data De-identification & Anonymization:** Expert in anonymizing sensitive data from text, images, PDFs, and DICOM files. I ensure privacy, security, and compliance with GDPR and HIPAA standards using NLP, OCR, and computer vision to remove or mask personal information, safeguarding data confidentiality.

🧬 **Healthcare, Pharma, MedTech, BioTech Expertise:** Over 5 years of experience in the healthcare and life sciences sectors, with a strong understanding of formats like DICOM, and expertise in delivering solutions specifically tailored to meet the unique needs of these industries.

### TOP 5 Reasons to Work With Me
✅ End-to-End Expertise

✅ Complex Problem-Solving Ability

✅ Timely Delivery

✅ Transparent Communication

✅ Scalable Solutions

### Professional Skills

🛠️ **Programming Languages:** Python, Scala

📊 **Data Science & Machine Learning:** NLP, Computer Vision, Large Language Models (LLMs), Optical Character Recognition (OCR), Model Productionalization, Deep Learning (PyTorch, TensorFlow, Hugging Face Transformers, ONNX, Pandas, CLIP)

💡 **LLMs and Related Tools:** OpenAI GPT, Gemini, Llama 3, FLUX, Together.ai, Ollama, Hugging Face, Langchain, LlamaIndex, LangServe, LangGraph, QLORA, Streamlit, Gradio

⚡ **Big Data & Distributed Systems:** Big Data Processing, ETL, Stream Processing, Real-Time Aggregation, Apache Spark (PySpark, Spark ML, Spark Structured Streaming), Kinesis, Kafka, Databricks

🚀 **Cloud Computing & Infrastructure:** Amazon Web Services (AWS), Distributed Systems, CI/CD Pipelines, Docker, Jenkins, Graphite, Grafana, Elasticsearch, Kibana

⚙️ **Databases:** PostgreSQL, MongoDB, Redis, DynamoDB

💼 **CRMs:** Hubspot, ZohoCRM

### Availability
Committed to long-term collaborations. Available full-time for your next project.

## My Projects

## Spark PDF DataSource

---

**Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)

**Home page**: [https://stabrise.com/spark-pdf/](https://stabrise.com/spark-pdf/)

**Quick Start Jupyter Notebook**: [PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb)

---

The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.

### Key features:

- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package

## ScaleDP

---

**Source Code**: [https://github.com/StabRise/scaledp](https://github.com/StabRise/scaledp)

**Home page**: [https://stabrise.com/scaledp/](https://stabrise.com/scaledp/)

**Quick Start Jupyter Notebook**: [https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb](https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb)

---

ScaleDP is an Open-Source Library for processing documents using Apache Spark.

### Key features:

- Load PDF documents/Images
- Extract text from PDF documents/Images
- Extract images from PDF documents
- OCR Images/PDF documents
- Run NER on text extracted from PDF documents/Images
- Visualize NER results

## Github

[![Mykola's GitHub stats](https://github-readme-stats-sigma-five.vercel.app/api?username=mykolamelnykml&include_all_commits=true&count_private=true&show_icons=true)](https://github.com/mykolamelnykml)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kolia1985/kolia1985

Awesome Lists containing this project

README