Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kolia1985/kolia1985

Mykola Melnyk profile
https://github.com/kolia1985/kolia1985

data-engineering data-science spark

Last synced: 9 days ago
JSON representation

Mykola Melnyk profile

Awesome Lists containing this project

README

        

Greetings! πŸ‘‹

My name is Mykola Melnyk, and I'm an ML expert with two decades of experience in the software development. I specialize in transforming *complex business ideas into scalable, secure, and efficient AI-driven products*. I have expert knowledge in various areas, enabling me to *deliver cutting-edge, top-tier AI solutions* that drive business growth and improve efficiency.

### Key Areas of My Specialization:

πŸ“„ **Natural Language Processing (NLP), Computer Vision (CV), and Optical Character Recognition (OCR):** 5+ years of experience in document processing, understanding, and anonymization. Led the development of Spark OCR (Visual NLP) using technologies such as Python/Scala, PySpark, PyTorch, LLMs, LLama 3, Mini Gemini, LangChain, and Hugging Face Transformers.

⚑ **Big Data Processing with Apache Spark:** 7+ years of experience designing and optimizing large-scale data pipelines for high-performance processing. In-depth knowledge of Spark internals, Spark Structured Streaming, and creator/contributor to the open-source spark-pdf datasource project written in Scala, enhancing Spark’s capabilities.

πŸ”’ **Data De-identification & Anonymization:** Expert in anonymizing sensitive data from text, images, PDFs, and DICOM files. I ensure privacy, security, and compliance with GDPR and HIPAA standards using NLP, OCR, and computer vision to remove or mask personal information, safeguarding data confidentiality.

🧬 **Healthcare, Pharma, MedTech, BioTech Expertise:** Over 5 years of experience in the healthcare and life sciences sectors, with a strong understanding of formats like DICOM, and expertise in delivering solutions specifically tailored to meet the unique needs of these industries.

### TOP 5 Reasons to Work With Me
βœ… End-to-End Expertise

βœ… Complex Problem-Solving Ability

βœ… Timely Delivery

βœ… Transparent Communication

βœ… Scalable Solutions

### Professional Skills

πŸ› οΈ **Programming Languages:** Python, Scala

πŸ“Š **Data Science & Machine Learning:** NLP, Computer Vision, Large Language Models (LLMs), Optical Character Recognition (OCR), Model Productionalization, Deep Learning (PyTorch, TensorFlow, Hugging Face Transformers, ONNX, Pandas, CLIP)

πŸ’‘ **LLMs and Related Tools:** OpenAI GPT, Gemini, Llama 3, FLUX, Together.ai, Ollama, Hugging Face, Langchain, LlamaIndex, LangServe, LangGraph, QLORA, Streamlit, Gradio

⚑ **Big Data & Distributed Systems:** Big Data Processing, ETL, Stream Processing, Real-Time Aggregation, Apache Spark (PySpark, Spark ML, Spark Structured Streaming), Kinesis, Kafka, Databricks

πŸš€ **Cloud Computing & Infrastructure:** Amazon Web Services (AWS), Distributed Systems, CI/CD Pipelines, Docker, Jenkins, Graphite, Grafana, Elasticsearch, Kibana

βš™οΈ **Databases:** PostgreSQL, MongoDB, Redis, DynamoDB

πŸ’Ό **CRMs:** Hubspot, ZohoCRM

### Availability
Committed to long-term collaborations. Available full-time for your next project.

## My Projects

## Spark PDF DataSource

### Spark Pdf

---

**Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)

**Home page**: [https://stabrise.com/spark-pdf/](https://stabrise.com/spark-pdf/)

**Quick Start Jupyter Notebook**: [PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb)

---

The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.

### Key features:

- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package

## ScaleDP

ScaleDP

---

**Source Code**: [https://github.com/StabRise/scaledp](https://github.com/StabRise/scaledp)

**Home page**: [https://stabrise.com/scaledp/](https://stabrise.com/scaledp/)

**Quick Start Jupyter Notebook**: [https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb](https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb)

---

ScaleDP is an Open-Source Library for processing documents using Apache Spark.

### Key features:

- Load PDF documents/Images
- Extract text from PDF documents/Images
- Extract images from PDF documents
- OCR Images/PDF documents
- Run NER on text extracted from PDF documents/Images
- Visualize NER results

## Github

[![Mykola's GitHub stats](https://github-readme-stats-sigma-five.vercel.app/api?username=mykolamelnykml&include_all_commits=true&count_private=true&show_icons=true)](https://github.com/mykolamelnykml)