An open API service indexing awesome lists of open source software.

https://github.com/priyam-hub/inside-medium

Inside-Medium is an AI-powered content recommendation engine designed to help readers find the most relevant and high-quality Medium articles based on their interests or selected articles.
https://github.com/priyam-hub/inside-medium

natural-language-processing non-negative-matrix-factorization tfidf-vectorizer

Last synced: 10 months ago
JSON representation

Inside-Medium is an AI-powered content recommendation engine designed to help readers find the most relevant and high-quality Medium articles based on their interests or selected articles.

Awesome Lists containing this project

README

          

![Cover Page](images/Resized.png)

# ๐Ÿค– **Inside-Medium : The Right Article, at the Right Time**

*Discover trending, relevant reads instantly with AI-powered article matching!*

[![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

[Features](#features) โ€ข [Installation](#installation) โ€ข [Documentation](#documentation) โ€ข [Usage](#usage) โ€ข [Contributing](#contributing)

---

## ๐ŸŒŸ Overview

**Inside-Medium** is an AI-powered content recommendation engine designed to help readers find the most relevant and high-quality Medium articles based on their interests or selected articles. By leveraging Natural Language Processing (NLP) and Topic Modeling (NMF) techniques, the system extracts hidden topics from articles, encodes them into meaningful vectors, and uses cosine similarity to recommend similar content.

---

## ๐Ÿ“š Dataset - Medium Articles Dataset

๐Ÿ“Ž **Source**: [Medium Articles Dataset โ€“ Kaggle](https://www.kaggle.com/datasets/dorianlazar/medium-articles-dataset/data)

The **Medium Articles Dataset** is a curated collection of publicly available articles published on Medium.com. It contains both **textual content and engagement metadata**, making it ideal for tasks like recommendation systems, NLP, and content analysis.

#### ๐Ÿ“ Dataset Highlights:

* **Total Records**: \~8,000 articles
* **Key Columns**:

* `title`: Title of the article
* `subtitle`: Subtitle or secondary heading
* `author`: Author of the article
* `date`: Publication date
* `claps`: Number of claps (engagement metric)
* `reading_time`: Estimated reading time (in minutes)
* `publication`: Name of the publication (if any)
* `url`: Link to the original article
* `article`: Full textual content of the article

#### โœ… Why This Dataset?

* Great for **topic modeling**, **text classification**, and **recommendation systems**
* Contains real-world engagement signals (`claps`) to enrich the model
* Useful for building **AI-driven content discovery platforms** like Inside-Medium

> ๐Ÿ“Œ **Dataset Link**: [https://www.kaggle.com/datasets/dorianlazar/medium-articles-dataset/data](https://www.kaggle.com/datasets/dorianlazar/medium-articles-dataset/data)

---

## ๐Ÿš€ Features of *Inside-Medium*

* ๐Ÿ” **Content-Based Article Recommendation**
Recommends articles similar to a userโ€™s query based on textual content and latent topic features.

* ๐Ÿ“ˆ **Similarity Scoring**
Calculates cosine similarity between articles to identify the most relevant ones.

* ๐Ÿ“‘ **Interactive Query Support**
Users can input any article title to retrieve a list of the most similar articles.

* ๐Ÿงผ **Modular, Clean Codebase**
Structured using classes for vectorization, normalization, and similarity search with full docstrings and logging.

* ๐Ÿ“ฆ **Reproducible Pipeline**
Complete workflow from raw data to recommendationsโ€”easy to extend or integrate into other systems.

* ๐Ÿงพ **Logging and Error Handling**
Built-in logging for debugging and tracking progress/errors in each module.

* ๐Ÿ“‚ **Scalable Design**
Easy to adapt for larger datasets or additional features like user profiling or collaborative filtering.

---

## ๐Ÿ“ฐ Published Article

Explore other Detailed Fine-Tuning Methods of Large Language Models with Mathematical Calculations:

๐Ÿ”— Read the article here:
[Inside Mediumโ€™s Recommendation Engine: How It Knows What Youโ€™ll Love](https://medium.com/@priyampal/inside-mediums-recommendation-engine-how-it-knows-what-you-ll-love-982cc295e0de)

---

## ๐Ÿ› ๏ธ Installation

#### Step - 1: Repository Cloning

```bash
# Clone the repository
git clone https://github.com/priyam-hub/Inside-Medium.git

# Navigate into the directory
cd Inside-Medium
```

#### Step - 2: Enviornmental Setup and Dependency Installation

```bash
# Run env_setup.sh
bash env_setup.sh

# Select 1 to create Python Environment
# Select 2 to create Conda Environment

# Python Version - 3.10

# Make the Project to run as a Local Package
python setup.py
```

#### Step - 3: Creation of Kaggle API

- Log-In to your Kaggle Account
- An API token downloaded from Kaggle Account Settings โ†’ Create New Token.
- Manually place your kaggle.json (downloaded from https://www.kaggle.com/settings) into this location:

```plaintext
C:\Users\\.kaggle\kaggle.json
```

#### Step - 4: Create a .env file in the root directory to add Credentials or (Change the filename ".sample_env" to ".env")

```bash
KAGGLE_USERNAME = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
KAGGLE_API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
```

#### Step - 5: Run the Full Pipeline

```bash
# Run the Main Python Script
python main.py
```

#### Step - 6: Run the Flask Server (Up-Coming)

```bash
# Run the Web App using Flask Server
python web/app.py
```

**Note** - Upon running, navigate to the provided local URL in your browser to interact with the Inside-Medium Recommendation Engine

---

## ๐Ÿงฐ Technology Stack

**Python** โ€“ Core programming language used to build the recommendation pipeline, data processing, and backend logic.
๐Ÿ”— [Install Python](https://www.python.org/downloads/)

**Pandas & NumPy** โ€“ Used for efficient data manipulation, cleaning, and numerical operations.
๐Ÿ”— [Pandas Documentation](https://pandas.pydata.org/docs/) | [NumPy Documentation](https://numpy.org/doc/)

**Scikit-learn** โ€“ Used for feature extraction (TF-IDF), dimensionality reduction (NMF), and similarity computation.
๐Ÿ”— [Scikit-learn Documentation](https://scikit-learn.org/stable/)

**Flask** โ€“ Lightweight Python web framework used to serve the recommendation engine as an API or simple web app.
๐Ÿ”— [Flask Installation](https://flask.palletsprojects.com/en/latest/installation/)

**Logging** โ€“ Pythonโ€™s built-in `logging` module used for tracking system operations and debugging.
๐Ÿ”— [Logging Documentation](https://docs.python.org/3/library/logging.html)

**Kaggle API** โ€“ Used to automatically fetch and manage the Medium Articles dataset.
๐Ÿ”— [Kaggle API Setup Guide](https://github.com/Kaggle/kaggle-api)

---

## ๐Ÿ“ Project Structure

```plaintext
Inside-Medium/
โ”œโ”€โ”€ .env # Store the Kaggle Username and API Key
โ”œโ”€โ”€ .gitignore # Ignoring files for Git
โ”œโ”€โ”€ env_setup.sh # Package installation configuration
โ”œโ”€โ”€ folder_structure.py # Contains the Project Folder Structure
โ”œโ”€โ”€ LICENCE # MIT License
โ”œโ”€โ”€ main.py # Full Pipeline of the Project
โ”œโ”€โ”€ README.md # Project documentation
โ”œโ”€โ”€ requirements.txt # Python dependencies
โ”œโ”€โ”€ setup.py # Create the Project as Python Package
โ”œโ”€โ”€ config/ # Configuration files
โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ””โ”€โ”€ config.py/ # All Configuration Variables of Pipeline
โ”œโ”€โ”€ data/ # Data Directory
โ”‚ โ”œโ”€โ”€ images/ # Medium Article Images Directory
โ”‚ โ”œโ”€โ”€ medium_normalized_data.csv # Normalized Data of the Medium Articles
โ”‚ โ”œโ”€โ”€ medium_processed_data.csv # Processed Data of the Medium Articles
โ”‚ โ””โ”€โ”€ medium_raw_data.csv # Raw Data of the Medium Articles
โ”œโ”€โ”€ logger/ # Logger Setup Directory
โ”‚ โ””โ”€โ”€ logger.py # Format of the Logger Setup of the Project
โ”œโ”€โ”€ notebooks/ # Jupyter notebooks for experimentation
โ”‚ โ””โ”€โ”€ Recommendation_System.ipynb # Experimented Recommendation Engine in Jupyter Notebook
โ”œโ”€โ”€ results/ # Directory to Store the results of the Project
โ”‚ โ””โ”€โ”€ eda_results/ # Directory to Store the EDA Results
โ”œโ”€โ”€ src/ # Source code
โ”‚ โ”œโ”€โ”€ data_preprocessor/ # Data Preprocessor Directory
โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”‚ โ””โ”€โ”€ data_preprocessor.py # Python file process the raw data
โ”‚ โ”œโ”€โ”€ exploratory_data_analysis/ # EDA Directory
โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”‚ โ””โ”€โ”€ exploratory_data_analyzer.py # Python file to perform EDA
โ”‚ โ”œโ”€โ”€ normalizer/ # Text Normalizing Directory
โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”‚ โ””โ”€โ”€ nmf_normalizer.py # Python File to Normalize the Preprocessed Data
โ”‚ โ”œโ”€โ”€ recommendation_engine/ # Recommendation Engine Directory
โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”‚ โ””โ”€โ”€ similarity_finder.py # Python file to perform similarity search
โ”‚ โ”œโ”€โ”€ vectorizer/ # Recommendation Engine Directory
โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”‚ โ””โ”€โ”€ tfidf_vectorizer.py # Python file to perform vectorizer
โ”‚ โ””โ”€โ”€ utils/ # Utility Functions Directory
โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”œโ”€โ”€ data_loader.py # Load and Save Data from Local
โ”‚ โ”œโ”€โ”€ download_dataset.py # Download the Data from Kaggle
โ”‚ โ””โ”€โ”€ save_plot.py # Save the Plot in Specified Path
โ””โ”€โ”€ web/
โ”œโ”€โ”€ __init__.py
โ”œโ”€โ”€ static/
โ”‚ โ”œโ”€โ”€ styles.css # Styling of the Web Page
โ”‚ โ””โ”€โ”€ script.js # JavaScript File
โ”œโ”€โ”€ templates/
โ”‚ โ””โ”€โ”€ index.html # Default Web Page
โ””โ”€โ”€ app.py/ # To run the flask server

```

---

## ๐Ÿ”ฎ Future Work Roadmap

The *Inside-Medium* project can be extended significantly to offer a more personalized and intelligent content recommendation system. Here's a proposed roadmap structured in **three development phases**, each with an estimated time frame.

---

### ๐Ÿš€ **Phase 1: UI & API Integration (1โ€“2 Weeks)**

**Objective:** Transform the backend logic into a user-accessible application.

* Build a clean and responsive frontend using **HTML/CSS/JS** for user interaction.
* Deploy the article recommender as a **Flask API**, allowing input of article titles and displaying similar content.
* Enable users to upload custom datasets (CSV) for analysis and recommendations.
* Add search bar, loading indicators, and user-friendly error messages.

### ๐Ÿง  **Phase 2: Personalization & Topic Modeling (2โ€“3 Weeks)**

**Objective:** Enhance the intelligence of the recommender.

* Introduce **user profiles** to track reading history and provide personalized recommendations.
* Apply **LDA or BERTopic** for better topic clustering and diversity in suggestions.
* Integrate **claps, reading time, and tags** more deeply into the similarity scoring system.
* Include feedback mechanism to rate recommended articles.

### ๐Ÿง  **Phase 3: Embedding Models & LLM Integration (3โ€“4 Weeks)**

**Objective:** Upgrade the recommendation engine with deep learning and language models.

* Replace TF-IDF + NMF with **sentence embeddings** using `SentenceTransformers` or Hugging Face models.
* Use **vector databases (e.g., Qdrant, FAISS)** for faster and smarter similarity search.
* Integrate with **LLMs (e.g., OpenAI, LLaMA via LangChain)** to enable query-based article retrieval using natural language.
* Package the app into a **Docker container** and deploy to the cloud for scalability.

---

## ๐Ÿ“œ License

This project is licensed under the **MIT License**. See the [LICENSE](LICENSE) file for more details.

**Made by Priyam Pal - AI and Data Science Engineer**

[โ†‘ Back to Top]