https://github.com/quantum-software-development/1-datamining_main_repository

data mining, focusing on unsupervised learning methods (clustering, PCA, dictionary learning, anomaly detection) applied to real-world projects for third-sector organizations. Results are shared publicly in open repositories and community platforms.
https://github.com/quantum-software-development/1-datamining_main_repository
Last synced: 6 months ago
JSON representation
Host: GitHub
URL: https://github.com/quantum-software-development/1-datamining_main_repository
Owner: Quantum-Software-Development
License: mit
Created: 2025-08-12T19:11:16.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-09-17T15:48:58.000Z (10 months ago)
Last Synced: 2025-09-17T17:54:54.182Z (10 months ago)
Language: Jupyter Notebook
Homepage: https://github.com/Quantum-Software-Development/specialized-consulting-data-mining
Size: 13.3 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project

README

          


**\[[🇧🇷 Português](README.pt_BR.md)\] \[**[🇺🇸 English](README.md)**\]**





#  
 1- [Data Mining]() /  [Main Repository]()





[**Institution:**]() Pontifical Catholic University of São Paulo (PUC-SP)  

[**School:**]() Faculty of Interdisciplinary Studies  

[**Program:**]() Humanistic AI and Data Science

[**Semester:**]() 2nd Semester 2025  

Professor:  [***Professor Doctor in Mathematics Daniel Rodrigues da Silva***](https://www.linkedin.com/in/daniel-rodrigues-048654a5/)





#### 
 [![Sponsor Quantum Software Development](https://img.shields.io/badge/Sponsor-Quantum%20Software%20Development-brightgreen?logo=GitHub)](https://github.com/sponsors/Quantum-Software-Development)





#






> [!IMPORTANT]

> 

> ⚠️ Heads Up

>

> * Projects and deliverables may be made [publicly available]() whenever possible.

> * The course emphasizes [**practical, hands-on experience**]() with real datasets to simulate professional consulting scenarios in the fields of **Data Analysis and Data Mining** for partner organizations and institutions affiliated with the university.

> * All activities comply with the [**academic and ethical guidelines of PUC-SP**]().

> * Any content not authorized for public disclosure will remain [**confidential**]() and securely stored in [private repositories]().  

>





#







##### 🎶 Prelude Suite no.1 (J. S. Bach) - [Sound Design Remix]()

https://github.com/user-attachments/assets/4ccd316b-74a1-4bae-9bc7-1c705be80498

####  📺 For better resolution, watch the video on [YouTube.](https://youtu.be/_ytC6S4oDbM)







> [!TIP]

> 

> * #### **If you’d like to explore the Full Statistics Materials from the 1st year (not only the review), you can visit the complete repository** [**Here**](https://github.com/FabianaCampanari/PracticalStats-PUCSP-2024). 


>

>







## Table of Contents




1. [Course Overview](#course-overview)

   - I - [class 1 - Intoductioon and Assessment](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a98512aa9dc2525446a3ffb236d06cbfb16d1f43/class_1-Introduction)

   - II - [class_2 - Introduction - Data Mining With Python](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a98512aa9dc2525446a3ffb236d06cbfb16d1f43/class_2%20-%20Introduction%20-%20Data%20Mining%20With%20Python)

   - III - [class_3 - Stats Review](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a98512aa9dc2525446a3ffb236d06cbfb16d1f43/class_3%20-%20Stats%20Review)

   - IV - [Data Cleaning by Zara Amini](https://github.com/Quantum-Software-Development/1-DataMining_Main_Repository/blob/cb4075948c0ae9f90ead385d620147daf0641f7c/Data%20Cleaning%20by%20Zahra%20Amini%20.pdf)

2. [Objectives](#objectives)

3. [Syllabus](#syllabus)

4. [Weekly Schedule](#weekly-schedule)

5. [Tools and Technologies](#tools-and-technologies)

6. [Installation and Setup](#installation-and-setup)

7. [Assessment](#assessment)

8. [Bibliography](#bibliography)

   - [Basic Bibliography](#basic-bibliography)

   - [Complementary Bibliography](#complementary-bibliography)

9. [Notes](#notes)





##  [Course Overview]()




This course introduces [**data mining techniques**]() with a focus on [**unsupervised learning methods**](), including:

- Clustering algorithms (K-Means, Affinity Propagation, Mean-Shift)

- Principal Component Analysis (PCA)

- Dictionary Learning

- Novelty and outlier detection

Students will work on [**practical projects**]() inspired by real-world problem-solving in third-sector organizations. Final deliverables will be shared in **open repositories** and made available to the broader community, schools, libraries, and non-profits.





## [Objectives]()

Enable students to **plan, conduct, and complete a research project** applying key **data mining concepts, algorithms, and methodologies**.





## [Syllabus]()




- Fundamentals of Data Mining

- Data cleaning and preparation

- Predictive analysis

- Clustering methods (K-Means, Affinity Propagation, Mean-Shift)

- Principal Component Analysis (PCA)

- Dictionary Learning

- Novelty and outlier detection

- Application of concepts to real-world consulting scenarios





Statistic Review - Stats Measures - Mean - Median - Mode - Variance]()

https://github.com/Quantum-Software-Development/7-DataMining-Regression-Techniques-Data-Integration

##  [Weekly Schedule]()




| [Week]() | [Repos]() | [Methodology]() | [Tools]() |

|------|-------|-------------|-------|

| 1     | [Course introduction](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/d737ff164c6b4d6e580d5ba6e95c54ac604f7ea4/class_1-Introduction) | Active methodology | – |

| 2  | [Statistical Review  - Stats Measures - Mean - Median - Mode - Variance](https://github.com/Quantum-Software-Development/2-DataMining_Statistical_Measures) | Active methodology | Python |

| 3  | [Statistical Review - Variation Measures and Standard Deviation](https://github.com/Quantum-Software-Development/3-DataMining_VariationMeasures_Standard-Deviation) | Active methodology | Python |

| 4     | [Data Mining - Concepts - Exploratory Analysis](https://github.com/Quantum-Software-Development/4-DataMining_Concepts_ExploratoryAnalysis) | Active methodology | Python - R |

| 5   | [Data Cleaning - Preparation - Anomalies (Outliers)](https://github.com/Quantum-Software-Development/5-DataMining_DataCleaning_Preparation_Anomalies_Outlier) | Active methodology | Python |

| 6     | [Data Mining - Pre Processing](https://github.com/Quantum-Software-Development/6-DataMining_Pre-Processing) | Active methodology | Python |

| 7     | [Regression Techniques with Data Integration](https://github.com/Quantum-Software-Development/7-DataMining-Regression-Techniques-Data-Integration) | Active methodology | Python |

| 8     | [Predictive  K-Means Clustering  Data and Figures Analysis](https://github.com/Quantum-Software-Development/8-DataMining-KMeans-Non-Hierarchical-Clustering) | Active methodology | Python |

| 9     | [* Project 1 – K-Means Clustering Repository Presentation](https://github.com/Quantum-Software-Development/9-DataMining_Project_1_K-Means_Clustering_Presentation) | Active methodology | Python |

| 10    | [Clustering Mean Shift](https://github.com/Quantum-Software-Development/10-DataMining_MeanShift) | Active methodology | Python |

| 11    | [Affinity Propagation](https://github.com/Quantum-Software-Development/11-DataMining_Affinity_Propagation_Algorithm) | Active methodology  | Python |

| 12    | [* Project 2 – Clustering Algorithms Exploration and Comparison- K-Means - Mean Shift - Affinity Propagation](https://github.com/Quantum-Software-Development/12-DataMining_Project_2_-Clustering_Comparison_KMeans_MeanShift-_AffinityPropagation) | Active methodology | Python |

| 13    | [Principal Component Analysis (PCA) and Isolation Forest Algorithms](https://github.com/Quantum-Software-Development/13-DataMining_PCA_IsolationForest-Guide) | Active methodology | Python |

| 14    | [DBSCAN and Spectral Clustering](https://github.com/Quantum-Software-Development/14-DataMining_DBSCAN_and_Spectral-Clustering) | Active methodology | Python |

| 15    | [* Project 3 – Clustering Algorithms Exploration and Comparison- K-Means - Mean Shift - - Dbscan](https://github.com/Quantum-Software-Development/15-DataMining_Project_3_-Clustering_Comparison_KMeans_MeanShift_DBSCAN) | Active methodology | Python |

| 16    | [ Dictionary-Based Feature Grouping for LLM/AI Pipelines](https://github.com/Quantum-Software-Development/16-DataMining_llm-tabular-preprocessing-dict-groups) | Active methodology | Python |

| 17    | **P2 Exam** | Written (Individual) | – |

| 18    | **P3 Exam & Grade Closure** | Written (Individual) | – |

| 19     | Final grade submission | – | – |





##  [Tools and Technologies]()




- **Programming Language:** Python  

- **Libraries:** NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn  

- **Environment:** Jupyter Notebook or other Python IDEs





##  Installation and Setup




Follow these steps to set up your local environment for the course projects:




[1](). **Clone the repository**

```

git clone https://github.com//.git

cd 

```




[2](). **Create a virtual environment** (recommended)

```

python -m venv venv

source venv/bin/activate   \# Mac/Linux

venv\Scripts\activate      \# Windows

```




[3](). **Install dependencies**

Make sure `pip` is updated:

```

pip install --upgrade pip

```

Then install the required packages:

```

pip install -r requirements.txt

```

*(If `requirements.txt` is not provided, install manually:)*  

```

pip install numpy pandas scikit-learn matplotlib seaborn jupyter

```




[4](). **Run Jupyter Notebook**

   

```

jupyter notebook

```




[5](). **Open course notebooks** and start practicing.





##  I - [Intoductioon and Assessment](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/86d9d9fbc56efdd0b8e377955c1c7abf8879b775/class_1-Introduction)




| Exam | Date | Format | Weight |

|------|------|--------|--------|

| **P1** | 01/10/2025 | Written – Individual | Arithmetic mean |

| **P2** | 19/11/2025 | Written – Individual | Arithmetic mean |

| **P3** | Substitution exam | Written – Individual | Replaces lowest score |




[**Final Grade:**]() Arithmetic mean of assessments.





## II - [class_2- Introduction - Data Mining With Python](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/86d9d9fbc56efdd0b8e377955c1c7abf8879b775/class_2%20-%20Introduction%20-%20Data%20Mining%20With%20Python)




☞ [Access Booklet](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/blob/81e2951f73c87cf7c4396a36d48be92384b7b720/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Book%20-%20Introd%20to%20Data%20Mining%20With%20Python.pdf)




## [Example 1]()




The following sample lists the number of minutes that 60 cable TV users watched content from their package in the last two hours. Construct a frequency distribution with 8 classes and build a histogram.




[Data]():

```

20, 55, 5, 64, 78, 49, 91, 87, 18, 83, 33, 39, 30, 31, 59, 85, 102, 24, 27, 28,

92, 108, 98, 67, 85, 109, 48, 19, 32, 69, 24, 59, 6, 49, 116, 37, 92, 43, 101, 60,

55, 107, 25, 33, 57, 25, 17, 49, 24, 101, 14, 45, 73, 120, 91, 2, 11, 47, 21, 38

```





### [Step 1](): Determine Range and Number of Classes

- Minimum value: 2

- Maximum value: 120

- Number of classes ($k$): 8 (given)





### [Step 2](): Calculate Class Width





$$

\huge

w = \left\lceil \frac{\text{max} - \text{min}}{k} \right\rceil = \left\lceil \frac{120 - 2}{8} \right\rceil = 15

$$





### [Step 3](): Construct Class Intervals (from minimum value)

| Class Interval | Explanation |

| :-- | :-- |

| 2 - 16 | Starts from minimum 2 |

| 17 - 31 | 16 + 1 to 31 |

| 32 - 46 | Next range |

| 47 - 61 | Next range |

| 62 - 76 | Next range |

| 77 - 91 | Next range |

| 92 - 106 | Next range |

| 107 - 121 | Covers maximum 120 |




### [Step 4](): Frequency Distribution Table




| Class Interval | Frequency |

| :--: | :--: |

| 2 - 16 | 5 |

| 17 - 31 | 14 |

| 32 - 46 | 8 |

| 47 - 61 | 13 |

| 62 - 76 | 5 |

| 77 - 91 | 8 |

| 92 - 106 | 6 |

| 107 - 121 | 5 |





### [Step 5](): Calculate Midpoints for Each Class




$$

\Huge

x_i = \frac{\text{Lower limit} + \text{Upper limit}}{2}

$$





| Class Interval | Midpoint ($x_i$) |

| :-- | :-- |

| 2 - 16 | 9 |

| 17 - 31 | 24 |

| 32 - 46 | 39 |

| 47 - 61 | 54 |

| 62 - 76 | 69 |

| 77 - 91 | 84 |

| 92 - 106 | 99 |

| 107 - 121 | 114 |





### [Step 6](): Calculate Mean Using Frequency and Midpoints




### [Mean](): ($\bar{x}$) is calculated by:





$$

\Huge

\bar{x} = \frac{\sum f_i x_i}{\sum f_i}

$$





### [Where](): $f_i$ = frequency, $x_i$ = [Midpoint]().




### [Calculate each product]():




| Class Interval | $f_i$ | $x_i$ | $f_i \times x_i$ |

| :-- | :-- | :-- | :-- |

| 2 - 16 | 5 | 9 | 45 |

| 17 - 31 | 14 | 24 | 336 |

| 32 - 46 | 8 | 39 | 312 |

| 47 - 61 | 13 | 54 | 702 |

| 62 - 76 | 5 | 69 | 345 |

| 77 - 91 | 8 | 84 | 672 |

| 92 - 106 | 6 | 99 | 594 |

| 107 - 121 | 5 | 114 | 570 |




### [Sum frequencies](): $5 + 14 + 8 + 13 + 5 + 8 + 6 + 5$ = [64]()

### [Sum of products](): $45 + 336 + 312 + 702 + 345 + 672 + 594 + 570$ = [3576]()




### [Calculate mean]():





$$

\huge

\bar{x} = \frac{3576}{64} = 55.875

$$





### [Step 7](): Histogram, Bar Plot and Time Series Frequency Distribution Over Time

- Construct a histogram, bar plot and  Time Series  with class intervals on the x-axis and frequencies on the y-axis.

- Each bar height corresponds to the frequency of the class.




☞ [Access Code](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/blob/a61b0572e5bca4d6f06b0187722f8ef97214c0a4/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Code/DataMining_1.ipynb)

☞ [Access Dataset](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/blob/01b6e27e588c3b830561385f14bd0d246f55049d/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Banks%20Dataset/banco.csv)

☞ [Access Plots](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a61b0572e5bca4d6f06b0187722f8ef97214c0a4/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Plots)





###[Frequency Analysis and Time Series Visualization]()

This notebook demonstrates how to perform frequency analysis on a CSV dataset, visualize results with histograms and bar plots, and create a time series chart using Python.




###  [1](). Install and Import Libraries

```python

# Import required libraries

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

```




###  [2](). Load Dataset

```python

# Load CSV file (semicolon-separated)

df = pd.read_csv('chose your dataset', sep=';')

# Select only the "day" column

df1 = df['day']

```




###  [3](). Calculate Frequencies

```python

# Calculate absolute frequency (ascending order)

freq_abs = pd.Series(df1).value_counts(ascending=True)

# Calculate relative frequency (normalized, 3 decimal places)

freq_rel = pd.Series(df1).value_counts(normalize=True).round(3)

# Create a DataFrame with both measures

df_freq = pd.DataFrame({

    'Absolute Frequency': freq_abs,

    'Relative Frequency': freq_rel

})

# Display the frequency table

display(df_freq)

```




###  [4]().  Histogram (Dark Theme)

```python

# Create figure and axes with dark background

plt.style.use('seaborn-v0_8-darkgrid')

fig, ax = plt.subplots(figsize=(16, 4))

fig.patch.set_facecolor('black')

ax.set_facecolor('black')

# Plot histogram

sns.histplot(df1, color='turquoise', ax=ax)

# Customize labels and ticks

plt.xlabel("Values")

plt.ylabel("Frequency")

plt.title("Frequency Distribution", color='white')

plt.tick_params(axis='x', colors='white')

plt.tick_params(axis='y', colors='white')

# Show plot

plt.show()

```











###  [5](). Bar Plot (Dark Theme)

```python

# Create figure and axes

plt.style.use('seaborn-v0_8-darkgrid')

fig, ax = plt.subplots(figsize=(10, 6))

fig.patch.set_facecolor('black')

ax.set_facecolor('black')

# Bar plot of absolute frequency

df_freq['Absolute Frequency'].plot(kind='bar', color="turquoise", ax=ax)

# Customize labels and ticks

plt.xlabel("Values")

plt.ylabel("Frequency")

plt.title("Frequency Distribution", color='white')

plt.xticks(rotation=0, color='white')

plt.yticks(color='white')

# Show plot

plt.show()

```











###  [6](). Time Series Preparation

```python

# Inspect available columns

print(df.columns)

# Create a new DataFrame for time series analysis

df_time_series = df[['day', 'month']].copy()

# Add dummy year (if year column is missing)

df_time_series['year'] = 2022

# Convert to strings for concatenation

df_time_series['day'] = df_time_series['day'].astype(str)

df_time_series['year'] = df_time_series['year'].astype(str)

# Create "date" column in dd-MMM-yyyy format

df_time_series['date'] = df_time_series['day'] + '-' + df_time_series['month'] + '-' + df_time_series['year']

df_time_series['date'] = pd.to_datetime(df_time_series['date'], format='%d-%b-%Y')

# Set "date" as index

df_time_series = df_time_series.set_index('date')

# Count occurrences per day

daily_counts = df_time_series.groupby(df_time_series.index).size()

# Display first rows

display(daily_counts.head())

```




###  [7](). Time Series Plot (Dark Theme)

```python

# Set plot style

plt.style.use('seaborn-v0_8-darkgrid')

fig, ax = plt.subplots(figsize=(16, 6))

fig.patch.set_facecolor('black')

ax.set_facecolor('black')

# Plot time series

plt.plot(daily_counts, color='turquoise')

# Customize labels and ticks

plt.title("Frequency Distribution Over Time", color='white')

plt.xlabel("Date", color='white')

plt.ylabel("Frequency", color='white')

plt.tick_params(axis='x', colors='white')

plt.tick_params(axis='y', colors='white')

# Show plot

plt.show()

```











### [Summary]()

Dummy Year: 2022 was used when year column was missing.

Visualizations: Histograms, bar plots, and time series chart.





## III - [class_3- Stats Review](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/86d9d9fbc56efdd0b8e377955c1c7abf8879b775/class_3%20-%20Stats%20Review)




> [!TIP]

> 

> [Access](https://github.com/Quantum-Software-Development/2_3-DataMining_Statistical_Review)  Class_3

> 





## IV - [class_4- Data Mining - Concepts - Exploratory Analysis]()




> [!TIP]

> 

> [Access](https://github.com/Quantum-Software-Development/4-DataMining_Concepts_ExploratoryAnalysis)  Class_4

> 





## V - [class_5- Data Cleaning - Preparation - Anomalies(Outliers)]()




> [!TIP]

> 

> [Access](https://github.com/Quantum-Software-Development/5-DataMining_DataCleaning_Preparation_Anomalies_Outlier)  Class_5

> 





## VI - [class_6- Data Mining - Pre Processing]()




> [!TIP]

> 

> [Access](https://github.com/Quantum-Software-Development/6-DataMining_Pre-Processing)  Class_6

> 





## VII - [class_7- Normalization](https://github.com/Quantum-Software-Development/1-DataMining_Main_Repository/tree/b555158e64f626fda67229fcc80bff665090c876/class_7-Normalization_Code)




> [!TIP]

> 

> [Access]()  Class_7

>

> ⚠️ Coming Soon

> 





## VIII - [class_8 - KMeans_NonHierarchical_Clustering](https://github.com/Quantum-Software-Development/1-DataMining_Main_Repository/tree/cd4d463e1745f2778db4d69e7faade4bfbc00c05/class_8-KMeans_NonHierarchical_Clustering)




> [!TIP]

> 

> [Access](https://github.com/Quantum-Software-Development/1-DataMining_Main_Repository/tree/cd4d463e1745f2778db4d69e7faade4bfbc00c05/class_8-KMeans_NonHierarchical_Clustering)   Class_8 - KMeans_NonHierarchical_Clustering

>

> ⚠️ Coming Soon

> 





## IX - [lass_8 - KMeans_NonHierarchical_Clustering](https://github.com/Quantum-Software-Development/1-DataMining_Main_Repository/tree/cd4d463e1745f2778db4d69e7faade4bfbc00c05/class_8-KMeans_NonHierarchical_Clustering)




> [!TIP]

> 

> [Access]()  Class_8

>

> ⚠️ Coming Soon

> 

































## [Bibliography]()

[1](). **Castro, L. N. & Ferrari, D. G.** (2016). *Introduction to Data Mining: Basic Concepts, Algorithms, and Applications*. Saraiva.

[2](). **Ferreira, A. C. P. L. et al.** (2024). *Artificial Intelligence – A Machine Learning Approach*. 2nd Ed. LTC.

[3](). **Larson & Farber** (2015). *Applied Statistics*. Pearson.




### [Complementary Bibliography]()

- THOMAS, C. *Data Mining*. IntechOpen, 2018.  

- HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. *Automated Machine Learning: Methods, Systems, Challenges*. Springer Nature, 2019.  

- NETTO, A.; MACIEL, F. *Python para Data Science e Machine Learning Descomplicado*. Alta Books, 2021.  

- RUSSELL, S. J.; NORVIG, P. *Artificial Intelligence: A Modern Approach*. GEN LTC, 2022.  

- SUD, K.; ERDOGMUS, P.; KADRY, S. *Introduction to Data Science and Machine Learning*. IntechOpen, 2020.





      





## 💌 [Let the data flow... Ping Me !](mailto:fabicampanari@proton.me)





#### 
  🛸๋ My Contacts [Hub](https://linktr.ee/fabianacampanari)




### 
 






  ────────────── 🔭⋆ ──────────────

 ➣➢➤ Back to Top 

#

###### 
 Copyright 2025 Quantum Software Development. Code released under the [MIT License license.](https://github.com/Quantum-Software-Development/Math/blob/3bf8270ca09d3848f2bf22f9ac89368e52a2fb66/LICENSE)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/quantum-software-development/1-datamining_main_repository

Awesome Lists containing this project

README