An open API service indexing awesome lists of open source software.

https://github.com/quantum-software-development/1-main_datamining_repository

data mining, focusing on unsupervised learning methods (clustering, PCA, dictionary learning, anomaly detection) applied to real-world projects for third-sector organizations. Results are shared publicly in open repositories and community platforms.
https://github.com/quantum-software-development/1-main_datamining_repository

Last synced: 5 months ago
JSON representation

data mining, focusing on unsupervised learning methods (clustering, PCA, dictionary learning, anomaly detection) applied to real-world projects for third-sector organizations. Results are shared publicly in open repositories and community platforms.

Awesome Lists containing this project

README

          


**\[[🇧🇷 Português](README.pt_BR.md)\] \[**[🇺🇸 English](README.md)**\]**



#

1- [Data Mining]() / [Main Repository]()



[**Institution:**]() Pontifical Catholic University of São Paulo (PUC-SP)
[**School:**]() Faculty of Interdisciplinary Studies
[**Program:**]() Humanistic AI and Data Science
[**Semester:**]() 2nd Semester 2025
Professor: [***Professor Doctor in Mathematics Daniel Rodrigues da Silva***](https://www.linkedin.com/in/daniel-rodrigues-048654a5/)



####

[![Sponsor Quantum Software Development](https://img.shields.io/badge/Sponsor-Quantum%20Software%20Development-brightgreen?logo=GitHub)](https://github.com/sponsors/Quantum-Software-Development)



#




> [!IMPORTANT]
>
> ⚠️ Heads Up
>
> * Projects and deliverables may be made [publicly available]() whenever possible.
> * The course emphasizes [**practical, hands-on experience**]() with real datasets to simulate professional consulting scenarios in the fields of **Data Analysis and Data Mining** for partner organizations and institutions affiliated with the university.
> * All activities comply with the [**academic and ethical guidelines of PUC-SP**]().
> * Any content not authorized for public disclosure will remain [**confidential**]() and securely stored in [private repositories]().
>



#





##### 🎶 Prelude Suite no.1 (J. S. Bach) - [Sound Design Remix]()

https://github.com/user-attachments/assets/4ccd316b-74a1-4bae-9bc7-1c705be80498

#### 📺 For better resolution, watch the video on [YouTube.](https://youtu.be/_ytC6S4oDbM)



> [!TIP]
>
> This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.
>
> Access Data Mining [Main Repository](https://github.com/Quantum-Software-Development/1-Main_DataMining_Repository)
>
> If you’d like to explore the full materials from the 1st year (not only the review), you can visit the complete repository [here](https://github.com/FabianaCampanari/PracticalStats-PUCSP-2024).
>
>





## Table of Contents


1. [Course Overview](#course-overview)
- I - [class 1 - Intoductioon and Assessment](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a98512aa9dc2525446a3ffb236d06cbfb16d1f43/class_1-Introduction)
- II - [class_2 - Introduction - Data Mining With Python](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a98512aa9dc2525446a3ffb236d06cbfb16d1f43/class_2%20-%20Introduction%20-%20Data%20Mining%20With%20Python)
- III - [class_3 - Stats Review](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a98512aa9dc2525446a3ffb236d06cbfb16d1f43/class_3%20-%20Stats%20Review)
2. [Objectives](#objectives)
3. [Syllabus](#syllabus)
4. [Weekly Schedule](#weekly-schedule)
5. [Tools and Technologies](#tools-and-technologies)
6. [Installation and Setup](#installation-and-setup)
7. [Assessment](#assessment)
8. [Bibliography](#bibliography)
- [Basic Bibliography](#basic-bibliography)
- [Complementary Bibliography](#complementary-bibliography)
9. [Notes](#notes)



## [Course Overview]()


This course introduces [**data mining techniques**]() with a focus on [**unsupervised learning methods**](), including:

- Clustering algorithms (K-Means, Affinity Propagation, Mean-Shift)
- Principal Component Analysis (PCA)
- Dictionary Learning
- Novelty and outlier detection

Students will work on [**practical projects**]() inspired by real-world problem-solving in third-sector organizations. Final deliverables will be shared in **open repositories** and made available to the broader community, schools, libraries, and non-profits.



## [Objectives]()

Enable students to **plan, conduct, and complete a research project** applying key **data mining concepts, algorithms, and methodologies**.



## [Syllabus]()


- Fundamentals of Data Mining
- Data cleaning and preparation
- Predictive analysis
- Clustering methods (K-Means, Affinity Propagation, Mean-Shift)
- Principal Component Analysis (PCA)
- Dictionary Learning
- Novelty and outlier detection
- Application of concepts to real-world consulting scenarios



## [Weekly Schedule]()


| [Week]() | [Repos]() | [Methodology]() | [Tools]() |
|------|-------|-------------|-------|
| 1 | [Course introduction](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/d737ff164c6b4d6e580d5ba6e95c54ac604f7ea4/class_1-Introduction) | Active methodology | – |
| 2–3 | [Statistical Review](https://github.com/Quantum-Software-Development/2-3-DataMining_Statistical_Review) | Active methodology | Python |
| 4 | [Fundamentals of Data Mining](https://github.com/Quantum-Software-Development/4-DataMining_Concepts_ExploratoryAnalysis) | Active methodology | Python |
| 5–6 | [Data cleaning and preparation](https://github.com/Quantum-Software-Development/5-DataMining) | Active methodology | Python |
| 7 | Predictive analysis | Active methodology | Python |
| 8, 10 | Clustering techniques | Active methodology | Python |
| 9 | **P1 Exam** | Written (Individual) | – |
| 11 | K-Means algorithm | Active methodology | Python |
| 12 | Affinity Propagation | Active methodology | Python |
| 13 | Mean-Shift algorithm | Active methodology | Python |
| 14 | Principal Component Analysis (PCA) | Active methodology | Python |
| 15 | Dictionary Learning | Active methodology | Python |
| 16 | **P2 Exam** | Written (Individual) | – |
| 17 | **P3 Exam & Grade Closure** | Written (Individual) | – |
| 18 | Final grade submission | – | – |



## [Tools and Technologies]()


- **Programming Language:** Python
- **Libraries:** NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn
- **Environment:** Jupyter Notebook or other Python IDEs



## Installation and Setup


Follow these steps to set up your local environment for the course projects:


[1](). **Clone the repository**

```
git clone https://github.com//.git
cd
```


[2](). **Create a virtual environment** (recommended)

```
python -m venv venv
source venv/bin/activate \# Mac/Linux
venv\Scripts\activate \# Windows
```


[3](). **Install dependencies**
Make sure `pip` is updated:
```

pip install --upgrade pip

```
Then install the required packages:
```

pip install -r requirements.txt

```
*(If `requirements.txt` is not provided, install manually:)*
```

pip install numpy pandas scikit-learn matplotlib seaborn jupyter
```


[4](). **Run Jupyter Notebook**

```
jupyter notebook
```


[5](). **Open course notebooks** and start practicing.



## I - [Intoductioon and Assessment](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/86d9d9fbc56efdd0b8e377955c1c7abf8879b775/class_1-Introduction)


| Exam | Date | Format | Weight |
|------|------|--------|--------|
| **P1** | 01/10/2025 | Written – Individual | Arithmetic mean |
| **P2** | 19/11/2025 | Written – Individual | Arithmetic mean |
| **P3** | Substitution exam | Written – Individual | Replaces lowest score |


[**Final Grade:**]() Arithmetic mean of assessments.



## II - [class_2- Introduction - Data Mining With Python](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/86d9d9fbc56efdd0b8e377955c1c7abf8879b775/class_2%20-%20Introduction%20-%20Data%20Mining%20With%20Python)


☞ [Access Booklet](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/blob/81e2951f73c87cf7c4396a36d48be92384b7b720/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Book%20-%20Introd%20to%20Data%20Mining%20With%20Python.pdf)


## [Example 1]()


The following sample lists the number of minutes that 60 cable TV users watched content from their package in the last two hours. Construct a frequency distribution with 8 classes and build a histogram.


[Data]():

```
20, 55, 5, 64, 78, 49, 91, 87, 18, 83, 33, 39, 30, 31, 59, 85, 102, 24, 27, 28,
92, 108, 98, 67, 85, 109, 48, 19, 32, 69, 24, 59, 6, 49, 116, 37, 92, 43, 101, 60,
55, 107, 25, 33, 57, 25, 17, 49, 24, 101, 14, 45, 73, 120, 91, 2, 11, 47, 21, 38
```



### [Step 1](): Determine Range and Number of Classes

- Minimum value: 2
- Maximum value: 120
- Number of classes ($k$): 8 (given)



### [Step 2](): Calculate Class Width



$$
\huge
w = \left\lceil \frac{\text{max} - \text{min}}{k} \right\rceil = \left\lceil \frac{120 - 2}{8} \right\rceil = 15
$$



### [Step 3](): Construct Class Intervals (from minimum value)

| Class Interval | Explanation |
| :-- | :-- |
| 2 - 16 | Starts from minimum 2 |
| 17 - 31 | 16 + 1 to 31 |
| 32 - 46 | Next range |
| 47 - 61 | Next range |
| 62 - 76 | Next range |
| 77 - 91 | Next range |
| 92 - 106 | Next range |
| 107 - 121 | Covers maximum 120 |


### [Step 4](): Frequency Distribution Table


| Class Interval | Frequency |
| :--: | :--: |
| 2 - 16 | 5 |
| 17 - 31 | 14 |
| 32 - 46 | 8 |
| 47 - 61 | 13 |
| 62 - 76 | 5 |
| 77 - 91 | 8 |
| 92 - 106 | 6 |
| 107 - 121 | 5 |



### [Step 5](): Calculate Midpoints for Each Class


$$
\Huge
x_i = \frac{\text{Lower limit} + \text{Upper limit}}{2}
$$



| Class Interval | Midpoint ($x_i$) |
| :-- | :-- |
| 2 - 16 | 9 |
| 17 - 31 | 24 |
| 32 - 46 | 39 |
| 47 - 61 | 54 |
| 62 - 76 | 69 |
| 77 - 91 | 84 |
| 92 - 106 | 99 |
| 107 - 121 | 114 |



### [Step 6](): Calculate Mean Using Frequency and Midpoints


### [Mean](): ($\bar{x}$) is calculated by:



$$
\Huge
\bar{x} = \frac{\sum f_i x_i}{\sum f_i}
$$



### [Where](): $f_i$ = frequency, $x_i$ = [Midpoint]().


### [Calculate each product]():


| Class Interval | $f_i$ | $x_i$ | $f_i \times x_i$ |
| :-- | :-- | :-- | :-- |
| 2 - 16 | 5 | 9 | 45 |
| 17 - 31 | 14 | 24 | 336 |
| 32 - 46 | 8 | 39 | 312 |
| 47 - 61 | 13 | 54 | 702 |
| 62 - 76 | 5 | 69 | 345 |
| 77 - 91 | 8 | 84 | 672 |
| 92 - 106 | 6 | 99 | 594 |
| 107 - 121 | 5 | 114 | 570 |


### [Sum frequencies](): $5 + 14 + 8 + 13 + 5 + 8 + 6 + 5$ = [64]()

### [Sum of products](): $45 + 336 + 312 + 702 + 345 + 672 + 594 + 570$ = [3576]()


### [Calculate mean]():



$$
\huge
\bar{x} = \frac{3576}{64} = 55.875
$$



### [Step 7](): Histogram, Bar Plot and Time Series Frequency Distribution Over Time

- Construct a histogram, bar plot and Time Series with class intervals on the x-axis and frequencies on the y-axis.
- Each bar height corresponds to the frequency of the class.


☞ [Access Code](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/blob/a61b0572e5bca4d6f06b0187722f8ef97214c0a4/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Code/DataMining_1.ipynb)

☞ [Access Dataset](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/blob/01b6e27e588c3b830561385f14bd0d246f55049d/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Banks%20Dataset/banco.csv)

☞ [Access Plots](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a61b0572e5bca4d6f06b0187722f8ef97214c0a4/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Plots)



###[Frequency Analysis and Time Series Visualization]()

This notebook demonstrates how to perform frequency analysis on a CSV dataset, visualize results with histograms and bar plots, and create a time series chart using Python.


### [1](). Install and Import Libraries

```python
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```


### [2](). Load Dataset

```python
# Load CSV file (semicolon-separated)
df = pd.read_csv('chose your dataset', sep=';')

# Select only the "day" column
df1 = df['day']
```


### [3](). Calculate Frequencies

```python
# Calculate absolute frequency (ascending order)
freq_abs = pd.Series(df1).value_counts(ascending=True)

# Calculate relative frequency (normalized, 3 decimal places)
freq_rel = pd.Series(df1).value_counts(normalize=True).round(3)

# Create a DataFrame with both measures
df_freq = pd.DataFrame({
'Absolute Frequency': freq_abs,
'Relative Frequency': freq_rel
})

# Display the frequency table
display(df_freq)
```


### [4](). Histogram (Dark Theme)

```python
# Create figure and axes with dark background
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 4))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')

# Plot histogram
sns.histplot(df1, color='turquoise', ax=ax)

# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')

# Show plot
plt.show()
```



Image


### [5](). Bar Plot (Dark Theme)

```python
# Create figure and axes
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(10, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')

# Bar plot of absolute frequency
df_freq['Absolute Frequency'].plot(kind='bar', color="turquoise", ax=ax)

# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.xticks(rotation=0, color='white')
plt.yticks(color='white')

# Show plot
plt.show()
```



Image


### [6](). Time Series Preparation

```python
# Inspect available columns
print(df.columns)

# Create a new DataFrame for time series analysis
df_time_series = df[['day', 'month']].copy()

# Add dummy year (if year column is missing)
df_time_series['year'] = 2022

# Convert to strings for concatenation
df_time_series['day'] = df_time_series['day'].astype(str)
df_time_series['year'] = df_time_series['year'].astype(str)

# Create "date" column in dd-MMM-yyyy format
df_time_series['date'] = df_time_series['day'] + '-' + df_time_series['month'] + '-' + df_time_series['year']
df_time_series['date'] = pd.to_datetime(df_time_series['date'], format='%d-%b-%Y')

# Set "date" as index
df_time_series = df_time_series.set_index('date')

# Count occurrences per day
daily_counts = df_time_series.groupby(df_time_series.index).size()

# Display first rows
display(daily_counts.head())
```


### [7](). Time Series Plot (Dark Theme)

```python
# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')

# Plot time series
plt.plot(daily_counts, color='turquoise')

# Customize labels and ticks
plt.title("Frequency Distribution Over Time", color='white')
plt.xlabel("Date", color='white')
plt.ylabel("Frequency", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')

# Show plot
plt.show()
```



Image


### [Summary]()

Dummy Year: 2022 was used when year column was missing.

Visualizations: Histograms, bar plots, and time series chart.



## III - [class_3 - Stats Review](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/86d9d9fbc56efdd0b8e377955c1c7abf8879b775/class_3%20-%20Stats%20Review)


> [!TIP]
>
> [Access](https://github.com/Quantum-Software-Development/2_3-DataMining_Statistical_Review) Class_3
>



## IV - [class_4]()


> [!TIP]
>
> [Access](https://github.com/Quantum-Software-Development/4-DataMining_Concepts_ExploratoryAnalysis) Class_4
>



## V - [class_5- XXXXXXX]()


> [!TIP]
>
> [Access](https://github.com/Quantum-Software-Development/5-DataMining) Class_5
>



## VI - [class_6- XXXXXX]()


> [!TIP]
>
> [Access]() Class_6
>


















## [Bibliography]()


### [Basic Bibliography]()

- CASTRO, L. N. *Introdução a mineração de dados: conceitos básicos, algoritmos e aplicações*. Saraiva, 2016.
- PIRIM, H. *Recent Applications in Data Clustering*. IntechOpen, 2018.
- SEN, J. *Machine Learning: Artificial Intelligence*. IntechOpen, 2021.


### [Complementary Bibliography]()

- THOMAS, C. *Data Mining*. IntechOpen, 2018.
- HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. *Automated Machine Learning: Methods, Systems, Challenges*. Springer Nature, 2019.
- NETTO, A.; MACIEL, F. *Python para Data Science e Machine Learning Descomplicado*. Alta Books, 2021.
- RUSSELL, S. J.; NORVIG, P. *Artificial Intelligence: A Modern Approach*. GEN LTC, 2022.
- SUD, K.; ERDOGMUS, P.; KADRY, S. *Introduction to Data Science and Machine Learning*. IntechOpen, 2020.



## 💌 [Let the data flow... Ping Me !](mailto:fabicampanari@proton.me)



####

🛸๋ My Contacts [Hub](https://linktr.ee/fabianacampanari)


###




────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

#

######

Copyright 2025 Quantum Software Development. Code released under the [MIT License license.](https://github.com/Quantum-Software-Development/Math/blob/3bf8270ca09d3848f2bf22f9ac89368e52a2fb66/LICENSE)