https://github.com/quantum-software-development/1-main_datamining_repository
data mining, focusing on unsupervised learning methods (clustering, PCA, dictionary learning, anomaly detection) applied to real-world projects for third-sector organizations. Results are shared publicly in open repositories and community platforms.
https://github.com/quantum-software-development/1-main_datamining_repository
Last synced: 5 months ago
JSON representation
data mining, focusing on unsupervised learning methods (clustering, PCA, dictionary learning, anomaly detection) applied to real-world projects for third-sector organizations. Results are shared publicly in open repositories and community platforms.
- Host: GitHub
- URL: https://github.com/quantum-software-development/1-main_datamining_repository
- Owner: Quantum-Software-Development
- License: mit
- Created: 2025-08-12T19:11:16.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-08-31T03:15:02.000Z (5 months ago)
- Last Synced: 2025-08-31T03:32:11.556Z (5 months ago)
- Language: Jupyter Notebook
- Homepage: https://github.com/Quantum-Software-Development/specialized-consulting-data-mining
- Size: 13.2 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
**\[[🇧🇷 Português](README.pt_BR.md)\] \[**[🇺🇸 English](README.md)**\]**
#
1- [Data Mining]() / [Main Repository]()
[**Institution:**]() Pontifical Catholic University of São Paulo (PUC-SP)
[**School:**]() Faculty of Interdisciplinary Studies
[**Program:**]() Humanistic AI and Data Science
[**Semester:**]() 2nd Semester 2025
Professor: [***Professor Doctor in Mathematics Daniel Rodrigues da Silva***](https://www.linkedin.com/in/daniel-rodrigues-048654a5/)
####
[](https://github.com/sponsors/Quantum-Software-Development)
#
> [!IMPORTANT]
>
> ⚠️ Heads Up
>
> * Projects and deliverables may be made [publicly available]() whenever possible.
> * The course emphasizes [**practical, hands-on experience**]() with real datasets to simulate professional consulting scenarios in the fields of **Data Analysis and Data Mining** for partner organizations and institutions affiliated with the university.
> * All activities comply with the [**academic and ethical guidelines of PUC-SP**]().
> * Any content not authorized for public disclosure will remain [**confidential**]() and securely stored in [private repositories]().
>
#
##### 🎶 Prelude Suite no.1 (J. S. Bach) - [Sound Design Remix]()
https://github.com/user-attachments/assets/4ccd316b-74a1-4bae-9bc7-1c705be80498
#### 📺 For better resolution, watch the video on [YouTube.](https://youtu.be/_ytC6S4oDbM)
> [!TIP]
>
> This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.
>
> Access Data Mining [Main Repository](https://github.com/Quantum-Software-Development/1-Main_DataMining_Repository)
>
> If you’d like to explore the full materials from the 1st year (not only the review), you can visit the complete repository [here](https://github.com/FabianaCampanari/PracticalStats-PUCSP-2024).
>
>
## Table of Contents
1. [Course Overview](#course-overview)
- I - [class 1 - Intoductioon and Assessment](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a98512aa9dc2525446a3ffb236d06cbfb16d1f43/class_1-Introduction)
- II - [class_2 - Introduction - Data Mining With Python](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a98512aa9dc2525446a3ffb236d06cbfb16d1f43/class_2%20-%20Introduction%20-%20Data%20Mining%20With%20Python)
- III - [class_3 - Stats Review](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a98512aa9dc2525446a3ffb236d06cbfb16d1f43/class_3%20-%20Stats%20Review)
2. [Objectives](#objectives)
3. [Syllabus](#syllabus)
4. [Weekly Schedule](#weekly-schedule)
5. [Tools and Technologies](#tools-and-technologies)
6. [Installation and Setup](#installation-and-setup)
7. [Assessment](#assessment)
8. [Bibliography](#bibliography)
- [Basic Bibliography](#basic-bibliography)
- [Complementary Bibliography](#complementary-bibliography)
9. [Notes](#notes)
## [Course Overview]()
This course introduces [**data mining techniques**]() with a focus on [**unsupervised learning methods**](), including:
- Clustering algorithms (K-Means, Affinity Propagation, Mean-Shift)
- Principal Component Analysis (PCA)
- Dictionary Learning
- Novelty and outlier detection
Students will work on [**practical projects**]() inspired by real-world problem-solving in third-sector organizations. Final deliverables will be shared in **open repositories** and made available to the broader community, schools, libraries, and non-profits.
## [Objectives]()
Enable students to **plan, conduct, and complete a research project** applying key **data mining concepts, algorithms, and methodologies**.
## [Syllabus]()
- Fundamentals of Data Mining
- Data cleaning and preparation
- Predictive analysis
- Clustering methods (K-Means, Affinity Propagation, Mean-Shift)
- Principal Component Analysis (PCA)
- Dictionary Learning
- Novelty and outlier detection
- Application of concepts to real-world consulting scenarios
## [Weekly Schedule]()
| [Week]() | [Repos]() | [Methodology]() | [Tools]() |
|------|-------|-------------|-------|
| 1 | [Course introduction](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/d737ff164c6b4d6e580d5ba6e95c54ac604f7ea4/class_1-Introduction) | Active methodology | – |
| 2–3 | [Statistical Review](https://github.com/Quantum-Software-Development/2-3-DataMining_Statistical_Review) | Active methodology | Python |
| 4 | [Fundamentals of Data Mining](https://github.com/Quantum-Software-Development/4-DataMining_Concepts_ExploratoryAnalysis) | Active methodology | Python |
| 5–6 | [Data cleaning and preparation](https://github.com/Quantum-Software-Development/5-DataMining) | Active methodology | Python |
| 7 | Predictive analysis | Active methodology | Python |
| 8, 10 | Clustering techniques | Active methodology | Python |
| 9 | **P1 Exam** | Written (Individual) | – |
| 11 | K-Means algorithm | Active methodology | Python |
| 12 | Affinity Propagation | Active methodology | Python |
| 13 | Mean-Shift algorithm | Active methodology | Python |
| 14 | Principal Component Analysis (PCA) | Active methodology | Python |
| 15 | Dictionary Learning | Active methodology | Python |
| 16 | **P2 Exam** | Written (Individual) | – |
| 17 | **P3 Exam & Grade Closure** | Written (Individual) | – |
| 18 | Final grade submission | – | – |
## [Tools and Technologies]()
- **Programming Language:** Python
- **Libraries:** NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn
- **Environment:** Jupyter Notebook or other Python IDEs
## Installation and Setup
Follow these steps to set up your local environment for the course projects:
[1](). **Clone the repository**
```
git clone https://github.com//.git
cd
```
[2](). **Create a virtual environment** (recommended)
```
python -m venv venv
source venv/bin/activate \# Mac/Linux
venv\Scripts\activate \# Windows
```
[3](). **Install dependencies**
Make sure `pip` is updated:
```
pip install --upgrade pip
```
Then install the required packages:
```
pip install -r requirements.txt
```
*(If `requirements.txt` is not provided, install manually:)*
```
pip install numpy pandas scikit-learn matplotlib seaborn jupyter
```
[4](). **Run Jupyter Notebook**
```
jupyter notebook
```
[5](). **Open course notebooks** and start practicing.
## I - [Intoductioon and Assessment](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/86d9d9fbc56efdd0b8e377955c1c7abf8879b775/class_1-Introduction)
| Exam | Date | Format | Weight |
|------|------|--------|--------|
| **P1** | 01/10/2025 | Written – Individual | Arithmetic mean |
| **P2** | 19/11/2025 | Written – Individual | Arithmetic mean |
| **P3** | Substitution exam | Written – Individual | Replaces lowest score |
[**Final Grade:**]() Arithmetic mean of assessments.
## II - [class_2- Introduction - Data Mining With Python](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/86d9d9fbc56efdd0b8e377955c1c7abf8879b775/class_2%20-%20Introduction%20-%20Data%20Mining%20With%20Python)
☞ [Access Booklet](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/blob/81e2951f73c87cf7c4396a36d48be92384b7b720/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Book%20-%20Introd%20to%20Data%20Mining%20With%20Python.pdf)
## [Example 1]()
The following sample lists the number of minutes that 60 cable TV users watched content from their package in the last two hours. Construct a frequency distribution with 8 classes and build a histogram.
[Data]():
```
20, 55, 5, 64, 78, 49, 91, 87, 18, 83, 33, 39, 30, 31, 59, 85, 102, 24, 27, 28,
92, 108, 98, 67, 85, 109, 48, 19, 32, 69, 24, 59, 6, 49, 116, 37, 92, 43, 101, 60,
55, 107, 25, 33, 57, 25, 17, 49, 24, 101, 14, 45, 73, 120, 91, 2, 11, 47, 21, 38
```
### [Step 1](): Determine Range and Number of Classes
- Minimum value: 2
- Maximum value: 120
- Number of classes ($k$): 8 (given)
### [Step 2](): Calculate Class Width
$$
\huge
w = \left\lceil \frac{\text{max} - \text{min}}{k} \right\rceil = \left\lceil \frac{120 - 2}{8} \right\rceil = 15
$$
### [Step 3](): Construct Class Intervals (from minimum value)
| Class Interval | Explanation |
| :-- | :-- |
| 2 - 16 | Starts from minimum 2 |
| 17 - 31 | 16 + 1 to 31 |
| 32 - 46 | Next range |
| 47 - 61 | Next range |
| 62 - 76 | Next range |
| 77 - 91 | Next range |
| 92 - 106 | Next range |
| 107 - 121 | Covers maximum 120 |
### [Step 4](): Frequency Distribution Table
| Class Interval | Frequency |
| :--: | :--: |
| 2 - 16 | 5 |
| 17 - 31 | 14 |
| 32 - 46 | 8 |
| 47 - 61 | 13 |
| 62 - 76 | 5 |
| 77 - 91 | 8 |
| 92 - 106 | 6 |
| 107 - 121 | 5 |
### [Step 5](): Calculate Midpoints for Each Class
$$
\Huge
x_i = \frac{\text{Lower limit} + \text{Upper limit}}{2}
$$
| Class Interval | Midpoint ($x_i$) |
| :-- | :-- |
| 2 - 16 | 9 |
| 17 - 31 | 24 |
| 32 - 46 | 39 |
| 47 - 61 | 54 |
| 62 - 76 | 69 |
| 77 - 91 | 84 |
| 92 - 106 | 99 |
| 107 - 121 | 114 |
### [Step 6](): Calculate Mean Using Frequency and Midpoints
### [Mean](): ($\bar{x}$) is calculated by:
$$
\Huge
\bar{x} = \frac{\sum f_i x_i}{\sum f_i}
$$
### [Where](): $f_i$ = frequency, $x_i$ = [Midpoint]().
### [Calculate each product]():
| Class Interval | $f_i$ | $x_i$ | $f_i \times x_i$ |
| :-- | :-- | :-- | :-- |
| 2 - 16 | 5 | 9 | 45 |
| 17 - 31 | 14 | 24 | 336 |
| 32 - 46 | 8 | 39 | 312 |
| 47 - 61 | 13 | 54 | 702 |
| 62 - 76 | 5 | 69 | 345 |
| 77 - 91 | 8 | 84 | 672 |
| 92 - 106 | 6 | 99 | 594 |
| 107 - 121 | 5 | 114 | 570 |
### [Sum frequencies](): $5 + 14 + 8 + 13 + 5 + 8 + 6 + 5$ = [64]()
### [Sum of products](): $45 + 336 + 312 + 702 + 345 + 672 + 594 + 570$ = [3576]()
### [Calculate mean]():
$$
\huge
\bar{x} = \frac{3576}{64} = 55.875
$$
### [Step 7](): Histogram, Bar Plot and Time Series Frequency Distribution Over Time
- Construct a histogram, bar plot and Time Series with class intervals on the x-axis and frequencies on the y-axis.
- Each bar height corresponds to the frequency of the class.
☞ [Access Code](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/blob/a61b0572e5bca4d6f06b0187722f8ef97214c0a4/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Code/DataMining_1.ipynb)
☞ [Access Dataset](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/blob/01b6e27e588c3b830561385f14bd0d246f55049d/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Banks%20Dataset/banco.csv)
☞ [Access Plots](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a61b0572e5bca4d6f06b0187722f8ef97214c0a4/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Plots)
###[Frequency Analysis and Time Series Visualization]()
This notebook demonstrates how to perform frequency analysis on a CSV dataset, visualize results with histograms and bar plots, and create a time series chart using Python.
### [1](). Install and Import Libraries
```python
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```
### [2](). Load Dataset
```python
# Load CSV file (semicolon-separated)
df = pd.read_csv('chose your dataset', sep=';')
# Select only the "day" column
df1 = df['day']
```
### [3](). Calculate Frequencies
```python
# Calculate absolute frequency (ascending order)
freq_abs = pd.Series(df1).value_counts(ascending=True)
# Calculate relative frequency (normalized, 3 decimal places)
freq_rel = pd.Series(df1).value_counts(normalize=True).round(3)
# Create a DataFrame with both measures
df_freq = pd.DataFrame({
'Absolute Frequency': freq_abs,
'Relative Frequency': freq_rel
})
# Display the frequency table
display(df_freq)
```
### [4](). Histogram (Dark Theme)
```python
# Create figure and axes with dark background
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 4))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')
# Plot histogram
sns.histplot(df1, color='turquoise', ax=ax)
# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')
# Show plot
plt.show()
```

### [5](). Bar Plot (Dark Theme)
```python
# Create figure and axes
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(10, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')
# Bar plot of absolute frequency
df_freq['Absolute Frequency'].plot(kind='bar', color="turquoise", ax=ax)
# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.xticks(rotation=0, color='white')
plt.yticks(color='white')
# Show plot
plt.show()
```

### [6](). Time Series Preparation
```python
# Inspect available columns
print(df.columns)
# Create a new DataFrame for time series analysis
df_time_series = df[['day', 'month']].copy()
# Add dummy year (if year column is missing)
df_time_series['year'] = 2022
# Convert to strings for concatenation
df_time_series['day'] = df_time_series['day'].astype(str)
df_time_series['year'] = df_time_series['year'].astype(str)
# Create "date" column in dd-MMM-yyyy format
df_time_series['date'] = df_time_series['day'] + '-' + df_time_series['month'] + '-' + df_time_series['year']
df_time_series['date'] = pd.to_datetime(df_time_series['date'], format='%d-%b-%Y')
# Set "date" as index
df_time_series = df_time_series.set_index('date')
# Count occurrences per day
daily_counts = df_time_series.groupby(df_time_series.index).size()
# Display first rows
display(daily_counts.head())
```
### [7](). Time Series Plot (Dark Theme)
```python
# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')
# Plot time series
plt.plot(daily_counts, color='turquoise')
# Customize labels and ticks
plt.title("Frequency Distribution Over Time", color='white')
plt.xlabel("Date", color='white')
plt.ylabel("Frequency", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')
# Show plot
plt.show()
```

### [Summary]()
Dummy Year: 2022 was used when year column was missing.
Visualizations: Histograms, bar plots, and time series chart.
## III - [class_3 - Stats Review](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/86d9d9fbc56efdd0b8e377955c1c7abf8879b775/class_3%20-%20Stats%20Review)
> [!TIP]
>
> [Access](https://github.com/Quantum-Software-Development/2_3-DataMining_Statistical_Review) Class_3
>
## IV - [class_4]()
> [!TIP]
>
> [Access](https://github.com/Quantum-Software-Development/4-DataMining_Concepts_ExploratoryAnalysis) Class_4
>
## V - [class_5- XXXXXXX]()
> [!TIP]
>
> [Access](https://github.com/Quantum-Software-Development/5-DataMining) Class_5
>
## VI - [class_6- XXXXXX]()
> [!TIP]
>
> [Access]() Class_6
>
## [Bibliography]()
### [Basic Bibliography]()
- CASTRO, L. N. *Introdução a mineração de dados: conceitos básicos, algoritmos e aplicações*. Saraiva, 2016.
- PIRIM, H. *Recent Applications in Data Clustering*. IntechOpen, 2018.
- SEN, J. *Machine Learning: Artificial Intelligence*. IntechOpen, 2021.
### [Complementary Bibliography]()
- THOMAS, C. *Data Mining*. IntechOpen, 2018.
- HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. *Automated Machine Learning: Methods, Systems, Challenges*. Springer Nature, 2019.
- NETTO, A.; MACIEL, F. *Python para Data Science e Machine Learning Descomplicado*. Alta Books, 2021.
- RUSSELL, S. J.; NORVIG, P. *Artificial Intelligence: A Modern Approach*. GEN LTC, 2022.
- SUD, K.; ERDOGMUS, P.; KADRY, S. *Introduction to Data Science and Machine Learning*. IntechOpen, 2020.
## 💌 [Let the data flow... Ping Me !](mailto:fabicampanari@proton.me)
####
🛸๋ My Contacts [Hub](https://linktr.ee/fabianacampanari)
###

────────────── 🔭⋆ ──────────────
➣➢➤ Back to Top
#
######
Copyright 2025 Quantum Software Development. Code released under the [MIT License license.](https://github.com/Quantum-Software-Development/Math/blob/3bf8270ca09d3848f2bf22f9ac89368e52a2fb66/LICENSE)