https://github.com/quantum-software-development/1-datamining_main_repository
data mining, focusing on unsupervised learning methods (clustering, PCA, dictionary learning, anomaly detection) applied to real-world projects for third-sector organizations. Results are shared publicly in open repositories and community platforms.
https://github.com/quantum-software-development/1-datamining_main_repository
Last synced: 4 months ago
JSON representation
data mining, focusing on unsupervised learning methods (clustering, PCA, dictionary learning, anomaly detection) applied to real-world projects for third-sector organizations. Results are shared publicly in open repositories and community platforms.
- Host: GitHub
- URL: https://github.com/quantum-software-development/1-datamining_main_repository
- Owner: Quantum-Software-Development
- License: mit
- Created: 2025-08-12T19:11:16.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-09-17T15:48:58.000Z (4 months ago)
- Last Synced: 2025-09-17T17:54:54.182Z (4 months ago)
- Language: Jupyter Notebook
- Homepage: https://github.com/Quantum-Software-Development/specialized-consulting-data-mining
- Size: 13.3 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
**\[[π§π· PortuguΓͺs](README.pt_BR.md)\] \[**[πΊπΈ English](README.md)**\]**
#
1- [Data Mining]() / [Main Repository]()
[**Institution:**]() Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
[**School:**]() Faculty of Interdisciplinary Studies
[**Program:**]() Humanistic AI and Data Science
[**Semester:**]() 2nd Semester 2025
Professor: [***Professor Doctor in Mathematics Daniel Rodrigues da Silva***](https://www.linkedin.com/in/daniel-rodrigues-048654a5/)
####
[](https://github.com/sponsors/Quantum-Software-Development)
#
> [!IMPORTANT]
>
> β οΈ Heads Up
>
> * Projects and deliverables may be made [publicly available]() whenever possible.
> * The course emphasizes [**practical, hands-on experience**]() with real datasets to simulate professional consulting scenarios in the fields of **Data Analysis and Data Mining** for partner organizations and institutions affiliated with the university.
> * All activities comply with the [**academic and ethical guidelines of PUC-SP**]().
> * Any content not authorized for public disclosure will remain [**confidential**]() and securely stored in [private repositories]().
>
#
##### πΆ Prelude Suite no.1 (J. S. Bach) - [Sound Design Remix]()
https://github.com/user-attachments/assets/4ccd316b-74a1-4bae-9bc7-1c705be80498
#### πΊ For better resolution, watch the video on [YouTube.](https://youtu.be/_ytC6S4oDbM)
> [!TIP]
>
> This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.
>
> ### β **Access Data Mining [Main Repository](https://github.com/Quantum-Software-Development/1-Main_DataMining_Repository)**
>
> If youβd like to explore the full materials from the 1st year (not only the review), you can visit the complete repository [here](https://github.com/FabianaCampanari/PracticalStats-PUCSP-2024).
>
>
## Table of Contents
1. [Course Overview](#course-overview)
- I - [class 1 - Intoductioon and Assessment](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a98512aa9dc2525446a3ffb236d06cbfb16d1f43/class_1-Introduction)
- II - [class_2 - Introduction - Data Mining With Python](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a98512aa9dc2525446a3ffb236d06cbfb16d1f43/class_2%20-%20Introduction%20-%20Data%20Mining%20With%20Python)
- III - [class_3 - Stats Review](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a98512aa9dc2525446a3ffb236d06cbfb16d1f43/class_3%20-%20Stats%20Review)
2. [Objectives](#objectives)
3. [Syllabus](#syllabus)
4. [Weekly Schedule](#weekly-schedule)
5. [Tools and Technologies](#tools-and-technologies)
6. [Installation and Setup](#installation-and-setup)
7. [Assessment](#assessment)
8. [Bibliography](#bibliography)
- [Basic Bibliography](#basic-bibliography)
- [Complementary Bibliography](#complementary-bibliography)
9. [Notes](#notes)
## [Course Overview]()
This course introduces [**data mining techniques**]() with a focus on [**unsupervised learning methods**](), including:
- Clustering algorithms (K-Means, Affinity Propagation, Mean-Shift)
- Principal Component Analysis (PCA)
- Dictionary Learning
- Novelty and outlier detection
Students will work on [**practical projects**]() inspired by real-world problem-solving in third-sector organizations. Final deliverables will be shared in **open repositories** and made available to the broader community, schools, libraries, and non-profits.
## [Objectives]()
Enable students to **plan, conduct, and complete a research project** applying key **data mining concepts, algorithms, and methodologies**.
## [Syllabus]()
- Fundamentals of Data Mining
- Data cleaning and preparation
- Predictive analysis
- Clustering methods (K-Means, Affinity Propagation, Mean-Shift)
- Principal Component Analysis (PCA)
- Dictionary Learning
- Novelty and outlier detection
- Application of concepts to real-world consulting scenarios
Statistic Review - Stats Measures - Mean - Median - Mode - Variance]()
https://github.com/Quantum-Software-Development/7-DataMining-Regression-Techniques-Data-Integration
## [Weekly Schedule]()
| [Week]() | [Repos]() | [Methodology]() | [Tools]() |
|------|-------|-------------|-------|
| 1 | [Course introduction](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/d737ff164c6b4d6e580d5ba6e95c54ac604f7ea4/class_1-Introduction) | Active methodology | β |
| 2 | [Statistical Review - Stats Measures - Mean - Median - Mode - Variance](https://github.com/Quantum-Software-Development/2-DataMining_Statistical_Measures) | Active methodology | Python |
| 3 | [Statistical Review - Variation Measures and Standard Deviation](https://github.com/Quantum-Software-Development/3-DataMining_VariationMeasures_Standard-Deviation) | Active methodology | Python |
| 4 | [Data Mining - Concepts - Exploratory Analysis](https://github.com/Quantum-Software-Development/4-DataMining_Concepts_ExploratoryAnalysis) | Active methodology | Python - R |
| 5 | [Data Cleaning - Preparation - Anomalies (Outliers)](https://github.com/Quantum-Software-Development/5-DataMining_DataCleaning_Preparation_Anomalies_Outlier) | Active methodology | Python |
| 6 | [Data Mining - Pre Processing](https://github.com/Quantum-Software-Development/6-DataMining_Pre-Processing) | Active methodology | Python |
| 7 | [Regression Techniques with Data Integration](https://github.com/Quantum-Software-Development/7-DataMining-Regression-Techniques-Data-Integration) | Active methodology | Python |
| 8 | [Predictive K-Means Clustering Data and Figures Analysis](https://github.com/Quantum-Software-Development/8-DataMining-KMeans-Non-Hierarchical-Clustering) | Active methodology | Python |
| 9 | Clustering techniques | Active methodology | Python |
| 10 | Clustering techniques | Active methodology | Python |
| 11 | **P1 Exam** | Written (Individual) | β |
| 12 | K-Means algorithm | Active methodology | Python |
| 14 | Affinity Propagation | Active methodology | Python |
| 14 | Mean-Shift algorithm | Active methodology | Python |
| 15 | Principal Component Analysis (PCA) | Active methodology | Python |
| 16 | Dictionary Learning | Active methodology | Python |
| 17 | **P2 Exam** | Written (Individual) | β |
| 18 | **P3 Exam & Grade Closure** | Written (Individual) | β |
| 19 | Final grade submission | β | β |
## [Tools and Technologies]()
- **Programming Language:** Python
- **Libraries:** NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn
- **Environment:** Jupyter Notebook or other Python IDEs
## Installation and Setup
Follow these steps to set up your local environment for the course projects:
[1](). **Clone the repository**
```
git clone https://github.com//.git
cd
```
[2](). **Create a virtual environment** (recommended)
```
python -m venv venv
source venv/bin/activate \# Mac/Linux
venv\Scripts\activate \# Windows
```
[3](). **Install dependencies**
Make sure `pip` is updated:
```
pip install --upgrade pip
```
Then install the required packages:
```
pip install -r requirements.txt
```
*(If `requirements.txt` is not provided, install manually:)*
```
pip install numpy pandas scikit-learn matplotlib seaborn jupyter
```
[4](). **Run Jupyter Notebook**
```
jupyter notebook
```
[5](). **Open course notebooks** and start practicing.
## I - [Intoductioon and Assessment](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/86d9d9fbc56efdd0b8e377955c1c7abf8879b775/class_1-Introduction)
| Exam | Date | Format | Weight |
|------|------|--------|--------|
| **P1** | 01/10/2025 | Written β Individual | Arithmetic mean |
| **P2** | 19/11/2025 | Written β Individual | Arithmetic mean |
| **P3** | Substitution exam | Written β Individual | Replaces lowest score |
[**Final Grade:**]() Arithmetic mean of assessments.
## II - [class_2- Introduction - Data Mining With Python](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/86d9d9fbc56efdd0b8e377955c1c7abf8879b775/class_2%20-%20Introduction%20-%20Data%20Mining%20With%20Python)
β [Access Booklet](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/blob/81e2951f73c87cf7c4396a36d48be92384b7b720/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Book%20-%20Introd%20to%20Data%20Mining%20With%20Python.pdf)
## [Example 1]()
The following sample lists the number of minutes that 60 cable TV users watched content from their package in the last two hours. Construct a frequency distribution with 8 classes and build a histogram.
[Data]():
```
20, 55, 5, 64, 78, 49, 91, 87, 18, 83, 33, 39, 30, 31, 59, 85, 102, 24, 27, 28,
92, 108, 98, 67, 85, 109, 48, 19, 32, 69, 24, 59, 6, 49, 116, 37, 92, 43, 101, 60,
55, 107, 25, 33, 57, 25, 17, 49, 24, 101, 14, 45, 73, 120, 91, 2, 11, 47, 21, 38
```
### [Step 1](): Determine Range and Number of Classes
- Minimum value: 2
- Maximum value: 120
- Number of classes ($k$): 8 (given)
### [Step 2](): Calculate Class Width
$$
\huge
w = \left\lceil \frac{\text{max} - \text{min}}{k} \right\rceil = \left\lceil \frac{120 - 2}{8} \right\rceil = 15
$$
### [Step 3](): Construct Class Intervals (from minimum value)
| Class Interval | Explanation |
| :-- | :-- |
| 2 - 16 | Starts from minimum 2 |
| 17 - 31 | 16 + 1 to 31 |
| 32 - 46 | Next range |
| 47 - 61 | Next range |
| 62 - 76 | Next range |
| 77 - 91 | Next range |
| 92 - 106 | Next range |
| 107 - 121 | Covers maximum 120 |
### [Step 4](): Frequency Distribution Table
| Class Interval | Frequency |
| :--: | :--: |
| 2 - 16 | 5 |
| 17 - 31 | 14 |
| 32 - 46 | 8 |
| 47 - 61 | 13 |
| 62 - 76 | 5 |
| 77 - 91 | 8 |
| 92 - 106 | 6 |
| 107 - 121 | 5 |
### [Step 5](): Calculate Midpoints for Each Class
$$
\Huge
x_i = \frac{\text{Lower limit} + \text{Upper limit}}{2}
$$
| Class Interval | Midpoint ($x_i$) |
| :-- | :-- |
| 2 - 16 | 9 |
| 17 - 31 | 24 |
| 32 - 46 | 39 |
| 47 - 61 | 54 |
| 62 - 76 | 69 |
| 77 - 91 | 84 |
| 92 - 106 | 99 |
| 107 - 121 | 114 |
### [Step 6](): Calculate Mean Using Frequency and Midpoints
### [Mean](): ($\bar{x}$) is calculated by:
$$
\Huge
\bar{x} = \frac{\sum f_i x_i}{\sum f_i}
$$
### [Where](): $f_i$ = frequency, $x_i$ = [Midpoint]().
### [Calculate each product]():
| Class Interval | $f_i$ | $x_i$ | $f_i \times x_i$ |
| :-- | :-- | :-- | :-- |
| 2 - 16 | 5 | 9 | 45 |
| 17 - 31 | 14 | 24 | 336 |
| 32 - 46 | 8 | 39 | 312 |
| 47 - 61 | 13 | 54 | 702 |
| 62 - 76 | 5 | 69 | 345 |
| 77 - 91 | 8 | 84 | 672 |
| 92 - 106 | 6 | 99 | 594 |
| 107 - 121 | 5 | 114 | 570 |
### [Sum frequencies](): $5 + 14 + 8 + 13 + 5 + 8 + 6 + 5$ = [64]()
### [Sum of products](): $45 + 336 + 312 + 702 + 345 + 672 + 594 + 570$ = [3576]()
### [Calculate mean]():
$$
\huge
\bar{x} = \frac{3576}{64} = 55.875
$$
### [Step 7](): Histogram, Bar Plot and Time Series Frequency Distribution Over Time
- Construct a histogram, bar plot and Time Series with class intervals on the x-axis and frequencies on the y-axis.
- Each bar height corresponds to the frequency of the class.
β [Access Code](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/blob/a61b0572e5bca4d6f06b0187722f8ef97214c0a4/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Code/DataMining_1.ipynb)
β [Access Dataset](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/blob/01b6e27e588c3b830561385f14bd0d246f55049d/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Banks%20Dataset/banco.csv)
β [Access Plots](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/a61b0572e5bca4d6f06b0187722f8ef97214c0a4/class_1-%20Introduction%20-%20Data%20Mining%20With%20Python/Plots)
###[Frequency Analysis and Time Series Visualization]()
This notebook demonstrates how to perform frequency analysis on a CSV dataset, visualize results with histograms and bar plots, and create a time series chart using Python.
### [1](). Install and Import Libraries
```python
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```
### [2](). Load Dataset
```python
# Load CSV file (semicolon-separated)
df = pd.read_csv('chose your dataset', sep=';')
# Select only the "day" column
df1 = df['day']
```
### [3](). Calculate Frequencies
```python
# Calculate absolute frequency (ascending order)
freq_abs = pd.Series(df1).value_counts(ascending=True)
# Calculate relative frequency (normalized, 3 decimal places)
freq_rel = pd.Series(df1).value_counts(normalize=True).round(3)
# Create a DataFrame with both measures
df_freq = pd.DataFrame({
'Absolute Frequency': freq_abs,
'Relative Frequency': freq_rel
})
# Display the frequency table
display(df_freq)
```
### [4](). Histogram (Dark Theme)
```python
# Create figure and axes with dark background
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 4))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')
# Plot histogram
sns.histplot(df1, color='turquoise', ax=ax)
# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')
# Show plot
plt.show()
```

### [5](). Bar Plot (Dark Theme)
```python
# Create figure and axes
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(10, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')
# Bar plot of absolute frequency
df_freq['Absolute Frequency'].plot(kind='bar', color="turquoise", ax=ax)
# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.xticks(rotation=0, color='white')
plt.yticks(color='white')
# Show plot
plt.show()
```

### [6](). Time Series Preparation
```python
# Inspect available columns
print(df.columns)
# Create a new DataFrame for time series analysis
df_time_series = df[['day', 'month']].copy()
# Add dummy year (if year column is missing)
df_time_series['year'] = 2022
# Convert to strings for concatenation
df_time_series['day'] = df_time_series['day'].astype(str)
df_time_series['year'] = df_time_series['year'].astype(str)
# Create "date" column in dd-MMM-yyyy format
df_time_series['date'] = df_time_series['day'] + '-' + df_time_series['month'] + '-' + df_time_series['year']
df_time_series['date'] = pd.to_datetime(df_time_series['date'], format='%d-%b-%Y')
# Set "date" as index
df_time_series = df_time_series.set_index('date')
# Count occurrences per day
daily_counts = df_time_series.groupby(df_time_series.index).size()
# Display first rows
display(daily_counts.head())
```
### [7](). Time Series Plot (Dark Theme)
```python
# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')
# Plot time series
plt.plot(daily_counts, color='turquoise')
# Customize labels and ticks
plt.title("Frequency Distribution Over Time", color='white')
plt.xlabel("Date", color='white')
plt.ylabel("Frequency", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')
# Show plot
plt.show()
```

### [Summary]()
Dummy Year: 2022 was used when year column was missing.
Visualizations: Histograms, bar plots, and time series chart.
## III - [class_3- Stats Review](https://github.com/Quantum-Software-Development/specialized-consulting-data-mining/tree/86d9d9fbc56efdd0b8e377955c1c7abf8879b775/class_3%20-%20Stats%20Review)
> [!TIP]
>
> [Access](https://github.com/Quantum-Software-Development/2_3-DataMining_Statistical_Review) Class_3
>
## IV - [class_4- Data Mining - Concepts - Exploratory Analysis]()
> [!TIP]
>
> [Access](https://github.com/Quantum-Software-Development/4-DataMining_Concepts_ExploratoryAnalysis) Class_4
>
## V - [class_5- Data Cleaning - Preparation - Anomalies(Outliers)]()
> [!TIP]
>
> [Access](https://github.com/Quantum-Software-Development/5-DataMining_DataCleaning_Preparation_Anomalies_Outlier) Class_5
>
## VI - [class_6- Data Mining - Pre Processing]()
> [!TIP]
>
> [Access](https://github.com/Quantum-Software-Development/6-DataMining_Pre-Processing) Class_6
>
## VII - [class_7- XXXXXX]()
> [!TIP]
>
> [Access]() Class_7
>
> β οΈ Coming Soon
>
## [Bibliography]()
[1](). **Castro, L. N. & Ferrari, D. G.** (2016). *Introduction to Data Mining: Basic Concepts, Algorithms, and Applications*. Saraiva.
[2](). **Ferreira, A. C. P. L. et al.** (2024). *Artificial Intelligence β A Machine Learning Approach*. 2nd Ed. LTC.
[3](). **Larson & Farber** (2015). *Applied Statistics*. Pearson.
### [Complementary Bibliography]()
- THOMAS, C. *Data Mining*. IntechOpen, 2018.
- HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. *Automated Machine Learning: Methods, Systems, Challenges*. Springer Nature, 2019.
- NETTO, A.; MACIEL, F. *Python para Data Science e Machine Learning Descomplicado*. Alta Books, 2021.
- RUSSELL, S. J.; NORVIG, P. *Artificial Intelligence: A Modern Approach*. GEN LTC, 2022.
- SUD, K.; ERDOGMUS, P.; KADRY, S. *Introduction to Data Science and Machine Learning*. IntechOpen, 2020.
## π [Let the data flow... Ping Me !](mailto:fabicampanari@proton.me)
####
πΈΰΉ My Contacts [Hub](https://linktr.ee/fabianacampanari)
###

ββββββββββββββ πβ ββββββββββββββ
β£β’β€ Back to Top
#
######
Copyright 2025 Quantum Software Development. Code released under the [MIT License license.](https://github.com/Quantum-Software-Development/Math/blob/3bf8270ca09d3848f2bf22f9ac89368e52a2fb66/LICENSE)