https://github.com/quantum-software-development/2_3-datamining_statistical_review

This repository contains materials and examples for the Data Mining with Python Class 2 and 3 course, focusing on fundamental statistical concepts and data analysis techniques essential for data mining applications.
https://github.com/quantum-software-development/2_3-datamining_statistical_review
Last synced: 5 months ago
JSON representation
Host: GitHub
URL: https://github.com/quantum-software-development/2_3-datamining_statistical_review
Owner: Quantum-Software-Development
License: mit
Created: 2025-08-13T16:11:33.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-08-31T02:39:55.000Z (5 months ago)
Last Synced: 2025-08-31T04:11:40.404Z (5 months ago)
Language: Jupyter Notebook
Homepage:
Size: 4.45 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          



**\[[🇧🇷 Português](README.pt_BR.md)\] \[**[🇺🇸 English](README.md)**\]**





# 
  2_3- [Data Mining]() / [Statistic Review]()





[**Institution:**]() Pontifical Catholic University of São Paulo (PUC-SP)  

[**School:**]() Faculty of Interdisciplinary Studies  

[**Program:**]() Humanistic AI and Data Science

[**Semester:**]() 2nd Semester 2025  

Professor:  [***Professor Doctor in Mathematics Daniel Rodrigues da Silva***](https://www.linkedin.com/in/daniel-rodrigues-048654a5/)





#### 
 [![Sponsor Quantum Software Development](https://img.shields.io/badge/Sponsor-Quantum%20Software%20Development-brightgreen?logo=GitHub)](https://github.com/sponsors/Quantum-Software-Development)





#






> [!IMPORTANT]

> 

> ⚠️ Heads Up

>

> * Projects and deliverables may be made [publicly available]() whenever possible.

> * The course emphasizes [**practical, hands-on experience**]() with real datasets to simulate professional consulting scenarios in the fields of **Data Analysis and Data Mining** for partner organizations and institutions affiliated with the university.

> * All activities comply with the [**academic and ethical guidelines of PUC-SP**]().

> * Any content not authorized for public disclosure will remain [**confidential**]() and securely stored in [private repositories]().  

>





#







##### 🎶 Prelude Suite no.1 (J. S. Bach) - [Sound Design Remix]()

https://github.com/user-attachments/assets/4ccd316b-74a1-4bae-9bc7-1c705be80498

####  📺 For better resolution, watch the video on [YouTube.](https://youtu.be/_ytC6S4oDbM)





> [!TIP]

> 

>  This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

>

>  Access Data Mining [Main Repository](https://github.com/Quantum-Software-Development/1-Main_DataMining_Repository)

> 

>  If you’d like to explore the full materials from the 1st year (not only the review), you can visit the complete repository [here](https://github.com/FabianaCampanari/PracticalStats-PUCSP-2024).

>

>





##  [Overview]()




This repository contains materials and examples for the **Introduction to Data Mining with Python Class 1** course, focusing on fundamental statistical concepts and data analysis techniques essential for data mining applications.




## Repository Structure

```

├── data/                 # Sample datasets

├── notebooks/           # Jupyter notebooks with examples

├── scripts/             # Python scripts for analysis

├── images/              # Generated plots and visualizations

└── docs/                # Additional documentation

```





## Getting Started

### Prerequisites:

- Python 3.7+

- Required libraries: pandas, numpy, matplotlib, seaborn, scikit-learn




### Installation:

```bash

pip install pandas numpy matplotlib seaborn scikit-learn jupyter

```




### Quick Start:




```python

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

# Load sample data

data = [50, 40, 41, 17, 11, 7, 22, 44, 28, 21, 19, 23, 37, 51, 54, 42, 86,

        41, 78, 56, 72, 56, 17, 7, 69, 30, 80, 56, 29, 33, 46, 31, 39, 20,

        18, 29, 34, 59, 73, 77, 36, 39, 30, 62, 54, 67, 39, 31, 53, 44]

# Create histogram

plt.figure(figsize=(10, 6))

plt.hist(data, bins=7, edgecolor='black')

plt.title('Internet Usage Distribution')

plt.xlabel('Minutes Online')

plt.ylabel('Frequency')

plt.show()

# Calculate statistics

print(f"Mean: {np.mean(data):.2f}")

print(f"Median: {np.median(data):.2f}")

print(f"Standard Deviation: {np.std(data):.2f}")

```





## Key Learning Outcomes

After completing this course, students will be able to:

1. **Construct and interpret frequency distributions** from raw data

2. **Create various types of histograms** and understand their relationship to frequency distributions

3. **Identify and handle outliers** in datasets

4. **Analyze distribution shapes** and their implications

5. **Calculate and interpret central tendency measures**

6. **Apply statistical concepts** to data mining problems

7. **Use Python tools** for statistical analysis and visualization




## Important Notes

- **Outliers require careful consideration** - they may represent valuable insights or data quality issues

- **Histogram bins should be chosen thoughtfully** - too few may hide patterns, too many may create noise

- **Frequency distributions are fundamental** to understanding data structure before applying advanced data mining techniques

- **Visual analysis complements numerical statistics** for comprehensive data understanding




*This material is part of the Introduction to Data Mining with Python course, focusing on fundamental statistical concepts essential for effective data analysis and mining.*




## [Class_1 Content]()

### Syllabus (Ementa)

- **Descriptive Statistics Review**

- **Data Mining Concepts**

- **Exploratory Data Analysis**

- **Predictive Analysis**

- **Clustering**

- **Association Rules**




### [Assessment Criteria]()

- Minimum 75% attendance required

- Final grade ≥ 5.0

- Formula: MF = (N₁ + N₂)/2, where Nᵢ = (Pᵢ + Aᵢ)/2

  - Pᵢ = Project grade for semester i

  - Aᵢ = Activity/exam grade for semester i




## [Key Topics Covered]()




### [1](). Frequency Distribution

A **frequency distribution** is a table that shows classes or intervals of data with a count of the number of entries in each class. It's fundamental for understanding data patterns and is the foundation for creating histograms.


 

#### [Components]():

- **Class limits**: Lower and upper boundaries of each class

- **Class size**: The width of each class interval

- **Frequency (f)**: Number of data entries in each class

- **Relative frequency**: Proportion of data in each class (f/n)

- **Cumulative frequency**: Sum of frequencies up to a given class


 

#### Construction Steps:

1. Decide the number of classes (typically 5-20)

2. Calculate class size: (max - min) / number of classes

3. Determine class limits

4. Count frequencies for each class

5. Calculate additional measures (relative, cumulative frequencies)



 

### 2. Histograms and Their Relationship to Frequency Distributions

**Histograms are vectorially related to frequency distributions** - they are the graphical representation of frequency distribution tables.


 

#### Key Characteristics:

- **Bar chart** representing frequency distribution

- **Horizontal axis**: Quantitative data values (class boundaries)

- **Vertical axis**: Frequencies or relative frequencies

- **Consecutive bars must touch** (unlike regular bar charts)

- **Class boundaries**: Numbers that separate classes without gaps


 

#### Types of Histograms:

1. **Frequency Histogram**: Shows absolute frequencies

2. **Relative Frequency Histogram**: Shows proportions/percentages

3. **Frequency Polygon**: Line graph emphasizing continuous change



 

### 3. Outliers in Histograms  

**Outliers, by definition, have few values** and can represent various phenomena:




#### What Outliers May Indicate:

- **Data entry errors** (typing mistakes)

- **Measurement errors**

- **Fraudulent activities**

- **Genuine extreme values**

- **Equipment malfunctions**




#### Impact on Histograms:

- **Generate few bars** (sparse representation)

- **Create gaps** in the distribution

- **Skew the overall pattern**

- **Affect central tendency measures**

- **May require special handling** in analysis




#### Outlier Detection in Histograms:

- Visible as **isolated bars** far from main distribution

- **Large gaps** between bars

- **Extremely tall or short bars** at distribution extremes

- **Asymmetric patterns** in otherwise normal distributions





### 4. Distribution Shapes

Understanding distribution shapes helps identify data characteristics:

#### Symmetric Distribution:

- Mean ≈ Median ≈ Mode

- Bell-shaped or uniform patterns

- Equal spread on both sides




#### Left-Skewed (Negatively Skewed):

- Mean < Median < Mode

- Tail extends to the left

- Few extremely low values




#### Right-Skewed (Positively Skewed):

- Mode < Median < Mean

- Tail extends to the right

- Few extremely high values




#### Uniform Distribution:

- All classes have equal frequencies

- Rectangular shape in histogram





### 5. Central Tendency Measures

#### Mean (μ or x̄):

- Sum of all values divided by count

- Most affected by outliers

- Uses all data points

#### Median:

- Middle value when data is ordered

- Less affected by outliers

- Robust measure

#### Mode:

- Most frequently occurring value

- May not exist or may be multiple

- Good for categorical data





### 6. Practical Applications

#### Data Mining Context:

- **Pattern Recognition**: Identifying data distributions

- **Anomaly Detection**: Finding outliers

- **Data Quality Assessment**: Checking for errors

- **Feature Engineering**: Understanding variable distributions

- **Model Selection**: Choosing appropriate algorithms based on data distribution





#### Python Implementation Examples:




```python

import matplotlib.pyplot as plt

import numpy as np

# Create frequency distribution

def create_frequency_distribution(data, num_classes=7):

    min_val, max_val = min(data), max(data)

    class_size = (max_val - min_val) / num_classes

    

    # Define class boundaries

    boundaries = [min_val + i * class_size for i in range(num_classes + 1)]

    

    # Count frequencies

    frequencies = []

    for i in range(num_classes):

        count = sum(1 for x in data if boundaries[i] <= x < boundaries[i+1])

        frequencies.append(count)

    

    return boundaries, frequencies

# Create histogram

def plot_histogram(data, title="Frequency Distribution"):

    plt.figure(figsize=(10, 6))

    plt.hist(data, bins=7, edgecolor='black', alpha=0.7)

    plt.title(title)

    plt.xlabel('Values')

    plt.ylabel('Frequency')

    plt.grid(True, alpha=0.3)

    plt.show()

```






## [Exemple 1]() - Finding the Mean of a Frequency Distribution

### [Step-by-Step Instructions]()

### In Words \& In Symbols](




| In Words | In Symbols |

| :-- | :-- |

| 1. Find the midpoint of each class. | \$ x = \frac{lower limit + upper limit}{2} \$ |

| 2. Multiply each midpoint by its class frequency and sum the results. | \$ \sum (x \cdot f) \$ |

| 3. Find the sum of all frequencies. | \$ n = \sum f \$ |

| 4. Calculate the mean by dividing the sum from step 2 by step 3. | \$ \bar{x} = \frac{\sum (x \cdot f)}{n} \$ 





### [Example](): Finding the Mean of a Frequency Distribution

Use the frequency distribution below to approximate the average number of minutes that a sample of internet users spent connected in their last session.




| Class | Midpoint ($x$) | Frequency ($f$) |

| :-- | :--: | :--: |

| 7 – 18 | 12.5 | 6 |

| 19 – 30 | 24.5 | 10 |

| 31 – 42 | 36.5 | 13 |

| 43 – 54 | 48.5 | 8 |

| 55 – 66 | 60.5 | 5 |

| 67 – 78 | 72.5 | 6 |

| 79 – 90 | 84.5 | 2 |





### [Let's compute the products and their sumLet's compute the products and their sum:

| Class | Midpoint ($x$) | Frequency ($f$) | $x \cdot f$ |

| :-- | :-- | :-- | :-- |

| 7 – 18 | 12.5 | 6 | 75.0 |

| 19 – 30 | 24.5 | 10 | 245.0 |

| 31 – 42 | 36.5 | 13 | 474.5 |

| 43 – 54 | 48.5 | 8 | 388.0 |

| 55 – 66 | 60.5 | 5 | 302.5 |

| 67 – 78 | 72.5 | 6 | 435.0 |

| 79 – 90 | 84.5 | 2 | 169.0 |

| **Total** |  | **50** | **2089.0** |:





### [Therefore, the mean is]():





$$

\Huge

\bar{x} = \frac{\sum (x \cdot f)}{n} = \frac{2089}{50} \approx 41.8 \text{ minutes}

$$





```latex

\Huge

\bar{x} = \frac{\sum (x \cdot f)}{n} = \frac{2089}{50} \approx 41.8 \text{ minutes}

```





### [Shapes of Distributions]()

### Symmetrical Distribution

- A vertical line can be drawn at the middle of the graph, and the halves are nearly identical.





### [Shapes of Distributions]()




### Symmetrical Distribution

- A vertical line can be drawn at the middle of the graph, and the halves are nearly identical.




### [Uniform Distribution]() (Rectangular)

- All entries have equal or nearly equal frequencies.

- The distribution is symmetric.





### [Left-Skewed Distribution]() (Negatively Skewed)

- The "tail" of the graph extends more to the left.

- The mean is to the left of the median.




### [Right-Skewed Distribution]() (Positively Skewed)

- The "tail" of the graph extends more to the right.

- The mean is to the right of the median.




### [Finding the Weighted Mean]()

Sometimes, the mean is calculated considering different "weights" for each value.





## [Exemple 2]() 




### [A student's grade is determined based on 5 sources]():

- 50% for the average of exams

- 15% for the midterm exam

- 20% for the final exam

- 10% for computer lab work

- 5% for homework




### [Suppose your grades are]():

- Exam average: 86

- Midterm: 96

- Final Exam: 82

- Lab: 98

- Homework: 100





### [Weighted Mean Calculation Table](()

| Source | Grade ($x$) | Weight ($w$) | $x \cdot w$ |

| :-- | :--: | :--: | :--: |

| Exam Average | 86 | 0.50 | 43.0 |

| Midterm | 96 | 0.15 | 14.4 |

| Final Exam | 82 | 0.20 | 16.4 |

| Lab | 98 | 0.10 | 9.8 |

| Homework | 100 | 0.05 | 5.0 |

| **Sum** |  | **1** | **88.6** |





$$

\Huge

\bar{x} = \frac{\sum (x \cdot w)}{\sum w} = \frac{88.6}{1} = 88.6

$$




```latex

\Huge

\bar{x} = \frac{\sum (x \cdot w)}{\sum w} = \frac{88.6}{1} = 88.6

```





So, the student did [**not**]() get an [A (minimum required is 90)]().





## [Mean of Grouped Data]()

The mean of a frequency distribution is calculated as:





$$

\Huge

\bar{x} = \frac{\sum (x \cdot f)}{n}

$$




```latex

\Huge

\bar{x} = \frac{\sum (x \cdot f)}{n}

```





Where [x]() is the class midpoint and [f]() is the frequency of the class.





## [Bibliography]()

[1](). **Castro, L. N. & Ferrari, D. G.** (2016). *Introdução à mineração de dados: conceitos básicos, algoritmos e aplicações*. Saraiva.

[2](). **Ferreira, A. C. P. L. et al.** (2024). *Inteligência Artificial - Uma Abordagem de Aprendizado de Máquina*. 2nd Ed. LTC.

[3](). **Larson & Farber** (2015). *Estatística Aplicada*. Pearson.









## 💌 [Let the data flow... Ping Me !](mailto:fabicampanari@proton.me)





#### 
  🛸๋ My Contacts [Hub](https://linktr.ee/fabianacampanari)




### 
 






  ────────────── 🔭⋆ ──────────────

 ➣➢➤ Back to Top 

#

###### 
 Copyright 2025 Quantum Software Development. Code released under the [MIT License license.](https://github.com/Quantum-Software-Development/Math/blob/3bf8270ca09d3848f2bf22f9ac89368e52a2fb66/LICENSE)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/quantum-software-development/2_3-datamining_statistical_review

Awesome Lists containing this project

README