{"id":30884257,"url":"https://github.com/quantum-software-development/5-datamining_datacleaning_preparation_anomalies_outlier","last_synced_at":"2026-02-14T15:03:05.652Z","repository":{"id":312496628,"uuid":"1047694532","full_name":"Quantum-Software-Development/5-DataMining_DataCleaning_Preparation_Anomalies_Outlier","owner":"Quantum-Software-Development","description":"👩🏻‍🚀 5-Data Mining  - Data Cleaning, Preparation and Detection of Anomalies (Outlier Detectio","archived":false,"fork":false,"pushed_at":"2025-11-11T20:41:39.000Z","size":8660,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-11T22:24:19.634Z","etag":null,"topics":["accuracy-metrics","data-cleaning-and-preprocessing","data-exploratory","fraud-detection","logistic-regression","random-forest","test-model"],"latest_commit_sha":null,"homepage":"https://github.com/Quantum-Software-Development/5-DataMining_DataCleaning_Preparation_Anomalies_Outlier","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Quantum-Software-Development.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"Quantum-Software-Development","Custom":"https://github.com/sponsors/Quantum-Software-Development/card"}},"created_at":"2025-08-31T02:19:39.000Z","updated_at":"2025-11-11T20:41:42.000Z","dependencies_parsed_at":"2025-09-03T14:38:03.407Z","dependency_job_id":null,"html_url":"https://github.com/Quantum-Software-Development/5-DataMining_DataCleaning_Preparation_Anomalies_Outlier","commit_stats":null,"previous_names":["quantum-software-development/5-datamining","quantum-software-development/5-datamining_datacleaning_preparation_anomalies_outlier"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Quantum-Software-Development/5-DataMining_DataCleaning_Preparation_Anomalies_Outlier","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Quantum-Software-Development%2F5-DataMining_DataCleaning_Preparation_Anomalies_Outlier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Quantum-Software-Development%2F5-DataMining_DataCleaning_Preparation_Anomalies_Outlier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Quantum-Software-Development%2F5-DataMining_DataCleaning_Preparation_Anomalies_Outlier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Quantum-Software-Development%2F5-DataMining_DataCleaning_Preparation_Anomalies_Outlier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Quantum-Software-Development","download_url":"https://codeload.github.com/Quantum-Software-Development/5-DataMining_DataCleaning_Preparation_Anomalies_Outlier/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Quantum-Software-Development%2F5-DataMining_DataCleaning_Preparation_Anomalies_Outlier/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29375599,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-12T08:51:36.827Z","status":"ssl_error","status_checked_at":"2026-02-12T08:51:26.849Z","response_time":55,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["accuracy-metrics","data-cleaning-and-preprocessing","data-exploratory","fraud-detection","logistic-regression","random-forest","test-model"],"created_at":"2025-09-08T10:10:45.915Z","updated_at":"2026-02-12T18:02:08.517Z","avatar_url":"https://github.com/Quantum-Software-Development.png","language":"Jupyter Notebook","funding_links":["https://github.com/sponsors/Quantum-Software-Development","https://github.com/sponsors/Quantum-Software-Development/card"],"categories":[],"sub_categories":[],"readme":"\n\u003cbr\u003e\n\n**\\[[🇧🇷 Português](README.pt_BR.md)\\] \\[**[🇺🇸 English](README.md)**\\]**\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n# 5- [Data Mining]() / Data Cleaning, Preparation and Detection of Anomalies (Outlier Detection)\n\n\n\u003c!-- ======================================= Start DEFAULT HEADER ===========================================  --\u003e\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n[**Institution:**]() Pontifical Catholic University of São Paulo (PUC-SP)  \n[**School:**]() Faculty of Interdisciplinary Studies  \n[**Program:**]() Humanistic AI and Data Science\n[**Semester:**]() 2nd Semester 2025  \nProfessor:  [***Professor Doctor in Mathematics Daniel Rodrigues da Silva***](https://www.linkedin.com/in/daniel-rodrigues-048654a5/)\n\n\u003cbr\u003e\u003cbr\u003e\n\n#### \u003cp align=\"center\"\u003e [![Sponsor Quantum Software Development](https://img.shields.io/badge/Sponsor-Quantum%20Software%20Development-brightgreen?logo=GitHub)](https://github.com/sponsors/Quantum-Software-Development)\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\u003c!--Confidentiality statement --\u003e\n\n#\n\n\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\n\n\u003e [!IMPORTANT]\n\u003e \n\u003e ⚠️ Heads Up\n\u003e\n\u003e * Projects and deliverables may be made [publicly available]() whenever possible.\n\u003e * The course emphasizes [**practical, hands-on experience**]() with real datasets to simulate professional consulting scenarios in the fields of **Data Analysis and Data Mining** for partner organizations and institutions affiliated with the university.\n\u003e * All activities comply with the [**academic and ethical guidelines of PUC-SP**]().\n\u003e * Any content not authorized for public disclosure will remain [**confidential**]() and securely stored in [private repositories]().  \n\u003e\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n#\n\n\u003c!--END--\u003e\n\n\n\n\n\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\n\n\n\n\u003c!-- PUC HEADER GIF\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/user-attachments/assets/0d6324da-9468-455e-b8d1-2cce8bb63b06\" /\u003e\n--\u003e\n\n\n\u003c!-- video presentation --\u003e\n\n\n##### 🎶 Prelude Suite no.1 (J. S. Bach) - [Sound Design Remix]()\n\nhttps://github.com/user-attachments/assets/4ccd316b-74a1-4bae-9bc7-1c705be80498\n\n####  📺 For better resolution, watch the video on [YouTube.](https://youtu.be/_ytC6S4oDbM)\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n\u003e [!TIP]\n\u003e \n\u003e  This repository makes part of the  Data Mining, course from the undergraduate program Humanities, AI and Data Science at PUC-SP.\n\u003e\n\u003e   ### ☞ **Access Data Mining [Main Repository](https://github.com/Quantum-Software-Development/1-Main_DataMining_Repository)**\n\u003e \n\n\u003c!-- =======================================END DEFAULT HEADER ===========================================  --\u003e\n\n\u003cbr\u003e\u003cbr\u003e\n\n\nThis repository addresses fundamental concepts and methodologies in Data Mining, with an emphasis on **data cleaning, preparation**, and the **identification of anomalies and outliers**. The material is grounded in a comprehensive reference document that integrates theoretical foundations with practical applications, including Python-based implementations for the treatment of heterogeneous and noisy datasets.\n\nIt constitutes a structured starting point for the systematic study and application of Data Mining techniques, particularly those related to data preprocessing, anomaly and outlier detection, and validation. The repository also provides contextualized examples and executable Python code to support empirical exploration and reproducibility.\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n## Table of Contents\n\n- [Introduction](#introduction)\n- [Dataset for Study](#dataset-for-study)\n- [Pandas Functions for Data Exploration](#pandas-functions-for-data-exploration)\n- [Key Concepts](#key-concepts)\n  - Anomaly\n  - Outlier\n  - Anomaly Detection\n  - Fraud Detection\n- [Tips for Efficient and Effective Analysis](#tips-for-efficient-and-effective-analysis)\n- [Statistical and Practical Significance](#statistical-and-practical-significance)\n- [Characteristics and Understanding of Data](#characteristics-and-understanding-of-data)\n- [Parsimony Principle in Model Selection](#parsimony-principle-in-model-selection)\n- [Error Checking and Validation](#error-checking-and-validation)\n- [Learning Paradigms](#learning-paradigms)\n- [Applications](#applications)\n- [Sentiment Analysis in Social Networks](#sentiment-analysis-in-social-networks)\n- [Credit Card Fraud Detection](#credit-card-fraud-detection)\n- [Non-Technical Losses in Electrical Energy](#non-technical-losses-in-electrical-energy)\n- [Energy Load Segmentation](#energy-load-segmentation)\n- [Steel Process Modeling](#steel-process-modeling)\n- [Data Cleaning by Zara Amini](https://github.com/Quantum-Software-Development/1-DataMining_Main_Repository/blob/cb4075948c0ae9f90ead385d620147daf0641f7c/Data%20Cleaning%20by%20Zahra%20Amini%20.pdf)\n2. [Objectives](#objectives)\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n## Introduction\n\nThe exponential growth of data generation necessitates intelligent techniques such as **Data Mining** to extract valuable knowledge from raw data. This process involves cleaning, preparing, mining, and validating data to enable effective decision-making.\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n## Dataset for Study\n\nWe use a publicly available, small, **dirty dataset** exemplifying missing values, duplicates, and inconsistencies to demonstrate concepts of data cleaning and anomaly detection.\n\nExample dataset: [Titanic dataset](https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv)  \nThis dataset contains missing values and requires preprocessing.\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n## Pandas Functions for Data Exploration\n\n\u003cbr\u003e\n\n- `dataframe.describe()`  \n  Displays statistical summary including count, mean, std, min, quartiles, and max.\n\n\u003cbr\u003e\n\n- `dataframe.info()`  \n  Shows information such as number of non-null entries and data types for each column.\n\n\n\u003cbr\u003e\n\nExample usage:\n\n\u003cbr\u003e\n\n```Python\nimport pandas as pd\n\ndf = pd.read_csv('titanic.csv')\nprint(df.describe())\nprint(df.info())\n```\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n\n## Key Concepts\n\n### Anomaly / Outlier\nAnomalies or outliers are data points that deviate significantly from the majority and may indicate errors, rare events, or fraud.\n\n\n\u003cbr\u003e\n\n\n### Anomaly Detection\nTechniques to identify such unusual data points, including statistical, proximity-based, and machine learning methods.\n\n\n\u003cbr\u003e\n\n### Fraud Detection\nIdentifying fraudulent transactions or activities that typically manifest as anomalies in data.\n\n\n\u003cbr\u003e\n\n\n## Tips for Efficient and Effective Analysis\n\n- **Significance of Mining**  \n  - *Statistical significance*: Confidence in results, ensured by properly prepared datasets.  \n  - *Practical significance*: Real-world applicability of insights.\n\n\u003cbr\u003e\n\n\n- **Data Characteristics Influence Results**  \n  The properties of the dataset affect analysis outcomes significantly.\n\n\n\u003cbr\u003e\n\n\n- **Know Your Data**  \n  Preliminary exploration and descriptive statistics help understand data distributions.\n\n\u003cbr\u003e\n\n\n- **Parsimony Principle**  \n  Choose models that balance complexity and interpretability.\n\n\u003cbr\u003e\n\n\n- **Error Verification \u0026 Model Performance**  \n  Check prediction errors, rule significance, and algorithm performance rigorously.\n\n\n\u003cbr\u003e\n\n- **Validation of Results**  \n  Compare multiple methods; assess generalization capacity; combine techniques; involve domain experts to validate findings.\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n## Formulas and Concepts\n\n### Interquartile Range (IQR) rule for outliers:\n\n\u003cbr\u003e\n\n$\\Huge IQR = Q_3 - Q_1$\n\n\u003cbr\u003e\u003cbr\u003e\n\n```latex\n\\Huge IQR = Q_3 - Q_1\n```\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n$\\Huge \\text{Outlier if } x \u003c Q_1 - 1.5 \\times IQR \\text{ or } x \u003e Q_3 + 1.5 \\times IQR$\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n```latex\n\\Huge \\text{Outlier if } x \u003c Q_1 - 1.5 \\times IQR \\text{ or } x \u003e Q_3 + 1.5 \\times IQR\n```\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n### Z-Score for detecting outliers:\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n$\\Huge Z = \\frac{x - \\mu}{\\sigma}$\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n```latex\n\\Huge Z = \\frac{x - \\mu}{\\sigma}\n```\n\n\n\u003cbr\u003e\n\n### [Where](): $\\(x\\)$ is a data point, $\\(\\mu\\)$ mean, and $\\(\\sigma\\)$ standard deviation.\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n## Learning Paradigms\n\n\u003cbr\u003e\n\n| Paradigm               | Description                                                            | Example Algorithms                          |\n|-----------------------|------------------------------------------------------------------------|---------------------------------------------|\n| Supervised Learning    | Training with labeled data; learns mapping from inputs to outputs      | Decision Trees, Random Forest, SVM           |\n| Unsupervised Learning  | Training with unlabeled data; discovers patterns or groups             | K-Means Clustering, DBSCAN, PCA              |\n| Lazy Learning          | Defers generalization until a query is made                            | K-Nearest Neighbors (KNN)                     |\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n### Example: Decision Tree\nA model is trained by partitioning data based on attribute splits optimizing a criterion like information gain.\n\n\u003cbr\u003e\n\n### Example: K-Nearest Neighbors (KNN)\nClassifies new data by looking at the 'k' closest known examples (lazy learning).\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n## Applications\n\nExtensive use of data mining techniques in:\n\n- Credit analysis and prediction\n- Fraud detection\n- Financial market prediction\n- Customer relationship management\n- Corporate bankruptcy prediction\n- Energy sector\n- Education, logistics, supply chain management\n- Environment, social networks, ecommerce\n\n\u003cbr\u003e\u003cbr\u003e\n\n## Sentiment Analysis in Social Networks\n\nClassifying texts based on expressed sentiments (positive, negative, neutral) to measure public opinion, marketing effectiveness, and product feedback.\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n## Non-Technical Losses in Electrical Energy\n\n- **Technical losses**: Intrinsic to electrical systems.\n- **Commercial losses**: Errors, unmeasured consumption, fraud.\n\nData mining supports identifying irregularities and optimizing inspections.\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n## Energy Load Segmentation\n\nUse clustering to segment typical daily electricity consumption patterns to improve demand prediction.\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n## Steel Process Modeling\n\nData mining to predict chemical composition and optimize production processes in steel industry.\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n## Credit Card Fraud Detection\n\nFraud categories:\n- **Application Fraud**: Using fake personal info to obtain cards.\n- **Behavioral Fraud**: Unauthorized use of genuine card user's data.\n\nFraud mitigation includes prevention (security measures) and detection (rapid identification of suspicious transactions).\n\n\n\n\n## Python Example: [Titanic - Exploratory Data Analysis]() \n\n\u003cbr\u003e\n\n This code guides through loading data, exploratory analysis, cleaning, outlier detection, normalization, modeling, and validation.\n\n\u003cbr\u003e\n\n\n\u003e [!TIP]\n\u003e\n\u003e [Access Code](https://github.com/Quantum-Software-Development/5-DataMining_DataCleaning_Preparation_Anomalies_Outlier/blob/fb8df7943f37ca911b41d0c83de18ecad7434f74/titanic_exploratory_analysis/titanic_exploratory_analysis%20.ipynb): Titanic - Exploratory Data Analysis\n\u003e\n\u003e\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n\u003c!-- =======================================START TITANIC Code =========================================== \n\n```python\nimport pandas as pd\nimport numpy as np\nfrom sklearn.ensemble import IsolationForest\n\n# Load dataset\n\ndf = pd.read_csv('titanic.csv')\n\n# Statistical overview\n\nprint(df.describe())\nprint(df.info())\n\n# Handling missing values\n\ndf.fillna(df.median(numeric_only=True), inplace=True)  \\# Impute missing numeric data\n\n# Detecting outliers using Isolation Forest\n\niso_forest = IsolationForest(contamination=0.1)\noutliers = iso_forest.fit_predict(df.select_dtypes(include=[np.number]))\ndf['outlier'] = outliers\n\n# Mark outliers (-1) and normal points (1)\n\nprint(df['outlier'].value_counts())\nprint(df[df['outlier'] == -1])\n\n```\n\n\u003c!-- =======================================TITSNIC END  ===========================================  --\u003e\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\u003c!-- =======================================START Fraud Detction CODE ===========================================--\u003e\n\n\n# Python Example: [Fraud Detection with Mini Data]()\n\n\u003cbr\u003e\n\n\nBelow is the structured fraud detection code, organized cell by cell. It includes explanations about the dataset, along with additional techniques such as **SMOTE for class balancing**, **Random Forest hyperparameter tuning**, and **model accuracy testing**.\n\nThe evaluation covers key performance metrics, including:\n\n* **Accuracy**\n* **Precision**\n* **Recall**\n* **F1-Score**\n* **ROC-AUC**\n\n**Fraud Detection with Random Forest \u0026 Logistic Regression**\n\n\n\u003cbr\u003e\n\n\u003e [!TIP]\n\u003e\n\u003e [Access Code](https://github.com/Quantum-Software-Development/5-DataMining_DataCleaning_Preparation_Anomalies_Outlier/blob/a33fdc3801ff33ff9a0030c8e735558d374e9b7e/Fraud_Detection_RandonForrest_Logistic_Regression__MiniData/Code/Fraud_Detection_RandonForrest_Logistic_Regression__MiniData.ipynb): Fraud Detection with Random Forest \u0026 Logistic Regression\n\u003e\n\u003e\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n##  [Cell 1]() -  Data loading and Initial Understanding\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n```python\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\n# 1. Load a smaller dataset (e.g., Iris dataset for binary classification - e.g., Versicolor vs Virginica)\n# Carregar um conjunto de dados menor (por exemplo, conjunto de dados Iris para classificação binária - por exemplo, Versicolor vs Virginica)\nfrom sklearn.datasets import load_iris\niris = load_iris()\ndf = pd.DataFrame(data=iris.data, columns=iris.feature_names)\ndf['target'] = iris.target\n\n# For binary classification, let's use only two classes (e.g., 1 and 2)\n# Para classificação binária, vamos usar apenas duas classes (por exemplo, 1 e 2)\ndf_binary = df[df['target'].isin([1, 2])]\ndf_binary['target'] = df_binary['target'].replace({1: 0, 2: 1}) # Rename classes to 0 and 1\n# Renomear classes para 0 e 1\n\n# 2. Display the first few rows of the loaded DataFrame.\n# Exibir as primeiras linhas do DataFrame carregado.\nprint(\"First 5 rows of the dataset:\")\n# Primeiras 5 linhas do conjunto de dados:\ndisplay(df_binary.head())\n\n# 3. Display concise information about the DataFrame.\n# Exibir informações concisas sobre o DataFrame.\nprint(\"\\nDataset Info:\")\n# Informações do conjunto de dados:\ndf_binary.info()\n\n# 4. Calculate and display the distribution of the target variable.\n# Calcular e exibir a distribuição da variável alvo.\nprint(\"\\nClass Distribution:\")\n# Distribuição de Classes:\ndisplay(df_binary['target'].value_counts())\n\n# 5. Set up matplotlib for dark mode plotting.\n# Configurar matplotlib para plotagem em modo escuro.\nplt.style.use('dark_background')\n\n# Set text color to white for better visibility in dark mode\n# Definir a cor do texto para branco para melhor visibilidade no modo escuro\nplt.rcParams['text.color'] = 'white'\nplt.rcParams['axes.labelcolor'] = 'white'\nplt.rcParams['xtick.color'] = 'white'\nplt.rcParams['ytick.color'] = 'white'\nplt.rcParams['axes.edgecolor'] = 'white'\nplt.rcParams['figure.facecolor'] = '#2b2b2b' # Dark background for figure\nplt.rcParams['axes.facecolor'] = '#2b2b2b' # Dark background for axes\n\n# 6. Define a turquoise color palette.\n# Definir uma paleta de cores turquesa.\nturquoise_palette = ['#40E0D0', '#48D1CC', '#00CED1', '#5F9EA0', '#008B8B']\n```\n\n\u003cbr\u003e\u003cbr\u003e\n\n##  [Cell 2]() - Exploratory Data Analysis (EDA) \n\n\u003cbr\u003e\n\nThis code block carries out the initial steps of a data analysis workflow.\nIn essence, it prepares the dataset for further exploration and offers a first look at its main characteristics, laying the groundwork for more detailed analysis or modeling.\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n```python\nimport os\n\n# Define the directory for saving plots\nplot_dir = '/content/plots'\nif not os.path.exists(plot_dir):\n    os.makedirs(plot_dir)\n\n# 1. Create histograms for each feature in df_binary with dual-language titles and labels.\n# 1. Criar histogramas para cada característica em df_binary com títulos e rótulos em dois idiomas.\nprint(\"Feature Distributions (Histograms):\")\n# Distribuições das Características (Histogramas):\n# Use only the first color from the palette for histograms\ndf_binary.hist(figsize=(12, 10), color=turquoise_palette[0], bins=15)\nplt.suptitle('Feature Distributions / Distribuições das Características', y=1.02, fontsize=16)\nplt.tight_layout()\nplt.show()\nplt.savefig(f'{plot_dir}/feature_histograms.png') # Save histogram plot\n\n# 2. Generate box plots for each feature, comparing distributions across target classes with dual-language titles and labels.\n# 2. Gerar box plots para cada característica, comparando as distribuições entre as classes alvo com títulos e rótulos em dois idiomas.\nprint(\"\\nFeature Distributions by Target Class (Box Plots):\")\n# Distribuições das Características por Classe Alvo (Box Plots):\nfig, axes = plt.subplots(2, 2, figsize=(14, 10))\naxes = axes.flatten()\nfor i, col in enumerate(df_binary.columns[:-1]):\n    # Removed palette argument from boxplot as it's not used with hue and causes a warning\n    sns.boxplot(x='target', y=col, data=df_binary, ax=axes[i])\n    axes[i].set_title(f'{col} Distribution by Target Class / Distribuição de {col} por Classe Alvo')\n    axes[i].set_xlabel('Target Class / Classe Alvo')\n    axes[i].set_ylabel(col)\nplt.tight_layout()\nplt.show()\nplt.savefig(f'{plot_dir}/feature_box_plots.png') # Save box plot\n\n# 3. Create a pair plot of the features in df_binary, colored by the 'target' variable, with a dual-language title.\n# 3. Criar um pair plot das características em df_binary, colorido pela variável 'target', com um título em dois idiomas.\nprint(\"\\nPair Plot of Features by Target Class:\")\n# Pair Plot das Características por Classe Alvo:\n# Use only the first two colors from the palette for the two classes\nsns.pairplot(df_binary, hue='target', palette=turquoise_palette[:2], diag_kind='kde')\nplt.suptitle('Pair Plot of Features by Target Class / Pair Plot das Características por Classe Alvo', y=1.02, fontsize=16)\nplt.show()\nplt.savefig(f'{plot_dir}/feature_pair_plot.png') # Save pair plot\n\n# 4. Calculate and display the correlation matrix for the features in df_binary and visualize it with a heatmap and dual-language titles and labels.\n# 4. Calcular e exibir a matriz de correlação para as características em df_binary e visualizá-la com um heatmap e títulos e rótulos em dois idiomas.\nprint(\"\\nCorrelation Matrix:\")\n# Matriz de Correlação:\ncorrelation_matrix_binary = df_binary.corr()\nplt.figure(figsize=(10, 8))\nsns.heatmap(correlation_matrix_binary, annot=True, cmap='viridis', fmt=\".2f\", linewidths=.5)\nplt.title('Correlation Matrix / Matriz de Correlação', fontsize=16)\nplt.xticks(rotation=45, ha='right')\nplt.yticks(rotation=0)\nplt.tight_layout()\nplt.show()\nplt.savefig(f'{plot_dir}/correlation_matrix_heatmap.png') # Save heatmap plot\n```\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n\n### 1- Feature Distributions (Histograms):\n\n\u003cbr\u003e\n\n\u003cimg width=\"1358\" height=\"1188\" alt=\"Image\" src=\"https://github.com/user-attachments/assets/c41a65c8-7725-4b9a-90ad-865c69aefadc\" /\u003e\n\n\u003cbr\u003e\u003cbr\u003e\n\n### 2- sepal length (cm) Distributions\n\n\u003cbr\u003e\n\n\u003cimg width=\"1740\" height=\"1160\" alt=\"Image\" src=\"https://github.com/user-attachments/assets/d34f548a-a84b-4200-91cf-9a0f8bb09bfb\" /\u003e\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n### 3-Pair Plot of Features by Target Class\n\n\u003cbr\u003e\n\n\u003cimg width=\"1230\" height=\"1198\" alt=\"Image\" src=\"https://github.com/user-attachments/assets/26e95e05-6873-438c-8d76-1aa6ce277a7d\" /\u003e\n\n\u003cbr\u003e\u003c\n\n### 4- Correlation Matrix\n\n\u003cbr\u003e\n\n\u003cimg width=\"1064\" height=\"914\" alt=\"Image\" src=\"https://github.com/user-attachments/assets/0fac7a0a-37aa-4319-a2c6-6189ce46b46f\" /\u003e\n\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n##  [Cell 3]() cData preparation\n\n\u003cbr\u003e\n\n```python\n# 1. Check for missing values in the df_binary DataFrame and print the count for each column.\n# 1. Verificar valores ausentes no DataFrame df_binary e imprimir a contagem para cada coluna.\nprint(\"Checking for missing values / Verificando valores ausentes:\")\nprint(df_binary.isnull().sum())\n\n# 2. If missing values are found, handle them appropriately for numerical data (e.g., imputation with the mean or median).\n# Based on the previous df_binary.info() output, there are no missing values.\n# Com base na saída anterior de df_binary.info(), não há valores ausentes.\n# No action needed for missing values in this case.\n# Nenhuma ação necessária para valores ausentes neste caso.\n\n# 3. Separate the features (X) and the target variable (y) from the df_binary DataFrame.\n# 3. Separar as características (X) e a variável alvo (y) do DataFrame df_binary.\nX = df_binary.drop('target', axis=1)\ny = df_binary['target']\nprint(\"\\nFeatures (X) and Target (y) separated. / Características (X) e Alvo (y) separados.\")\n\n# 4. Scale the numerical features using StandardScaler.\n# Fit the scaler only on the training data to prevent data leakage.\n# 4. Escalar as características numéricas usando StandardScaler.\n# Ajustar o scaler apenas nos dados de treinamento para evitar vazamento de dados.\nfrom sklearn.preprocessing import StandardScaler\n\nscaler = StandardScaler()\n\n# 5. Split the data into training and testing sets.\n# 5. Dividir os dados em conjuntos de treinamento e teste.\nfrom sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Using stratify for balanced classes\n# Usando stratify para classes balanceadas\n\n# Fit and transform the scaler on the training data\n# Ajustar e transformar o scaler nos dados de treinamento\nX_train_scaled = scaler.fit_transform(X_train)\n\n# Transform the test data using the fitted scaler\n# Transformar os dados de teste usando o scaler ajustado\nX_test_scaled = scaler.transform(X_test)\n\nprint(\"\\nData split into training and testing sets (80/20). / Dados divididos em conjuntos de treinamento e teste (80/20).\")\nprint(\"Features scaled using StandardScaler. / Características escaladas usando StandardScaler.\")\nprint(f\"X_train shape: {X_train_scaled.shape}, X_test shape: {X_test_scaled.shape}\")\nprint(f\"y_train shape: {y_train.shape}, y_test shape: {y_test.shape}\")\n```\n\n\u003cbr\u003e\u003cbr\u003e\n\n##  [Cell 4]() Handle class imbalance\n\n\u003cbr\u003e\n\n```python\n# 1. Check the class distribution of the training set (y_train) to confirm if class imbalance exists.\n# Print the value counts with a dual-language explanation.\nprint(\"Class distribution in the training set (y_train):\")\n# Distribuição de classes no conjunto de treinamento (y_train):\ndisplay(y_train.value_counts())\n```\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n\n\n=======================================Still Surfing this Repo 🏄 =========================================== \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\n\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\n\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\n\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\n\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\n\n\n\n\u003c!-- ========================== [Bibliographr ====================  --\u003e\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n## [Bibliography]()\n\n[1](). **Castro, L. N. \u0026 Ferrari, D. G.** (2016). *Introduction to Data Mining: Basic Concepts, Algorithms, and Applications*. Saraiva.\n\n[2](). **Ferreira, A. C. P. L. et al.** (2024). *Artificial Intelligence – A Machine Learning Approach*. 2nd Ed. LTC.\n\n[3](). **Larson \u0026 Farber** (2015). *Applied Statistics*. Pearson.\n\n\u003cbr\u003e\u003cbr\u003e\n\n      \n\u003c!-- ======================================= Bibliography Portugues ===========================================  --\u003e\n\n\u003c!--\n\n## [Bibliography]()\n\n\n[1](). **Castro, L. N. \u0026 Ferrari, D. G.** (2016). *Introdução à mineração de dados: conceitos básicos, algoritmos e aplicações*. Saraiva.\n\n[2](). **Ferreira, A. C. P. L. et al.** (2024). *Inteligência Artificial - Uma Abordagem de Aprendizado de Máquina*. 2nd Ed. LTC.\n\n[3](). **Larson \u0026 Farber** (2015). *Estatística Aplicada*. Pearson.\n\n\n\u003cbr\u003e\u003cbr\u003e\n--\u003e\n\n\u003c!-- ======================================= Start Footer ===========================================  --\u003e\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n## 💌 [Let the data flow... Ping Me !](mailto:fabicampanari@proton.me)\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n\n#### \u003cp align=\"center\"\u003e  🛸๋ My Contacts [Hub](https://linktr.ee/fabianacampanari)\n\n\n\u003cbr\u003e\n\n### \u003cp align=\"center\"\u003e \u003cimg src=\"https://github.com/user-attachments/assets/517fc573-7607-4c5d-82a7-38383cc0537d\" /\u003e\n\n\n\n\n\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\n\n\u003cp align=\"center\"\u003e  ────────────── 🔭⋆ ──────────────\n\n\n\u003cp align=\"center\"\u003e ➣➢➤ \u003ca href=\"#top\"\u003eBack to Top \u003c/a\u003e\n\n\u003c!--\n\u003cp align=\"center\"\u003e  ────────────── ✦ ──────────────\n--\u003e\n\n\n\n\u003c!-- Programmers and artists are the only professionals whose hobby is their profession.\"\n\n\" I love people who are committed to transforming the world \"\n\n\" I'm big fan of those who are making waves in the world! \"\n\n##### \u003cp align=\"center\"\u003e( Rafael Lain ) \u003c/p\u003e   --\u003e\n\n#\n\n###### \u003cp align=\"center\"\u003e Copyright 2025 Quantum Software Development. Code released under the [MIT License license.](https://github.com/Quantum-Software-Development/Math/blob/3bf8270ca09d3848f2bf22f9ac89368e52a2fb66/LICENSE)\n\n\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquantum-software-development%2F5-datamining_datacleaning_preparation_anomalies_outlier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquantum-software-development%2F5-datamining_datacleaning_preparation_anomalies_outlier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquantum-software-development%2F5-datamining_datacleaning_preparation_anomalies_outlier/lists"}