{"id":22814250,"url":"https://github.com/devinterview-io/scikit-learn-interview-questions","last_synced_at":"2025-04-13T23:22:07.486Z","repository":{"id":216162512,"uuid":"740618597","full_name":"Devinterview-io/scikit-learn-interview-questions","owner":"Devinterview-io","description":"🟣 Scikit-Learn interview questions and answers to help you prepare for your next machine learning and data science interview in 2024.","archived":false,"fork":false,"pushed_at":"2024-01-08T18:01:05.000Z","size":13,"stargazers_count":14,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-27T13:39:43.682Z","etag":null,"topics":["ai-interview-questions","coding-interview-questions","coding-interviews","data-science","data-science-interview","data-science-interview-questions","data-scientist-interview","interview-practice","interview-preparation","machine-learning","machine-learning-and-data-science","machine-learning-interview","machine-learning-interview-questions","scikit-learn","scikit-learn-interview-questions","scikit-learn-questions","scikit-learn-tech-interview","software-engineer-interview","technical-interview-questions"],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Devinterview-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2024-01-08T18:00:11.000Z","updated_at":"2025-03-05T08:42:56.000Z","dependencies_parsed_at":"2024-01-08T19:30:46.730Z","dependency_job_id":"2ba3c1e2-666b-4164-9e55-9d602083331d","html_url":"https://github.com/Devinterview-io/scikit-learn-interview-questions","commit_stats":null,"previous_names":["devinterview-io/scikit-learn-interview-questions"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Devinterview-io%2Fscikit-learn-interview-questions","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Devinterview-io%2Fscikit-learn-interview-questions/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Devinterview-io%2Fscikit-learn-interview-questions/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Devinterview-io%2Fscikit-learn-interview-questions/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Devinterview-io","download_url":"https://codeload.github.com/Devinterview-io/scikit-learn-interview-questions/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248795126,"owners_count":21162716,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-interview-questions","coding-interview-questions","coding-interviews","data-science","data-science-interview","data-science-interview-questions","data-scientist-interview","interview-practice","interview-preparation","machine-learning","machine-learning-and-data-science","machine-learning-interview","machine-learning-interview-questions","scikit-learn","scikit-learn-interview-questions","scikit-learn-questions","scikit-learn-tech-interview","software-engineer-interview","technical-interview-questions"],"created_at":"2024-12-12T13:07:52.179Z","updated_at":"2025-04-13T23:22:07.442Z","avatar_url":"https://github.com/Devinterview-io.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Top 50 Scikit-Learn Interview Questions\n\n\u003cdiv\u003e\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://devinterview.io/questions/machine-learning-and-data-science/\"\u003e\n\u003cimg src=\"https://firebasestorage.googleapis.com/v0/b/dev-stack-app.appspot.com/o/github-blog-img%2Fmachine-learning-and-data-science-github-img.jpg?alt=media\u0026token=c511359d-cb91-4157-9465-a8e75a0242fe\" alt=\"machine-learning-and-data-science\" width=\"100%\"\u003e\n\u003c/a\u003e\n\u003c/p\u003e\n\n#### You can also find all 50 answers here 👉 [Devinterview.io - Scikit-Learn](https://devinterview.io/questions/machine-learning-and-data-science/scikit-learn-interview-questions)\n\n\u003cbr\u003e\n\n## 1. What is _Scikit-Learn_, and why is it popular in the field of _Machine Learning_?\n\n**Scikit-Learn**, an open-source Python library, is a leading solution for machine learning tasks. Its simplicity, versatility, and consistent performance across different ML methods and datasets have earned it tremendous popularity.\n\n### Key Features\n\n- **Straightforward Interface**: Intuitive API design simplifies the implementation of various ML tasks, ranging from data preprocessing to model evaluation.\n\n- **Model Selection and Automation**: Scikit-Learn provides techniques for extensive hyperparameter optimization and model evaluation, reducing the burden on developers in these areas.\n\n- **Consistent Model Objects**: All models and techniques in Scikit-Learn are implemented as unified Python objects, ensuring a standardized approach.\n\n- **Robustness and Flexibility**: Many algorithms and models in Scikit-Learn come with adaptive features, catering to diverse requirements.\n\n- **Versatile Tools**: Apart from standard supervised and unsupervised models, Scikit-Learn offers utilities for feature selection and pipeline construction, allowing for seamless integration of multiple methods.\n\n### Model Consistency\n\nScikit-Learn maintains a **consistent model interface** adaptable to a plethora of use-cases. This structure sculpts model-training and prediction procedures into recognizable patterns.\n\n  - **Three Basic Techniques**: Users uniformly use `fit()` for model training, `predict()` for data inference, and `score()` for performance evaluation, simplifying interaction with distinct models.\n\n### Versatility and Go-To Algorithms\n\nScikit-Learn presents an extensive suite of algorithms, especially catering to fundamental ML tasks.\n\n- **Supervised Learning**: Scikit-Learn houses methods for everything from linear and tree-based models to support vector machines and neural networks.\n\n- **Unsupervised Learning**: Clustering and dimensionality reduction are seamlessly achieved using the library's tools.\n\n- **Hyperparameter Tuning**: Feature-rich options for grid search and randomized search streamline the process.\n\n- **Feature Selection**: Employ varied selection techniques to isolate meaningful predictors.\n\u003cbr\u003e\n\n## 2. Explain the design principles behind _Scikit-Learn's API_.\n\n**Scikit-Learn** aims to provide a consistent and user-friendly interface for various machine learning tasks. Its API design is grounded in several key principles to ensure clarity, modularity, and versatility.\n\n### Core API Principles\n\n- **Consistency**: The API adheres to a consistent design pattern across all its modules.\n  \n- **Non-Redundancy**: It avoids redundancy by drawing on general routines for common tasks. This keeps the API concise and unified across different algorithms.\n\n### Data Representation\n\n- **Data as Rectangular Arrays**: Scikit-Learn algorithms expect input data to be stored in a two-dimensional array or a matrix-like object. This ensures **data is homogenous** and can be accessed efficiently using NumPy.\n\n- **Encoded Targets**: Categorical target variables are converted to integers or one-hot encodings before feeding them to most estimators.\n\n### Model Fitting and Predictions\n\n- **Fit then Transform**: The API distinguishes between fitting estimators to data and transforming them. In cases where data transformations are involved, pipelines are used to ensure consistency and reusability.\n\n- **Stateless Transforms**: Preprocessing operations like feature scaling and imputation transform data but do not preserve any internal state from one `fit_transform` call to the next.\n\n- **Predict Method**: After fitting, models use the `predict` method to produce predictions or labeling.\n\n### Unsupervised Learning\n\n- **transform Method**: Unsupervised estimators have a `transform` method that modifies inputs as a form of feature extraction, transformation, or clustering—a step distinct from initial fitting.\n\n### Composability and Provenance\n\n- **Make Predictions with Immutable Parts**: A model's prediction phase depends only on its parameters. **Fit state** doesn't influence predictions, ensuring consistency.\n\n- **Pipelines for Chaining Steps**: Pipelines harmonize data processing and modeling stages, providing a single interface for both.\n\n- **Feature and Model Names**: For **interpretability**, Scikit-Learn uses string identifiers for model and feature names.\n\n  Example: In text classification, a feature may be \"wordcount\" or \"tf_idf\" instead of the raw text itself.\n\n### Model Evaluation\n\n- **Separation of Concerns**: A distinct set of classes is dedicated to model selection and evaluation, like `GridSearchCV` or `cross_val_score`.\n\n### Task-Specific Estimators\n\nScikit-Learn features specialized estimators for distinct tasks:\n\n- **Classifier**: For binary or multi-class classification tasks.\n- **Regressor**: For continuous target variables in regression problems.\n- **Clusterer**: For unsupervised clustering.\n- **Transformer**: For data transformation, like dimensionality reduction or feature selection.\n\nThis categorization makes it simple to pinpoint the right estimator for a given task.\n\n### The Golden Rules of the Scikit-Learn API\n\n1. **Know the Estimator You Are Using**: There are various supported tasks, but different estimators can't be coerced to accommodate tasks outside their primary wheelhouse.\n\n2. **Be Mindful of Your Data**: Preprocess your data consistently and according to the estimator's requirements using data transformers and pipelines.\n\n3. **Respect the Training-Scoring-Evaluation Discrimination**: Training on one dataset and evaluating on another isn't merely an option; it's a careful protocol that helps prevent overfitting.\n\n4. **Determine a Conveyable and Understandable Feature and Model Identifiers**: Knowing what was used where can sometimes be just as important as knowing the numeric result of a prediction or transformation.\n\n5. **Remember the Task at Hand**: Always keep in mind the specificity of your problem—classification versus regression, supervised versus unsupervised—so you can pick the best tool for the job.\n\u003cbr\u003e\n\n## 3. How do you handle _missing values_ in a dataset using _Scikit-Learn_?\n\nWhen handling **missing values** in a dataset, scikit-learn provides several tools and techniques as well. These include:\n\n### Imputation\n\nImputation replaces missing values with substitutes. Scikit-learn's `SimpleImputer` offers several strategies:\n\n- **Mean, Median, Most Frequent**: Fills in with the mean, median, or mode of the non-missing values in the column.\n- **Constant**: Assigns a fixed value to all missing entries.\n- **KNN**: Uses the k-Nearest Neighbors algorithm to determine an appropriate value based on other instances' known feature values.\n\nHere is the Python code:\n\n```python\nfrom sklearn.impute import SimpleImputer\nimport numpy as np\n\n# Example data\nX = np.array([[1, 2], [np.nan, 3], [7, 6]])\n\n# Simple imputer\nimp_mean = SimpleImputer()\nX_mean = imp_mean.fit_transform(X)\n\nprint(X_mean)  # Result: [[1. 2.], [4. 3.], [7. 6.]]\n```\n\n### K-Means and Missing Values\n\nUsing methods that transform data but not handle missing values for example for **K-Means** you can preprocess your data to handle missing values using one of the methods provided by`SimpleImputer` and then use `KMeans` to fit your preprocessed data.\n\u003cbr\u003e\n\n## 4. Describe the role of _transformers_ and _estimators_ in _Scikit-Learn_.\n\n**Scikit-Learn** employs two primary components for machine learning: **transformers** and **estimators**.\n\n### Transformers\n\n**Transformers** are objects that map data into a new format, usually for feature extraction, scaling, or dimensionality reduction. They perform this transformation using the `.transform()` method.\n\nSome common transformers include the `MinMaxScaler` for feature scaling, `PCA` for dimensionality reduction, and `CountVectorizer` for text preprocessing.\n\n#### Example: MinMaxScaler\n\nHere is the Python code:\n\n```python\nfrom sklearn.preprocessing import MinMaxScaler\n\n# Creating the scaler object\nscaler = MinMaxScaler()\n\n# Fitting the data and transforming it\ndata_transformed = scaler.fit_transform(original_data)\n```\n\nIn this example, we fit the transformer on the original data and then transform that data into a new format.\n  \n### Estimators\n\n**Estimators** represent models that learn from data, making predictions or influencing other algorithms. The principal methods used by estimators are `.fit()` to learn from the data and `.predict()` to make predictions on new data.\n\nOne example of an estimator is the `RandomForestClassifier`, which is a machine learning model used for classification tasks.\n\n#### Example: RandomForestClassifier\n\nHere is the Python code:\n\n```python\nfrom sklearn.ensemble import RandomForestClassifier\n\n# Creating the classifier object\nclf = RandomForestClassifier()\n\n# Fitting the classifier on training data\nclf.fit(X_train, y_train)\n\n# Making predictions on the test set\ny_pred = clf.predict(X_test)\n```\n\nIn this example, `X_train` and `y_train` represent the input features and output labels of the training set, respectively. The classifier is trained using these datasets. After training, it can be used to make predictions on new, unseen data represented by `X_test`.\n\u003cbr\u003e\n\n## 5. What is the typical workflow for building a _predictive model_ using _Scikit-Learn_?\n\nWhen using **Scikit-Learn** for building predictive models, you'll typically follow these seven steps in a **methodical workflow**:\n\n### Scikit-Learn Workflow Steps\n\n1. **Acquiring** the Data: This step involves obtaining your data from a variety of sources.\n2. **Preprocessing** the Data: Data preprocessing includes tasks such as cleaning, transforming, and splitting the data.\n3. **Defining** the Model: This step involves choosing the type of model that best fits your data and problem.\n4. **Training** the Model: Here, the model is fitted to the training data.\n5. **Evaluating** the Model: The model's performance is assessed using testing data or cross-validation techniques.\n6. **Fine-Tuning** the Model: Various methods, such as hyperparameter tuning, can improve the model's performance.\n7. **Deploying** the Model: The trained and validated model is put to use for making predictions.\n\n### Code Example: Workflow Steps\n\nHere is the Python code:\n\n```python\n# Step 1: Acquire the Data\nimport pandas as pd\nfrom sklearn.datasets import load_iris\n\n# Load the Iris dataset\niris = load_iris()\nX, y = iris.data, iris.target\ndf = pd.DataFrame(data=iris.data, columns=iris.feature_names)\n\n# Step 2: Preprocess the Data\nfrom sklearn.model_selection import train_test_split\n# Split the data into training and testing sets\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\n# Step 3: Define the Model\nfrom sklearn.tree import DecisionTreeClassifier\n# Initialize the model\nmodel = DecisionTreeClassifier()\n\n# Step 4: Train the Model\n# Fit the model to the training data\nmodel.fit(X_train, y_train)\n\n# Step 5: Evaluate the Model\nfrom sklearn.metrics import accuracy_score\n# Make predictions\ny_pred = model.predict(X_test)\n# Assess accuracy\naccuracy = accuracy_score(y_test, y_pred)\nprint(f\"Model Accuracy: {accuracy:.2f}\")\n\n# Step 6: Fine-Tune the Model\nfrom sklearn.model_selection import GridSearchCV\n# Define the parameter grid to search\nparam_grid = {'max_depth': [3, 4, 5]}\n# Initialize the grid search\ngrid_search = GridSearchCV(model, param_grid, cv=5)\n# Conduct the grid search\ngrid_search.fit(X_train, y_train)\n# Get the best parameters\nbest_params = grid_search.best_params_\nprint(f\"Best Parameters: {best_params}\")\n\n# Refit the model with the best parameters\nbest_model = grid_search.best_estimator_\nbest_model.fit(X_train, y_train)\n\n# Step 7: Deploy the Model\n# Use the deployed model to make predictions\nnew_data = [[5.1, 3.5, 1.4, 0.2], [6.2, 3.4, 5.4, 2.3]]\npredictions = best_model.predict(new_data)\nprint(f\"Predicted Classes: {predictions}\")\n```\n\u003cbr\u003e\n\n## 6. How can you _scale features_ in a dataset using _Scikit-Learn_?\n\n**Feature scaling** is a crucial step in many machine learning algorithms. It involves transforming numerical features to a \"standard\" scale, often leading to better model performance. **Scikit-Learn** offers convenient methods for feature scaling.\n\n### Methods for Feature Scaling\n\n1. **Min-Max Scaling**: Rescales data to a specific range using the formula:\n\n   $$X_{\\text{scaled}} = \\frac{X - X_{\\text{min}}}{X_{\\text{max}} - X_{\\text{min}}}$$\n\n   ```python\n   from sklearn.preprocessing import MinMaxScaler\n   min_max_scaler = MinMaxScaler()\n   X_minmax = min_max_scaler.fit_transform(X)\n   ```\n\n2. **Standardization**: Centers the data to have a mean of $0$ and a standard deviation of $1$ using the formula:\n\n   $$X_{\\text{standardized}} = \\frac{X - \\mu}{\\sigma}$$\n\n   ```python\n   from sklearn.preprocessing import StandardScaler\n   std_scaler = StandardScaler()\n   X_std = std_scaler.fit_transform(X)\n   ```\n\n3. **Robust Scaling**: Scales data based on interquartile range (IQR), making it robust to outliers.\n\n   $$\\frac{X - Q_1(X)}{Q_3(X) - Q_1(X)}$$\n\n   ```python\n   from sklearn.preprocessing import RobustScaler\n   robust_scaler = RobustScaler()\n   X_robust = robust_scaler.fit_transform(X)\n   ```\n\u003cbr\u003e\n\n## 7. Explain the concept of a _pipeline_ in _Scikit-Learn_.\n\nA **pipeline** in Scikit-Learn is a way to streamline and automate a sequence of data transformations and model fitting or predicting, all integrated in a single, tidy framework.\n\n### Core Components\n\n1. **Pre-Processors**:  These perform any necessary data transformations, such as imputation of missing values, feature scaling, and feature selection.\n\n2. **Estimators**: These represent any model or algorithm for learning from data. They can be either a classifier or a regressor.\n\n### Benefits of Using Pipelines\n\n- **Streamlined Code**: Piping together several data processing steps makes the code look cleaner and easier to understand.\n- **Reduced Data Leakage**: Pipelines apply each step in the sequence to the data, which helps in avoiding common pitfalls like data leakage during transformations and evaluation.\n- **Cross-Validation Integration**: Pipelines are supported within cross-validation and grid search, enabling fine-tuning of the entire workflow at once.\n\n### Code Example: Pipelining in Scikit-Learn\n\nHere is the Python code:\n\n```python\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.model_selection import cross_val_score\n\n# Fake or dummy data for illustration.\nX = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]\ny = [0, 1, 2, 3]\n\n# Define pipeline components\nimputer = SimpleImputer(strategy='mean')\nscaler = MinMaxScaler()\nclassifier = RandomForestClassifier()\n\n# Construct the pipeline\npipeline = make_pipeline(imputer, scaler, classifier)\n\n# Perform cross-validation with the pipeline\nscores = cross_val_score(pipeline, X, y, cv=5)\n```\n\nIn this example, the pipeline consolidates three essential steps:\n\n1. **Data Imputation**: Use mean to fill missing or NaN values.\n2. **Data Scaling**: Use Min-Max scaling.\n3. **Model Building and Training**: RandomForest's classifier.\n\nOnce the pipeline is set up, training or predicting is a one-step process, like so:\n\n```python\npipeline.fit(X_train, y_train)  # Train the pipeline.\npredicted = pipeline.predict(X_test)  # Use the pipeline to make predictions.\n```\n\u003cbr\u003e\n\n## 8. What are some of the main categories of _algorithms_ included in _Scikit-Learn_?\n\n**Scikit-Learn** provides a diverse array of algorithms, and here are the main categories for supervised and unsupervised learning.\n\n### Supervised Learning Algorithms\n\n#### Regression\n\n- **Linear Regression**: Establishes linear relationships between features and target.\n- **Ridge, Lasso and ElasticNet**: Utilizes regularization methods.\n\n#### Classification\n\n- **Decision Trees \u0026 Random Forest**: Uses tree structures for decision-making.\n- **SVM (Support Vector Machine)**: Separates data into classes using a hyperplane.\n- **K-Nearest Neighbors (K-NN)**: Classifies based on the majority labels in the k-nearest neighbors.\n\n#### Ensembles\n\n- **Adaboost, Gradient Boosting**: Combines multiple weak learners to form a strong model.\n\n#### Neural Networks\n\n- **Multi-layer Perceptron**: A type of feedforward neural network.\n\n### Unsupervised Learning Algorithms\n\n#### Clustering\n\n- **K-Means**: Divides data into k clusters based on centroids.\n- **Hierarchical \u0026 DBSCAN**: Unsupervised methods that do not require prior specification of clusters.\n\n#### Dimensionality Reduction\n\n- **PCA (Principal Component Analysis)**: Reduces feature dimensionality based on variance.\n- **LDA (Linear Discriminant Analysis)**: Reduces dimensions while maintaining class separability.\n\n#### Outlier Detection\n\n- **One Class SVM**: Identifies observations that deviate from the majority.\n\n#### Decomposition and Feature Selection\n\n- **FastICA, NMF, VarianceThreshold**: Feature selection and signal decomposition methods.\n\u003cbr\u003e\n\n## 9. How do you encode _categorical variables_ using _Scikit-Learn_?\n\nIn **Scikit-Learn**, you can use various techniques to encode **Categorical Variables**.\n\n### Categorical Encoding Techniques\n\n- **OrdinalEncoder**: For ordinal categories, assigns a range of numbers to each category. Works well when certain categories have an inherent order.\n\n- **OneHotEncoder**: Creates **Binary** columns representing each category to avoid assuming any ordinal relationship. Ideal for non-binary categories.\n\n- **LabelBinarizer**: A simpler version of OneHotEncoder designed for binary (two-class) categories.\n\n### Example: Using `OneHotEncoder`\n\nHere is the Python code:\n\n```python\nfrom sklearn.preprocessing import OneHotEncoder\nimport pandas as pd\n\n# Example data\ndata = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']})\n\n# Initializing and fitting OneHotEncoder\nencoder = OneHotEncoder()\nencoded_data = encoder.fit_transform(data[['Color']])\n\n# Converting to DataFrame for visibility\nencoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out(['Color']))\n\n# Displaying encoded DataFrame\nprint(encoded_df)\n```\n\n### Example: Using `LabelBinarizer`\n\nHere is the Python code:\n\n```python\nfrom sklearn.preprocessing import LabelBinarizer\nimport pandas as pd\n\n# Example data\ndata = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']})\n\n# Initializing and fitting LabelBinarizer\nbinarizer = LabelBinarizer()\nencoded_data = binarizer.fit_transform(data['Color'])\n\n# Converting to DataFrame for visibility\nencoded_df = pd.DataFrame(encoded_data, columns=binarizer.classes_)\n\n# Displaying encoded DataFrame\nprint(encoded_df)\n```\n\u003cbr\u003e\n\n## 10. What are the strategies provided by _Scikit-Learn_ to handle _imbalanced datasets_?\n\n**Imbalanced datasets** pose a challenge in machine learning because the frequency of different classes is disproportionate, often leading to biased models.\n\n### Techniques to Handle Imbalance\n\n#### Weighted Loss Function\n\nBy assigning different weights to classes, you can make the model prioritize the minority class. For instance, in a binary classification problem with an imbalanced dataset, you can use `class_weight` in classifiers like `LogisticRegression` or `SVC`.\n\nExample with `LogisticRegression`:\n\n```python\nfrom sklearn.linear_model import LogisticRegression\n\n# Set class_weight to 'balanced' or a custom weight\nclf = LogisticRegression(class_weight='balanced')  \n```\n\n#### Resampling\n\n**Oversampling** involves replicating examples in the minority class, while **undersampling** reduces the number of examples in the majority class. This achieves a better balance for training.\n\nScikit-Learn doesn't have built-in functions for resampling, but third-party libraries like `imbalanced-learn` offer this capability.\n\nExample using `imbalanced-learn`:\n\n```python\nfrom imblearn.over_sampling import RandomOverSampler\n\nover_sampler = RandomOverSampler()\nX_train_resampled, y_train_resampled = over_sampler.fit_resample(X_train, y_train)\n```\n\n#### Focused Model Evaluation\n\nThe **Area Under the Receiver Operating Characteristic Curve** (AUC-ROC) can be a better evaluation metric than accuracy for imbalanced datasets.\n\n- **Precision-Recall** metrics, which focus on the performance of the minority class.\n\nIn Scikit-Learn, you can use `roc_auc_score` and `average_precision_score` for these metrics.\n\n### Key Considerations\n\n- **Resampling** can introduce bias or overfitting. It's essential to validate models carefully.\n- **Weighted Loss Functions** are an easy way to address imbalance but may not always be sufficient. Balanced weights are a good starting point, but your problem might require custom weights.\n\u003cbr\u003e\n\n## 11. How do you split a dataset into _training and testing sets_ using _Scikit-Learn_?\n\n**Train-Test Split** is a fundamental step in machine learning model development for evaluating **model performance**.\n\nScikit-Learn, through its `model_selection` module, provides a straightforward method for performing this task:\n\n### Code Example: Train-Test Split\n\nHere is the Python code:\n\n```python\nfrom sklearn.model_selection import train_test_split\nimport numpy as np\n\n# Data\nX, y = np.arange(10).reshape((5, 2)), range(5)\n\n# Split with a test size ratio (commonly 80-20 or 70-30)\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n\n# Or specify a specific number of samples for the test set\n# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=2)\n```\n\u003cbr\u003e\n\n## 12. Describe the use of `ColumnTransformer` in _Scikit-Learn_.\n\nThe **ColumnTransformer** utility in `Scikit-Learn` allows for independent preprocessing of different feature types or subsets (columns) of the input data.\n\n### Key Use Cases\n\n- **Multi-Modal Feature Processing**: For datasets where features are of different types (e.g., text, numerical, categorical), `ColumnTransformer` is particularly useful.\n- **Pipelining for Specific Features**: The tool is employed for applies specific transformers to certain subsets of the feature space, allowing for focused pre-processing.\n- **Simplifying Transformation Pipelines**: When there are multiple features and multiple steps in the data transformation process, the `ColumnTransformer` methodology can help manage the complexity.\n\n### Core Components and Concepts\n\n- **Transformers**: These translate data from its original format to a format suitable for ML models.\n- **Transformations**: These are the operations or `Callables` that the transformers perform on the input data.\n- **Feature Groups**: The data features are divided into groups or subsets, and each group is associated with a unique transformation process, defined by different transformers. These feature groups correspond to the columns of the input dataset.\n\n### Code Example: ColumnTransformer\n\nHere is how to use `ColumnTransformer` with multiple pre-processing steps and each active_discovery step tailored to a specific subset of columns:\n\n```python\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.preprocessing import Normalizer, StandardScaler, OneHotEncoder\nfrom sklearn.impute import SimpleImputer\n\n# Defining the ColumnTransformer\npreprocessor = ColumnTransformer(\n    transformers=[\n        ('num', StandardScaler(), ['numerical_feature_1', 'numerical_feature_2']),\n        ('num2',Normalizer(),['numerical_feature_3']),\n        ('cat', OneHotEncoder(), ['categorical_feature_1', 'categorical_feature_2']),\n        ('drop_col', 'drop', ['column_to_drop']),\n        ('fill_unk', SimpleImputer(strategy='constant', fill_value='Unknown'), ['categorical_feature_with_nan']),\n        ('default', 'passthrough', ['remaining_col_1'])  # By default, remaining columns are \"passed through\"\n    ]\n)\n\n# Applying the ColumnTransformer\ntransformed_data = preprocessor.fit_transform(data)\n```\n\nIn the example above:\n\n- Columns `numerical_feature_1` and `numerical_feature_2` undergo z-score standardization.\n- `numerical_feature_3` is normalized.\n- We use one-hot encoding for `categorical_feature_1` and `categorical_feature_2`.\n- We drop `column_to_drop`.\n- For `categorical_feature_with_nan`, we replace `NaN` values with a constant ('Unknown').\n- All remaining columns (including `remaining_col_1`) are passed through without any transformations (`'passthrough'`).\n\u003cbr\u003e\n\n## 13. What _preprocessing steps_ would you take before inputting data into a _machine learning algorithm_?\n\nBefore feeding data into a machine learning algorithm, it is crucial to **pre-process** it. This involves several steps: \n\n### Data Preprocessing Steps\n\n1. **Handling Missing Data**: Remove, impute, or flag missing values.\n2. **Handling Categorical Data**: Convert categorical data to numerical form.\n3. **Scaling and Normalization**: Rescale numerical data to a similar range.\n4. **Splitting Data for Training and Testing**: Split the dataset to evaluate model performance.\n5. **Feature Engineering**: Generate new features or transform existing ones for better model performance.\n\n### Scikit-Learn Tools for Data Preprocessing\n\n1. **Imputer**: Fills missing values.\n2. **OneHotEncoder**: Encodes categorical data as one-hot vectors.\n3. **StandardScaler**: Standardizes numerical data to have zero mean and unit variance.\n4. **MinMaxScaler**: Rescales numerical data to a specific range. \n5. **Train-Test Split**: Divides data into training and testing sets. \n6. **PolynomialFeatures**: Generates polynomial features.\n\n### Code Example: Data Preprocessing\n\nHere is the Python code:\n\n```python\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler\nfrom sklearn.model_selection import train_test_split\n\n# Data\nX = ...  # Features\ny = ...  # Target\n\n# 1. Handling Missing Data\nimputer = SimpleImputer(strategy='mean')\nX_imputed = imputer.fit_transform(X)\n\n# 2. Handling Categorical Data\nencoder = OneHotEncoder()\nX_encoded = encoder.fit_transform(X_imputed)\n\n# 3. Scaling and Normalization\nscaler = MinMaxScaler()  # Alternatively, can use StandardScaler\nX_scaled = scaler.fit_transform(X_encoded)\n\n# 4. Train-Test Split\nX_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)\n```\n\u003cbr\u003e\n\n## 14. Explain how `Imputer` works in _Scikit-Learn_ for dealing with _missing data_.\n\n**Imputer**, available in `sklearn.preprocessing`, offers a streamlined solution for handling missing data in your datasets.\n\n### Core Functionality\n\nUsing a variety of strategies, `Imputer` takes in your feature matrix and replaces missing values with appropriate data.\n\nThe process can be summarized as follows:\n\n1. **Fit**: The Imputer instance estimates the method statistics from the training data. This is done using the `fit` method.\n2. **Transform**: The missing values in the training data are then replaced with the learned statistics. This is accomplished using the `transform` method.\n3. **Predict/Transform new data**: After training, the imputer can replace missing values in new data in a consistent fashion. For transformation of either training or new data, simply use the `fit_transform` method, which combines the `fit` and `transform` operations.\n\n### Core Methods\n\n- **fit(X)**: Learns the required statistics from the training data.\n- **transform(X)**: Uses the learned statistics to replace missing data points in the dataset (self-contained operation, does not modify the imputer itself).\n- **fit_transform(X)**: Combines the training and transformation processes for convenience.\n- **statistics_**: After fitting, you can access the determined strategy or value from the imputer's `statistics_` attribute.\n\n### Common Strategies for Imputation\n\n- **Mean**: Substitutes missing values with the mean of the feature.\n- **Median**: Replaces missing entries with the median of the feature.\n- **Most Frequent**: Uses the mode of the feature for imputation.\n- **Constant**: Allows you to specify a constant value for filling in missing data.\n\n### Code Example: Using an Imputer\n\nHere is the scikit-learn imputer code:\n\n```python\nimport numpy as np\nfrom sklearn.impute import SimpleImputer\n\n# Sample data with missing values\nX = np.array([[1, 2], [np.nan, 3], [7, 6]])\n\n# Define the imputer\nimputer = SimpleImputer(missing_values=np.nan, strategy='mean')\n\n# Fit and transform the data\nX_imputed = imputer.fit_transform(X)\n\n# View the imputed data\nprint(X_imputed)\n```\n\u003cbr\u003e\n\n## 15. How do you _normalize_ or _standardize_ data with _Scikit-Learn_?\n\nWhen preparing data for a machine learning model, it's often crucial to **normalize** or **standardize** features. Scikit-Learn provides two primary methods for this: `MinMaxScaler` for normalization and `StandardScaler` for standardization.\n\n### Normalization and Min-Max Scaling\n\n*Normalization* allows for rescaling of features within a set range.\n\nThe example code demonstrates how to normalize a feature vector using Scikit-Learn's `MinMaxScaler`.\n\n```python\nfrom sklearn.preprocessing import MinMaxScaler\n\n# Feature matrix H lies within 50-80 age and 1.60-1.95 meters height\n# Age is generally thousands of days\n# Height is generally above 1\nH = [[50, 1.90],    \n     [80000, 1.95],  \n     [45, 1.60],     \n     [100000, 1.65]]\n\nscaler = MinMaxScaler()\nH_scaled = scaler.fit_transform(H)\n\nprint(H_scaled)\n```\n\nThe output showcases each feature's new normalized range between 0 and 1.\n\n### Z-Score Standardization\n\n*Standardization* transforms data to have a **mean** of 0 and a **standard deviation** of 1.\n\nHere is the Python code to implement Z-Score Standardization using the `StandardScaler` in Scikit-Learn:\n\n```python\nfrom sklearn.preprocessing import StandardScaler\n\n# Feature matrix M representing Mean (mu) and standard deviation (sigma)\n# 80 and 1.8 are typical mean and standard deviation for age and height respectively.\nM = [[40, 1.60],  # close to average\n     [120, 1.95],  # exceptionally tall\n     [20, 1.50],   # shorter\n     [60, 1.75]]   # slightly above mean\n\nscaler = StandardScaler()\nM_scaled = scaler.fit_transform(M)\n\nprint(M_scaled)\n```\n\u003cbr\u003e\n\n\n\n#### Explore all 50 answers here 👉 [Devinterview.io - Scikit-Learn](https://devinterview.io/questions/machine-learning-and-data-science/scikit-learn-interview-questions)\n\n\u003cbr\u003e\n\n\u003ca href=\"https://devinterview.io/questions/machine-learning-and-data-science/\"\u003e\n\u003cimg src=\"https://firebasestorage.googleapis.com/v0/b/dev-stack-app.appspot.com/o/github-blog-img%2Fmachine-learning-and-data-science-github-img.jpg?alt=media\u0026token=c511359d-cb91-4157-9465-a8e75a0242fe\" alt=\"machine-learning-and-data-science\" width=\"100%\"\u003e\n\u003c/a\u003e\n\u003c/p\u003e\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevinterview-io%2Fscikit-learn-interview-questions","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdevinterview-io%2Fscikit-learn-interview-questions","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevinterview-io%2Fscikit-learn-interview-questions/lists"}