https://github.com/eric15342335/week1-linearly-sparse-code
[Kaggle Competition] Linear regression using Elastic Net with Mean Squared Error + Sparsity penalty term
https://github.com/eric15342335/week1-linearly-sparse-code
custom-metrics elastic-net kaggle linear-regression mean-square-error optuna
Last synced: 4 days ago
JSON representation
[Kaggle Competition] Linear regression using Elastic Net with Mean Squared Error + Sparsity penalty term
- Host: GitHub
- URL: https://github.com/eric15342335/week1-linearly-sparse-code
- Owner: eric15342335
- Created: 2025-12-15T19:48:54.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-12-15T19:54:09.000Z (6 months ago)
- Last Synced: 2026-03-24T14:09:05.440Z (2 months ago)
- Topics: custom-metrics, elastic-net, kaggle, linear-regression, mean-square-error, optuna
- Language: Python
- Homepage: https://www.kaggle.com/competitions/week-1-linearly-sparse
- Size: 2.89 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Week-1: Linearly Sparse
This is the first challenge among a series of challenges being hosted under Winter In Data Science Initiative Project ID: 62
## Week-1: Linearly Sparse β Overview
Welcome to the Linearly Sparse!
In this competition, your task is to build a predictive model that achieves strong accuracy on a target variable. The core idea is simple: A good model should be both accurate and efficient. Modern ML workflows often involve hundreds of input features, but real-world constraints frequently favor models that are compact, interpretable, and deployable. This competition evaluates both prediction quality and feature usage, rewarding solutions that achieve an effective balance.
### π― Your Objective
You are provided with:
- A training dataset containing input features and corresponding target values
- A test dataset containing only input features
- A submission format where you will provide:
- A vector of model weights
- Predictions for the test set
Your goal is to: Minimize the score metric as described further The evaluation metric combines two aspects into a single score. Models that rely on many features may achieve high accuracy but will receive a penalty.
Models that are sparse but inaccurate will also score poorly.
The best solutions find the right balance between the two.
## What You Submit
Each submission contains:
- Your modelβs weight vector
- Your predictions for the test set
- The evaluation metric interprets this structure and computes a combined score based on:
- Mean squared error (MSE) of your predictions
- A complexity penalty based on how many features your model uses
Lower scores are better.
## Get Started
Head over to the Data, Evaluation, and Submission tabs to understand the dataset structure, scoring mechanism, and submission requirements.
Good luck β and may the most efficient model win!
## Description
In many real-world machine learning tasks, large feature sets make models powerful β but also harder to interpret, more expensive to deploy, and more prone to overfitting.
This competition challenges you to design a model that is both accurate and efficient, using only the features that truly matter. You are provided with a tabular dataset consisting of:
- A training set with input features and target values
- A test set with input features only
- A custom scoring function that considers:
- How precisely you predict the target
- How many features your model relies on
- Your objective is to discover a model that makes strong predictions while minimizing unnecessary complexity.
- Models that use many features may perform well on accuracy alone, but they will receive a penalty during evaluation.
Your task is to find the sweet spot between predictive performance and feature economy.
### π§© What Makes This Competition Unique?
Unlike standard regression challenges where only prediction accuracy matters, this competition rewards solutions that:
- Identify and leverage the most important features
- Avoid depending on the full feature set
- Balance accuracy with interpretability and simplicity
This setup mimics real-world constraints where computational budgets, latency requirements, or domain knowledge push practitioners toward more compact models.
## Evaluation
This competition evaluates two aspects of your model:
- How accurately it predicts the target values for the test set, and
- How efficiently it uses the available features. To capture this trade-off, we use a custom scoring function that combines:
- Mean Squared Error (MSE) on the test predictions
- A sparsity penalty based on how many features your model uses
The final score is computed as:
where:
- ( f ) = number of features your model actually uses ( the number of non-zero coefficients in the coefficient vector)
- ( m ) = total number of features
- ( alpha ) and ( p ) = penalty parameters (we choose both as 2 )
- Lower scores are better
In other words:
Two models can have similar accuracy, but the one using fewer features will achieve a better score. In general, you would want to have a weight vector that is sparse (i.e. contains a lot of zeros).
## π Submission Format
To make submission creation easier and to avoid formatting mistakes, we provide a `create_submission` utility.
You only need to supply:
- Your weight vector (length = 200)
- Your predictions on the test dataset (length = 500)
- A filename for the submission CSV The helper function will automatically generate a properly formatted `filename.csv` file that you can upload. Check the `create_submission.txt` for the helper function.
## Dataset Description
The dataset for this competition consists of a training set and a test set, both containing 200 numerical features. Your goal is to learn from the training data and generate predictions for the unseen test samples.
### π§ Training Data (`train.csv`)
The training file contains:
- 1000 samples (rows)
- 200 numerical features (`x1` to `x200`)
- 1 target variable (`y`) Each row represents a single observation with 200 input variables and a corresponding output value.
- You will use this data to train your model and compute the weight vector. Columns in `train.csv`:
Column Name Description
`y` Target variable to be predicted
| `x1βx200` | Numeric input features |
### π Test Data (`test.csv`)
The test set contains:
- 500 samples (rows)
- The same 200 numerical features as the training set
- No target variable Your model should generate a prediction for each of these 500 samples. Columns in `test.csv`:
| `x1βx200` | Numeric input features (same format as training data) |
## π Notes
- Feature meanings are not explicitly provided β part of the challenge is determining which features are important.
- The training and test sets share the same feature structure.
- You will submit a weight vector (based on these 200 features) and predictions for the 500 test rows. The data has been prepared to encourage thoughtful feature selection and model design.
- Use the training data to learn effective patterns, and then apply your model to the test set to produce your final predictions.