An open API service indexing awesome lists of open source software.

https://github.com/eric15342335/week1-linearly-sparse-code

[Kaggle Competition] Linear regression using Elastic Net with Mean Squared Error + Sparsity penalty term
https://github.com/eric15342335/week1-linearly-sparse-code

custom-metrics elastic-net kaggle linear-regression mean-square-error optuna

Last synced: 4 days ago
JSON representation

[Kaggle Competition] Linear regression using Elastic Net with Mean Squared Error + Sparsity penalty term

Awesome Lists containing this project

README

          

# Week-1: Linearly Sparse

This is the first challenge among a series of challenges being hosted under Winter In Data Science Initiative Project ID: 62

## Week-1: Linearly Sparse β€” Overview

Welcome to the Linearly Sparse!
In this competition, your task is to build a predictive model that achieves strong accuracy on a target variable. The core idea is simple: A good model should be both accurate and efficient. Modern ML workflows often involve hundreds of input features, but real-world constraints frequently favor models that are compact, interpretable, and deployable. This competition evaluates both prediction quality and feature usage, rewarding solutions that achieve an effective balance.

### 🎯 Your Objective

You are provided with:

- A training dataset containing input features and corresponding target values
- A test dataset containing only input features
- A submission format where you will provide:
- A vector of model weights
- Predictions for the test set

Your goal is to: Minimize the score metric as described further The evaluation metric combines two aspects into a single score. Models that rely on many features may achieve high accuracy but will receive a penalty.
Models that are sparse but inaccurate will also score poorly.

The best solutions find the right balance between the two.

## What You Submit

Each submission contains:

- Your model’s weight vector
- Your predictions for the test set
- The evaluation metric interprets this structure and computes a combined score based on:
- Mean squared error (MSE) of your predictions
- A complexity penalty based on how many features your model uses

Lower scores are better.

## Get Started

Head over to the Data, Evaluation, and Submission tabs to understand the dataset structure, scoring mechanism, and submission requirements.
Good luck β€” and may the most efficient model win!

## Description

In many real-world machine learning tasks, large feature sets make models powerful β€” but also harder to interpret, more expensive to deploy, and more prone to overfitting.
This competition challenges you to design a model that is both accurate and efficient, using only the features that truly matter. You are provided with a tabular dataset consisting of:

- A training set with input features and target values
- A test set with input features only
- A custom scoring function that considers:
- How precisely you predict the target
- How many features your model relies on
- Your objective is to discover a model that makes strong predictions while minimizing unnecessary complexity.
- Models that use many features may perform well on accuracy alone, but they will receive a penalty during evaluation.

Your task is to find the sweet spot between predictive performance and feature economy.

### 🧩 What Makes This Competition Unique?

Unlike standard regression challenges where only prediction accuracy matters, this competition rewards solutions that:

- Identify and leverage the most important features
- Avoid depending on the full feature set
- Balance accuracy with interpretability and simplicity
This setup mimics real-world constraints where computational budgets, latency requirements, or domain knowledge push practitioners toward more compact models.

## Evaluation

This competition evaluates two aspects of your model:

- How accurately it predicts the target values for the test set, and
- How efficiently it uses the available features. To capture this trade-off, we use a custom scoring function that combines:

- Mean Squared Error (MSE) on the test predictions
- A sparsity penalty based on how many features your model uses

The final score is computed as:

where:

- ( f ) = number of features your model actually uses ( the number of non-zero coefficients in the coefficient vector)
- ( m ) = total number of features
- ( alpha ) and ( p ) = penalty parameters (we choose both as 2 )
- Lower scores are better

In other words:

Two models can have similar accuracy, but the one using fewer features will achieve a better score. In general, you would want to have a weight vector that is sparse (i.e. contains a lot of zeros).

## πŸ“ Submission Format

To make submission creation easier and to avoid formatting mistakes, we provide a `create_submission` utility.
You only need to supply:

- Your weight vector (length = 200)
- Your predictions on the test dataset (length = 500)
- A filename for the submission CSV The helper function will automatically generate a properly formatted `filename.csv` file that you can upload. Check the `create_submission.txt` for the helper function.

## Dataset Description

The dataset for this competition consists of a training set and a test set, both containing 200 numerical features. Your goal is to learn from the training data and generate predictions for the unseen test samples.

### 🧠 Training Data (`train.csv`)

The training file contains:

- 1000 samples (rows)
- 200 numerical features (`x1` to `x200`)
- 1 target variable (`y`) Each row represents a single observation with 200 input variables and a corresponding output value.
- You will use this data to train your model and compute the weight vector. Columns in `train.csv`:

Column Name Description
`y` Target variable to be predicted

| `x1–x200` | Numeric input features |

### πŸ” Test Data (`test.csv`)

The test set contains:

- 500 samples (rows)
- The same 200 numerical features as the training set
- No target variable Your model should generate a prediction for each of these 500 samples. Columns in `test.csv`:

| `x1–x200` | Numeric input features (same format as training data) |

## πŸ“Œ Notes

- Feature meanings are not explicitly provided β€” part of the challenge is determining which features are important.
- The training and test sets share the same feature structure.
- You will submit a weight vector (based on these 200 features) and predictions for the 500 test rows. The data has been prepared to encourage thoughtful feature selection and model design.
- Use the training data to learn effective patterns, and then apply your model to the test set to produce your final predictions.