An open API service indexing awesome lists of open source software.

https://github.com/post2web/santander


https://github.com/post2web/santander

Last synced: about 1 year ago
JSON representation

Awesome Lists containing this project

README

          

{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### This project is used for class 2018-0507 MSDS 7335 Machine Learning at SMU\n",
"\n",
"- hosted on https://github.com/post2web/santander\n",
"- by Ivelin Angelov"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Vector Representation of sparse data\n",
"\n",
"Santander Group's dataset is a good example of challenge where we have a very sparse high dimensional dataset and before we can do supervised machine learning we have to represent the dataset with a denser and less dimensional representation. The competition has two datasets: \n",
"- train.csv with approx dimensions 4459x4992. The training data represents 4459 observations with 4991 features and 1 target variable.\n",
"- test.csv with approx dimensions 49342x4991. It has more than ten times the number of observations as compared to the training data. It doesn't have a target variable. This is what we are asked to predict by the competition.\n",
"\n",
"All features names are sanitization with random strings. There is no information given of what they mean. From exploration, we came to the conclusion that they represent the amount of money for different categories that are used in the transaction with a customer. The targets (labels) in the training dataset represent the value of transactions for each potential customer."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data exploration"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train Dataset shape (4459, 4993)\n",
"Test Dataset shape (4459, 4993)\n"
]
}
],
"source": [
"train = pd.read_csv('data/train.csv')\n",
"test = pd.read_csv('data/test.csv')\n",
"print('Train Dataset shape', train.shape)\n",
"print('Test Dataset shape', train.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Head of Train Dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"

\n",
"\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" ID\n",
" target\n",
" 48df886f9\n",
" 0deb4b6a8\n",
" 34b15f335\n",
" a8cb14b00\n",
" 2f0771a37\n",
" 30347e683\n",
" d08d1fbe3\n",
" 6ee66e115\n",
" ...\n",
" 3ecc09859\n",
" 9281abeea\n",
" 8675bec0b\n",
" 3a13ed79a\n",
" f677d4d13\n",
" 71b203550\n",
" 137efaa80\n",
" fb36b89d9\n",
" 7e293fbaf\n",
" 9fc776466\n",
" \n",
" \n",
" \n",
" \n",
" 0\n",
" 000d6aaf2\n",
" 38000000.0\n",
" 0.0\n",
" 0\n",
" 0.0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" ...\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" \n",
" \n",
" 1\n",
" 000fbd867\n",
" 600000.0\n",
" 0.0\n",
" 0\n",
" 0.0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" ...\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" \n",
" \n",
"\n",
"

2 rows × 4993 columns

\n",
"
"
],
"text/plain": [
" ID target 48df886f9 0deb4b6a8 34b15f335 a8cb14b00 \\\n",
"0 000d6aaf2 38000000.0 0.0 0 0.0 0 \n",
"1 000fbd867 600000.0 0.0 0 0.0 0 \n",
"\n",
" 2f0771a37 30347e683 d08d1fbe3 6ee66e115 ... 3ecc09859 \\\n",
"0 0 0 0 0 ... 0.0 \n",
"1 0 0 0 0 ... 0.0 \n",
"\n",
" 9281abeea 8675bec0b 3a13ed79a f677d4d13 71b203550 137efaa80 \\\n",
"0 0.0 0.0 0 0 0 0 \n",
"1 0.0 0.0 0 0 0 0 \n",
"\n",
" fb36b89d9 7e293fbaf 9fc776466 \n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"\n",
"[2 rows x 4993 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Head of Test Dataset"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" ID\n",
" 48df886f9\n",
" 0deb4b6a8\n",
" 34b15f335\n",
" a8cb14b00\n",
" 2f0771a37\n",
" 30347e683\n",
" d08d1fbe3\n",
" 6ee66e115\n",
" 20aa07010\n",
" ...\n",
" 3ecc09859\n",
" 9281abeea\n",
" 8675bec0b\n",
" 3a13ed79a\n",
" f677d4d13\n",
" 71b203550\n",
" 137efaa80\n",
" fb36b89d9\n",
" 7e293fbaf\n",
" 9fc776466\n",
" \n",
" \n",
" \n",
" \n",
" 0\n",
" 000137c73\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" ...\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" \n",
" \n",
" 1\n",
" 00021489f\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" ...\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" 0.0\n",
" \n",
" \n",
"\n",
"

2 rows × 4992 columns

\n",
"
"
],
"text/plain": [
" ID 48df886f9 0deb4b6a8 34b15f335 a8cb14b00 2f0771a37 \\\n",
"0 000137c73 0.0 0.0 0.0 0.0 0.0 \n",
"1 00021489f 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
" 30347e683 d08d1fbe3 6ee66e115 20aa07010 ... 3ecc09859 \\\n",
"0 0.0 0.0 0.0 0.0 ... 0.0 \n",
"1 0.0 0.0 0.0 0.0 ... 0.0 \n",
"\n",
" 9281abeea 8675bec0b 3a13ed79a f677d4d13 71b203550 137efaa80 \\\n",
"0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"1 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
" fb36b89d9 7e293fbaf 9fc776466 \n",
"0 0.0 0.0 0.0 \n",
"1 0.0 0.0 0.0 \n",
"\n",
"[2 rows x 4992 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Distribution of the labesl"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The labels have minimum of 30000 and maximum of 40000000\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEDCAYAAADOc0QpAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAE8lJREFUeJzt3X2MXfV95/H3Z8zgZBnW2DByWR48aMNuWJxAkhFJE20XksaiSQRahahE3SYpNFbZkAe10pZUK9LwR5WIpdlt04KsgAq73SQqpVmHkDVIuJukuxAG1jzaZa3gCFAwU/wAdl3XZr77x1yj8c2YuXN9Z+7cw/slXZ2n35zztWR/5udzf+f8UlVIkpplqN8FSJJ6z3CXpAYy3CWpgQx3SWogw12SGshwl6QG6mu4J7ktyYtJnuig7deSbGl9nk6yZzFqlKRBlH6Oc0/yS8A+4I6qWjuPn/ss8I6qumrBipOkAdbXnntV/QDYNXNfkn+e5H8meTjJD5O8dZYf/TjwzUUpUpIG0An9LmAWG4Dfqqr/l+TdwJ8C7z9yMMka4Bzg/j7VJ0lL3pIK9yQjwHuBv0hyZPfytmZXAndW1auLWZskDZIlFe5M3ybaU1UXvk6bK4HPLFI9kjSQltRQyKp6GXgmyccAMu2CI8db999XAv+nTyVK0kDo91DIbzId1P8yyXNJrgZ+Dbg6yaPAk8DlM37kSuBb5assJel19XUopCRpYSyp2zKSpN7o2xeqp512Wo2NjfXr8pI0kB5++OG/q6rRudr1LdzHxsaYmJjo1+UlaSAl+Wkn7bwtI0kNZLhLUgMZ7pLUQIa7JDWQ4S5JDdRxuCdZluT/Jrl7lmPLk3w7yfYkDyYZ62WR0mJJ8nMfaRDNp+f+eWDrMY5dDeyuqrcAXwO+eryFSYttZpB/+tOfnnW/NCg6CvckZwIfBr5xjCaXA7e31u8EPhD/RWhAVRUbNmzAV3NokHXac//PwH8Apo5x/AzgWYCqOgzsBU5tb5RkfZKJJBOTk5NdlCstrEsuuYS1a9eybNky1q5dyyWXXNLvkqSuzBnuST4CvFhVDx/vxapqQ1WNV9X46OicT89Ki27z5s3s37+fqmL//v1s3ry53yVJXemk5/4+4LIkO4BvAe9P8t/a2jwPnAWQ5ARgBfBSD+uUFs2OHTuoKnbs2NHvUqSuzRnuVfXFqjqzqsaYfp/6/VX179qabQQ+2Vq/otXGG5aS1Cddj3NPckOSy1qbtwKnJtkO/DZwXS+Kkxbb8uXLGR4eBmB4eJjly9un8JUGw7zeCllVfw38dWv9+hn7/wH4WC8Lk/rh4MGDDA1N93leffVVDh061OeKpO74hKrUZsWKFSRhxYoV/S5F6prhLs2QhN27d1NV7N692weYNLAMd2mGqmJkZASAkZERH2TSwDLcpTb79u07aikNIsNdanPkVoy3ZDTIDHdphuHh4dduxVTVa8MipUFjuEsztA99dCikBpXhLkkNZLhLbW666Sb279/PTTfd1O9SpK6lX0O9xsfHa2Jioi/XlmaThLe+9a0888wzHDx4kOXLl3POOeewbds2h0RqyUjycFWNz9XOnrs0w7Zt27jqqqvYs2cPV111Fdu2bet3SVJXDHepZd26dQDcfPPNnHLKKdx8881H7ZcGieEutdx7773z2i8tZYa7NMOyZcs4//zzGRoa4vzzz2fZsmX9LknqiuEuzXDiiSfy9NNPMzU1xdNPP82JJ57Y75Kkrhju0gwHDhxg1apVJGHVqlUcOHCg3yVJXelkguw3JflxkkeTPJnky7O0+VSSySRbWp/fXJhypYW3c+dOqoqdO3f2uxSpa53MxHQQeH9V7UsyDPwoyfer6oG2dt+uqmt7X6Ikab46mSC7qurIu0+HWx+f6FBj+YSqmqCje+5JliXZArwI3FdVD87S7KNJHktyZ5KzjnGe9UkmkkxMTk4eR9nSwrnxxhs5+eSTufHGG/tditS1jsK9ql6tqguBM4GLkqxta/JdYKyq3g7cB9x+jPNsqKrxqhofHR09nrqlBfPCCy8wNTXFCy+80O9SpK7Na7RMVe0BNgOXtu1/qaoOtja/AbyrN+VJi+dYk3M4aYcGUSejZUaTnNJafzPwQWBbW5vTZ2xeBmztZZHSYjjWA0s+yKRB1EnP/XRgc5LHgIeYvud+d5IbklzWavO51jDJR4HPAZ9amHKlhXP48GGGh4cZGxsjCWNjYwwPD3P48OF+lybNm6/8lVqSMD4+zuOPP/7aK3/f9ra3MTEx4St/tWT4yl+pCxMTE0e98tcOiAaVPXep5cgXp0moqteWgD13LRn23KUuGehqAsNdkhrIcJfarF69mq1bt7J69ep+lyJ1zXCXZjjppJPYs2cP5513Hnv27OGkk07qd0lSVzp5K6T0hrF///7X1g8ePMjBgwdfp7W0dNlzl9qccMIJRy2lQWS4S22OPJHqk6kaZIa7NMPw8PDrbkuDwnCXZjh06NBRk3UcOnSo3yVJXfEJVanl9V7t6wNNWip8QlWS3sAMd6nNNddcw549e7jmmmv6XYrUNcNdanPXXXexcuVK7rrrrn6XInXNgbxSm507dx61lAZRJ9PsvSnJj5M82ppt6cuztFme5NtJtid5MMnYQhQrSepMJ7dlDgLvr6oLgAuBS5O8p63N1cDuqnoL8DXgq70tU5I0H3OGe03b19ocbn3ax4VdDtzeWr8T+ECcMl4DauakHdKg6ugL1STLkmwBXmR6guwH25qcATwLUFWHgb3AqbOcZ32SiSQTk5OTx1e5tEDWrFlDEtasWdPvUqSudRTuVfVqVV0InAlclGRtNxerqg1VNV5V46Ojo92cQlpwe/fuPWopDaJ5DYWsqj3AZuDStkPPA2cBJDkBWAG81IsCpcW2e/duqordu3f3uxSpa52MlhlNckpr/c3AB4Ftbc02Ap9srV8B3F8+r60Bc6x77N571yDqpOd+OrA5yWPAQ0zfc787yQ1JLmu1uRU4Ncl24LeB6xamXGnhDA0NMTQ0dNSLw47skwaNLw6TWpKQ5KiXhB3Z9j+iWip8cZjUhapiZGQEgJGREUNdA8twl9rs27fvqKU0iAx3SWogw12SGshwl6QGMtwlqYEMd0lqIMNdkhrIcJekBjLcJamBDHdJaiDDXZIayHCXpAYy3KU2K1euZGhoiJUrV/a7FKlrJ/S7AGmpOTIDkzMxaZB1MhPTWUk2J3kqyZNJPj9Lm4uT7E2ypfW5fmHKlSR1opOe+2Hgd6rqkSQnAw8nua+qnmpr98Oq+kjvS5QkzdecPfeq+llVPdJafwXYCpyx0IVJ/bJ69eqjltIgmtcXqknGgHcAD85y+BeTPJrk+0nOP8bPr08ykWRicnJy3sVKC21oaIhdu3YBsGvXLudP1cDq+G9ukhHgL4EvVNXLbYcfAdZU1QXAHwPfme0cVbWhqsaranx0dLTbmqUFMzU1xapVqwBYtWoVU1NTfa5I6k5H4Z5kmOlg//Oquqv9eFW9XFX7Wuv3AMNJTutppdIi2blz51FLaRB1MlomwK3A1qr6w2O0+YVWO5Jc1DrvS70sVJLUuU5Gy7wP+HXg8SRbWvt+DzgboKpuAa4ArklyGDgAXFlOGy9JfTNnuFfVj4DM0ebrwNd7VZQk6fg4FECSGshwl6QGMtwlqYEMd6nNyMjIUUtpEBnuUpt9+/YdtZQGkeEuSQ1kuEtSAxnuktRAhrskNZDhLkkNZLhLUgMZ7pLUQIa7JDWQ4S5JDWS4S1IDGe6S1ECdTLN3VpLNSZ5K8mSSz8/SJkn+KMn2JI8leefClCtJ6kQn0+wdBn6nqh5JcjLwcJL7quqpGW1+BTi39Xk3cHNrKUnqgzl77lX1s6p6pLX+CrAVOKOt2eXAHTXtAeCUJKf3vFpJUkfmdc89yRjwDuDBtkNnAM/O2H6On/8FQJL1SSaSTExOTs6vUklSxzoO9yQjwF8CX6iql7u5WFVtqKrxqhofHR3t5hSSpA50FO5JhpkO9j+vqrtmafI8cNaM7TNb+yRJfdDJaJkAtwJbq+oPj9FsI/CJ1qiZ9wB7q+pnPaxTkjQPnYyWeR/w68DjSba09v0ecDZAVd0C3AN8CNgO/D3wG70vVVocK1euZPfu3a8tpUE0Z7hX1Y+AzNGmgM/0qiipl6b/89m5I4E+M9g7Ocf0PwNpafAJVTVeVXX0WbduHQBDQ0NHLdetW9fRz0tLieEutWzatOm1IAdeC/xNmzb1uTJp/gx3aYZNmzYxNTXFmt+9m6mpKYNdA8twl6QGMtwlqYEMd0lqIMNdkhrIcJekBjLcJamBDHdJaiDDXZIayHCXpAYy3CWpgQx3SWogw12SGshwl6QG6mSavduSvJjkiWMcvzjJ3iRbWp/re1+mJGk+Oplm78+ArwN3vE6bH1bVR3pSkSTpuM3Zc6+qHwC7FqEWSVKP9Oqe+y8meTTJ95Ocf6xGSdYnmUgyMTk52aNLS5La9SLcHwHWVNUFwB8D3zlWw6raUFXjVTU+Ojrag0tLkmZz3OFeVS9X1b7W+j3AcJLTjrsySVLXjjvck/xCkrTWL2qd86XjPa8kqXtzjpZJ8k3gYuC0JM8BXwKGAarqFuAK4Jokh4EDwJV1ZPp4SVJfzBnuVfXxOY5/nemhkpKkJcInVCWpgQx3SWogw12SGshwl6QGMtwlqYEMd0lqIMNdkhrIcJekBjLcJamBDHdJaiDDXZIayHCXpAYy3CWpgQx3SWogw12SGmjOcE9yW5IXkzxxjONJ8kdJtid5LMk7e1+mJGk+Oum5/xlw6esc/xXg3NZnPXDz8ZclSToec4Z7Vf0A2PU6TS4H7qhpDwCnJDm9VwVKkuavF/fczwCenbH9XGufJKlPFvUL1STrk0wkmZicnFzMS0vSG0ovwv154KwZ22e29v2cqtpQVeNVNT46OtqDS0uSZtOLcN8IfKI1auY9wN6q+lkPzitJ6tIJczVI8k3gYuC0JM8BXwKGAarqFuAe4EPAduDvgd9YqGIlSZ2ZM9yr6uNzHC/gMz2rSJJ03HxCVZIayHCXpAYy3CWpgQx3SWogw12SGmjO0TLSUnLBl+9l74FDi3Ktseu+t6DnX/HmYR790roFvYbeuAx3DZS9Bw6x4ysf7ncZPbHQvzz0xuZtGUlqIMNdkhrIcJekBjLcJamBDHdJaiDDXZIayHCXpAYy3CWpgQx3SWqgjsI9yaVJ/jbJ9iTXzXL8U0kmk2xpfX6z96VKkjrVyTR7y4A/AT4IPAc8lGRjVT3V1vTbVXXtAtQoSZqnTnruFwHbq+onVfWPwLeAyxe2LEnS8egk3M8Anp2x/VxrX7uPJnksyZ1JzprtREnWJ5lIMjE5OdlFuZKkTvTqC9XvAmNV9XbgPuD22RpV1YaqGq+q8dHR0R5dWpLUrpNwfx6Y2RM/s7XvNVX1UlUdbG1+A3hXb8qTJHWjk3B/CDg3yTlJTgSuBDbObJDk9BmblwFbe1eiJGm+5hwtU1WHk1wLbAKWAbdV1ZNJbgAmqmoj8LkklwGHgV3ApxawZknSHDqaiamq7gHuadt3/Yz1LwJf7G1pkqRu+YSqJDWQ4S5JDWS4S1IDGe6S1ECGuyQ1kOEuSQ1kuEtSAxnuktRAHT3EJC0VJ593HW+7/efmixlIJ58H8OF+l6GGMtw1UF7Z+hV2fKUZgTh23ff6XYIazNsyktRAhrskNZDhLkkNZLhLUgMZ7pLUQIa7JDVQR0Mhk1wK/BemZ2L6RlV9pe34cuAOpudOfQn41ara0dtSpWlNGUK44s3D/S5BDTZnuCdZBvwJ8EHgOeChJBur6qkZza4GdlfVW5JcCXwV+NWFKFhvbIs1xn3suu81Zjy93pg6uS1zEbC9qn5SVf8IfAu4vK3N5cDtrfU7gQ8kSe/KlCTNRye3Zc4Anp2x/Rzw7mO1aU2ovRc4Ffi7mY2SrAfWA5x99tldlizNT7f9jHx1fu2rqqvrSAthUb9QraoNVTVeVeOjo6OLeWm9gVXVonykpaSTcH8eOGvG9pmtfbO2SXICsILpL1YlSX3QSbg/BJyb5JwkJwJXAhvb2mwEPtlavwK4v+zKSFLfzHnPvXUP/VpgE9NDIW+rqieT3ABMVNVG4FbgvybZDuxi+heAJKlPOhrnXlX3APe07bt+xvo/AB/rbWmSpG75hKokNZDhLkkNZLhLUgMZ7pLUQOnXiMUkk8BP+3JxaW6n0faEtbRErKmqOZ8C7Vu4S0tZkomqGu93HVK3vC0jSQ1kuEtSAxnu0uw29LsA6Xh4z12SGsieuyQ1kOEuSQ1kuKuxkpyS5N8vwnUuTvLehb6ONB+Gu5rsFKDjcM+0bv5NXAwY7lpS/EJVjZXkyGTufwtsBt4OrASGgf9YVf8jyRjTcxU8CLwL+BDwy8DvAnuAR4GDVXVtklHgFuDIBMBfYHoWsgeAV4FJ4LNV9cPF+PNJr8dwV2O1gvvuqlrbmv7xn1TVy0lOYzqQzwXWAD8B3ltVDyT5Z8D/Bt4JvALcDzzaCvf/DvxpVf0oydnApqo6L8nvA/uq6j8t9p9ROpaOJuuQGiDAHyT5JWAKOANY3Tr206p6oLV+EfC/qmoXQJK/AP5F69gvA/8qyZFz/tMkI4tRvDRfhrveKH4NGAXeVVWHkuwA3tQ6tr/DcwwB72nNPPaaGWEvLRl+oaomewU4ubW+AnixFeyXMH07ZjYPAf8mycrWrZyPzjh2L/DZIxtJLpzlOtKSYLirsarqJeBvkjwBXAiMJ3kc+ASw7Rg/8zzwB8CPgb8BdgB7W4c/1zrHY0meAn6rtf+7wL9NsiXJv16oP480H36hKrVJMlJV+1o9978Cbquqv+p3XdJ82HOXft7vJ9kCPAE8A3ynz/VI82bPXZIayJ67JDWQ4S5JDWS4S1IDGe6S1ECGuyQ10P8HT4NW4J8yWmEAAAAASUVORK5CYII=\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"train['target'].plot(kind='box')\n",
"print('The labels have minimum of %d and maximum of %d' % (train['target'].min(), train['target'].max()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Features exploration\n",
"\n",
"Most of the values in the two datasets are zeros and zeros most probably represent missing value for the amount in a certain category. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"features = train.iloc[:,2:].astype(np.int32)#.append(test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Percent missing values"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"96.85413111171313 percent of the features are zeros (missing values)\n"
]
}
],
"source": [
"percent_zeros = (features==0).sum(axis=1).mean() / features.shape[1] * 100\n",
"print(percent_zeros, 'percent of the features are zeros (missing values)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Distribution\n",
"\n",
"The distribution shows mostly zeros, two negative values, and some transaction amount values."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The labels have minimum of -2147483648 and maximum of 960000000\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEDCAYAAAA849PJAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAEAVJREFUeJzt3X+MVWdex/HPhxk602rjDgsOLT9KVaKQll3dG3TVpDXbGtqaYrs2aW1iq10mJGX8g2yETRNNmohLNDGxi47D2rRriN11E1IMuGx/LK1GUS5JC6VTuiO6YWiZTgu7Jl1mCu3XPzjgML0z3LnncM+9fd6v5GbO89xn7vMl4Z7PPOece64jQgCA9MwpuwAAQDkIAABIFAEAAIkiAAAgUQQAACSKAACARLV8ANh+0vY7tl+rY+wNtl+wfcj2PtuLm1EjALSjlg8ASU9JWlPn2L+Q9I2IWCXpcUl/dqWKAoB21/IBEBEvSzo1uc/2z9r+ju2Dtv/F9i9kT62U9GK2/T1Ja5tYKgC0lZYPgGkMSuqPiM9J+rKkv876X5V0b7Z9j6RrbX+6hPoAoOV1ll3AbNn+SUm/KukfbV/o7sp+flnS12w/LOllSSckfdjsGgGgHbRdAOj8quWHEfHZqU9ExFvKVgBZUHwxIn7Y5PoAoC203SGgiPhfSf9t+z5J8nmfybbn277wb/qKpCdLKhMAWl7LB4Dtf5D075J+3vaI7UckPSjpEduvSjqi/z/Ze6uko7bflNQr6U9LKBkA2oK5HTQApKnlVwAAgCujpU8Cz58/P5YtW1Z2GQDQNg4ePPhuRCyoZ2xLB8CyZctUrVbLLgMA2obtH9Q7lkNAAJAoAgAAEkUAAECiCgmAy92yOfuw1l/ZHs5u1fxLRcwLAGhcUSuApzTzLZvvkLQ8e/RJ+puC5gUANKiQAKh1y+Yp1ur8ffojIvZL+pTt64qYG2im/v5+dXd3y7a6u7vV399fdklAw5p1DmCRpOOT2iNZ38fY7rNdtV0dGxtrSnFAPfr7+zUwMKAtW7bo/fff15YtWzQwMEAIoG213EngiBiMiEpEVBYsqOuzDEBTbN++XVu3btXGjRt1zTXXaOPGjdq6dau2b99edmlAQ5oVACckLZnUXpz1AW1jYmJC69evv6Rv/fr1mpiYKKkiIJ9mBcAuSb+XXQ30K5J+FBFvN2luoBBdXV0aGBi4pG9gYEBdXV3T/AbQ2gq5FUR2y+ZbJc23PSLpTyTNlaSIGJC0R9KdkoYl/VjS7xcxL9BM69at06ZNmySd/8t/YGBAmzZt+tiqAGgXLX076EqlEtwLCK2kv79f27dv18TEhLq6urRu3To98cQTZZcFXGT7YERU6hpLAAD1m/Q91Be18nsI6ZlNALTcVUBAq5q883/00Udr9gPthAAAGrBt27aySwByIwCAWert7dXQ0JB6e3vLLgXIpaW/EAZoRSdPnrz4k8M/aGesAIBZsq0NGzaw80fbIwCABnAOAJ8EBABQpwuf+J16DoBPAqNdcQ4AqNP4+Li6u7s1OjqqFStWSDq/8x8fHy+5MqAxBAAwC+zs8UnCISAASBQBAACJIgAAIFEEAAAkigAAgEQRAACQKC4DBWZhzpw5l9z/37Y++uijEisCGscKAKjThZ1/d3e39u/fr+7ubkWE5szhbYT2xAoAqNOFnf+ZM2ckSWfOnNHVV1/Nh8PQtvjTBZiFffv2zdgG2gkBAMzCrbfeOmMbaCccAgLqZFvj4+Mf+x4AvhcA7aqQFYDtNbaP2h62vbnG8w/bHrP9Svb4UhHzAs00+eqfevqBVpd7BWC7Q9I2SbdLGpF0wPauiHh9ytBvRsSGvPMBZZt6GSjQropYAayWNBwRxyLiA0nPSFpbwOsCLefBBx+csQ20kyICYJGk45PaI1nfVF+0fcj2t20vKWBeoOl27Ngh2xcfO3bsKLskoGHNugronyQti4hVkp6T9PR0A2332a7aro6NjTWpPGB2LnwjGNDOigiAE5Im/0W/OOu7KCLei4iJrPl1SZ+b7sUiYjAiKhFRWbBgQQHlAcUbGhoquwQgtyIC4ICk5bZvtH2VpPsl7Zo8wPZ1k5p3S+Ldg7bV0dGhffv2qaOjo+xSgFxyXwUUEedsb5C0V1KHpCcj4ojtxyVVI2KXpD+0fbekc5JOSXo477xAWT788EM+AIZPBLfyNcyVSiWq1WrZZQCSLr3k86677tLu3bsvtlv5fYS02D4YEZV6xnIrCKABk3f+QLsiAAAgUQQAACSKAACARBEAAJAoAgAAEkUAAECiCAAASBQBAACJIgAAIFEEAAAkigAAgEQRAACQKAIAABJFAABAoggAAEgUAQAAiSIAACBRBAAAJIoAAIBEEQAAkCgCAAASRQAAQKIKCQDba2wftT1se3ON57tsfzN7/j9sLytiXgBA43IHgO0OSdsk3SFppaQHbK+cMuwRSacj4uck/aWkrXnnBQDkU8QKYLWk4Yg4FhEfSHpG0topY9ZKejrb/rakL9h2AXMDudmu63GlXwNots4CXmORpOOT2iOSfnm6MRFxzvaPJH1a0rtTX8x2n6Q+SVq6dGkB5SE1Nz9986zG3/TUTVeoko+bbW2HHzp8hSoBigmAQkXEoKRBSapUKlFyOWhDV2qnOdNf8BH8V0X7KeIQ0AlJSya1F2d9NcfY7pT0U5LeK2BuoGmm28mz80e7KiIADkhabvtG21dJul/Sriljdkl6KNv+HUkvBu8atKGI+NgDaFe5DwFlx/Q3SNorqUPSkxFxxPbjkqoRsUvS30n6e9vDkk7pfEgAAEpUyDmAiNgjac+Uvj+etD0u6b4i5gIAFINPAgNAoggAAEgUAQAAiSIAACBRBAAAJIoAAIBEEQAAkCgCAAASRQAAQKIIAABIFAEAAIkiAAAgUQQAACSKAACARBEAAJAoAgAAEkUAAECiCAAASBQBAACJIgAAIFEEAAAkigAAgEQRAACQqFwBYHue7edsfz/72TPNuA9tv5I9duWZEwBQjLwrgM2SXoiI5ZJeyNq1nImIz2aPu3POCQAoQN4AWCvp6Wz7aUm/nfP1AABNkjcAeiPi7Wz7pKTeacZ1267a3m97xpCw3ZeNrY6NjeUsDwAwnc7LDbD9vKSFNZ56bHIjIsJ2TPMyN0TECds/I+lF24cj4r9qDYyIQUmDklSpVKZ7PQBATpcNgIi4bbrnbI/avi4i3rZ9naR3pnmNE9nPY7b3SfpFSTUDAADQHHkPAe2S9FC2/ZCkZ6cOsN1juyvbni/p1yS9nnNeAEBOeQPgq5Jut/19Sbdlbdmu2P56NmaFpKrtVyV9T9JXI4IAAICSXfYQ0Ewi4j1JX6jRX5X0pWz73yTdnGceAEDx+CQwACSKAACARBEAAJAoAgAAEkUAAECiCAAASBQBAACJIgAAIFEEAAAkigAAgEQRAACQKAIAABJFAABAoggAAEgUAQAAiSIAACBRBAAAJIoAAIBEEQAAkCgCAAASRQAAQKIIAABIFAEAAInKFQC277N9xPZHtiszjFtj+6jtYdub88wJAChG3hXAa5LulfTydANsd0jaJukOSSslPWB7Zc55AQA5deb55YgYkiTbMw1bLWk4Io5lY5+RtFbS63nmBgDk04xzAIskHZ/UHsn6arLdZ7tquzo2NnbFiwOAVF12BWD7eUkLazz1WEQ8W3RBETEoaVCSKpVKFP36AIDzLhsAEXFbzjlOSFoyqb046wMAlKgZh4AOSFpu+0bbV0m6X9KuJswLAJhB3stA77E9Iunzknbb3pv1X297jyRFxDlJGyTtlTQk6VsRcSRf2QCAvPJeBbRT0s4a/W9JunNSe4+kPXnmAgAUi08CA0CiCAAASBQBAACJIgAAIFEEAAAkigAAgEQRAACQKAIAABJFAABAoggAAEgUAQAAiSIAACBRBAAAJIoAAIBEEQAAkCgCAAASRQAAQKIIAABIFAEAAIkiAAAgUQQAACSKAACARBEAAJCoXAFg+z7bR2x/ZLsyw7j/sX3Y9iu2q3nmBAAUozPn778m6V5Jf1vH2N+IiHdzzgcAKEiuAIiIIUmyXUw1AICmadY5gJD0XdsHbffNNNB2n+2q7erY2FiTygOA9Fx2BWD7eUkLazz1WEQ8W+c8vx4RJ2z/tKTnbL8RES/XGhgRg5IGJalSqUSdrw8AmKXLBkBE3JZ3kog4kf18x/ZOSasl1QwAAEBzXPFDQLZ/wva1F7Yl/abOnzwGAJQo72Wg99gekfR5Sbtt7836r7e9JxvWK+lfbb8q6T8l7Y6I7+SZFwCQX96rgHZK2lmj/y1Jd2bbxyR9Js88AIDi8UlgAEgUAQAAiSIAACBRBAAAJIoAAIBEEQAAkCgCAAASRQAAQKIIAABIFAEAAIkiAAAgUQQAACSKAACARBEAAJAoAgAAEkUAAECiCAAASBQBAACJIgAAIFEEAAAkigAAgEQRAACQqFwBYPvPbb9h+5DtnbY/Nc24NbaP2h62vTnPnACAYuRdATwn6aaIWCXpTUlfmTrAdoekbZLukLRS0gO2V+acFwCQU64AiIjvRsS5rLlf0uIaw1ZLGo6IYxHxgaRnJK3NMy8AIL8izwH8gaR/rtG/SNLxSe2RrA8AUKLOyw2w/bykhTWeeiwins3GPCbpnKQdeQuy3SepT5KWLl2a9+WAQnV3d2tiYuJiu6urS+Pj4yVWBDTusiuAiLgtIm6q8biw839Y0m9JejAiosZLnJC0ZFJ7cdY33XyDEVGJiMqCBQtm9Y8BrqQLO//e3l4NDQ2pt7dXExMT6u7uLrs0oCGXXQHMxPYaSX8k6ZaI+PE0ww5IWm77Rp3f8d8v6XfzzAuU4cLO/+TJk5KkkydPauHChRodHS25MqAxec8BfE3StZKes/2K7QFJsn297T2SlJ0k3iBpr6QhSd+KiCM55wVKsW/fvhnbQDtx7aM2raFSqUS1Wi27DECSZPuSFYCkiyuAVn4fIS22D0ZEpZ6xfBIYqFNXV5dGR0e1cOFCvfHGGxd3/l1dXWWXBjQk1zkAICXj4+OyrdHRUa1YseKSfqAdsQIA6jR37lxJUk9Pjw4dOqSenp5L+oF2wwoAqNO5c+fU09OjU6dOSZJOnTqlefPm6fTp0yVXBjSGFQAwCy+99NKMbaCdEADALNxyyy0ztoF2QgAAders7NTp06c1b948HT58+OLhn85OjqSiPfE/F6jT2bNnNXfuXJ0+fVqrVq2SdD4Uzp49W3JlQGMIAGAW2Nnjk4RDQACQKAIAABJFAABAoggAAEgUAQAAiWrp20HbHpP0g7LrAGqYL+ndsosAarghIur6OsWWDgCgVdmu1nvPdaBVcQgIABJFAABAoggAoDGDZRcA5MU5AABIFCsAAEgUAQAAiSIAgFmwvcb2UdvDtjeXXQ+QB+cAgDrZ7pD0pqTbJY1IOiDpgYh4vdTCgAaxAgDqt1rScEQci4gPJD0jaW3JNQENIwCA+i2SdHxSeyTrA9oSAQAAiSIAgPqdkLRkUntx1ge0JQIAqN8BSctt32j7Kkn3S9pVck1Aw/hSeKBOEXHO9gZJeyV1SHoyIo6UXBbQMC4DBYBEcQgIABJFAABAoggAAEgUAQAAiSIAACBRBAAAJIoAAIBE/R+r3ReIs3iGlgAAAABJRU5ErkJggg==\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"values_vector = features.values.flatten()\n",
"values_vector = values_vector[values_vector!=0]\n",
"pd.DataFrame(values_vector).plot(kind='box')\n",
"print('The labels have minimum of %d and maximum of %d' % (values_vector.min(), values_vector.max()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Distribution of the positive values"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The labels have minimum of 52 and maximum of 960000000\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEDCAYAAADOc0QpAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAECFJREFUeJzt3W2MVNd9x/HfD7B3MLiOHFaLxYNBWdICdtKkI8fCL4rUtMKuBGrTtKBGTSrLCMusXdmpINR2UuzSpKn8woZ0SxXLSaTGdfKiWim0VEqMXKFgsSi1w4Ox1gSXh+56QxJXWlhsl39f7GWZXQ87w+zM3r3H3480Ys65Z+f+Xyy/OXvukyNCAIC0zMi7AABA8xHuAJAgwh0AEkS4A0CCCHcASBDhDgAJyjXcbT9r+y3bh+sYe6vtH9p+1fY+2wunokYAKKK8Z+7PSVpT59i/l/TtiPiYpO2S/rZVRQFA0eUa7hHxkqRfVPbZ/ojtf7d9yPZ/2v6NbNMKST/K3r8oad0UlgoAhZL3zL2a3ZK6IuK3JH1R0jey/lck/WH2/g8k3Wj7wznUBwDT3qy8C6hke66kVZK+Z/tyd1v27xcl7bT9BUkvSToj6f+mukYAKIJpFe4a+UviVxHxm+M3RMRZZTP37EvgMxHxqymuDwAKYVoty0TE/0r6me3PSpJHfDx7P8/25Xq/JOnZnMoEgGkv71Mhvyvpx5J+3fZp2/dK+lNJ99p+RdIRXTlwulrScduvS+qQ9Dc5lAwAhWBu+QsA6ak5c691oVG2dPK07b7sAqNPNr9MAMC1qOeA6nOSdkr69lW23y1pWfb6lKR/yP6d0Lx582LJkiV1FQkAGHHo0KGfR0R7rXE1wz0iXrK9ZIIh6zRy5WhIOmD7Q7ZviYj/mehzlyxZot7e3lq7BwBUsP1mPeOacUB1gaRTFe3TWV+1ojba7rXdOzg42IRdAwCqmdKzZSJid0SUI6Lc3l7zrwoAQIOaEe5nJC2qaC/M+gAAOWlGuPdI+rPsrJk7Jb1da70dANBaNQ+oZhcarZY0z/ZpSV+WdJ0kRUS3pD2S7pHUJ+m8pD9vVbEAgPrUnLlHxIaIuCUirouIhRHxzYjozoJdMeKBiPhIRNweEZwCg0Lq6upSqVSSbZVKJXV1deVdEtCwaXVvGSAvXV1d6u7u1o4dOzQ0NKQdO3aou7ubgEdh5Xb7gXK5HJznjumiVCppx44devjhh0f7nnrqKW3btk3Dw8M5VgaMZftQRJRrjiPcAcm2hoaGdMMNN4z2nT9/XnPmzBH3X8J0Um+4sywDSGpra1N3d/eYvu7ubrW1tV3lJ4Dpbbo9rAPIxX333actW7ZIkjZt2qTu7m5t2bJFmzZtyrkyoDGEOyDpmWeekSRt27ZNjzzyiNra2rRp06bRfqBoWHMHgAJhzR0APsAIdwBIEOEOAAki3AEgQYQ7ACSIcAeABBHuAJAgwh0AEkS4A0CCCHcASBDhDgAJItwBIEGEOwAkiFv+Ahnb7+vjKUwoKmbugMYG+wMPPFC1HygSwh2oEBHauXMnM3YUHuEOZCpn7NXaQJHwJCZAV5ZfKv8/VOsD8saTmIAG2NbmzZtZa0fhEe6Axs7Od+3aVbUfKBLCHZBUKpUkSR0dHTp27Jg6OjrG9ANFw3nugKSLFy+qo6ND/f39kqT+/n7Nnz9fAwMDOVcGNIaZO5DZt2/fhG2gSAh3ILN69eoJ20CREO6ApLa2Ng0MDGj+/Pl67bXXRpdk2tra8i4NaEhd4W57je3jtvtsb62yfbHtF23/xPartu9pfqlA6wwPD48G/PLly0eDfXh4OO/SgIbUDHfbMyXtknS3pBWSNtheMW7Yo5JeiIhPSFov6RvNLhRoteHhYUXE6ItgR5HVM3O/Q1JfRJyIiHckPS9p3bgxIenXsvc3STrbvBIBANeqnnBfIOlURft01lfpK5I+Z/u0pD2Suqp9kO2Ntntt9w4ODjZQLgCgHs06oLpB0nMRsVDSPZK+Y/t9nx0RuyOiHBHl9vb2Ju0aADBePeF+RtKiivbCrK/SvZJekKSI+LGkkqR5zSgQAHDt6gn3g5KW2V5q+3qNHDDtGTfmvyX9jiTZXq6RcGfdBQByUjPcI+I9SZsl7ZV0TCNnxRyxvd322mzYI5Lus/2KpO9K+kJwxyUAyE1d95aJiD0aOVBa2fd4xfujku5qbmkAgEZxhSoAJIhwB4AEEe4AkCDCHQASRLgDQIIIdwBIEOEOAAki3AEgQYQ7ACSIcAeABBHuAJAgwh0AEkS4A0CCCHcASBDhDgAJItwBIEGEOwAkiHAHgAQR7gCQIMIdABJEuANAggh3AEgQ4Q4ACSLcASBBhDsAJIhwB4AEEe4AkCDCHQASRLgDQIIIdwBIEOEOAAki3AEgQXWFu+01to/b7rO99Spj/tj2UdtHbP9zc8sEWm/GjBmyPfqaMYO5D4qr5m+v7ZmSdkm6W9IKSRtsrxg3ZpmkL0m6KyJWSvqLFtQKtMyMGTMUESqVSjpw4IBKpZIigoBHYc2qY8wdkvoi4oQk2X5e0jpJRyvG3CdpV0T8UpIi4q1mFwq00uVgv3DhgiTpwoULmj17toaHh3OuDGhMPdOSBZJOVbRPZ32VPirpo7b32z5ge021D7K90Xav7d7BwcHGKgZaZN++fRO2gSJp1t+csyQtk7Ra0gZJ/2T7Q+MHRcTuiChHRLm9vb1JuwaaY/Xq1RO2gSKpJ9zPSFpU0V6Y9VU6LaknIt6NiJ9Jel0jYQ8Ugm0NDw9r9uzZevnll0eXZGznXRrQkHrC/aCkZbaX2r5e0npJPePG/KtGZu2yPU8jyzQnmlgn0FKXLl0aDfg777xzNNgvXbqUd2lAQ2qGe0S8J2mzpL2Sjkl6ISKO2N5ue202bK+kc7aPSnpR0l9GxLlWFQ20wqVLlxQRoy+CHUXmiMhlx+VyOXp7e3PZNwAUle1DEVGuNY6TeAEgQYQ7ACSIcAeABBHuAJAgwh0AEkS4A0CCCHcgUyqVxtzyt1Qq5V0S0DDCHdBIsF+8eFEdHR06duyYOjo6dPHiRQIehVXPLX+B5F0O9v7+fklSf3+/5s+fr4GBgZwrAxpDuAOZgYEBbhSGZLAsA4zz4IMP5l0CMGmEO1Cho6ND999/vzo6OvIuBZgUlmWACgMDA1q+fHneZQCTRrgDFSrvksr6O4qMZRmggm099NBDBDsKj3AHNHbG/vTTT1ftB4qEZRkgQ5AjJczcASBBhDsAJIhlGSBT7SAqSzUoKmbugMYG+xNPPFG1HygSwh2oEBF69NFHmbGj8Ah3IFM5Y6/WBorEec1QyuVy9Pb25rJvYLzLyy/VrlBlFo/pxPahiCjXGsfMHahgW08++SRr7Sg8wh3Q2Nn5Y489VrUfKBLCHcjMmTNnwjZQJIQ7IGnu3LkaGhoa0zc0NKS5c+fmVBEwOYQ7II0G+8qVK/Xmm29q5cqVY/qBoiHcgUxnZ6cOHz6sxYsX6/Dhw+rs7My7JKBhhDuQOX/+/IRtoEgIdyBz9uxZLV26VG+88YaWLl2qs2fP5l0S0LC6wt32GtvHbffZ3jrBuM/YDts1T7AHppObb75ZknTy5El1dnbq5MmTY/qBoqkZ7rZnStol6W5JKyRtsL2iyrgbJT0k6eVmFwm02rlz566pH5ju6pm53yGpLyJORMQ7kp6XtK7KuCckfU3ScBPrA6bE1a5I5UpVFFU94b5A0qmK9umsb5TtT0paFBE/aGJtwJRbtWqVzp49q1WrVuVdCjApkz6ganuGpKckPVLH2I22e233Dg4OTnbXQFN1dnZq//79uuWWW7R//35OhUSh1RPuZyQtqmgvzPouu1HSbZL22T4p6U5JPdUOqkbE7ogoR0S5vb298aqBFujr65uwDRRJPeF+UNIy20ttXy9pvaSeyxsj4u2ImBcRSyJiiaQDktZGBPfzReHYHn0BRVYz3CPiPUmbJe2VdEzSCxFxxPZ222tbXSAA4NrV9YDsiNgjac+4vsevMnb15MsCAEwGV6gCQIIIdwBIEOEOjHPXXXflXQIwaYQ7MM7+/fvzLgGYNMIdABJEuANAggh3AEgQ4Q4ACSLcASBBhDsAJIhwB4AEEe4AkCDCHQASRLgDQIIIdwBIEOEOAAki3AEgQYQ7ACSIcAeABBHuAJAgwh0AEkS4A0CCCHcASBDhDgAJItwBIEGEOwAkiHAHgAQR7gCQIMIdABJEuANAggh3AEgQ4Q4ACSLcASBBdYW77TW2j9vus721yvaHbR+1/artH9q+tfmlAgDqVTPcbc+UtEvS3ZJWSNpge8W4YT+RVI6Ij0n6vqS/a3ahAID61TNzv0NSX0SciIh3JD0vaV3lgIh4MSLOZ80DkhY2t0wAwLWoJ9wXSDpV0T6d9V3NvZL+rdoG2xtt99ruHRwcrL9KAMA1aeoBVdufk1SW9PVq2yNid0SUI6Lc3t7ezF0DACrMqmPMGUmLKtoLs74xbH9a0l9J+u2IuNic8gAAjahn5n5Q0jLbS21fL2m9pJ7KAbY/IekfJa2NiLeaXyYA4FrUDPeIeE/SZkl7JR2T9EJEHLG93fbabNjXJc2V9D3b/2W75yofBwCYAvUsyygi9kjaM67v8Yr3n25yXQCASeAKVQBIEOEOAAki3AEgQYQ7ACSIcAeABBHuAJAgwh0AEkS4A0CCCHcASBDhDgAJItwBIEGEOwAkiHAHgAQR7gCQIMIdABJEuANAggh3AEgQ4Q4ACSLcASBBhDsAJIhwB4AEEe4AkCDCHQASRLgDQIIIdwBIEOEOAAki3AEgQYQ7ACSIcAeABBHuAJAgwh0AEjQr7wKAVrM9JT8fEZPaD9BMdc3cba+xfdx2n+2tVba32f6XbPvLtpc0u1CgURFR8zXZnyfYMd3UnLnbnilpl6TflXRa0kHbPRFxtGLYvZJ+GRGdttdL+pqkP2lFwfhg+/hf/4fevvDulO5zydYftORzb5p9nV758u+15LOBepZl7pDUFxEnJMn285LWSaoM93WSvpK9/76knbYdTGfQZG9feFcnv/r7zf/gr0bV5ZdW/gq36ksDkOoL9wWSTlW0T0v61NXGRMR7tt+W9GFJP68cZHujpI2StHjx4gZLxgfZjcu36vZvvW9lsClue+629/Xd/q3bW7IvSbpxuSS14IsK0BQfUI2I3ZJ2S1K5XGZWj2v208//NO8SgEKo54DqGUmLKtoLs76qY2zPknSTpHPNKBAAcO3qCfeDkpbZXmr7eknrJfWMG9Mj6fPZ+z+S9CPW2wEgPzWXZbI19M2S9kqaKenZiDhie7uk3ojokfRNSd+x3SfpFxr5AgAA5KSuNfeI2CNpz7i+xyveD0v6bHNLAwA0itsPAECCCHcASBDhDgAJItwBIEHO64xF24OS3sxl58DE5mnc1dXANHJrRLTXGpRbuAPTle3eiCjnXQcwGSzLAECCCHcASBDhDrzf7rwLACaLNXcASBAzdwBIEOEOAAki3IEKtR4GDxQFa+5AJnsY/OuqeBi8pA3jHgYPFAIzd+CK0YfBR8Q7ki4/DB4oHMIduKLaw+AX5FQLMCmEOwAkiHAHrqjnYfBAIRDuwBX1PAweKIS6nqEKfBBc7WHwOZcFNIRTIQEgQSzLAECCCHcASBDhDgAJItwBIEGEOwAkiHAHgAQR7gCQoP8Hu/+Q9ek01f4AAAAASUVORK5CYII=\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"values_vector = features.values.flatten()\n",
"values_vector = values_vector[values_vector>0]\n",
"pd.DataFrame(values_vector).plot(kind='box')\n",
"print('The labels have minimum of %d and maximum of %d' % (values_vector.min(), values_vector.max()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Selection\n",
"\n",
"Top 1000 features are selected based on a ranking of a random forest model, where 1000 is a hyperparameter that could be optimized. This step could also be skipped or executed after the preprocessing step\n",
"\n",
"![](images/3.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Features preprocessing\n",
"\n",
"To make the distribution less skewed, I first remove the negative values (only two values) and log all features."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAClhJREFUeJzt3V+Ipfddx/HP105kwRbJstNhiV1HTFACjZEOQVFMpRqiCKkiYihl0cB6YccqKgRvjFfmJipNirjSmFXaSlFrcxFsQxSitEh3a9lNGyQhZDEhfzZkob2Km/bnxZyEccnszJxzZs7ud14vGM55/px5vgu7b5595jlzaowRAK5937PoAQCYD0EHaELQAZoQdIAmBB2gCUEHaELQAZoQdIAmBB2giaX9PNiRI0fG6urqfh4S4Jp35syZ18YYy9vtt69BX11dzenTp/fzkADXvKo6v5P9XHIBaELQAZoQdIAmBB2gCUEHaELQOdDW19dz6NChVFUOHTqU9fX1RY8EU9vX2xbharK+vp6HHnro7eU33njj7eUHH3xwUWPB1Go/P4JubW1tuA+dq0VVbbnNRzNyNamqM2OMte32c8mFA29lZSVPP/10VlZWFj0KzMQlFw68l19++e3HK521w9VO0DnwRJwutr3kUlXvq6p/q6pvVtU3qurjk/WHq+rxqnpm8nj93o8LwFZ2cg39zSS/P8a4OclPJPntqro5yb1Jnhhj3JTkickyAAuybdDHGC+NMb42ef7tJE8nuSHJXUlOTXY7leTDezUkANvb1V0uVbWa5MeT/GeSlTHGS5NNLydxiwDAAu046FX17iT/mOR3xxjf2rxtbNy0+4437lbViao6XVWnL1y4MNOwAGxtR0GvquuyEfNPjzH+abL6lao6Otl+NMmr7/TaMcbJMcbaGGNteXnbD9wAYEo7uculknwqydNjjD/btOnRJMcnz48n+cL8xwNgp3ZyH/pPJfloknNV9fXJuj9Kcn+Sz1XVPUnOJ/m1vRkRgJ3YNuhjjP9IstU7Lz4033EAmJbf5QLQhKADNCHoAE0IOkATgg7QhKADNCHoAE0IOkATgg7QhKADNCHoAE0IOkATgg7QhKADNCHoAE0IOkATgg7QhKADNCHoAE0IOkATgg7QhKADNCHoAE0IOkATgg7QhKADNCHoAE0IOkATgg7QhKADNCHoAE0IOkATgg7QhKADNCHoAE0IOkATgg7QhKADNCHoAE1sG/SqeriqXq2qpzatu6+qXqyqr0++fnFvxwRgOzs5Q38kyZ3vsP7Pxxi3Tr4em+9YAOzWtkEfYzyZ5PV9mAWAGcxyDf1jVXV2cknm+rlNBMBUpg36Xyb54SS3JnkpyQNb7VhVJ6rqdFWdvnDhwpSHA2A7UwV9jPHKGOM7Y4zvJvnrJLddYd+TY4y1Mcba8vLytHMCsI2pgl5VRzct/nKSp7baF4D9sbTdDlX12SQfTHKkql5I8sdJPlhVtyYZSZ5P8lt7OCMAO7Bt0McYd7/D6k/twSwAzMA7RQGaEHSAJgQdoAlBB2hC0AGaEHSAJgQdoAlBB2hC0AGaEHSAJgQdoAlBB2hC0AGaEHSAJgQdoAlBB2hC0AGaEHSAJgQdoAlBB2hC0AGaEHSAJgQdoAlBB2hC0AGaEHSAJgQdoAlBB2hC0AGaEHSAJgQdoAlBB2hC0AGaEHSAJgQdoAlBB2hC0AGaEHSAJgQdoIltg15VD1fVq1X11KZ1h6vq8ap6ZvJ4/d6OCcB2dnKG/kiSOy9bd2+SJ8YYNyV5YrIMwAJtG/QxxpNJXr9s9V1JTk2en0ry4TnPBcAuTXsNfWWM8dLk+ctJVrbasapOVNXpqjp94cKFKQ8HwHZm/qHoGGMkGVfYfnKMsTbGWFteXp71cABsYdqgv1JVR5Nk8vjq/EYCYBrTBv3RJMcnz48n+cJ8xgFgWju5bfGzSb6S5Eeq6oWquifJ/Ul+vqqeSfJzk2UAFmhpux3GGHdvselDc54FgBl4pyhAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0MTSLC+uqueTfDvJd5K8OcZYm8dQAOzeTEGf+Nkxxmtz+D4AzMAlF4AmZg36SPKlqjpTVSfmMRAswn333bfoEWBmNcaY/sVVN4wxXqyq9yZ5PMn6GOPJy/Y5keREkhw7duwD58+fn2VemJuq2nLbLP8uYN6q6sxOfkY50xn6GOPFyeOrST6f5LZ32OfkGGNtjLG2vLw8y+EAuIKpg15V31dV73nreZI7kjw1r8EA2J1Z7nJZSfL5yX9bl5J8ZozxL3OZCoBdmzroY4znkvzYHGcBYAZuW+TAW11dzbPPPpvV1dVFjwIzmccbi+Ca9vzzz+fGG29c9BgwM2foAE0IOkATgg7QhKBzoJ09ezZjjLe/zp49u+iRYGqCzoF2++23X3EZriWCzoG1tLSUixcv5vDhwzl37lwOHz6cixcvZmnJzV9cm/zN5cC6dOlSrrvuuly8eDG33HJLko3IX7p0acGTwXQEnQNNvOnEJReAJgQdoAlBB2hC0AGaEHSAJgQdoAlBB2hC0AGaEHSAJgQdoAlBB2hC0AGaEHSAJgQdoAlBB2hC0AGaEHSAJgQdoAlBB2hC0AGaEHSAJgQdoAlBB2hC0AGaEHSAJgQdoAlBB2hC0AGaEHSAJmYKelXdWVX/XVXPVtW98xoKgN2bOuhV9a4kn0zyC0luTnJ3Vd08r8EA2J1ZztBvS/LsGOO5Mcb/Jvn7JHfNZywAdmuWoN+Q5H82Lb8wWQfAAizt9QGq6kSSE0ly7NixvT4cDb3/1PsXPcLcnDt+btEj0NgsQX8xyfs2Lf/AZN3/M8Y4meRkkqytrY0ZjscBJYKwM7Nccvlqkpuq6oeq6nuT/HqSR+czFgC7NfUZ+hjjzar6WJIvJnlXkofHGN+Y22QA7MpM19DHGI8leWxOswAwA+8UBWhC0AGaEHSAJgQdoAlBB2iixti/9/pU1YUk5/ftgLBzR5K8tughYAs/OMZY3m6nfQ06XK2q6vQYY23Rc8AsXHIBaELQAZoQdNhwctEDwKxcQwdowhk6QBOCTktVNarqgU3Lf1BV9y1wJNhzgk5XbyT5lao6suhBYL8IOl29mY0fdP7e5RuqarWq/rWqzlbVE1V1bLL+kar6RFV9uaqeq6pf3fSaP6yqr05e8yf798eAnRN0Ovtkko9U1fdftv7BJKfGGLck+XSST2zadjTJTyf5pST3J0lV3ZHkpiS3Jbk1yQeq6mf2eHbYNUGnrTHGt5L8bZLfuWzTTyb5zOT532Uj4G/55zHGd8cY30yyMll3x+Trv5J8LcmPZiPwcFWZ6ROL4BrwF9mI8N/scP83Nj2vTY9/Osb4q3kOBvPmDJ3WxhivJ/lckns2rf5yNj7UPEk+kuTft/k2X0zym1X17iSpqhuq6r3znhVmJegcBA9k47cpvmU9yW9U1dkkH03y8Su9eIzxpWxcovlKVZ1L8g9J3rNHs8LUvFMUoAln6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzTxf0Lo92f7nYxyAAAAAElFTkSuQmCC\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"preprocessed_features = features.copy()\n",
"preprocessed_features[preprocessed_features.values<0] = 0.\n",
"preprocessed_features = np.log(preprocessed_features+1)\n",
"values_vector = preprocessed_features.values.flatten()\n",
"assert preprocessed_features.min().min() >= 0.\n",
"pd.Series(values_vector).plot(kind='box')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To center the data around zeros I first create a new feature representing flags for the missing values (zeros) and remove the median from the data."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAD8CAYAAABzTgP2AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAADa1JREFUeJzt3X+s3Xddx/Hny17WJnMRw7qNdbttYxcTfripJ0UiJgIDR4NWzND2DwQku2poo9FEiySE+McSFLrEgTSdLKLZHAbFNVK6H5Fk8gdjt8vGul94V0vaskg32diY61J8+0dP8X66c+69u+d+e3q35yO56fl+v5/7Pe8/1j3v+Z7v6U1VIUnSKT827gEkSWcXwyBJahgGSVLDMEiSGoZBktQwDJKkhmGQJDUMgySpYRgkSY2JcQ+wGOeff36tW7du3GNI0rKyf//+J6pq9XzrlmUY1q1bx/T09LjHkKRlJcm3F7LOS0mSpIZhkCQ1DIMkqWEYJEkNwyBJahgGaQls376dVatWkYRVq1axffv2cY8kLZphkEa0fft2du3axbXXXssPfvADrr32Wnbt2mUctGxlOf5qz16vV36OQWeLVatWcfz48RftX7lyJc8///wYJpIGS7K/qnrzrfMVgzSi2VHYunXrwP3ScmIYpCVSVdx8880sx1fh0myGQVoCSdi5cyfPPfccO3fuJMm4R5IWzfcYpBHNFYHl+PdLL1++xyBJWhTDIElqGAZJUsMwSJIahkGS1DAMkqRG52FIcijJA0nuS/Kie0xz0l8lmUnyzSQ/1/VMkqThztTvfH5rVT0x5Ni7gMv6X28CPtv/U5I0BmfDpaTNwN/VSV8HXp3kteMeSpJeqc5EGAq4Pcn+JFMDjq8BDs/aPtLfJ0kagzNxKektVXU0yQXAHUkeqaq7XupJ+lGZApicnFzqGSVJfZ2/Yqiqo/0/vwt8Cdh42pKjwKWzti/p7zv9PLurqldVvdWrV3c1riS94nUahiTnJjnv1GPgncCB05btAX67f3fSLwBPV9XjXc4lSRqu60tJFwJf6v/rkxPAzVW1L8nvAVTVLmAvsAmYAZ4DPtjxTJKkOXQahqo6CFw+YP+uWY8L+HCXc0iSFu5suF1VknQWMQySpIZhkCQ1DIMkqWEYJEkNwyBJahgGSVLDMEiSGoZBktQwDJKkhmGQJDUMgySpYRgkSQ3DIElqGAZJUsMwSJIahkGS1DAMkqSGYZAkNQyDJKlhGCRJjc7CkOTSJF9N8lCSB5P8wYA1v5zk6ST39b8+1tU8kqSFmejw3CeAP66qe5OcB+xPckdVPXTaun+vqnd3OIck6SXo7BVDVT1eVff2Hz8DPAys6er5JElL44y8x5BkHfCzwN0DDr85yf1JvpLk9WdiHknScF1eSgIgyY8D/wT8YVV9/7TD9wJrq+rZJJuAfwEuG3KeKWAKYHJyssOJJemVrdNXDElexcko3FRV/3z68ar6flU923+8F3hVkvMHnauqdldVr6p6q1ev7nJsSXpF6/KupACfAx6uqp1D1lzUX0eSjf15nuxqJknS/Lq8lPSLwPuAB5Lc19/3Z8AkQFXtAq4Gfj/JCeB/gC1VVR3OJEmaR2dhqKqvAZlnzaeBT3c1gyTppfOTz5KkhmGQJDUMgySpYRgkSQ3DIElqGAZJUsMwSJIahkGS1DAMkqSGYZAkNQyDJKlhGCRJDcMgSWoYBklSwzBIkhqGQZLUMAySpIZhkCQ1DIMkqWEYJEkNwyBJanQehiRXJXk0yUySHQOOr0zyhf7xu5Os63omSdJwnYYhyQrgM8C7gNcBW5O87rRlHwK+V1UbgOuAT3Q5kyRpbl2/YtgIzFTVwap6AbgF2Hzams3A5/uPvwi8PUk6nkuSNMREx+dfAxyetX0EeNOwNVV1IsnTwGuAJ2YvSjIFTAFMTk52Na9ext74+Td2ct43/O0bzvhzPvD+Bzo5rwTdh2HJVNVuYDdAr9erMY+jZair/5nO9QK3yv9Utfx0fSnpKHDprO1L+vsGrkkyAfwE8GTHc0mShug6DPcAlyVZn+QcYAuw57Q1e4D39x9fDfxb+WOWJI1Np5eS+u8ZbANuA1YAN1bVg0n+HJiuqj3A54C/TzID/Dcn4yFJGpPO32Ooqr3A3tP2fWzW4+eB93Y9hyRpYfzksySpYRgkSQ3DIElqGAZJUsMwSJIahkGS1DAMkqSGYZAkNQyDJKlhGCRJDcMgSWoYBklSwzBIkhqGQZLUMAySpIZhkCQ1DIMkqWEYJEkNwyBJahgGSVJjoouTJvlL4FeBF4DHgA9W1VMD1h0CngF+CJyoql4X80iSFq6rVwx3AG+oqp8BvgV8ZI61b62qK4yCJJ0dOglDVd1eVSf6m18HLunieSRJS+9MvMfwO8BXhhwr4PYk+5NMnYFZJEnzWPR7DEnuBC4acOijVXVrf81HgRPATUNO85aqOprkAuCOJI9U1V1Dnm8KmAKYnJxc7NiSpHksOgxVdeVcx5N8AHg38PaqqiHnONr/87tJvgRsBAaGoap2A7sBer3ewPNJkkbXyaWkJFcBfwL8WlU9N2TNuUnOO/UYeCdwoIt5JEkL19V7DJ8GzuPk5aH7kuwCSHJxkr39NRcCX0tyP/AN4MtVta+jeSRJC9TJ5xiqasOQ/d8BNvUfHwQu7+L5JUmL5yefJUkNwyBJahgGSVLDMEiSGoZBktQwDJKkhmGQJDUMgySpYRgkSQ3DIElqGAZJUsMwSJIahkGS1DAMkqSGYZAkNQyDJKlhGCRJDcMgSWoYBklSwzBIkhqGQZLU6CwMST6e5GiS+/pfm4asuyrJo0lmkuzoah5J0sJMdHz+66rqk8MOJlkBfAZ4B3AEuCfJnqp6qOO5JElDjPtS0kZgpqoOVtULwC3A5jHPJEmvaF2HYVuSbya5MclPDji+Bjg8a/tIf9+LJJlKMp1k+tixY13MKklixDAkuTPJgQFfm4HPAj8FXAE8DnxqlOeqqt1V1auq3urVq0c5lSRpDiO9x1BVVy5kXZIbgH8dcOgocOms7Uv6+yRJY9LlXUmvnbX5HuDAgGX3AJclWZ/kHGALsKermSRJ8+vyrqS/SHIFUMAh4HcBklwM/E1VbaqqE0m2AbcBK4Abq+rBDmeSJM2jszBU1fuG7P8OsGnW9l5gb1dzSJJemnHfripJOssYBklSwzBIkhqGQZLUMAySpIZhkCQ1DIMkqWEYJEkNwyBJahgGSVLDMEiSGoZBktQwDJKkhmGQJDUMgySpYRgkSQ3DIElqGAZJUsMwSJIahkGS1Jjo4qRJvgD8dH/z1cBTVXXFgHWHgGeAHwInqqrXxTySpIXrJAxV9VunHif5FPD0HMvfWlVPdDGHJOml6yQMpyQJ8JvA27p8HknS0un6PYZfAv6rqv5jyPECbk+yP8lUx7NIkhZg0a8YktwJXDTg0Eer6tb+463AP8xxmrdU1dEkFwB3JHmkqu4a8nxTwBTA5OTkYseWJM0jVdXNiZMJ4Cjw81V1ZAHrPw48W1WfnG9tr9er6enp0YeUlsDJK6aDdfX3S1qMJPsXcpNPl5eSrgQeGRaFJOcmOe/UY+CdwIEO55EkLUCXYdjCaZeRklycZG9/80Lga0nuB74BfLmq9nU4jyRpATq7K6mqPjBg33eATf3HB4HLu3p+SdLi+MlnSVLDMEiSGoZBktQwDJKkhmGQJDUMgySpYRgkSQ3DIElqGAZJUsMwSJIahkGS1DAMkqSGYZAkNQyDJKlhGCRJDcMgSWoYBklSwzBIkhqGQZLUMAySpIZhkCQ1RgpDkvcmeTDJ/ybpnXbsI0lmkjya5FeGfP/6JHf3130hyTmjzCNJGt2orxgOAL8B3DV7Z5LXAVuA1wNXAX+dZMWA7/8EcF1VbQC+B3xoxHkkSSMaKQxV9XBVPTrg0Gbglqo6XlX/CcwAG2cvSBLgbcAX+7s+D/z6KPNIkkbX1XsMa4DDs7aP9PfN9hrgqao6MccaadlIwr59+zj5M4+0fM0bhiR3Jjkw4GvzmRhw1hxTSaaTTB87duxMPrW0IGvXrmXDhg2sXbt23KNII5mYb0FVXbmI8x4FLp21fUl/32xPAq9OMtF/1TBozew5dgO7AXq9Xi1iJqlThw4dYsOGDeMeQxpZV5eS9gBbkqxMsh64DPjG7AVVVcBXgav7u94P3NrRPJKkBRr1dtX3JDkCvBn4cpLbAKrqQeAfgYeAfcCHq+qH/e/Zm+Ti/in+FPijJDOcfM/hc6PMI0kaXU7+4L689Hq9mp6eHvcYEsCcbzYvx79fevlKsr+qevOt85PP0hLYsWMHVfWjrx07dox7JGnRDIO0BG666aY5t6XlxDBIS+Dw4cOsX7+exx57jPXr13P48OH5v0k6SxkGaUTbtm0D/v921UOHDjX7peVm3s8xSJrb9ddfD8ANN9zA8ePHWblyJddcc82P9kvLjXclSdIrhHclSZIWxTBIkhqGQZLUMAySpIZhkCQ1luVdSUmOAd8e9xzSAOcDT4x7CGmItVW1er5FyzIM0tkqyfRCbgeUzmZeSpIkNQyDJKlhGKSltXvcA0ij8j0GSVLDVwySpIZhkJZIkquSPJpkJom/wk3LlpeSpCWQZAXwLeAdwBHgHmBrVT001sGkRfAVg7Q0NgIzVXWwql4AbgE2j3kmaVEMg7Q01gCzf5/nkf4+adkxDJKkhmGQlsZR4NJZ25f090nLjmGQlsY9wGVJ1ic5B9gC7BnzTNKiTIx7AOnloKpOJNkG3AasAG6sqgfHPJa0KN6uKklqeClJktQwDJKkhmGQJDUMgySpYRgkSQ3DIElqGAZJUsMwSJIa/wfo038tMkpHxAAAAABJRU5ErkJggg==\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"zero_flags = preprocessed_features == 0.\n",
"zero_flags.columns = [c+'_flag' for c in zero_flags.columns]\n",
"values_vector = preprocessed_features.values.flatten()\n",
"data_center = np.median(values_vector[values_vector>0])\n",
"\n",
"preprocessed_features[~zero_flags] = preprocessed_features[~zero_flags] - data_center\n",
"values_vector = preprocessed_features.values.flatten()\n",
"pd.DataFrame(values_vector).plot(kind='box')\n",
"\n",
"preprocessed_features = pd.concat([preprocessed_features, zero_flags], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dimensionality reduction\n",
"\n",
"The data is now approximately normally distributed but because for every feature I create a new \"missing flag\" it becomes even more sparse and wide. \n",
"\n",
"Two forms of dimensionality reduction are applied and compared - PCA and Autoencoder. LDA is not considered here, because the testing dataset is more than ten times larger than the training and it will be of great value if included in the dimensionality reduction process. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Measure of information loss\n",
"\n",
"To get some form of unsupervised performance measurements, compare the MSE of the reconstruction of PCA and Autoencoder. The reconstruction MSE for the Autoencoder doesn't seem to be affected by the number of dimensions. It seems all values could fit in 2 dimensions with about the same efficiency as they would in 128 dimensions. This is a red flag for the Autoencoder and it could mean that even with a lot of hyperparameter searching, I didn't achieve good training.\n",
"![](images/1.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Recreating the analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download raw data files\n",
"\n",
"Login to Kaggle and download the data files from \"Santander Value Prediction Challenge\": [train.csv](https://www.kaggle.com/c/santander-value-prediction-challenge/download/train.csv), [test.csv](https://www.kaggle.com/c/santander-value-prediction-challenge/download/test.csv). Place the files in the data folder in the root of the project. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## All steps of the ML pipeline are implemented in Jupyter notebooks. \n",
"\n",
"There are extra details and hyperparameters inside every notebook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Preprocess\n",
"\n",
"- This process is executed in a preprocessing notebook: preprocess.ipynb\n",
"\n",
"It will apply the preprocessing logic and will save the result to `data/train_transformed.h5` and `data/test_transformed.h5`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Feature Importance\n",
"\n",
"- This process is executed in a preprocessing notebook: feature_importance.ipynb\n",
"\n",
"It will rank the features and save the ranking as to `data/importance.h5`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dimentionality reduction\n",
"\n",
"PCA and Autoencoder are applied to the training dataset. Both training and testing datasets are used for \"trained\" with the testing dataset. A validation subset is used to measure unsupervised performance and early stopping.\n",
"\n",
"- PCA is implemented with sklean in `run_pca.ipynb`. After executing the notebook, folder `pca_result` will be populated with files representing encoded the training dataset for number different number of dimensions and validation reconstructions for unsupervised comparison.\n",
"- Autoencoder is implemented with tensorflow in `run_ed.ipynb`, `ed.py`, `export_ed.py`. After executing the notebook, folder `ed_model_dirs` will be populated with files representing encoded the training dataset for number different number of dimensions, validation reconstructions for unsupervised comparison and checkpoints for the tensorflow encoder-decoder models. The execution of the Autoencoder takes multiple hours with a fast GPU.\n",
"\n",
"#### Dimentionality reduction unsupervied result\n",
"![](images/2.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Supervised training\n",
"\n",
"After the two different dimensionality reduction methods are executed, for every method and every number of dimensions we can apply three different supervised regression methods: Linear Regression, Random Forest, and Neural Network\n",
"\n",
"- Linear Regression, Random Forest is implemented in a notebook: `ll_rf.ipynb` with sklern. \n",
"- Neural Network is implemented with tensorflow in notebook `nn.ipynb`, `nn.py` and `nn_export.py`. The execution of the Neural Network takes multiple hours with a fast GPU."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Linear Regression result\n",
"\n",
"![](images/lr.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Random Forest Regression result\n",
"\n",
"![](images/rf.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Neural Network Regression result\n",
"\n",
"![](images/nn.png)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}