{"id":18658075,"url":"https://github.com/irfanchahyadi/ml-notes","last_synced_at":"2025-07-11T03:33:34.180Z","repository":{"id":129578122,"uuid":"193005488","full_name":"irfanchahyadi/ML-Notes","owner":"irfanchahyadi","description":"Complete personal notes for performing Data Analysis, Preprocessing, and Training ML model.","archived":false,"fork":false,"pushed_at":"2020-05-17T08:22:36.000Z","size":81,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-04-11T21:22:54.827Z","etag":null,"topics":["data-analysis","machine-learning","plotting","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/irfanchahyadi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-21T00:48:20.000Z","updated_at":"2021-05-27T11:28:59.000Z","dependencies_parsed_at":"2023-06-12T07:45:20.389Z","dependency_job_id":null,"html_url":"https://github.com/irfanchahyadi/ML-Notes","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/irfanchahyadi/ML-Notes","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/irfanchahyadi%2FML-Notes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/irfanchahyadi%2FML-Notes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/irfanchahyadi%2FML-Notes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/irfanchahyadi%2FML-Notes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/irfanchahyadi","download_url":"https://codeload.github.com/irfanchahyadi/ML-Notes/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/irfanchahyadi%2FML-Notes/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264721557,"owners_count":23653953,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","machine-learning","plotting","python"],"created_at":"2024-11-07T07:31:24.713Z","updated_at":"2025-07-11T03:33:34.144Z","avatar_url":"https://github.com/irfanchahyadi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ML-Notes\nComplete personal notes for performing Data Analysis, Preprocessing, and Training ML model. For easy guideline and quick copy paste snippet to real work. Fit on one page and constantly updated.\n## Table of contents\n- [Preparation](#Preparation)\n\t- [Importer](#Importer)\n\t- [Input Output](#Input-Output)\n\t\t- From Other Source : [Flat File](#Flat-File), [SQL](#SQL), [AWS Athena](#AWS-Athena), [GSpread](#GSpread)\n\t\t- Scraping : [BeautifulSoup](#BeautifulSoup), [Scrapy](#Scrapy)\n- [Exploratory Data Analysis](#Exploratory-Data-Analysis)\n\t- [Indexing](#Indexing)\n\t- [Describe](#Describe)\n\t- [Aggregate](#Aggregate)\n\t- [Plotting](#Plotting)\n\t\t- Relational : [Scatter](#Scatter-plot), [Line](#Line-Plot), [Joint](#Joint-Plot), [Pair](#Pair-Plot), [Regression](#Regression-Plot)\n\t\t- Distribution : [Pie](#Pie-Plot), [Histogram](#Histogram-Plot), [Bar](#Bar-Plot), [Strip](#Strip-Plot), [Swarm](#Swarm-Plot), [Box](#Box-Plot), [Violin](#Violin-Plot), [Categorical](#Categorical-Plot)\n\t\t- Other : [Heat Map](#Heat-Map)\n\t\t- [Properties](#Properties)\n- [Preprocessing](#Preprocessing)\n\t- [Feature Engineering](#Feature-Engineering)\n\t- [Missing Value](#Missing-Value)\n\t- [Categorical Feature](#Categorical-Feature)\n\t- [Transform](#Transform)\n\t- [Scaling and Normalize](#Scaling-and-Normalize)\n- [Training Model](#Training-Model)\n\t- [Feature Selection](#Feature-Selection)\n\t- [Cross Validation](#Cross-Validation)\n\t- [Train Model](#Train-Model)\n\t- [Evaluation](#Evaluation)\n\t- [Hyperparameter Tuning](#Hyperparameter-Tuning)\n\t- [Pipeline](#Pipeline)\n- [Neural Network](#Neural-Network)\n\t- Tensor Flow : \n\t- Keras Model : \n\t\t- [Build Model](#Build-Keras-Model)\n\t\t- [Create Callback](#Create-Keras-Callback)\n\t\t- [Train Model](#Train-Keras-Model)\n\t\t- [Evaluate Model](#Evaluate-Keras-Model)\n\t- PyTorch : \n- [Miscellaneous](#Miscellaneous)\n\t- [Basic Python](#Basic-Python)\n\t- [Regex Cheatsheet](#Regex-Cheatsheet)\n\t- [Datetime Cheatsheet](#Datetime-Cheatsheet)\n\t- [CSS Selector Cheatsheet](#CSS-Selector-Cheatsheet)\n\t- [Matplotlib Cheatsheet](#Matplotlib-Cheatsheet)\n\n## Preparation\n### Importer\n```python\n# Most used\nimport numpy as np                      # numerical analysis and matrix computation \nimport pandas as pd                     # data manipulation and analysis on tabular data\nimport matplotlib.pyplot as plt         # plotting data\nimport seaborn as sns                   # data visualization based on matplotlib\n\n# Connection to data\nimport pymysql                          # connect to mysql database\nimport pyodbc                           # connect to sql server database\nimport pyathena                         # connect to aws athena\nimport gspread                          # connect to gspread\nfrom oauth2client.service_account import ServiceAccountCredentials   # google auth\nfrom gspread_dataframe import get_as_dataframe, set_with_dataframe   # library i/o directly from df\n\n# Scikit-learn\nfrom sklearn.preprocessing import Imputer, scale, StandardScaler\nfrom sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV\nfrom sklearn.metrics import mean_squared_error, classification_report, confusion_matrix, roc_curve, roc_auc_score\nfrom sklearn.pipeline import Pipeline\n\n# Scikit-learn Model\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, LogisticRegression\nfrom sklearn.svm import SVC\n\n# Other Tools\n%reload_ext dotenv                      # reload dotenv on jupyter notebook\n%dotenv                                 # load dotenv\nimport os                               # os interface, directory, path\nimport glob                             # find file on directory with wildcard\nimport pickle                           # save/load object on python into/from binary file\nimport re                               # find regex pattern on string\nimport scipy                            # scientific computing\nimport statsmodels.api as sm            # statistic lib for python\nimport requests                         # http for human\nfrom bs4 import BeautifulSoup           # tool for scrape web page\n```\n### Input Output\nCreate DataFrame from list / dict.\n```python\ndata = [{'name': 'A', 'height': 172, 'weight': 78},\n        {'name': 'B', 'height': 168, 'weight': 75},\n        {'name': 'C', 'height': 183, 'weight': 81},\n        {'name': 'D', 'height': 175, 'weight': 77}]\npd.DataFrame(data)                                     # from list of dict\n\ndata = {'name': ['A', 'B', 'C', 'D'],\n        'height': [172, 168, 183, 175],\n        'weight': [78, 75, 81, 77]}\npd.DataFrame(data)                                     # from dict\n\ndata = [('A', 172, 78),\n        ('B', 168, 75),\n        ('C', 183, 81),\n        ('D', 175, 77)]\ncolumns = ['name', 'height', 'weight']\npd.DataFrame(data, columns=columns)                    # from records\n```\nGenerate random data.\n```python\nX = np.random.randn(100, 3)                              # 100 x 3 random std normal dist array\nX = np.random.normal(1, 2, size=(100, 3))                # 100 x 3 random normal with mean 1 and stddev 2\n\nfrom sklearn.datasets import make_regression, make_classification, make_blobs\n# generate 100 row data for regression with 10 feature but only 5 informative\nX, y = make_regression(n_samples=100, n_features=10, n_informative=5, noise=0.0, random_state=42)\n\n# generate 100 row data for classification with 10 feature but only 5 informative with 3 classes\nX, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_classes=3, random_state=42)\n\n# generate 100 row data for clustering with 10 feature with 3 cluster\nX, y = make_blobs(n_samples=100, n_features=10, centers=3, cluster_std=1.0, random_state=42)\n```\nLoad sample data.\n```python\nfrom sklearn.datasets import load_boston, load_digits, load_iris\nd = load_boston()                                          # load data dict 'like' of numpy.ndarray\ndf = pd.DataFrame(d.data, columns=d.feature_names)         # create dataframe with column name\ndf['TargetCol'] = d.target                                 # add TargetCol column\n```\n#### Flat File\n```python\ndf = pd.read_csv('data.csv', sep=',', index_col='col1', na_values='-', parse_dates=True)\ndf = pd.read_excel('data.xlsx', sheet_name='Sheet1', usecols='A,C,E:F')\n```\n#### SQL\n```python\ncon = pymysql.connect(user=user, password=pwd, database=db, host=host)       # mysql\ncon = pyodbc.connect('DRIVER={ODBC Driver 11 for SQL Server};\n       SERVER=server_name;DATABASE=db_name;UID=username;PWD=password')       # sql server\ncon = create_engine('mysql+pymysql://username:password@host/database')       # with sqlalchemy.create_engine\nquery = 'select * from employee where name = %(name)s'\ndf = pd.read_sql(query, con, params={'name': 'value'})                       # qieru or table name\ndf.to_sql(table_name, con, index=False, if_exists='replace')                 # if_exists = replace/append\n```\n#### AWS Athena\n```python\nconn  = pyathena.connect(aws_access_key_id=id, aws_secret_access_key=secret, \n                         s3_staging_dir=stgdir, region_name=region)\nquery = 'select * from employee'\ndf = pd.read_sql(query, conn)\n```\n#### GSpread\n```python\nscope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']\ncreds = ServiceAccountCredentials.from_json_keyfile_name('client_secret.json', scope)\nclient = gspread.authorize(creds)\nsheet = client.open('FileNameOnGDrive').get_worksheet(0)\ndf = get_as_dataframe(sheet, usecols=list(range(10)))       # use additional gspread_dataframe lib\ndata = sheet.get_all_values()\nheader = data.pop(0)\ndf = pd.DataFrame(data, columns=header)                     # only use gspread\n```\n#### BeautifulSoup\n```python\nHEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}\nres = requests.get(url, headers=HEADERS)                 # request url, with user agent on headers\nsoup = bs4.BeautifulSoup(res.content, 'html.parser')     # create soup object\nrows = soup.select('div.product')                        # selector, see misc\ntext = soup.select('div.product \u003e a').text               # get text of link\nhref = soup.select('div.product \u003e a')['href']            # get href attribute of link\n```\n#### Scrapy\n```python\n# Shell command:\nscrapy startproject project_name           # create new project\ncd project_name\nscrapy genspider spider_name url           # generate new spider\nscrapy crawl spider_name                   # run spider\nscrapy crawl spider_name -o result.csv     # run spider, save output as csv\nscrapy shell url                           # testing shell to specific url\n\n# Spider example 1 :\nclass QuotesSpider(scrapy.Spider):\n    name = 'quotes'\n    start_urls = ['http://quotes.toscrape.com']\n\n    def parse(self, response):\n        self.log('I just visited: ' + response.url)\n\n        # List of quotes\n        for quote in response.css('div.quote'):\n            item = {'author_name': quote.css('small.author::text').extract_first(),\n\t            'text': quote.css('span.text::text').extract_first(),\n\t            'tags': quote.css('a.tag::text').extract()}\n            yield item\n\n        # Follow pagination link\n        next_link = response.css('li.next \u003e a::attr(href)').extract_first()\n        if next_link:\n            full_next_link = response.urljoin(rel_link)\n            yield scrapy.Request(url=full_next_link, callback=self.parse)\n```\n## Exploratory Data Analysis\n### Indexing\n```python\ndf.col1                                  # return series col1, easy way\ndf['col1']                               # return series col1, robust way\ndf[['col1', 'col2']]                     # return dataframe consist col1 and col2\ndf.loc[5:10, ['col1','col2']]            # return dataframe from row 5:10 column col1 and col2\ndf.iloc[5:10, 3:5]                       # return dataframe from row 5:10 column 3:5\ndf.head()                                # return first 5 rows, df.tail() return last 5 rows\ndf[df.col1 == 'abc']                     # filter by comparison, use ==, !=, \u003e, \u003c, \u003e=, \u003c=\ndf[(df.col1 == 'abc') \u0026 (df.col2 \u003e 50)]  # conditional filter, use \u0026(and), |(or), ~(not), ^(xor), .any(), .all()\ndf[df.col1.isna()]                       # filter by col1 is na\ndf[df.col1.isnull()]                     # filter by col1 is null, otherwise use .notnull()\ndf[df.col1.isin(['a','b'])]              # filter by col1 is in list\ndf[df.col1.between(70, 80)]              # filter by col1 value between 2 values\ndf.filter(regex = 'pattern')             # filter by regex pattern, see misc\nfor idx, row in df.iterrows():           # iterate dataframe by rows\n    print(row['col1'])                   # return index and series of row\n\npd.options.display.max_rows = len(df)    # default 60 rows\n```\n### Describe\n```python\ndf.shape                           # number of rows and cols\ndf.columns                         # columns dataframe\ndf.index                           # index dataframe\ndf.T                               # transpose dataframe\ndf.info()                          # info number of rows and cols, dtype each col, memory size\ndf.describe(include='all')         # statistical descriptive: unique, mean, std, min, max, quartile\ndf.skew()                          # degree of symetrical, 0 symmetry, + righthand longer, - lefthand longer\ndf.kurt()                          # degree of peakedness, 0 normal dist, + too peaked, - almost flat\ndf.corr()                          # correlation matrix\ndf.isnull().sum()                  # count null value each column, df.isnull() = df.isna()\ndf.col1.unique()                   # return unique value of col1\ndf.nunique()                       # unique value each column\ndf.sample(10)                      # return random sample 10 rows\ndf['col1'].value_counts(normalize=True)      # frequency each value\ndf.sort_values(['col1'], ascending=True)     # sort by col1 ascending, .sort_index() for index\ndf.drop_duplicates(subset='col1', keep='first', inplace=True)     # drop duplicate based on subset\n```\n### Aggregate\n```python\ndf.sum()                           # use sum, count, median, min, mean, var, std, nunique, quantile([0.25,0.75])\ndf.groupby(['col1']).size()        # group by col1\ndf.groupby(df.col1).TargetCol.agg([np.mean, 'count'])     # multi aggregate function on 1 column\ndf.groupby('col1').agg({'col2': 'count', 'col3': 'mean'}) # multi aggregate function on multi columns\ndf.pivot(index='col1', columns='col2', values='col3')     # reshape to pivot, error when duplicate\ndf.pivot_table(index='col1', columns='col2', values='col3', aggfunc='sum')     # pivot table, like excel\nflat = pd.DataFrame(df.to_records())            # flatten multiindex dataframe\n```\n### Plotting\n#### Scatter plot\n```python\nplt.scatter(x, y, c, s)\n# x, y, c, s array like object, c (color) can be color format string, s (size) can be scalar\n# also df.plot.scatter(x='col1', y='col2', c='col3', s='col4') or\n# sns.scatterplot(x='col1', y='col2', hue='col3', size='col4', style='col5', data=df)\n```\n![scatterplot](https://seaborn.pydata.org/_images/seaborn-scatterplot-13.png)\n#### Line Plot\n```python\nplt.plot(x, y, 'ro--')\n# x and y array like object, 'ro--' means red circle marker with dash line (see matplotlib cheatsheet below)\n# also written as plt.plot(x, y, color='r', marker='o', linestyle='--') you can also use df.plot() or \n# sns.lineplot(x='col1', y='col2', hue='col3', size='col4', data=df)\n```\n![lineplot](https://matplotlib.org/_images/sphx_glr_set_and_get_001.png)\n#### Joint Plot\n```python\nsns.jointplot(x='col1', y='col2', data=df, kind='reg')     # kind = scatter/reg/resid/kde/hex\n# Joint 2 two type distrbution plot and kind plot\n```\n![jointplot](https://seaborn.pydata.org/_images/seaborn-jointplot-2.png)\n#### Pair Plot\n```python\nsns.pairplot(df, x_vars=['col1'], y_vars=['col2'], hue='col3', kind='scatter', diag_kind='auto')\n# multi joint plot, _vars for filter column, kind = scatter/reg, diag_kind = hist/kde\n```\n![pairplot](https://seaborn.pydata.org/_images/seaborn-pairplot-2.png)\n#### Regression Plot\n```python\nregplot(x='col1', y='col2', data=df, ci=95, order=1)\n# scatter plot + regression fit, ci (confidence interval 0-100), order (polynomial order)\n```\n![regplot](https://seaborn.pydata.org/_images/seaborn-regplot-1.png)\n#### Pie Plot\n```python\nplt.pie(x, labels, explode, autopct='%1.1f%%')\n# x, labels, explode array like also df.plot.pie(y='col1') lable get from index  \n```\n![pieplot](https://matplotlib.org/_images/sphx_glr_pie_features_001.png)\n#### Histogram Plot\n```python\nplt.hist(x, bins=50, density=False,  cumulative=False)\n# x array like, density (probability density), cumulative probability\n# also df.plot.hist('col1') or sns.distplot(x)\n```\n![histplot](https://matplotlib.org/_images/sphx_glr_pyplot_text_001.png)\n#### Bar Plot\n```python\nplt.bar(x, y)     # or plt.barh(x, y)\n# x array like, also df.plot.bar(x='col1', y=['col2','col3'], stacked=True, subplots=False)\n# sns.countplot(x='col1', y='col2', hue='col3', data=df, orient='v')\n# sns.barplot(x='col1', y='col2', hue='col3', data=df, orient='v')\n```\n![barplot](https://matplotlib.org/_images/sphx_glr_barchart_001.png)\n#### Strip Plot\n```python\nsns.stripplot(x='col1', y='col2', hue='col3', data=df, jitter=True, dodge=False, orient='v')\n# for few data, jitter=True makes point not overwrite on top each other\n```\n![stripplot](https://seaborn.pydata.org/_images/seaborn-stripplot-4.png)\n#### Swarm Plot\n```python\nsns.swarmplot(x='col1', y='col2', hue='col3', data=df, dodge=False, orient='v')\n# for few data, more clearly than stripplot, dodge=True makes each cat in hue separable\n```\n![swarmplot](https://seaborn.pydata.org/_images/seaborn-swarmplot-4.png)\n#### Box Plot\n```python\nsns.boxplot(x='col1', y='col2', hue='col3', data=df, dodge=False, orient='v')\n# for large data, include median, Q1 \u0026 Q3, IQR (Q3-Q1), min (Q1-1.5*IQR), max (Q3+1.5*IQR) and outliers\n```\n![boxplot](https://seaborn.pydata.org/_images/seaborn-boxplot-2.png)\n#### Violin Plot\n```python\nsns.violinplot(x='col1', y='col2', hue='col3', data=df, dodge=False, orient='v')\n# kernel density plot (KDE) for visualize clearly distribution of data\n```\n![violinplot](https://seaborn.pydata.org/_images/seaborn-violinplot-4.png)\n#### Categorical Plot\n```python\nsns.catplot(x='col1', y='col2', hue='col3', data=df, row='col4', col='col5', col_wrap=4, \nkind='strip', sharex=True, sharey=True, orient='v')\n# categorical plot with facetgrid options \n```\n![catplot](https://seaborn.pydata.org/_images/seaborn-catplot-5.png)\n#### Heat Map\n```python\nsns.heatmap(df.corr(), annot=True, fmt='.2g', annot_kws={'size': 8}, square=True, cmap=plt.cm.Reds)\n# useful for plot correlation, annot (write value data), fmt (format value), square, cmap (color map)\n# other option use df.corr().style.background_gradient(cmap='coolwarm').set_precision(2)\n```\n![heatmap](https://seaborn.pydata.org/_images/seaborn-heatmap-1.png)\n#### Properties\n```python\nplt.figure(figsize=(15,8))\nfig, ax = plt.subplots(1, 2, sharex=False, sharey=False, figsize=(15,4))    # subplots, access with ax[0,1]\nplt.title('title')          # or ax.set_title\nplt.xlabel('foo')           # or plt.ylabel, ax.set_xlabel, ax.set_ylabel\nplt.xticks(x, labels)       # x and labels list, or ax.set_xticks\nplt.xticks(rotation=90)     # rotate xticks\nplt.xlim(0, 100)            # or ylim, ax.set_xlim, ax.set_ylim\nplt.legend(loc='best')      # or ax.legend, loc = upper/lower/right/left/center/upper right\nplt.rcParams['figure.figsize'] = (16, 10)      # setting default figsize\nplt.style.use('classic')    # find all style on plt.style.available\n\ng = sns.FacetGrid(df, row='col1', col='col2', hue='col3')     # comparable subplot row by col1 col by col2\ng.map(plt.hist, 'col4', bins=50)                              # with histogram count col4\ng.map(plt.scatter, 'col4', 'col5')                            # or with scatter plot col4 and col5\n```\n## Preprocessing\n### Feature Engineering\nBasic Operation\n```python\ndf['new_col'] = df.col1 / 1000                       # create new column\ndf = df.drop('col1', axis=1)                         # drop column\ndf = df.drop(df[df.col1 == 'abc'].index)             # drop row which col1 equal to 'abc'\ndf.col1 = df.col1.astype(str)                        # convert column to string, also use 'category', 'int32'\ndf.col1 = pd.to_numeric(df.col1, error='coerce')     # convert column to numeric\ndf.col1 = pd.to_datetime(df.col1, error='coerce')    # convert column to datetime\ndf.col1 = pd.Categorical(df.col1, categories=['A','B','C'], ordered=True)     # convert column to category\n\ndf.col1.str[:2]                     # access string method/properties\ndf.col1.dt.strftime('%d/%m/%Y')     # access datetime method/properties\ndf.values                           # convert dataframe to numpy array\n```\nMap, Apply, Applymap\n```python\n# Map (Series only)\nd = {1: 'one', 2: 'two', 3: 'three'}\ndf.col1.map(d)\n\n# Apply (Series)\ndf.col1.apply(func)          # element wise\n\n# Apply (Dataframe)\nsum_col1, sum_col2 = df[['col1', 'col2']].apply(sum)       # apply to axis=0 (row)\ndf['new_col'] = df.apply(lambda x: x[0] + x[1], axis=1)    # apply to axis=1 (column)\n\n# Applymap (Dataframe only)\ndf.applymap(func)            # element wise\n\n# Numpy Vectorize\nnp.vectorize(func)(A)        # element wise\n\n```\n\n### Merge Data\n```python\n# Append\ndf1.append(df2, ignore_index=True)                # stacked vertical with reset index\n# Concat\npd.concat([df1, df2, df3], ignore_index=True)     # stacked vertical with reset index, axis=0\n\n# Join\n# Merge\ndf1.merge(df2, on='key_col')\npd.merge(df1, df2, on='key_col', how='inner')     # how: left, right, outer, inner\npd.merge(df1, df2, left_on='lkey_col', right_on='rkey_col')\n```\n\n### Missing Value\n```python\ndf = df.fillna(0, method=None)                                   # None/backfill/bfill/pad/ffill\nimp = Imputer(missing_values='NaN', strategy='mean', axis=0)     # strategy = mean/median/most_frequent\nimp.fit(X)\nX = enc.transform(X)             # perform imputing, or use .fit_transform\n```\n### Categorical Feature\n```python\ndf = pd.get_dummies(df, columns=['col1'], prefix='col1')\n\nenc = LabelBinarizer()                          # label with value 0 or 1\nenc = LabelEncoder()                            # label with value 0 to n-1 \nenc = OrdinalEncoder()                          # label with value 0 to n-1, multi column \nenc = OneHotEncoder(handle_unknown='error')     # create dummy n column binarize, handle_unknown = error/ignore\n\nenc.fit(X)\nX = enc.transform(X)             # perform encoding, or use .fit_transform\nX = enc.inverse_transform(X)     # decode back to original\n\n\n```\n### Transform\n```python\n\n\n```\n\n### Scaling and Normalize\n```python\nscaler = StandardScaler()                              # scale data to mean 0 and stddev 1\nscaler = MinMaxScaler(feature_range=(0, 1))            # scale data to 0 to 1 (can be set as -1 to 1)\nscaler = RobustScaler()quantile_range=(25.0, 75.0)     # scale data to robust to outlier\nscaler = Normalizer(norm='l2')                         # normalize data\n\nscaler.fit(X)\nX = scaler.transform(X)             # perform scaling, or use .fit_transform\nX = scaler.inverse_transform(X)     # scale back to original\n```\n\n## Training Model\n### Feature Selection\n### Cross Validation\n```python\n# Train test split\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)\n\n# Train test split with Numpy\nnp.random.shuffle(data)\nborder = round(data.shap[0] * 0.7)\ntrain, test = np.split(data, [border])\n\n# Cross Validation\ncv = cross_val_score(model, X, y, cv=5, scoring='r2', )     # (Stratified) KFold CV, return list of score\n```\n### Train Model\n```python\n# Classification model\nclf = LogisticRegression(penalty='l1', C=1.0)\n# C inverse regularization, smaller C stronger reg\n\nclf = KNeighborsClassifier(n_neighbors=3)\n# n_neighbors number neighbors\n\nclf = DecisionTreeClassifier(max_depth=3, criterion='entropy')\n# criterion = gini/entropy, other params: min_sample_split, min_sample_leaf, max_features\n```\n### Evaluation\n```python\naccuracy = model.score(X_test, y_test)     # get accuracy (depend on model)\n```\n### Hyperparameter Tuning\n```python\n# Grid Sarch CV: Try every parameter combination\nparams = {'C': [1, 0.5, 0.1, 0.05, 0.01]}\ngrid_cv = GridSearchCV(model, params, cv=5, scoring='r2')\ngrid_cv.fit(X, y)     # print grid_cv.best_params_ and grid_cv.best_score_\n\n# Randomized Search CV: Try every parameter combination based on random distribution\nparams = {'C': scipy.stats.randint(0, 1)}\nrandomsearch_cv = RandomizedSearchCV(model, params, cv=5, scoirng='r2')\nrandomsearch_cv.fit(X, y)     # print grid_cv.best_params_ and grid_cv.best_score_\n```\n### Pipeline\n\n## Neural Network\n### Keras Model\n### Build Keras Model\n```python\n# Model\nmodel = Sequential()\n# First Layer\nmodel.add(Dense(20, activation='relu', input_shape=(X_train.shape[1],)))\n\n# Hidden Layer\nmodel.add(Dense(20, activation='relu'))\n\n# Output Layer\nmodel.add(Dense(CLASS, activation='softmax'))\n\n# Compile Model\nmodel.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])\n# loss = categorical_crossentropy, mean_squared_error\n# optimizer = sgd, adam\n\n# Model Summary\nmodel.summary()\n```\n### Create Keras Callback\n```python\ncallback = [ReduceLROnPlateau(patience=5),     # reduce learning rate when metrics stop improving\n            EarlyStopping(patience=5),         # stop training when metrics stop improving\n            ModelCheckpoint(filepath='best_model.h5', save_best_only=True)]     # save model every period\n```\n### Train Keras Model\n```python\nhistory = model.fit(X_train, y_train, epochs=100, validation_data=(X_val, y_val), callbacks=callback)\n# history.history return dict of every metrics each epoch\n```\n### Evaluate Keras Model\n```python\npred = model.predict(X_val).argmax(axis=1)                            # multiple prediction\npred = model.predict(X_train[0, :].reshape(1, col_shape)).argmax()    # single prediction (first sample)\nscore = round(model.evaluate(X_train, y_train)[1]*100, 2)             # metric score\n```\n## Miscellaneous\n### Basic Python\n#### Basic data type\nMath\n```python\nnp.ceil(5.7)        # return 6.0\nnp.floor(5.7)       # return 5.0\nnp.round(5.7)       # return 6.0\nround(5.7)          # return 6\ns.min()\ns.max()\n```\nString\n```python\ns.isalnum()            # return True if alphabetic or numeric\ns.isnumeric()          # return True if numeric\ns.isalpha()            # return True if alphabetic\n'abc' in s             # return True if 'abc' found in s, otherwise use not in\ns.find('abc')          # return index where substring 'abc' found in s\n'abcd: {}'.format(x)   # replace {} with value of x, use {:,} for thousand separator, {:.2%} for 2 decimal\ns.strip()              # return removed leading and trailing space, also use lstrip() or rstrip()\n'1,2,3'.split(',')     # return list of ['1', '2', '3']\n','.join(['a', 'b', 'c'])  # join string, return 'a,b,c'\n```\nList, Tuple \u0026 Set\n```python\ns = [1,2,3,4]      # or use list(1,2,3,4) \n1 in s             # return True if 1 found s\ns1 + s2            # concate s1 and s2\ns[1]               # second item in s\ns[1:5]             # slice s from 1 to 5 (4 element)\ns[1:5:-1]          # slice s from 1 to 5 backwards\ns.index(3)         # return index of 3 in s\ns.append(3)        # append 3 to end of s\ns.reverse()        # reverse s, not return anything, or use list(reversed(a))\ns.sort()           # sort s asc, not return anything, or use sorted(s)\ns = (1,2,3,4)      # create tuple, immutable list, or use tuple(1,2,3,4)\ns = set(1,2,3,4)   # create set, unique list\nlist(map(lambda x: x*2, b))       # map list\nlist(filter(lambda x: x\u003e2, a))    # filter list\nfor idx, val in enumerate(s):     # iterate over list with index\n    print(idx, val)\n```\nDictionary\n```python\nd = {'a': 1, 'b': 2, 'c': 3}   # or use dict(a=1,b=2,c=3)\nd.keys()                       # return all keys, or use list(d)\nd.values()                     # return all values\n'a' in d                       # return True if 'a' is key in d\nd['a']                         # get value with keys='a'\ndel d['a']                     # remove item with keys='a'\nfor key, val in d.items():     # iterate over dict\n    print(key, val)\n```\nNumpy.Array\n```python\na = np.array([[1,2],[3,4]])     # create array 2x2\n\n```\n#### Pickle\n```python\ntry:                                           # check if there is pickle file\n    with open('data.pickle', 'rb') as f:       # open pickle file\n        data = pickle.load(f)                  # load data from pickle\n\nexcept FileNotFoundError:                      # if pickle file not found\n    data = some_process()                      # run some process to get data\n    with open('data.pickle', 'wb') as f:       # create new blank pickle file\n        pickle.dump(data, f)                   # save data to pickle\n```\n\n#### Others\n```python\nos.listdir()          # get all filename on current directory\n```\n### Regex Cheatsheet\n```\n# Cheatsheet:\n\\d      # any number\n\\D      # anything but number\n\\w      # any character\n\\W      # anything but character\n.       # any character except newline\n\\b      # whitespace\n\\.      # dot character\n+       # match 1 or more, use +? for non-greedy\n*       # match 0 or more, use *? for non-greedy\n?       # match 0 or 1\n^       # start string\n$       # end string\n()      # capturing group, use (?:) for non-capturing group\n{}      # expect 1-3, ex \\d{3} expect 3 digit number, \\d{1,3} expect 1-3 digit number\n[]      # range, ex [A-Z] all capital letter, [abcd] letter a, b, c, d, [^abc] letter NOT a, b, c\n|       # either or, ex \\d{1,4}|[A-z]{4,10} expect 1-4 digit number or 4-10 character word\n\\n      # newline\n\\s      # space\n\\t      # tab\n\\e      # escape\n\\r      # return\n\n# Sample:\n'Chapter\\s\\d{1,4}[\\,\\.]?\\d{0,1}'        # get: Chapter 1, Chapter 23, Chapter 649, Chapter 120.5, Chapter 120,5\n'[a-zA-Z0-9.]+@[a-zA-Z0-9-]+\\.\\w+'      # get email address\n\n# Usage:\nre.findall(r\"\\w+ly\", text)                      # return ['carefully', 'quickly']\nphone = re.sub(\"[a-zA-Z '()+-]\", '', phone)     # substitute character with ''\n\n```\n### Datetime Cheatsheet\n```python\n# Cheatsheet:\n%Y   # Year 4 digit, ex: 1992, 2008, 2014\n%y   # Year 2 digit, ex: 92, 08, 14\n%m   # Month 2 digit, ex: 01, 02, ..., 12\n%b   # Month abbreviation, ex: Jan, Feb, ..., Dec\n%B   # Month full, ex: January, February, ..., December\n%d   # Day 2 digit, ex: 01, 02, ..., 31\n%w   # Weekday 1 digit, ex: 0 (Sunday), 1, ..., 6\n%a   # Weekday abbreviation, ex: Sun, Mon, ..., Sat\n%A   # Weekday full, ex: Sunday, Monday, ..., Saturday\n%H   # Hour (24), ex: 01, 02, ..., 23, 00\n%I   # Hour (12), ex: 01, 02, ..., 11, 12\n%M   # Minute, ex: 00, 01, ..., 59\n%S   # Second, ex: 00, 01, ..., 59\n%f   # Microsecond, ex: 000000, 000001, ..., 999999\n%c   # Local datetime representation, ex: Tue Aug 16 21:30:00 1988\n%x   # Local date representation, ex: 08/16/88\n%X   # Local time representation, ex: 21:30:00\n\n# Sample:\n'%d-%m-%Y %H:%M:%S'     # Personal preferred datetime, ex: 16-08-1988 21:30:00\n\n# Usage:\ncur_date = datetime.datetime.now()     # return current datetime\ncur_date.strftime('%d/%m/%Y')          # convert datetime object to string\ndatetime.datetime.strptime('16/08/1988', '%d/%m%Y')     # convert string to object\ndatetime.datetime.fromtimestamp(1575278092)             # convert timestamp datetime to datetime\n\n```\n### CSS Selector Cheatsheet\n```python\n# Cheatsheet:\ndiv                # div\n#abc               # id abc\n.abc               # class abc\ndiv.abc            # div with class abc\ndiv.abc.def        # div with class both abc and def\ndiv a              # a inside div\ndiv \u003e a            # a directly inside div\ndiv + p            # p immediately after div\ndiv ~ p            # p after div\na[target=_blank]   # a with attribute target=\"_blank\"\ndiv[style*=\"border:1px\"]     # div with style contain \"border:1px\"\ndiv[style^=\"border:1px\"]     # div with style begin with \"border:1px\"\ndiv[style$=\"border:1px\"]     # div with style end with \"border:1px\"\n\n# Usage:\nlinks = soup.select('div.abc \u003e a')\nfor link in links:\n    print(link['href'])\n```\n### Matplotlib Cheatsheet\n```python\n# Line Styles Cheatsheet:\n-    # solid line\n--   # dashed line\n-.   # dash-dot line\n:    # dotted line\n\n# Marker Styles Cheatsheet:\n.   # point\no   # circle\n^   # triangle up, also v, \u003c, \u003e\ns   # square\np   # pentagon\n*   # star\n+   # plus\nx   # x\n|   # v line\n-   # h line\n\n# Color Styles Cheatsheet:\nb   # blue\ng   # green\nr   # red\nc   # cyan\nm   # magenta\ny   # yellow\nk   # black\nw   # white\n\n# Cmap Cheatsheet:\nviridis, plasma, Reds, cool, hot, coolwarm, hsv, Pastel1, Pastel2, Paired, Set1, Set2, Set3\nplt.colormaps()     # return all possible cmap\n\n# Usage:\nplt.plot(x, y, 'go--')\nplt.plot(x, y, color='g', marker='o', linestyle='--')\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Firfanchahyadi%2Fml-notes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Firfanchahyadi%2Fml-notes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Firfanchahyadi%2Fml-notes/lists"}