{"id":13737352,"url":"https://github.com/30lm32/ml-projects","last_synced_at":"2025-05-08T13:33:51.570Z","repository":{"id":201732755,"uuid":"108330767","full_name":"30lm32/ml-projects","owner":"30lm32","description":"ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python","archived":false,"fork":false,"pushed_at":"2020-12-15T10:54:16.000Z","size":68974,"stargazers_count":268,"open_issues_count":0,"forks_count":110,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-01-08T05:01:41.801Z","etag":null,"topics":["ab-testing","deep-learning","docker","gensim","geolocation","imbalanced-data","kdtree","keras","lstm-neural-networks","machine-learning","mlflow","nlp","random-forest","spam-classification","svm","tensorboard","tensorflow","text-classification","timeseries-analysis","word2vec"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/30lm32.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-10-25T21:59:37.000Z","updated_at":"2025-01-07T05:27:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"9f1a11ee-e8da-47ef-82df-b68cb21c0679","html_url":"https://github.com/30lm32/ml-projects","commit_stats":null,"previous_names":["30lm32/ml-projects","erdiolmezogullari/ml-projects"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/30lm32%2Fml-projects","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/30lm32%2Fml-projects/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/30lm32%2Fml-projects/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/30lm32%2Fml-projects/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/30lm32","download_url":"https://codeload.github.com/30lm32/ml-projects/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253077658,"owners_count":21850361,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ab-testing","deep-learning","docker","gensim","geolocation","imbalanced-data","kdtree","keras","lstm-neural-networks","machine-learning","mlflow","nlp","random-forest","spam-classification","svm","tensorboard","tensorflow","text-classification","timeseries-analysis","word2vec"],"created_at":"2024-08-03T03:01:43.425Z","updated_at":"2025-05-08T13:33:51.562Z","avatar_url":"https://github.com/30lm32.png","language":null,"funding_links":[],"categories":["Others"],"sub_categories":[],"readme":"\n\n![Image](https://cdn-images-1.medium.com/max/1600/1*60gs-SFYyooZZBxatuoNJw.jpeg)\n\n### Introduction\n---\n\nIn this passionated self paced collection repository, you will find many Machine Learning, Data Mining and Data Engineering challenges that I have been tackling, so far. Throughout this guideline you will walk through the details of projects and repositories.\n\nI hope that you would enjoy while checking out those repositories related to ML, Data Mining and Data Engineering on the table, below.\n\nYou may reach me whenever you want to get further information about projects.\n\n\n|__Problem__|__Methods__|__Libs__|__Repo__|\n|-|-|-|-|\n|[Conversion of Landing Page](#ab-testing-to-distinguish-impact-of-version-of-landing-page-on-user)|`A\\B Testing`, `Z test` |`pandas`, `statsmodel`|[Click](https://github.com/erdiolmezogullari/ml-ab-testing)|\n|[Integration of Fashion MNIST (CNN) Model into Tensorboard and MLflow](#fashion-mnist-with-tensorboard-and-mlflow)|`CNN`, `Deep Learning` |`Keras`, `MLflow`, `Pandas`, `Sklearn`|[Click](https://github.com/erdiolmezogullari/ml-fmnist-mlflow-tensorboard)|\n|[Dockerize an Apache Flink Application through Docker](#dockerize-an-apache-flink-application)| `Apache Flink Table \u0026 SQL` |`Apache Flink Table \u0026 SQL`, `Docker`, `Docker-Compose`|[Click](https://github.com/erdiolmezogullari/de-flink-sql-as-a-docker)|\n|[Crawler as a Service](#crawler-as-a-service)| Searching (`DFS`, `BFS`) |`GO`, `Neo4j`, `Redis`, `Docker`, `Docker-Compose`|[Click](https://github.com/erdiolmezogullari/de-crawler-as-a-service)|\n|[Prediction Skip Action on Music Dataset](#prediction-skip-action)|`LightGBM`, `Linear Reg`, `Logistic Reg.`|`Sklearn`, `LightGBM`, `Pandas`, `Seaborn`|[Click](https://github.com/erdiolmezogullari/ml-prediction-skip-action)|\n|[Hairstyle Classification](#hairstyle-classification)|`LightGBM`, `TF-IDF` |`Sklearn`, `LightGBM`, `Pandas`, `Seaborn`|[Click](https://github.com/erdiolmezogullari/ml-hairstyle-classification)|\n|[Time Series Analysis by SARIMAX](#time-series-analysis-by-sarimax)|`ARIMA`, `SARIMAX` |`statsmodels`, `pandas`, `sklearn`, `seaborn`|[Click](https://github.com/erdiolmezogullari/ml-time-series-analysis-sarimax)|\n|[Multi-language and Multi-label Classification Problem on Fashion Dataset](#multi-language-and-multi-label-classification-problem-on-fashion-dataset)|`LightGBM`, `TF-IDF` |`Sklearn`, `LightGBM`, `Pandas`, `Seaborn`|[Click](https://github.com/erdiolmezogullari/multi-label-classification)|\n|[Which one does it catch whole* SPAM SMS?](#which-one-does-it-catch-whole-spam-sms)|`Naive Bayesian`, `SVM`, `Random Forest Classifier`, `Deep Learning - LSTM`, `Word2Vec`|`Sklearn`, `Keras`, `Gensim`, `Pandas`, `Seaborn`|[Click](https://github.com/erdiolmezogullari/ml-spam-sms-classification)|\n|[Which novel do I belong To?](#which-novel-do-i-belong-to)|`Deep Learning - LSTM`, `Word2Vec`|`Sklearn`, `Keras`, `Gensim`, `Pandas`, `Seaborn`|[Click](https://github.com/erdiolmezogullari/ml-deep-learning-keras-novel)|\n|[Why do customers choose and book specific vehicles?](#why-do-customers-choose-and-book-specific-vehicles)|`Random Forest Classifier`|`Sklearn`, `Pandas`, `Seaborn`|[Click](https://github.com/erdiolmezogullari/ml-imbalanced-car-booking-data)|\n|[Forecasting impact of promos (promo1, promo2) on sales in Germany, Austria, and France](#forecasting-impact-of-promos-promo1-promo2-on-sales-in-germany-austria-and-france)|`Random Forest Regressor`, `ARIMA`, `SARIMAX`|`statsmodels`, `pandas`, `sklearn`, `seaborn`|[Click](https://github.com/erdiolmezogullari/ml-time-series-analysis-on-sales-data)||[Deploying a Machine Learning model as a Service in a Docker container : MLasS](#deploying-machine-learning-model-as-a-service-in-a-docker-container--mlass)|`Random Forest Classifier`|`Flask`, `Docker`, `Redis`, `Sklearn`|[Click](https://github.com/erdiolmezogullari/ml-dockerized-microservice)|\n|[Random Forest Classification Tutorial in PySpark](#random-forest-classification-pyspark)| `Random Forest Classifier`|`Spark (PySpark)`, `Sklearn`, `Pandas`, `Seaborn`|[Click](https://github.com/erdiolmezogullari/ml-random-forest-pyspark)|\n|[Spatial data enrichment: Join two geolocation datasets by using Kdtree](#spatial-data-enrichment-join-two-geolocation-datasets-by-using-kdtree)|`Kd-tree`|`cKDTree`|[Click](https://github.com/erdiolmezogullari/ml-join-spatial-data)|\n|[Implementation of K-Means Algorithm from scratch in Java](#implementation-of-k-means-algorithm-from-scratch-in-java)|`K-Means`|`Java SDK`|[Click](https://github.com/erdiolmezogullari/ml-k-means)|\n|[Forecasting AWS Spot Price by using Adaboosting on Rapidminer](#forecasting-aws-spot-price-by-using-adaboosting-on-rapidminer)|`Adaboost Classifier`, `Decision Tree`|`Rapidminer`|[Click](https://github.com/erdiolmezogullari/ml-forecasting-aws-spot-price)|\n\nPlease, scroll down to see the details of projects comprehensively and visit their repository.\n\n### A/B Testing to Distinguish Impact of Version of Landing Page on User\n\n![](https://camo.githubusercontent.com/b6b4a987351274b68f606b1904cba146654ec7f1/68747470733a2f2f666f7875746563682e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031382f31302f41422d6465706c6f796d656e742e706e67)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`Conversion`|Retail|`A\\B Testing`, `Z test`|`pandas`, `statsmodel`|https://github.com/erdiolmezogullari/ml-ab-testing|\n\nIn this project, A/B testing was performed on Udacity's Course dataset. It consists of 5 columns, `\u003cuser_id, timestamp, group, landing_page, converted\u003e`. In A/B testing,  we used 3 columns of out of them, `group, landing_page, and converted`.\n\n We once simulated some experiments N times with respect to the conversion rates (`control, treatment`) already obtained over dataset. After got the further idea about dataset with this simulation, we supposed a null hypothesis and an alternative thesis. To claim our trueness of alternative hypothesis, we calculated z critical score by using `Z test` method with respect to alpha (0.05), and then we checked out beta, and power with respect to the effect size of the experiment.\n\nPlease, note that you may check out [`ab_test.md`](https://github.com/erdiolmezogullari/ml-ab-testing/blob/master/ab_test.md) to get the further information about hypothesis test and A/B testing with some important photos.\n\n### Fashion MNIST with Tensorboard and Mlflow\n---\n![Image](https://miro.medium.com/max/571/1*evP6ekF_aPAxMzSL3LZmAg.png)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`Prediction`| Fashion MNIST |`CNN`, `Deep Learning` |`Keras`, `MLflow`, `Pandas`, `Sklearn`| https://github.com/erdiolmezogullari/ml-fmnist-mlflow-tensorboard|\n\nIn this project, we used docker container technologies to create ML platform from scratch.\nIt consists of four different docker containers (mlflow, notebook, postgres, tensorboard) that are already built in `docker-compose.yml`\n\nThe details of containers could be found under `./platform` directory.\nEach container service has a specific dockerfile corresponding to the directories (mlflow, notebook, postgres, tensorboard) under platform directory\n\n### Dockerize an Apache Flink Application\n---\n![Image](https://i.ytimg.com/vi/ej4juSB6MKs/hqdefault.jpg)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`Implementation`| Click Stream Dataset | `Apache Flink Table \u0026 SQL` |`Apache Flink Table \u0026 SQL`, `Docker`, `Docker-Compose`| https://github.com/erdiolmezogullari/de-flink-sql-as-a-docker|\n\n\nIn this project, we used docker container technologies to launch Flink cluster and Flink App separately from scratch. Flink Cluster (Platform) consists of two different docker containers (jobmanager, taskmanager) that are already built in docker-compose.flink.yml. Flink Application consists of one docker container that already using a dockerfile (./app-flink-base/Dockerfile) and a shell script (./app-flink-base/run.sh) to submit jar file to cluster in docker-compose-app-flink.yml.\n\n\n### Crawler as a Service\n---\n![Image](https://22570l2e793j2oo9c81ug2nh-wpengine.netdna-ssl.com/wp-content/uploads/2014/06/web-spider-cropped.png)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`Implementation`| N/A | Searching (`BFS`, `DFS`) |`GO`, `Neo4j`, `Redis`, `Docker`, `Docker-Compose`| https://github.com/erdiolmezogullari/de-crawler-as-a-service|\n\nIn this project, a simple crawler service was implemented from scratch, and integrated into `Redis` and `Neo4j` NoSQL systems by using `Docker` and `Docker-compose`.\nThe crawler service is crawling the first target URL, and then, visiting the rest of URLs in the fetched HTML documents, respectively and recursively.\nWhile crawling a HTML documents corresponding to URLs, it could refer to 1 out of 2 different searching algorithms (`BFS, DFS`).\nThose searching algorithms were boosted by `go routines` in `GO` in order to speed up crawling service.\n\nDuring crawling, there is a possibility that a bunch of go routines that would be created may fetch and process the same HTML documents at the same time.\nIn this case, the crawler may create inconsistent data. Thus, `Redis` Key-Value NoSQL system was preferred using in this project to solve that problem and build a robust and consistent system.\n\nEach URL may referring to either the other different URL or itself in a HTML document. That relationship between two URLs can call as a Link.\nThere is a simple easy way to represent those crawled Links and URLs by using a specific data structure, which is graph.\nThus, `Neo4j` Graph NoSQL were used to represent and visualize the graph which consists of URLs and Links.\nDuring crawling, the crawling service is either creating a new node for each URL and new link for each URL pair, or updating existing nodes and links on `Neo4j` by using [`Cypher`](https://neo4j.com/developer/cypher-query-language/) query, as well.\n\n\n### Prediction Skip Action\n---\n![Image](https://raw.githubusercontent.com/erdiolmezogullari/ml-prediction-skip-action/2c3d0dcef096a475c6bf214c71cab23a22fd6bf8/img/waiting_time.jpg)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`Prediction`|Music Dataset|`LightGBM`, `Linear Reg`, `Logistic Reg.`|`Sklearn`, `LightGBM`, `Pandas`, `Seaborn`| https://github.com/erdiolmezogullari/ml-prediction-skip-action|\n\nIn this project, we need to predict the probability of skip action made by listeners, who is listening musics. Since we don't have any class already labelled by anyone. In this circumstances, We need to create a target label that could solve the problem. So, any continuous target variable should be picked as a target feature. According to the features we created, `per_listen (percentage of listen)` will be more suitable for that problem since it obviously gives idea about skipping action. If we pick it as a target feature, this problem will turn out a scoring/probability problem because of having ratio of listening time, which tends between 0 to 1.\n\nIf we want to convert that problem to a classfication problem, we can determine a treshold for skipping aciton as a rule of thump. `per_listen` denotes how much percentage of the track that were listened by listener. So, our threshold could be 25%, 50% even 51% and so on. However, before making a decision, we can check out Complementary Cumulative Distribution Function (CCDF) of  `per_listen`. It would be give an idea about our reasonanle threshold. According the following plot, we have 65% of instances, whose per_listen value is greater than 0.5. Therefore, 0.5 is reasonable, however, when we think about it more realistic, less than 0.5 around 0.25 would be more suitable determine any skipping action.\n\n### Hairstyle Classification\n---\n![Image](https://howng.com/wp-content/uploads/2016/10/traditional-hairstyles-e1477039899416.jpg)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`Classification`|Hairstyle Dataset|`LightGBM`, `TF-IDF` |`Sklearn`, `LightGBM`, `Pandas`, `Seaborn`| https://github.com/erdiolmezogullari/ml-hairstyle-classification|\n\nIn this project, the dataset contains a sample 10000 images mined from Instagram \nand clustered based on the hairstyle they showcase.  \n \nThe variable `cluster`  represents the hairstyle cluster that the image has been assigned to by \nthe visual recognition algorithm. \n \nEach row contains the variable `url` which is the link to the image and  the number of ​ likes \ntogether with the `comments` per image.  The `user_id`  is the unique id of the Instagram account \nfrom which the post comes and the variable  `id`  is the unique identifier associated with the post \nitself.\n\nEach post contains the date(`date_unix`)  in unix format when the image was posted on \nInstagram and additionally the date has been converted to different formats (`date_week`-\u003enon-iso number of the week, `date_month`  -\u003e the month, `date_formated` -\u003efull date dd/mm/YY) partly \nfor use in prior analyses. Feel free to convert that variable in a way that suits your analysis. \n \nAdditionally a classifier `influencer_flag` was added to each of the images which have more than \n500 likes, flagging them as influencer posts.  \n\n### Time Series Analysis by SARIMAX\n---\n![Image](https://c1.sfdcstatic.com/content/dam/blogs/ca/Blog%20Posts/sales-forecasting-header.jpg)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`Time Series Analysis`|Working Statistics|`ARIMA`, `SARIMAX` |`statsmodels`, `pandas`, `sklearn`, `seaborn`|https://github.com/erdiolmezogullari/ml-time-series-analysis-sarimax|\n\nIn this project, we use time series analysis technique to decompose our data into 3 components like the below:\n\n    1-Trend (T)\n    2-Seasonility (S)\n    3-Residual (R)\n\nOnce we need to get a statinory dataset before performing Time Series Analysis (TSA) flawlessly beacuse it would be easy making a predicition over a stationary dataset since it would already satisfy the preoperties of Normal Distribution in terms of mean and variance, roughly. So, we need to delve into the raw dataset by applying some EDA techniques to expose valuable insight of data related to trend, and seasonility if it is possible to observe in EDA. After we complete data analyis stage, we need to pick best available techniques (e.g ARIMA, SARIMAX) to perform on the dataset according to our knowledge we would get in EDA.\n\nIn EDA stage, we will be applying a bunch of techniques such as, boxploting, rolling statictics (mean, std) by time based features (year, month, day, weekday and quarter) to find out 2 components (trend, seasonility) out of 3 time series components over specific plots, rougly. Those plots will give reasonable feedback for TSA before starting it.\n\nIn TSA stage, we will build different models for non-seasonal and seasonal approahes by using ARIMA and SARIMAX in statsmodels package, respectively.\n\nSince the most challenging parts of TSA is finding optimum parameters (p,d,q) and (P,D,Q,S) of those techniques, we will be referring to Autocorrelation (ACF) and Partial Autocorrelation (PACF) functions to find out significant time correlations in terms of performing either Autoregression Model (AR) or Moving Average Model (MA), or Seanosal Autoregression (SAR) and Moving Average (SAM).\n\n### Multi-language and Multi-label Classification Problem on Fashion Dataset\n---\n![Image](http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion/attributes.jpg)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`Classification`|Fashion Dataset|`LightGBM`, `TF-IDF` |`Sklearn`, `LightGBM`, `Pandas`, `Seaborn`|https://github.com/erdiolmezogullari/multi-label-classification|\n\n\nIn this project, dataset was collected over different fashion web sites. It consists of 7 fields like below.\n\n* `id`: A unique product identifier\n* `name`: The title of the product, as displayed on our website\n* `description`: The description of the product\n* `price`: The price of the product\n* `shop`: The shop from which you can buy this product\n* `brand`: The product brand\n* `labels`: The category labels that apply to this product\n\nThe text features (name, description) are in different languages, such as English, German and Russian. The format of target feature is multilabels (60 categories) that were tagged according to corresponding to the category in fashion web sites differently.\n\n### Which one does it catch whole* SPAM SMS?\n---\n![Image](https://appliedmachinelearning.files.wordpress.com/2017/01/spam-filter.png)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`NLP`|Text|`Naive Bayesian`, `SVM`, `Random Forest Classifier`, `Deep Learning - LSTM`, `Word2Vec`|`Sklearn`, `Keras`, `Gensim`, `Pandas`, `Seaborn`|https://github.com/erdiolmezogullari/ml-spam-sms-classification|\n\nIn this project, We applied supervised learning (classification) algorithms and deep learning (LSTM).\n\nWe used a public [SMS Spam dataset](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection), which is not purely clean dataset. The data consists of two different columns (features), such as context, and class. The column context is referring to SMS. The column class may take a value that can be either `spam` or `ham` corresponding to related SMS context.\n\nBefore applying any supervised learning methods, we applied a bunch of data cleansing operations to get rid of messy and dirty data since it has some broken and messy context.\n\nAfter obtaining cleaned dataset, we created tokens and lemmas of SMS corpus seperately by using [Spacy](https://spacy.io/), and then, we generated [bag-of-word](https://en.wikipedia.org/wiki/Bag-of-words_model) and [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) of SMS corpus, respectively. In addition to these data transformations, we also performed [SVD](https://en.wikipedia.org/wiki/Singular-value_decomposition), [SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) to reduce dimension of dataset.\n\nTo manage data transformation in training and testing phase effectively and avoid [data leakage](https://www.kaggle.com/wiki/Leakage), we used Sklearn's [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class. So, we added each data transformation step (e.g. `bag-of-word`, `TF-IDF`, `SVC`) and classifier (e.g. `Naive Bayesian`, `SVM`, `Random Forest Classifier`) into an instance of class `Pipeline`.\n\nAfter applying those supervised learning methods, we also perfomed deep learning.\nOur deep learning architecture we used is based on [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory). To perform LSTM approching in [Keras  (Tensorflow)](https://keras.io/), we needed to create an embedding matrix of our corpus. So, we used [Gensim's Word2Vec](https://radimrehurek.com/gensim/) approach to obtain embedding matrix, rather than TF-IDF.\n\nAt the end of each processing by different classifier, we plotted [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) to compare which one the best classifier for filtering SPAM SMS.\n\n### Which novel do I belong To?\n---\n![Image](https://github.com/erdiolmezogullari/ml-deep-learning-keras-novel/blob/master/cover.jpg)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`NLP`|Text|`Deep Learning - LSTM`, `Word2Vec`|`Sklearn`, `Keras`, `Gensim`, `Pandas`, `Seaborn`|https://github.com/erdiolmezogullari/ml-deep-learning-keras-novel|\n\nThis project is related to text classification problem that we tackled with `Deeplearing (LSTM)` model, which classifies given arbitrary paragraphes collected over 12 different novels randomly, above: \n\n    1. alice_in_wonderland\n    2. dracula\n    3. dubliners\n    4. great_expectations\n    5. hard_times\n    6. huckleberry_finn\n    7. les_miserable\n    8. moby_dick\n    9. oliver_twist\n    10. peter_pan\n    11. talw_of_two_cities\n    12. tom_sawyer\n\nIn other words, you can think about those novels are our target classes of our dataset.\nTo distinguish actual class of paragraph, the semantic latent amongst paragraphes would play an important role. Therefore, We used `Deeplearing (LSTM)` on top of `Keras (Tensorflow)` after creating an embedding matrix by `Gensim's word2vec`.\n\nIf there is any semantic latent amongst sentences in corresponding paragraph, \nWe think about similar paragraphes were collected from same resources (novels) most likely.\n\n### Why do customers choose and book specific vehicles?\n---\n![Image](https://cabrentalmysore.com/wp-content/uploads/2018/11/Booking-Car-reservation-online.jpg)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`Imbalanced Data`|Car Booking|`Random Forest Classifier`|`Sklearn`, `Pandas`, `Seaborn`|https://github.com/erdiolmezogullari/ml-imbalanced-car-booking-data|\n\nIn this project, We built a machine learning model that answers the question, -what is the customer preference- on car booking dataset.\n\nWe explored the dataset by using `Seaborn`, and transformed, derived new features necessary.\n\nIn addition, the shape of dataset is `imbalanced`. It means that the target variable's distribution is skewed. To overcome that challenge, there are already defined a few different techniques (e.g. `over/under re-sampling techniques`) and intuitive approaches. We try to solve that problem using resampling techniques, as well.\n\n### Forecasting impact of promos (promo1, promo2) on sales in Germany, Austria, and France\n---\n![Image](https://cdn-images-1.medium.com/max/1600/1*QHB8AhRSDDKpCV1WU1xFag.png)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`Forecasting - Timeseries`|Sales|`Random Forest Regressor`|`statsmodels`, `pandas`, `sklearn`, `seaborn`|https://github.com/erdiolmezogullari/ml-time-series-analysis-on-sales-data|\n\nIn this project, we need to perform time series analysis to get new insight about promos. There are stores that are giving two type of promos such as radio, TV corresponding to promo1 and promo2 so that they want to increase their sales across Germany, Austria, and France. However, they don't have any idea about which promo is sufficient to do it. So, the impact of promos on their sales are important roles on their preference.\n\nTo define well-defined promo strategy, we once need to analysis data in terms of impacts of promos. In that case, since data is based on time series, we once referred to use  `time series decomposition`. After we decomposed `observed` data into `trend`, `seasonal`, and `residual` components, We exposed the impact of promos clearly to make a decision which promo is better in each country.\n\nIn addition, we used `Random Forest Regression` in this forecasting problem to boost our decision. \n\n### Deploying Machine Learning model as a Service in a Docker container : MLasS\n---\n![Image](https://i.ytimg.com/vi/AODHFqKBJRs/maxresdefault.jpg)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`ML Service`|Randomly Generated|`Random Forest Classifier`|`Flask`, `Docker`, `Redis`, `Sklearn`|https://github.com/erdiolmezogullari/ml-dockerized-microservice|\n\nIn this project, a `ML based micro-service` was developed on top of `REST` and `Docker` after building a machine learning model by performing `Random Forest`\n\nWe used `docker-compose` to launch the micro services, below.\n\n    1.Jupyter Notebook,\n    2.Restful Comm. (Flask),\n    3.Redis\n\nAfter we created three different container, our MLasS would be ready.\n\n### Random Forest Classification (PySpark)\n---\n![Image](https://www.kdnuggets.com/images/apache-spark-python-scala-605.jpg)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`PySpark`|Randomly Generated|`Random Forest Classifier`|`Spark (PySpark)`, `Sklearn`, `Pandas`, `Seaborn`| https://github.com/erdiolmezogullari/ml-random-forest-pyspark|\n\nIn this project, you can find a bunch of sample code related to how you can use PySpark Spark's MLlib (Random Forest Classifier), and Pipeline via PySpark.\n\n### Spatial data enrichment: Join two geolocation datasets by using Kdtree\n---\n![Image](https://gistbok.ucgis.org/sites/default/files/DM66-Fig7.png)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`Data Enrichment`|Spatial|`Kd-tree`|`cKDTree`|https://github.com/erdiolmezogullari/ml-join-spatial-data|\n\nIn this project, to build an efficient script that finds the closest airport to a given user based on their geolocation and the geolocation of the airport.\n\nTo make that data enrichment, we used `Kd-tree` algorithm.\n\n### Implementation of K-Means Algorithm from scratch in Java\n---\n![Image](https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/19344/versions/1/screenshot.jpg)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`Implementation`|Statistics of Countries|`K-Means`|`Java SDK`| https://github.com/erdiolmezogullari/ml-k-means|\n\nIn this project, K-Means clustering algorithm were implemented in Java from scratch.\nDataset: https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/K-Means#Input_data\n\n### Forecasting AWS Spot Price by using Adaboosting on Rapidminer\n---\n![Image](https://image.slidesharecdn.com/leveragingelasticweb-scalecomputingwithaws-150326210749-conversion-gate01/95/leveraging-elastic-web-scale-computing-with-aws-5-638.jpg?cb=1463633063)\n\n|__Problem__|__Data__|__Methods__|__Libs__|__Link__|\n|-|-|-|-|-|\n|`Forecasting, Timeseries Analysis`|AWS EC2 Spot Price|`Adaboost Classifier`, `Decision Tree`|`Rapidminer`|https://github.com/erdiolmezogullari/ml-forecasting-aws-spot-price|\n\nIn this project, we will use public data, which was collected by third party people and released through some specific websites. Since our data will be mainly related to Amazon Web Services’ (AWS) Elastic Computing (EC2), it will be consisting of some different fields. EC2 is a kind of virtual machine in the AWS’s cloud.\nA virtual machine can be created just in time either on private or public cloud over AWS whenever you need it. A new virtual machine can be picked with respect to different specs and configurations in terms of CPU, RAM, storage, and network band limit before creating it once from scratch. EC2 machines also are separated and managed by AWS on different geographical regions (US East, US West, EU, Asia Pacific, South America) and zone to increase availability of virtual machines across the world. AWS has different segmentations, which were classified with respect to system specs by AWS for based on different goals (macro instance, general purpose, compute optimized, storage optimized, GPU instance, memory optimized). Payment options are dedicated, on­demand and spot instance. Since they make different cost to customer’s operation, customers may prefer different kinds of virtual machine according to their goals and budgets. In general, spot instance is cheaper than the rest of the options. However, spot instance may be interrupted if market price exceeds our max bid.\nIn our research, we will focus on spot instance payment. Our aim in this project will be selecting correct AWS instance from the Spot Instance Market according to the requirement of the customer. We plan to perform Decision Tree on streaming data to make a decision on the fly. It may be implemented as an incremental version of decision tree since data is changing continuously\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F30lm32%2Fml-projects","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F30lm32%2Fml-projects","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F30lm32%2Fml-projects/lists"}