{"id":18941142,"url":"https://github.com/gaussalgo/mlp_2017_workshop","last_synced_at":"2025-04-15T20:31:48.250Z","repository":{"id":130589124,"uuid":"87938431","full_name":"gaussalgo/MLP_2017_workshop","owner":"gaussalgo","description":"Machine Learning Prague 2017 - Advanced data analysis on Hadoop clusters workshop details. ","archived":false,"fork":false,"pushed_at":"2017-04-24T06:44:12.000Z","size":407,"stargazers_count":0,"open_issues_count":0,"forks_count":5,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-29T02:04:29.883Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gaussalgo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-04-11T13:38:31.000Z","updated_at":"2019-10-23T15:17:42.000Z","dependencies_parsed_at":"2023-03-13T11:15:12.892Z","dependency_job_id":null,"html_url":"https://github.com/gaussalgo/MLP_2017_workshop","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gaussalgo%2FMLP_2017_workshop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gaussalgo%2FMLP_2017_workshop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gaussalgo%2FMLP_2017_workshop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gaussalgo%2FMLP_2017_workshop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gaussalgo","download_url":"https://codeload.github.com/gaussalgo/MLP_2017_workshop/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249147975,"owners_count":21220454,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T12:26:15.580Z","updated_at":"2025-04-15T20:31:48.237Z","avatar_url":"https://github.com/gaussalgo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MLP\\_2017\\_workshop\n\n## Introduction\n\nThis is an accompanying website for MLPrague 2017 workshop *Advanced data analysis on Hadoop clusters*.\nSpecifically, source codes for the machine learning part are provided.\nDescription of the used data can be found here too, see below.\n\nSource codes are created for [Spark](http://spark.apache.org/).\n\n## Problem Statement\n\nTha practical part of machine learning can be divided into two parts:\n1) Community detection in telecommunication networks\n2) Churn prediction in telecommunication industry\n\nAs churn prediction part assumes results from community detection, it is necesarry to run the codes described in Community Detection section first.\nThen the churn prediction part can be executed by running the *main.py* script.\n\n## Community Detection\n\nGiven the phone call records, the task is to find communities in a network created from these phone calls.\nCustomers represent vertices in such a network and edges link customers who called to each other.\n\nThe presented solution creates a graph from one-month call records.\nOnly customers with at least 10 calls are linked together.\n[Label Propagation Algorithm](https://en.wikipedia.org/wiki/Label_Propagation_Algorithm) is used for community detection.\n\nScala source codes for Spark can be found in phase\\_0\\_community\\_detection/ directory.\nThe scala script assumes mlp\\_sampled\\_cdr\\_records.parquet data available.\n\nThis script will create two new data files: lpa\\_20160301\\_20160401.parquet and lpa\\_20160401\\_20160501.parquet.\n\n## Churn Prediction\n\nIn this part, the task is to predict customers who are likely to churn.\nAll source codes for this part are written in python are assumed to be run by PySpark.\nCreated machine learning model uses features extracted from one month and predicts potential churners for the next month.\nFor example, it takes phone call records from March and predicts which customers are likely to churn in April.\nFeatures are built from the input data described below.\n\nThis part is divided into three phases:\n1) Data preparation - creates various features from the input data\n2) Data preprocessing - imputing and trasforming features; it also adds some new derived features\n3) Classification - trains a classification model on a train dataset and applies it on a test dataset\n\nEvaluation of the model is performed outside of those phases for the sake of detailed illustration.\n\n## Other Information\n\nDirectory *scripts* contains various python scripts for data exploration.\nScript *scripts/move_data.py* illustrates how to save parquet data from a remote AWS S3 repository to local repository.\n\n\n## Input Data Description\n\n*mlp\\_sampled\\_cdr\\_records.parquet* - phone call records from two months\n\n* record\\_type: string - type of voice records\n* date\\_key: string - date of the call\n* duration: integer - duration of the call in seconds\n* frommsisdn\\_prefix: string - operator prefix\n* frommsisdn: long - home operator number (either receiving or calling - according to the record type)\n* tomsisdn\\_prefix: string - operator prefix\n* tomsisdn: long - number of the second customer (can be either of the home operator or not)\n\n\n*mlp\\_sampled\\_ebr\\_base\\_20160401.parquet*, *mlp\\_sampled\\_ebr\\_base\\_20160501.parquet* - information about home operator customers\n\n* msisdn: long - number of the customer\n* customer\\_type: string - either private or business\n* commitment\\_from\\_key: string - date of the commitment start\n* commitment\\_to\\_key: string - date of the commitment end\n* rateplan\\_group: long - name of the rateplan group\n* rateplan\\_name: long - name of the raplan\n\n*mlp_sampled_ebr_churners_20151201_20160630.parquet* - list of churned customers from two months\n\n* msisdn: long - number of the customer\n* date\\_key: string - date of the churn\n\n## Description of Features\n\nNOTE: 'callcenters' are numbers behaving like callcenters - i.e. they call to a huge number of phone numbers.\nWe select TOP 12 such 'callcenters' from data.\n\n* churned - binary label attribute\n* msisdn\n* customer\\_type \n* rateplan\\_group\n* rateplan\\_name\n* committed - whether the customer is committed at this point\n* committed\\_days - for how long is the customer committed\n* commitment\\_remaining - how many days till the end of the commitment\n* callcenter\\_calls\\_count - count of phone calls with so called 'callcenters'\n* callcenter\\_calls\\_duration - total duration of phone calls with so called 'callcenters'\n* cc\\_cnt\\_X1 - count of phone calls with call center X1, where X1 is the number of the callcenter\n* cc\\_dur\\_X1 - duration of phone calls with call center X1, where X1 is the number of the callcenter\n* cc\\_avg\\_X1 - average duration of phone calls with call center X1, where X1 is the number of the callcenter\n* cc\\_std\\_X1 - standard deviation of duration of phone calls with call center X1, where X1 is the number of the callcenter\n* com\\_degree - vertex degree in the graph used for community detection\n* com\\_degree\\_total - vertex  degree within the community\n* com\\_count\\_in\\_group - number of vertices in the same community\n* com\\_degree\\_in\\_group - sum of degrees in the vertex's community\n* com\\_score - score computed as degree / degree\\_in\\_group \n* com\\_group\\_leader - boolean; whether the vertex has maximal score within the group\n* com\\_group\\_follower - boolean; whether the vertex has minimal score within the group\n* com\\_churned\\_cnt - how many customers from the community churned \n* com\\_leader\\_churned\\_cnt - how many customer leaders from the community churned\n\n... rest of the features represent various characteristics about phone calls.\nDuration of calls is always expressed in seconds.\nMore specifically, \"dur\" represents duration, \"cnt\" count, \"avg\" average, \"std\" standard deviation.\nThere may be phone calls to people belonging to the same operator (\"\\_t\\_\") or different operator (\"\\_not\\_t\\_\"), or it is not differentiated (\"all\").\nMoreover, there may be distinction between incoming and outgoing calls.\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgaussalgo%2Fmlp_2017_workshop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgaussalgo%2Fmlp_2017_workshop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgaussalgo%2Fmlp_2017_workshop/lists"}