{"id":15632821,"url":"https://github.com/karolzak/support-tickets-classification","last_synced_at":"2025-10-04T06:30:22.545Z","repository":{"id":29287607,"uuid":"121242847","full_name":"karolzak/support-tickets-classification","owner":"karolzak","description":"This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en","archived":false,"fork":false,"pushed_at":"2022-06-21T21:14:47.000Z","size":3923,"stargazers_count":168,"open_issues_count":10,"forks_count":91,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-04-08T04:18:48.734Z","etag":null,"topics":["ai","artificial-intelligence","azure","azure-app-service","azure-machine-learning","azure-web-app-service","azure-webapp","classification","classifier","machine-learning","ml","model","numpy","pandas","python","text-analysis","text-classification","text-mining","text-processing","web-service"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/karolzak.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-02-12T12:15:32.000Z","updated_at":"2025-02-17T05:18:49.000Z","dependencies_parsed_at":"2022-09-11T08:40:40.353Z","dependency_job_id":null,"html_url":"https://github.com/karolzak/support-tickets-classification","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/karolzak/support-tickets-classification","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/karolzak%2Fsupport-tickets-classification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/karolzak%2Fsupport-tickets-classification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/karolzak%2Fsupport-tickets-classification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/karolzak%2Fsupport-tickets-classification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/karolzak","download_url":"https://codeload.github.com/karolzak/support-tickets-classification/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/karolzak%2Fsupport-tickets-classification/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263588296,"owners_count":23484896,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","artificial-intelligence","azure","azure-app-service","azure-machine-learning","azure-web-app-service","azure-webapp","classification","classifier","machine-learning","ml","model","numpy","pandas","python","text-analysis","text-classification","text-mining","text-processing","web-service"],"created_at":"2024-10-03T10:45:24.453Z","updated_at":"2025-10-04T06:30:17.521Z","avatar_url":"https://github.com/karolzak.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Table of contents\n1. [Project description](#1-project-description)  \n2. [Results and learnings](#2-results-and-learnings)  \n    2.1. [Main challenge and initial assumptions](#21-main-challenge-and-initial-assumptions)  \n    2.2. [Dataset](#22-dataset)  \n    2.3. [Training and evaluation results](#23-training-and-evaluation-results)    \n    2.4. [Model deployment and usage](#24-model-deployment-and-usage)\n3. [Run the example](#3-run-the-example)  \n    3.1. [Prerequisites](#31-prerequisites)  \n    3.2. [Train and evaluate the model](#32-train-and-evaluate-the-model)  \n    3.3. [Deploy web service](#33-deploy-web-service)\n4. [Code highlights](#4-code-highlights)  \n\n\u003cbr\u003e\n\n# 1. Project description \n[[back to the top]](#table-of-contents)\n\nThis case study shows how to create a model for **text analysis and classification** and deploy it as a **web service in Azure cloud** in order to automatically **classify support tickets**.\u003cbr\u003e\nThis project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with [Endava](http://endava.com/en).\u003cbr\u003e\nOur combined team tried 3 different approaches to tackle this challenge using:\n- [Azure Machine Learning Studio](https://studio.azureml.net/) - drag-and-drop machine learning tools\n- [Microsoft Cognitive Toolkit (CNTK)](https://github.com/Microsoft/CNTK) - deep neural networks framework\n- [Azure Machine Learning service](https://docs.microsoft.com/azure/machine-learning/service) - with Python and classic machine learning algorithms\n\n\n\n#### What will you find inside:     #### \n- How to clean and prepare text data and featurize it to make it valuable for machine learning scenarios\n- How to strip the data from any sensitive information and also anonymize/encrypt it\n- How to create a classification model using Python modules like: [sklearn](http://scikit-learn.org/stable/), [nltk](https://www.nltk.org/), [matplotlib](https://matplotlib.org/), [pandas](https://pandas.pydata.org/)\n- How to create a web service with a trained model and deploy it to Azure\n- How to leverage [Azure Machine Learning Service](https://azure.microsoft.com/en-us/services/machine-learning-services/) to make working on ML projects easier, faster and much more efficient\n\n\n#### The team: ####\n- [Karol Żak](https://twitter.com/karolzak13) ([GitHub](https://github.com/karolzak)) - Software Development Engineer, Microsoft\n- [Filip Glavota](https://twitter.com/fglavota) - Software Development Engineer, Microsoft\n- [Ionut Mironica](https://www.linkedin.com/in/ionut-mironica-06b35a2a/) ([GitHub](https://github.com/imironica)) - Senior Developer, Endava \n- [Bogdan Dinu](https://www.linkedin.com/in/bogdanvdinu) - Senior Development Consultant, Endava\n- [Bogdan Marin](www.linkedin.com/in/bogdanmmarin) ([GitHub](https://github.com/bogdanm-marin)) - Senior Developer, Endava \n- [Florin Vinca](https://www.linkedin.com/in/vinca-florin-442ba229/) - Senior Developer, Endava\n- Ioana Raducanu - BI Analyst Developer, Endava\n- [Andreea Tipau](https://www.linkedin.com/in/andreea-tipau-309aa1124/) - Developer, Endava\n\n![](docs/endava_team.jpg)\n\n\n\u003cbr\u003e\n\n# 2. Results and learnings\n[[back to the top]](#table-of-contents)\n\n***Disclaimer:***\n*This POC and all the learnings you can find below is an outcome of close cooperation between Microsoft and [Endava](http://endava.com/en). Our combined team spent total of 3 days in order to solve a challenge of automatic support tickets classification.*\n\n\n## 2.1. Main challenge and initial assumptions ##\n[[back to the top]](#table-of-contents)\n\n- Main challenge we tried to solve was to create a model for automatic support tickets classification for Endavas helpdesk solution. As Endava stated: currently helpdesk operators waste a lot of time evaluating tickets and trying to assign values to properties like: `ticket_type, urgency, impact, category, etc.` for each submitted ticket\n- The dataset we used is Endavas internal data imported from their helpdesk system. We were able to collect around 50k classified support tickets with original messages from users and already assigned labels\n- In our POC we focused only on tickets submited in form of an email, similar to the one below:\n![](docs/sample_email.jpg)\n\n\u003cbr\u003e\n\n## 2.2. Dataset ##    \n[[back to the top]](#table-of-contents)\n\n- For the sake of this repository, data have been stripped out of any sensitive information and anonymized (encrypted). In the original solution we worked on a full dataset without any encryptions. You can download anonymized dataset from [here](https://privdatastorage.blob.core.windows.net/github/support-tickets-classification/datasets/all_tickets.csv?sp=r\u0026st=2021-06-07T14:36:30Z\u0026se=2022-12-30T23:36:30Z\u0026spr=https\u0026sv=2020-02-10\u0026sr=b\u0026sig=Za0%2Fgbe%2FanVblbcYsCdQS5zTS5%2B17QKESzlbEXPp2KE%3D).\n\n- Example of anonymized and preprocessed data from [AML Workbench](https://docs.microsoft.com/en-us/azure/machine-learning/preview/quickstart-installation) view:  \n![](docs/sample_data.jpg)\n\n  \u003e[!Important]\n  \u003e[Azure Machine Learning service](https://docs.microsoft.com/azure/machine-learning/service) no longer supports the deprecated Workbench tool.\n\n- Below you can see a sample data transformation flow we used while preparing our dataset:  \n![](docs/data_steps.jpg)\n\n- After evaluating the data we quickly discovered that distribution of values for most of columns we wanted to classify is strongly unbalanced with some of the unique values represented by even as low as 1-2 samples. There are [multiple techniques](https://shiring.github.io/machine_learning/2017/04/02/unbalanced) to deal with that kind of issues but due to limited amount of time for this POC we were not able to test them in action.   \n\n- Distribution of values for each column:  \n\n    ticket_type   |  business_service\n    :-------------------------:|:-------------------------:\n    ![](docs/value_count_ticket_type.jpg) | ![](docs/value_count_business_service.jpg) \n\n    impact   |  urgency \n    :-------------------------:|:-------------------------:\n    ![](docs/value_count_impact.jpg) | ![](docs/value_count_urgency.jpg) \n\n    category   |  sub_category1\n    :-------------------------:|:-------------------------:\n    ![](docs/value_count_category.jpg) | ![](docs/value_count_sub_category1.jpg)\n\n    sub_category2   |  \n    :-------------------------:|\n    ![](docs/value_count_sub_category2.jpg) |\n\n\n\u003cbr\u003e\n\n## 2.3. Training and evaluation results ##\n[[back to the top]](#table-of-contents)\n\nIn order to train our models, we used [Azure Machine Learning Services](https://azure.microsoft.com/en-us/services/machine-learning-services/) to run training jobs with different parameters and then compare the results and pick up the one with the best values.:\n\n![](docs/workbench_runs_1.jpg)\n\nTo train models we tested 2 different algorithms: [SVM](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) and [Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes). In both cases results were pretty similar but for some of the models, Naive Bayes performed much better (especially after applying [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter)) so at some point we decided to work with NB only.    \n\nBelow you can find some of the results of models we trained to predict different properties:\n\n- ### **`ticket_type`** ###    \n    We started from predicting the least unbalanced (and most important from Endavas business point of view) parameter which is `ticket_type` and after training the model and finding the best hyperparameters using [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) (which improved precision and recall by around 4%), we were able to achieve some really good results which you can see below:\n\n    | confusion matrix for `ticket_type` | metrics for `ticket_type` |\n    :-------------------------:|:-------------------------:\n    ![](docs/score_ticket_type.png) | ![](docs/score_ticket_type_acc.png)\n\n- ### **`business_service`** ###\n\n    `business_service` property is one of the unbalanced features with very low amount of samples/per class for most values.    \n    We started from running the training on a subset of our dataset where we removed `business_service` values which were represented by less then 100 samples.    \n    Unfortunately that didn't help much and we still had a lot of classes that were not at all recognized. So we continued to increase the minimum required number of samples per class untill we started to see some meaningful results:\n\n    \n    | confusion matrix for `business_service` | metrics for `business_service` |\n    :-------------------------:|:-------------------------:\n    ![](docs/score_business_service_min1000.png) | ![](docs/score_business_service_min1000_acc.png)\n\n- ### **`category`, `impact` and `urgency`** ###\n\n    To predict `category`, `impact` and `urgency` we took the same approach as with `business_service` property but results looked even worse. It's obvious that such level of unbalance within the data makes it impossible to create a model with any meaningful results.    \n    If you would only look at mean/average value of `precision` and `recall` you could wrongly assume that results are quite well but if you would check the values of `support` for each class it would become clear that because one class which covers 70-90% of our data, the results are completely rigged:\n    \n    | confusion matrix for `category` | metrics for `category` |\n    :-------------------------:|:-------------------------:\n    ![](docs/score_category_min100.png) | ![](docs/score_category_min100_acc.png)\n    \n    | confusion matrix for `impact` | metrics for `impact` |\n    :-------------------------:|:-------------------------:\n    ![](docs/score_impact.png) | ![](docs/score_impact_acc.png)\n    \n    | confusion matrix for `urgency` | metrics for `urgency` |\n    :-------------------------:|:-------------------------:\n    ![](docs/score_urgency.png) | ![](docs/score_urgency_acc.png)\n\n\n\u003cbr\u003e\n\n## 2.4. Model deployment and usage ##\n[[back to the top]](#table-of-contents)\n\nFinal model will be used in form of a web service running on Azure and that's why we prepared a sample RESTful web service written in Python and using [Flask module](http://flask.pocoo.org/). This web service makes use of our trained model and provides API which accepts email body (text) and returns predicted properties.\n\nYou can find a running web service hosted on [Azure Web Apps](https://docs.microsoft.com/en-us/azure/app-service/app-service-web-overview) here: https://endavaclassifiertest1.azurewebsites.net/.    \nThe project we based our service on with code and all the deployment scripts can be found here: [karolzak/CNTK-Python-Web-Service-on-Azure](https://github.com/karolzak/CNTK-Python-Web-Service-on-Azure).\n\n*Sample request and response in Postman:*\n![Demo](docs/postman_1.jpg)\n\n\u003cbr\u003e\n\n# 3. Run the example\n## 3.1. Prerequisites\n[[back to the top]](#table-of-contents)\n\n\n- **Download content of this repo**\n\n    You can either clone this repo or just download it and unzip to some folder\n\n- **Setup Python environment**\n\n    In order to run scripts from this repo you should have a proper Python environment setup. If you don't want to setup it locally you can use one of the [Data Science Virtual Machine](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) images (both on [Linux](https://azuremarketplace.microsoft.com/marketplace/apps/microsoft-ads.linux-data-science-vm-ubuntu) and [Windows](https://azuremarketplace.microsoft.com/marketplace/apps/microsoft-ads.windows-data-science-vm)) on Azure. All of them come in with most popular data science and machine learning tools and frameworks already preinstalled and ready for you.\n\n- **Install dependencies**\n\n    Make sure to install all the dependencies for this project. You can easily do it by using [requirements.txt](requirements.txt) file and running this command:\n\n    ```cmd\n    pip install -r requirements.txt\n    ```\n    Please report issues if you find any errors or missing modules, thanks!\n\n- **Download Endava support tickets dataset (all_tickets.csv)**\n\n    You can download the dataset from [here](https://privdatastorage.blob.core.windows.net/github/support-tickets-classification/datasets/all_tickets.csv?sp=r\u0026st=2021-06-07T14:36:30Z\u0026se=2022-12-30T23:36:30Z\u0026spr=https\u0026sv=2020-02-10\u0026sr=b\u0026sig=Za0%2Fgbe%2FanVblbcYsCdQS5zTS5%2B17QKESzlbEXPp2KE%3D) or by executing [1_download_dataset.py](1_download_dataset.py) script. If you decide to download it manually, just make sure to put it under:\n    ```\n    project\n    └───datasets\n        └───all_tickets.csv\n    ```\n    Endavas support tickets dataset is already cleaned and stripped out of any unnecessary words and characters. You can check some of the preprocessing operations that were used in [0_preprocess_data.py](0_preprocess_data.py) script.\n\n    \n## 3.2. Train and evaluate the model\n[[back to the top]](#table-of-contents)\n\nTo train the model you need to run [2_train_and_eval_model.py](2_train_and_eval_model.py) script. There are some parameters you could possibly play around with - check out [code highlights section](#4-code-highlights) for more info.\n\n## 3.3. Deploy web service\n[[back to the top]](#table-of-contents)\n\nInside [webservice](webservice) folder you can find scripts to setup a Python based RESTful web service (made with Flask module).\n\nDeeper in that folder you can also find [download_models.py](webservice/models/download_models.py) script which can be used to download some already trained models that will be used by the web service app.\n\nIn order to deploy it to an environment like [Azure App Service](https://azure.microsoft.com/en-us/services/app-service/) you can check [this GitHub repo](https://github.com/karolzak/CNTK-Python-Web-Service-on-Azure) for some inspiration.\n\n\u003cbr\u003e\n\n# 4. Code highlights\n[[back to the top]](#table-of-contents)\n\n- [0_preprocess_data.py](0_preprocess_data.py) - collection of scripts we used for data preprocessing and anonymization. I attached it in case someone would be interested in steps we followed with our data\n\n    In order to clean our data we removed:\n    - headers and footers\n    - email metadata (like: from, to, cc, date, etc.)\n    - email addresses, phone numbers and urls\n    - image references\n    - blacklisted words (Endavas sensitive information)\n    - non-english words - few percent of emails contained Romanian language\n    - all numerical values\n    - all non-alphabetic characters\n    - whitespaces\n\n    In order to anonymize the dataset for publishing purposes we used [sklearn.preprocessing.LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html):\n    \n    ```Python    \n    def encryptSingleColumn(data):\n        le = preprocessing.LabelEncoder()\n        le.fit(data)\n        return le.transform(data)\n\n\n    def encryptColumnsCollection(data, columnsToEncrypt):\n        for column in columnsToEncrypt:\n            data[column] = encryptSingleColumn(data[column])\n        return data\n    ```\n\n- [1_download_dataset.py](1_download_dataset.py) - simple script used to download our dataset (already preprocessed, cleaned and anonymized)\n\n- [2_train_and_eval_model.py](2_train_and_eval_model.py) - \n\n    - `column_to_predict` variable is used to determine which column should be used for classification\n\n        ```Python                \n        column_to_predict = \"ticket_type\"\n        # Supported datasets:\n        # ticket_type\n        # business_service\n        # category\n        # impact\n        # urgency\n        # sub_category1\n        # sub_category2\n        ```\n\n    - You can play around with some variables in order to improve accuracy:\n    \n        ```Python\n        classifier = \"NB\"  # Supported algorithms: \"SVM\" and \"NB\"\n        use_grid_search = False  # grid search is used to find hyperparameters. Searching for hyperparameters is time consuming\n        remove_stop_words = True  # removes stop words from processed text\n        stop_words_lang = 'english'  # used with 'remove_stop_words' and defines language of stop words collection\n        use_stemming = False  # word stemming using nltk\n        fit_prior = True  # if use_stemming == True then it should be set to False ?? double check\n        min_data_per_class = 1  # used to determine number of samples required for each class. Classes with less than that will be excluded from the dataset. Default value is 1\n        ```\n    \n    - Loading dataset into [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) object:\n\n        ```Python                \n            \n        # dfTickets = package.run('AllTickets.dprep', dataflow_idx=0) \n        \n        # loading dataset from csv\n        dfTickets = pd.read_csv(\n            './datasets/all_tickets.csv',\n            dtype=str\n        )  \n        ```\n        \n    - Splitting dataset using [sklearn.model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html):\n\n        ```Python      \n        # Split dataset into training and testing data\n        train_data, test_data, train_labels, test_labels = train_test_split(\n            data, labelData, test_size=0.2\n        )  # split data to train/test sets with 80:20 ratio\n        ```\n        \n    - You can use one of 3 different count vectorizers for features extraction from text:\n\n        ```Python                \n        # Extracting features from text\n        # Count vectorizer\n        if remove_stop_words:\n            count_vect = CountVectorizer(stop_words=stop_words_lang)\n        elif use_stemming:\n            count_vect = StemmedCountVectorizer(stop_words=stop_words_lang)\n        else:\n            count_vect = CountVectorizer()\n        ```\n        \n    - Creating a [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to organize transforms and final estimator and fitting the dataset into it:\n\n        ```Python                \n        text_clf = Pipeline([\n            ('vect', count_vect),\n            ('tfidf', TfidfTransformer()),\n            ('clf', MultinomialNB(fit_prior=fit_prior))\n        ])\n        text_clf = text_clf.fit(train_data, train_labels)\n        ```\n\n    - [sklearn.model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) can be used to search for best possible set of parameters for the learning algorithm:\n\n        ```Python                \n        if use_grid_search:\n            # Grid Search\n            # Here, we are creating a list of parameters for which we would like to do performance tuning.\n            # All the parameters name start with the classifier name (remember the arbitrary name we gave).\n            # E.g. vect__ngram_range; here we are telling to use unigram and bigrams and choose the one which is optimal.\n            \n            # NB parameters\n            parameters = {\n                'vect__ngram_range': [(1, 1), (1, 2)],\n                'tfidf__use_idf': (True, False),\n                'clf__alpha': (1e-2, 1e-3)\n            }\n\n            # SVM parameters\n            #    'vect__max_df': (0.5, 0.75, 1.0),\n            #    'vect__max_features': (None, 5000, 10000, 50000),\n            #    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams\n            #    'tfidf__use_idf': (True, False),\n            #    'tfidf__norm': ('l1', 'l2'),\n            #    'clf__alpha': (0.00001, 0.000001),\n            #    'clf__penalty': ('l2', 'elasticnet'),\n            #    'clf__n_iter': (10, 50, 80),\n            # }\n\n            # Next, we create an instance of the grid search by passing the classifier, parameters\n            # and n_jobs=-1 which tells to use multiple cores from user machine.\n            gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)\n            gs_clf = gs_clf.fit(train_data, train_labels)\n\n            # To see the best mean score and the params, run the following code\n            gs_clf.best_score_\n            gs_clf.best_params_\n        ```\n        \n    - Predicting labels for test set, evaluating accuracy of the model (with and without GridSearch) and printing out a simple [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html):\n\n        ```Python                \n        print(\"Evaluating model\")\n        # Score and evaluate model on test data using model without hyperparameter tuning\n        predicted = text_clf.predict(test_data)\n        prediction_acc = np.mean(predicted == test_labels)\n        print(\"Confusion matrix without GridSearch:\")\n        print(metrics.confusion_matrix(test_labels, predicted))\n        print(\"Mean without GridSearch: \" + str(prediction_acc))\n\n        # Score and evaluate model on test data using model WITH hyperparameter tuning\n        if use_grid_search:\n            predicted = gs_clf.predict(test_data)\n            prediction_acc = np.mean(predicted == test_labels)\n            print(\"Confusion matrix with GridSearch:\")\n            print(metrics.confusion_matrix(test_labels, predicted))\n            print(\"Mean with GridSearch: \" + str(prediction_acc))\n        ```\n        \n    - Plotting confusion matrix using `heatmap` from [seaborn](https://seaborn.pydata.org/generated/seaborn.heatmap.html) module:\n\n        ```Python                \n        # Ploting confusion matrix with 'seaborn' module\n        # Use below line only with Jupyter Notebook\n        # %matplotlib inline\n        import seaborn as sns\n        from sklearn.metrics import confusion_matrix\n        import matplotlib.pyplot as plt\n        import matplotlib\n        mat = confusion_matrix(test_labels, predicted)\n        plt.figure(figsize=(4, 4))\n        sns.set()\n        sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,\n                    xticklabels=np.unique(test_labels),\n                    yticklabels=np.unique(test_labels))\n        plt.xlabel('true label')\n        plt.ylabel('predicted label')\n        # Save confusion matrix\n        # plt.savefig(os.path.join('.', 'outputs', 'confusion_matrix.png'))\n        plt.show()\n        ```\n\n        Resulting confusion matrix should look similar to this:    \n        ![](docs/score_category_min100.png)        \n        \n    - Printing out [classification report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html):\n\n        ```Python                \n        # Printing classification report\n        # Use below line only with Jupyter Notebook\n        from sklearn.metrics import classification_report\n        print(classification_report(test_labels, predicted,\n                                    target_names=np.unique(test_labels)))\n        ```\n        \n        Resulting classification report should look similar to this:   \n        ![](docs/score_category_min100_acc.png)\n    \n    - Serializing trained models using [pickle](https://docs.python.org/3/library/pickle.html#module-pickle) module:\n\n        ```Python                \n        # Save trained models to /output folder\n        if use_grid_search:\n            pickle.dump(\n                gs_clf,\n                open(os.path.join(\n                    '.', 'outputs', column_to_predict+\".model\"),\n                    'wb'\n                )\n            )\n        else:\n            pickle.dump(\n                text_clf,\n                open(os.path.join(\n                    '.', 'outputs', column_to_predict+\".model\"),\n                    'wb'\n                )\n            )\n        ```\n\n- [webservice.py](webservice/webservice.py) - \n\n    - Loading pretrained models from serialized files using pickle:\n\n        ```Python\n        model_ticket_type = pickle.load(\n            open(\n                os.path.join(__location__, \"ticket_type.model\"), \"rb\"\n            )\n        )\n        ```\n\n    - Extracting `description` text from requests json, fitting it into the model to get a prediction and returning the result as json:\n\n        ```Python        \n        @app.route('/endava/api/v1.0/tickettype', methods=['POST'])\n        def tickettype():\n            ts = time.gmtime()\n            logging.info(\"Request received - %s\" % time.strftime(\"%Y-%m-%d %H:%M:%S\", ts))\n            print(request)\n            print(request.json)\n            if not request.json or 'description' not in request.json:\n                abort(400)\n            description = request.json['description']\n            print(description)\n\n            predicted = model_ticket_type.predict([description])\n            print(\"Predicted: \" + str(predicted))\n\n            ts = time.gmtime()\n            logging.info(\"Request sent to evaluation - %s\" % time.strftime(\"%Y-%m-%d %H:%M:%S\", ts))\n            return jsonify({\"ticket_type\": predicted[0]})\n        ```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkarolzak%2Fsupport-tickets-classification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkarolzak%2Fsupport-tickets-classification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkarolzak%2Fsupport-tickets-classification/lists"}