{"id":20625384,"url":"https://github.com/vadniks/akabigdata","last_synced_at":"2025-09-23T23:11:54.205Z","repository":{"id":194079697,"uuid":"690009375","full_name":"vadniks/akaBigData","owner":"vadniks","description":"Technologies and tools for big data analysis","archived":false,"fork":false,"pushed_at":"2023-12-10T10:42:48.000Z","size":1994,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-06-23T18:07:49.502Z","etag":null,"topics":["applied-mathematics","association-rule-learning","classification","clustering","data-analysis","data-visualization","ensemble-learning","machine-learning-algorithms","python3","statistics"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vadniks.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-09-11T11:08:01.000Z","updated_at":"2023-12-10T10:50:26.000Z","dependencies_parsed_at":null,"dependency_job_id":"51eb54bf-d5ca-48c0-84e5-ac260e7d39c4","html_url":"https://github.com/vadniks/akaBigData","commit_stats":null,"previous_names":["vadniks/akabigdata"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/vadniks/akaBigData","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vadniks%2FakaBigData","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vadniks%2FakaBigData/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vadniks%2FakaBigData/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vadniks%2FakaBigData/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vadniks","download_url":"https://codeload.github.com/vadniks/akaBigData/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vadniks%2FakaBigData/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267621581,"owners_count":24116900,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["applied-mathematics","association-rule-learning","classification","clustering","data-analysis","data-visualization","ensemble-learning","machine-learning-algorithms","python3","statistics"],"created_at":"2024-11-16T13:09:26.692Z","updated_at":"2025-09-23T23:11:49.175Z","avatar_url":"https://github.com/vadniks.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Technologies and tools for big data analysis\nThese are the practices from some AI-related course from my university which was taught by \nour department of applied mathematics.\n\n__This source code can be freely used as well as the other materials even without mentioning the original author - \n*you can safely write them off!*__\\\n[Original repository](https://github.com/vadniks/akaBigData)\n\n_# noinspection MachineTranslation, Math_\n\n---\n\n# Datasets\n\nFree datasets were used from [Gapminer](https://www.gapminder.org) and [Kaggle](https://www.kaggle.com), \nother datasets were provided with the tasks, so their original sources are unknown.\n\n---\n\n# Contents\n\n* [Practice 1 - Getting started with Python language](#practice-1---getting-started-with-python-language)\n* [Practice 2 - Various data visualization libraries](#practice-2---various-data-visualization-libraries)\n* [Practice 3 - Various methods of statistical research](#practice-3---various-methods-of-statistical-research)\n* [Practice 4 - Methods for calculating correlation and linear regression, conducting analysis of variance](#practice-4---methods-for-calculating-correlation-and-linear-regression-conducting-analysis-of-variance)\n* [Practice 5 - Applying machine learning algorithms to solve classification problems](#practice-5---applying-machine-learning-algorithms-to-solve-classification-problems)\n* [Practice 6 - Applying machine learning algorithms to solve clusterization problems](#practice-6---applying-machine-learning-algorithms-to-solve-clusterization-problems)\n* [Practice 7 - Ensemble learning methods](#practice-7---ensemble-learning-methods)\n* [Practice 8 - Teaching methods based on association rules](#practice-8---teaching-methods-based-on-association-rules)\n\n---\n\n## Practice 1 - Getting started with Python language\n\n### Task 1\nInstall Python - _Seriously!_\n\n### Task 2\n\nWrite a program that calculates the area of a figure,\nthe parameters of which are supplied to the input. Figures that are submitted for input:\ntriangle, rectangle, circle. The result of the work is a dictionary, where\nthe key is the name of the figure, and the value is the area.\n\n`P1T2`\n\n__Output__\n```\nt2 {'triangle': 1.0, 'rectangle': 12.0, 'circle':\n78.53981633974483}\n```\n\n### Task 3\n\nWrite a program that takes two numbers as input and\nthe operation that needs to be applied to them. Must be implemented\nthe following operations: +, -, /, //, abs – modulus, pow or ** – exponentiation.\n\n`P1T3`\n\n__Output__\n```\nt3 3.0 2.0 1.0\n```\n\n### Task 4\n\nWrite a program that reads numbers from the console (by\none per line) until the sum of the entered numbers is equal to 0 and\nafter that it displays the sum of the squares of all read numbers.\n\n`P1T4`\n\n__Output__\n```\nt4:\n1\n2\n-3\nt4 14\n```\n\n### Task 5\n\nWrite a program that prints the sequence\nnumbers of length N, where each number is repeated as many times as it is equal to.\nA non-negative integer N is passed to the program input. For example, if\nN = 7, then the program should print 1 2 2 3 3 3 4. Printing list elements\nseparated by a space – print(*list).\n\n`P1T5`\n\n__Output__\n```\nt5 [1, 2, 2, 3, 3, 3, 4]\n```\n\n### Task 6\n\nGiven two lists: A = [1, 2, 3, 4, 2, 1, 3, 4, 5, 6, 5, 4, 3, 2] B =\n['a', 'b', 'c', 'c', 'c', 'b', 'a', 'c', 'a', 'a', 'b', 'c', ' b', 'a']. Create a dictionary in\nin which the keys are the contents of list B, and the values for the dictionary keys are\nis the sum of all elements of list A according to the letter contained in\nthe same position in list B. Example program result: {‘a’ : 10, ‘b’ : 15, ‘c’\n: 6}.\n\n`P1T6`\n\n__Output__\n```\nt6 {'a': 17, 'b': 11, 'c': 17}\n```\n\n### Task 7-12\n\nTasks seven to twelve were combined into\ndue to their small size. 7. Download and Upload Home Value Data\nin California using the sklearn library. 8. Use the info() method. 9.\nFind out if there are missing values using isna().sum(). 10. Withdraw\nrecords where the average age of houses in the area is more than 50 years and the population is more than\n2500 people using the loc() method. 11. Find out the maximum and minimum\nmedian house price values. 12. Using the apply() method, output to\nscreen the name of the characteristic and its average value.\n\n`P1T7_12`\n\n__Output__\n```\nt7:\n        MedInc  HouseAge  AveRooms  ...  AveOccup  Latitude  Longitude\n0      8.3252      41.0  6.984127  ...  2.555556     37.88    -122.23\n1      8.3014      21.0  6.238137  ...  2.109842     37.86    -122.22\n2      7.2574      52.0  8.288136  ...  2.802260     37.85    -122.24\n3      5.6431      52.0  5.817352  ...  2.547945     37.85    -122.25\n4      3.8462      52.0  6.281853  ...  2.181467     37.85    -122.25\n...       ...       ...       ...  ...       ...       ...        ...\n20635  1.5603      25.0  5.045455  ...  2.560606     39.48    -121.09\n20636  2.5568      18.0  6.114035  ...  3.122807     39.49    -121.21\n20637  1.7000      17.0  5.205543  ...  2.325635     39.43    -121.22\n20638  1.8672      18.0  5.329513  ...  2.123209     39.43    -121.32\n20639  2.3886      16.0  5.254717  ...  2.616981     39.37    -121.24\n\n[20640 rows x 8 columns]\n\nt8:\n\u003cclass 'pandas.core.frame.DataFrame'\u003e\nRangeIndex: 20640 entries, 0 to 20639\nData columns (total 8 columns):\n #   Column      Non-Null Count  Dtype\n---  ------      --------------  -----\n 0   MedInc      20640 non-null  float64\n 1   HouseAge    20640 non-null  float64\n 2   AveRooms    20640 non-null  float64\n 3   AveBedrms   20640 non-null  float64\n 4   Population  20640 non-null  float64\n 5   AveOccup    20640 non-null  float64\n 6   Latitude    20640 non-null  float64\n 7   Longitude   20640 non-null  float64\ndtypes: float64(8)\nmemory usage: 1.3 MB\n\nt9:\nMedInc        0\nHouseAge      0\nAveRooms      0\nAveBedrms     0\nPopulation    0\nAveOccup      0\nLatitude      0\nLongitude     0\ndtype: int64\n\nt10:\n       MedInc  HouseAge  AveRooms  ...    AveOccup  Latitude  Longitude\n460    1.4012      52.0  3.105714  ...    9.534286     37.87    -122.26\n4131   3.5349      52.0  4.646119  ...    5.910959     34.13    -118.20\n4440   2.6806      52.0  4.806283  ...    4.007853     34.08    -118.21\n5986   1.8750      52.0  4.500000  ...   21.333333     34.10    -117.71\n7369   3.1901      52.0  4.730942  ...    4.182735     33.97    -118.21\n8227   2.3305      52.0  3.488860  ...    3.955439     33.78    -118.20\n13034  6.1359      52.0  8.275862  ...  230.172414     38.69    -121.15\n15634  1.8295      52.0  2.628169  ...    4.164789     37.80    -122.41\n15652  0.9000      52.0  2.237474  ...    2.237474     37.80    -122.41\n15657  2.5166      52.0  2.839075  ...    1.621520     37.79    -122.41\n15659  1.7240      52.0  2.278566  ...    1.780142     37.79    -122.41\n15795  2.5755      52.0  3.402576  ...    2.108696     37.77    -122.42\n15868  2.8135      52.0  4.584329  ...    3.966799     37.76    -122.41\n\n[13 rows x 8 columns]\n\nt11:\n15.0001 0.4999\n\nt12:\nMedInc           3.870671\nHouseAge        28.639486\nAveRooms         5.429000\nAveBedrms        1.096675\nPopulation    1425.476744\nAveOccup         3.070655\nLatitude        35.631861\nLongitude     -119.569704\ndtype: float64\n```\n\n---\n\n## Practice 2 - Various data visualization libraries\n\n### Task 1\n\nFind and download multidimensional data (with a large number of features - columns) \nusing the pandas library. Describe the data found in the report.\n\n### Task 2\n\nDisplay information about the data using the .info(), .head() methods.\nCheck data for empty values. If present, remove row data or interpolate missing\nvalues. If necessary, additionally pre-process the data for further work with it.\n\n`t1-t2`\n\n__Output__\n```\nt2:\n\u003cclass 'pandas.core.frame.DataFrame'\u003e\nRangeIndex: 197 entries, 0 to 196\nData columns (total 9 columns):\n #   Column             Non-Null Count  Dtype\n---  ------             --------------  -----\n 0   income             195 non-null    float64\n 1   life_exp           195 non-null    float64\n 2   population         195 non-null    float64\n 3   year               197 non-null    int64\n 4   country            197 non-null    object\n 5   four_regions       193 non-null    object\n 6   six_regions        193 non-null    object\n 7   eight_regions      193 non-null    object\n 8   world_bank_region  193 non-null    object\ndtypes: float64(3), int64(1), object(5)\nmemory usage: 14.0+ KB\nt2:\n     income  life_exp  ...       eight_regions           world_bank_region\n0   1910.0      61.0  ...           asia_west                  South Asia\n1  11100.0      78.1  ...         europe_east       Europe \u0026 Central Asia\n2  11100.0      74.7  ...        africa_north  Middle East \u0026 North Africa\n3  46900.0      81.9  ...         europe_west       Europe \u0026 Central Asia\n4   7680.0      60.8  ...  africa_sub_saharan          Sub-Saharan Africa\n\n[5 rows x 9 columns]\n```\n\n### Task 3\n\nPlot a bar chart (.bar) using the graph_objs module from the Plotly library with the following parameters:\n1. On the X-axis indicate the date or name, on the Y-axis indicate the quantitative indicator.\n2. Make the column take on a color depending on the value of the indicator (marker=dict(color=attribute, coloraxis=\"coloraxis\")).\n3. Make sure that the borders of each column are highlighted with a black line with a thickness of 2.\n4. Display the chart title, centered at the top, with text size 20.\n5. Add labels for the X and Y axes with a text size of 16. For the x-axis, rotate the labels so that they are read at an angle of 315.\n6. Make the text size of the axis labels equal to 14.\n7. Place the graph across the entire width of the work area and set the height to 700 pixels.\n8. Add a grid to the graph, make its color 'ivory' and thickness equal to 2. (You can do this when setting the axes using gridwidth=2, gridcolor='ivory').\n9. Remove extra padding along the edges.\n\n`t3`\n\n__Output__\\\n![](images/p2_1.png)\n\n### Task 4\n\nCreate a pie chart (go.Pie) using the data and design style from the previous graph. \nMake sure that the boundaries of each share are highlighted with a black line with a \nthickness of 2 and the categories of the pie chart are readable (for example, combine \nsome objects).\n\n`t4`\n\n__Output__\\\n![](images/p2_2.png)\n\n### Task 5\n\nConstruct linear graphs, take one of the parameters and determine the relationship \nbetween several other (from 2 to 5) indicators using the matplotlib library. Draw a \nconclusion. Make a graph with lines and markers, line color 'crimson', point color \n'white', point border color 'black', point border thickness 2. Add a grid to the \ngraph, make its color 'mistyrose' and width equal to 2. (You can do this when \nsetting the axes using linewidth=2, color='mistyrose').\n\n`t5`\n\n__Output__\\\n![](images/p2_3.png)\n![](images/p2_4.png)\n\n__Conclusion__\\\nThe first graph shows that the higher the income, the longer the life expectancy.\nThe second graph shows that the majority of income is concentrated in the vast \nminority of people, and also that most people have incomes that do not exceed $20,000.\n\n### Task 6\n\nVisualize multidimensional data using t-SNE. It is necessary to use the MNIST or \nfashion MNIST data set (you can also use other ready-made data sets where you can \nobserve the division of objects into clusters). Consider the visualization results \nfor different perplexity values.\n\n`t6`\n\n__Output__\n```\nt6:    label  1x1  1x2  1x3  1x4  1x5  ...  28x23  28x24  28x25  28x26  28x27  28x28\n0      5    0    0    0    0    0  ...      0      0      0      0      0      0\n1      0    0    0    0    0    0  ...      0      0      0      0      0      0\n2      4    0    0    0    0    0  ...      0      0      0      0      0      0\n3      1    0    0    0    0    0  ...      0      0      0      0      0      0\n4      9    0    0    0    0    0  ...      0      0      0      0      0      0\n\n[5 rows x 785 columns]\nt6: Elapsed time: 1.5147051811218262 seconds\n```\n![](images/p2_5.png)\n![](images/p2_6.png)\n![](images/p2_7.png)\n\n__Conclusion__\\\nFrom the resulting graphs it follows that the higher the perplexity value, the larger the \nclusters become. Perplexity (a variable parameter) describes the expected density around \neach point. Low values focus the algorithm on fewer neighbors, high values reduce the \nnumber of more densely packed groups.\n\n### Task 7\n\nVisualize multidimensional data using UMAP with different n_neighbors and min_dist \nparameters. Calculate the running time of the algorithm using the time library and \ncompare it with the running time of t-SNE.\n\n`t7`\n\n__Output__\n```\nt7: Elapsed time: 1.9380676746368408 seconds\n```\n![](images/p2_8.png)\n![](images/p2_9.png)\n![](images/p2_10.png)\n![](images/p2_11.png)\n![](images/p2_12.png)\n![](images/p2_13.png)\n\n__Conclusion__\\\nBased on the obtained graphs, we can draw the following conclusion: Small values of \nthe n_neighbors parameter mean that the algorithm is limited to a small neighborhood \naround each point - it tries to capture the local structure of the data. Large ones \nretain the global structure, but lose details. The min_dist parameter determines the \nminimum distance at which points can be located in the new space. Low values define \nthe division of data into clusters, while high values define the structure of the data \nas a whole. Despite the fact that theoretically the UMAP method should be faster than \nthe TSNE method, practical measurements have shown the opposite, although the difference \nis less than a second. The likely reason is that only the first 1000 data items are used, \nso both methods are fast, but if you increase the number of data items, the UMAP method \nis faster.\n\n---\n\n## Practice 3 - Various methods of statistical research\n\n### Task 1\n\nLoad data from file\n\n### Task 2\n\nUse the describe() method to view statistics on the data. Draw conclusions.\n\n`t1-t2`\n\n__Output__\n```\nt2:\n   age     sex     bmi  children smoker     region      charges\n0   19  female  27.900         0    yes  southwest  16884.92400\n1   18    male  33.770         1     no  southeast   1725.55230\n2   28    male  33.000         3     no  southeast   4449.46200\n3   33    male  22.705         0     no  northwest  21984.47061\n4   32    male  28.880         0     no  northwest   3866.85520\n\n               age          bmi     children       charges\ncount  1338.000000  1338.000000  1338.000000   1338.000000\nmean     39.207025    30.663397     1.094918  13270.422265\nstd      14.049960     6.098187     1.205493  12110.011237\nmin      18.000000    15.960000     0.000000   1121.873900\n25%      27.000000    26.296250     0.000000   4740.287150\n50%      39.000000    30.400000     1.000000   9382.033000\n75%      51.000000    34.693750     2.000000  16639.912515\nmax      64.000000    53.130000     5.000000  63770.428010\n--------------------------------------------------\n```\n\n__Conclusion__\\\nYou can see a count of all the attributes of the dataset, and also that age in the dataset goes \nfrom 18 to 64, bmi (body mass index) from 15.9 to 53, children (number of children) from 0 to 5, \ncharges (expenses) from 1121.9 up to 63770. Average value (mean) of each attribute: 39 for age, \n30.6 for bmi, 1 for children, 13270 for charges. Standard deviation is an estimate for a sample \nthat allows you to evaluate how much the data changes relative to their average: 14 for age, 6 \nfor bmi, 1 for children, 12110 for charges. Each subsequent quarter increases (25%, 50%, 75%, \n100%), the charges attribute increases more strongly. The count attribute is the same everywhere.\n\n### Task 3\n\nConstruct histograms for numerical indicators. Draw conclusions.\n\n`t3`\n\n__Output__\\\n![](images/p3_1.png)\n\n__Conclusion__\\\nThe x-axis indicates the values of the variable, and the y-axis indicates how often the value of \nthis variable occurs in a certain interval. The interval length was chosen to be 15. From left to \nright, top to bottom, you can see how often the value of the variable appears. Charges values close \nto zero appear more often. The most common age value is close to zero, while the rest are evenly \ndistributed. The bmi values have a normal distribution. The most common value of children is zero; \nthe larger the number, the less repeated it is.\n\n### Task 4\n\nFind measures of central tendency and measures of dispersion for body mass index (bmi) and charges \n(charges). Display results as text and in histograms (3 vertical lines). Add a legend to graphs. \nDraw conclusions.\n\n`t4`\n\n__Output__\n```\nt4:\nMean BMI = 30.663397\nMode BMI:  32.3\nMedian BMI = 30.400000\n\nMean Charges = 13270.422265\nMode Charges:  1639.5631\nMedian Charges = 9382.033000\n\nStandard Deviation of charges:  12110.011236694001\nRange of charges:  62648.554110000005\nQuarter range of charges using numpy:  11879.80148\nQuarter range of charges with scipy:  11879.80148\n\nStandard Deviation of bmi:  6.098186911679014\nRange of bmi:  37.17\nQuarter range of bmi using numpy:  8.384999999999998\nQuarter range of bmi with scipy:  8.384999999999998\n--------------------------------------------------\n```\n![](images/p3_2.png)\n\n\n__Conclusion__\\\nThe bmi graph shows that the values in it have a normal distribution. According to the charges graph, \nfrom left to right the values are repeated less. You can also see from the bmi graph that mode and \nmedian are close to each other - the mean and central values are almost the same. The most common \nvalue is mode. In charges, the mode value is the most repeated (on the left), also in charges there \nis much more variability in values, the average differs from the central one. The measure of dispersion \nincludes: Standard Deviation, Range, Quarter range (the difference between the 1st and 3rd quarters \nis the most common). The range y of the charges attribute is very large (max – min).\n\n### Task 5\n\nConstruct a box-plot for numerical indicators. The names of the graphs must correspond to the names \nof the features. Draw conclusions.\n\n`t5`\n\n__Output__\\\n![](images/p3_3.png)\n![](images/p3_4.png)\n\n__Conclusion__\\\nIn bmi and charges, the points outside the “whiskers” (quarters 1 and 3 (second 50%)) are outliers (values \nthat are very different from other values, they are very rare), the orange line inside the “box” (clusters \nof average values) – median, outliers outside the category are too large to characterize the category. The \ngraph shows the distribution of information in a certain category. Categories: age, bmi, charges and children. \nThere are outliers only in the bmi and charges attributes, and these outliers are strictly greater than the \nmaximum; in the rest they are not present. In charges, half of the values are outliers.\n\n### Task 6\n\nUsing the charges or imb attribute, check whether the central limit theorem holds. Use different sample \nlengths n. Number of samples = 300. Display the result in the form of histograms. Find the standard deviation \nand mean for the resulting distributions. Draw conclusions.\n\n`t6`\n\n__Output__\n```\nt6:\nMean of  n=1    12198.327287   Std of  n=1    10855.966798\nMean of  n=10    13357.920196   Std of  n=10    4062.840514\nMean of  n=50    13118.524016   Std of  n=50    1833.114858\nMean of  n=100    13378.232319   Std of  n=100    1197.308316\nMean of  n=150    13309.708941   Std of  n=150    800.090288\nMean of  n=200    13260.895946   Std of  n=200    735.038002\n\nStandard Deviation:  12110.011236694001\nRange:  62648.554110000005\nQuarter range using numpy:  11879.80148\nQuarter range with scipy:  11879.80148\n--------------------------------------------------\n```\n\n![](images/p3_5.png)\n![](images/p3_6.png)\n\n__Conclusion__\\\nDataset values pass the central limit theorem in general and for various sample lengths (except n = 1) in particular. \nThe larger n, the closer to the ideal form of the normal distribution. The value n is the length of samples. All mean \nvalues are around 12 thousand. The larger the sample length, the smaller the standard deviation and the closer the \ngraph is to a very accurate form of normal distribution.\n\n### Task 7\n\nConstruct 95% and 99% confidence intervals for the mean expenditure and mean BMI.\n\n`t7`\n\n__Output__\n```\nt7:\n90% confidence interval for Charges:  (12725.864762144516, 13814.979768137997)\n95% confidence interval for Charges:  (12621.54197822916, 13919.302552053354)\n90% confidence interval for BMI:  (30.389176352638128, 30.93761736933497)\n95% confidence interval for BMI:  (30.336642971534822, 30.990150750438275)\n--------------------------------------------------\n```\n\n### Task 8\n\nCheck the distribution of the following characteristics for normality: body mass index, expenses. Formulate the null \nand alternative hypotheses. For each characteristic, use the KS test and q-q plot. Draw conclusions based on the \nobtained p-values.\n\n`t8`\n\n__Output__\n```\nt8:\nKstestResult(statistic=0.02613962682509635, pvalue=0.31453976932347394, statistic_location=28.975, statistic_sign=1)\nKstestResult(statistic=0.18846204110424236, pvalue=4.39305730768502e-42, statistic_location=13470.86, statistic_sign=1)\n--------------------------------------------------\n```\n![](images/p3_7.png)\n![](images/p3_8.png)\n\n__Conclusion__\\\nThe task is to test the null and alternative hypotheses, null – there is no difference (or there are few significant \ndifferences), alternative – there are significant differences (visible differences). They are determined by the p-level \nvalue (pvalue) - if it is less than 0.05, then the null hypothesis is rejected and the alternative is accepted, if it \nis more, vice versa. Hypotheses are always about difference. Normality – comparison of the dependence of the original \nsample values with the values of an ideal normal distribution. If the values follow the line exactly, then they are \nnormally distributed. If the deviations are higher than the straight line, then the values are higher than normal and \nvice versa. Conclusions from the graphs: the bmi graph shows a fairly normal distribution, but the charges graph is \nvery different from the normal distribution. The x-axis shows the standard normal distribution, and the y-axis shows \nthe distribution of the sample under study. Null hypothesis - we assume that there are no differences between the ideal \nnormal distribution and the dependence of our initial values (charges, for example). Alternative hypothesis - we assume \nthat significant differences exist between the sample values and the normal distribution. For the bmi characteristic, \nthe null hypothesis was chosen and the alternative was rejected, and for the charges characteristic, vice versa. The \nessence of the KS test is to assess the significance of the differences between two samples, as in the previous test \n(q-q). Here too, hypotheses are selected based on the pvalue. For the bmi feature, the pvalue is higher than 0.05, \nwhich means we need to accept the null hypothesis, since the sample has a normal distribution. The charges attribute \nhas a much smaller pvalue, which means the null hypothesis is rejected since the sample does not have a normal \ndistribution.\n\n### Task 9\n\nLoad data from file\n\n`t9`\n\n__Output__\n```\nt9:\n          dateRep  day  month  year  cases  deaths countriesAndTerritories  \\\n0      14/12/2020   14     12  2020    746       6             Afghanistan\n1      13/12/2020   13     12  2020    298       9             Afghanistan\n2      12/12/2020   12     12  2020    113      11             Afghanistan\n3      12/12/2020   12     12  2020    113      11             Afghanistan\n4      11/12/2020   11     12  2020     63      10             Afghanistan\n...           ...  ...    ...   ...    ...     ...                     ...\n61899  25/03/2020   25      3  2020      0       0                Zimbabwe\n61900  24/03/2020   24      3  2020      0       1                Zimbabwe\n61901  23/03/2020   23      3  2020      0       0                Zimbabwe\n61902  22/03/2020   22      3  2020      1       0                Zimbabwe\n61903  21/03/2020   21      3  2020      1       0                Zimbabwe\n\n      geoId countryterritoryCode  popData2019 continentExp  \\\n0        AF                  AFG   38041757.0         Asia\n1        AF                  AFG   38041757.0         Asia\n2        AF                  AFG   38041757.0         Asia\n3        AF                  AFG   38041757.0         Asia\n4        AF                  AFG   38041757.0         Asia\n...     ...                  ...          ...          ...\n61899    ZW                  ZWE   14645473.0       Africa\n61900    ZW                  ZWE   14645473.0       Africa\n61901    ZW                  ZWE   14645473.0       Africa\n61902    ZW                  ZWE   14645473.0       Africa\n61903    ZW                  ZWE   14645473.0       Africa\n\n       Cumulative_number_for_14_days_of_COVID-19_cases_per_100000\n0                                               9.013779\n1                                               7.052776\n2                                               6.868768\n3                                               6.868768\n4                                               7.134266\n...                                                  ...\n61899                                                NaN\n61900                                                NaN\n61901                                                NaN\n61902                                                NaN\n61903                                                NaN\n\n[61904 rows x 12 columns]\n\n\u003cclass 'pandas.core.frame.DataFrame'\u003e\nRangeIndex: 61904 entries, 0 to 61903\nData columns (total 12 columns):\n #   Column                                             Non-Null Count  Dtype\n---  ------                                             --------------  -----\n 0   dateRep                                            61904 non-null  object\n 1   day                                                61904 non-null  int64\n 2   month                                              61904 non-null  int64\n 3   year                                               61904 non-null  int64\n 4   cases                                              61904 non-null  int64\n 5   deaths                                             61904 non-null  int64\n 6   countriesAndTerritories                            61904 non-null  object\n 7   geoId                                              61629 non-null  object\n 8   countryterritoryCode                               61781 non-null  object\n 9   popData2019                                        61781 non-null  float64\n 10  continentExp                                       61904 non-null  object\n 11  Cumulative_number_for_14_days_of_COVID-19_cases_per_100000\n                                                        59025 non-null  float64\ndtypes: float64(2), int64(5), object(5)\nmemory usage: 5.7+ MB\n--------------------------------------------------\n```\n\n### Task 10\n\nCheck the data for missing values. Display the number of missing values as a percentage. Remove the two features that \nhave the most missing values. For the remaining features, process gaps: for a categorical feature, use filling with \nthe default value (for example, “other”), for a numeric feature, use filling with the median value. Show that there \nare no more gaps in the data.\n\n`t10`\n\n__Output__\n```\nt10:\n dateRep : 0.0%\n day : 0.0%\n month : 0.0%\n year : 0.0%\n cases : 0.0%\n deaths : 0.0%\n countriesAndTerritories : 0.0%\n geoId : 0.4%\n countryterritoryCode : 0.2%\n popData2019 : 0.2%\n continentExp : 0.0%\n Cumulative_number_for_14_days_of_COVID-19_cases_per_100000 : 4.7%\n\n          dateRep  day  month  year  cases  deaths countriesAndTerritories  \\\n0      14/12/2020   14     12  2020    746       6             Afghanistan\n1      13/12/2020   13     12  2020    298       9             Afghanistan\n2      12/12/2020   12     12  2020    113      11             Afghanistan\n3      12/12/2020   12     12  2020    113      11             Afghanistan\n4      11/12/2020   11     12  2020     63      10             Afghanistan\n...           ...  ...    ...   ...    ...     ...                     ...\n61899  25/03/2020   25      3  2020      0       0                Zimbabwe\n61900  24/03/2020   24      3  2020      0       1                Zimbabwe\n61901  23/03/2020   23      3  2020      0       0                Zimbabwe\n61902  22/03/2020   22      3  2020      1       0                Zimbabwe\n61903  21/03/2020   21      3  2020      1       0                Zimbabwe\n\n      countryterritoryCode  popData2019 continentExp\n0                      AFG   38041757.0         Asia\n1                      AFG   38041757.0         Asia\n2                      AFG   38041757.0         Asia\n3                      AFG   38041757.0         Asia\n4                      AFG   38041757.0         Asia\n...                    ...          ...          ...\n61899                  ZWE   14645473.0       Africa\n61900                  ZWE   14645473.0       Africa\n61901                  ZWE   14645473.0       Africa\n61902                  ZWE   14645473.0       Africa\n61903                  ZWE   14645473.0       Africa\n\n[61904 rows x 10 columns]\n\n dateRep : 0.0%\n day : 0.0%\n month : 0.0%\n year : 0.0%\n cases : 0.0%\n deaths : 0.0%\n countriesAndTerritories : 0.0%\n countryterritoryCode : 0.0%\n popData2019 : 0.0%\n continentExp : 0.0%\n--------------------------------------------------\n```\n\n### Task 11\n\nView statistics on data using describe(). Draw conclusions about which features contain outliers. See for which \ncountries the number of deaths per day exceeded 3000 and how many such days there were.\n\n`t11`\n\n__Output__\n```\nt11:\n                day         month          year          cases        deaths  \\\ncount  61904.000000  61904.000000  61904.000000   61904.000000  61904.000000\nmean      15.629232      7.067104   2019.998918    1155.079026     26.053987\nstd        8.841624      2.954816      0.032881    6779.010824    131.222948\nmin        1.000000      1.000000   2019.000000   -8261.000000  -1918.000000\n25%        8.000000      5.000000   2020.000000       0.000000      0.000000\n50%       15.000000      7.000000   2020.000000      15.000000      0.000000\n75%       23.000000     10.000000   2020.000000     273.000000      4.000000\nmax       31.000000     12.000000   2020.000000  234633.000000   4928.000000\n\n        popData2019\ncount  6.190400e+04\nmean   4.091909e+07\nstd    1.529798e+08\nmin    8.150000e+02\n25%    1.324820e+06\n50%    7.169456e+06\n75%    2.851583e+07\nmax    1.433784e+09\n\n0        False\n1        False\n2        False\n3        False\n4        False\n         ...\n61899    False\n61900    False\n61901    False\n61902    False\n61903    False\nName: deaths, Length: 61904, dtype: bool\n\nThere are 11 days where deaths \u003e= 3000\n\n          dateRep  day  month  year   cases  deaths   countriesAndTerritories  \\\n2118   02/10/2020    2     10  2020   14001    3351                 Argentina\n16908  07/09/2020    7      9  2020   -8261    3800                   Ecuador\n37038  09/10/2020    9     10  2020    4936    3013                    Mexico\n44888  14/08/2020   14      8  2020    9441    3935                      Peru\n44909  24/07/2020   24      7  2020    4546    3887                      Peru\n59007  12/12/2020   12     12  2020  234633    3343  United_States_of_America\n59009  10/12/2020   10     12  2020  220025    3124  United_States_of_America\n59016  03/12/2020    3     12  2020  203311    3190  United_States_of_America\n59239  24/04/2020   24      4  2020   26543    3179  United_States_of_America\n59245  18/04/2020   18      4  2020   30833    3770  United_States_of_America\n59247  16/04/2020   16      4  2020   30148    4928  United_States_of_America\n\n      countryterritoryCode  popData2019 continentExp\n2118                   ARG   44780675.0      America\n16908                  ECU   17373657.0      America\n37038                  MEX  127575529.0      America\n44888                  PER   32510462.0      America\n44909                  PER   32510462.0      America\n59007                  USA  329064917.0      America\n59009                  USA  329064917.0      America\n59016                  USA  329064917.0      America\n59239                  USA  329064917.0      America\n59245                  USA  329064917.0      America\n59247                  USA  329064917.0      America\n--------------------------------------------------\n```\n\n![](images/p3_9.png)\n\n__Conclusion__\\\nOutliers are present in the cases and deaths characteristics because there the minima are negative (can be seen from \ndescribe()) – the values to the left of the significant minimum. It is also clear from the general graph that outliers \nare also present in the year and popData2019 features. The latter has more of them than the others. A total of 11 days \nwere found when the number of deaths exceeded 3000. Countries in which these days were recorded: Argentina (Argentina), \nEcuador (Ecuador), Mexico (Mexico), Peru (Peru), United_States_of_America (USA).\n\n### Task 12\n\nFind data duplication. Remove duplicates.\n\n`t12`\n\n__Output__\n```\nt12:\n          dateRep  day  month  year  cases  deaths countriesAndTerritories  \\\n3      12/12/2020   12     12  2020    113      11             Afghanistan\n218    12/05/2020   12      5  2020    285       2             Afghanistan\n48010  29/05/2020   29      5  2020      0       0             Saint_Lucia\n48073  28/03/2020   28      3  2020      0       0             Saint_Lucia\n\n      countryterritoryCode  popData2019 continentExp\n3                      AFG   38041757.0         Asia\n218                    AFG   38041757.0         Asia\n48010                  LCA     182795.0      America\n48073                  LCA     182795.0      America\n\n          dateRep  day  month  year  cases  deaths countriesAndTerritories  \\\n0      14/12/2020   14     12  2020    746       6             Afghanistan\n1      13/12/2020   13     12  2020    298       9             Afghanistan\n2      12/12/2020   12     12  2020    113      11             Afghanistan\n4      11/12/2020   11     12  2020     63      10             Afghanistan\n5      10/12/2020   10     12  2020    202      16             Afghanistan\n...           ...  ...    ...   ...    ...     ...                     ...\n61899  25/03/2020   25      3  2020      0       0                Zimbabwe\n61900  24/03/2020   24      3  2020      0       1                Zimbabwe\n61901  23/03/2020   23      3  2020      0       0                Zimbabwe\n61902  22/03/2020   22      3  2020      1       0                Zimbabwe\n61903  21/03/2020   21      3  2020      1       0                Zimbabwe\n\n      countryterritoryCode  popData2019 continentExp\n0                      AFG   38041757.0         Asia\n1                      AFG   38041757.0         Asia\n2                      AFG   38041757.0         Asia\n4                      AFG   38041757.0         Asia\n5                      AFG   38041757.0         Asia\n...                    ...          ...          ...\n61899                  ZWE   14645473.0       Africa\n61900                  ZWE   14645473.0       Africa\n61901                  ZWE   14645473.0       Africa\n61902                  ZWE   14645473.0       Africa\n61903                  ZWE   14645473.0       Africa\n\n[61900 rows x 10 columns]\n\nEmpty DataFrame\nColumns: [dateRep, day, month, year, cases, deaths, countriesAndTerritories, countryterritoryCode, popData2019, continentExp]\nIndex: []\n--------------------------------------------------\n```\n\n### Task 13\n\nLoad data from the file “bmi.csv”. Take two samples from there. One sample is the body mass index of people from the \nnorthwest region, the second sample is the body mass index of people from the southwest region. Compare the means of \nthese samples using Student's t-test. Preliminarily check samples for normality (Shopiro-Wilk test) and homogeneity of \nvariance (Bartlett test).\n\n`t13`\n\n__Output__\n```\nt13:\n      bmi     region\n0  22.705  northwest\n1  28.880  northwest\n2  27.740  northwest\n3  25.840  northwest\n4  28.025  northwest\n\n        bmi     region\n0    22.705  northwest\n1    28.880  northwest\n2    27.740  northwest\n3    25.840  northwest\n4    28.025  northwest\n..      ...        ...\n320  26.315  northwest\n321  31.065  northwest\n322  25.935  northwest\n323  30.970  northwest\n324  29.070  northwest\n\n[325 rows x 2 columns]\n\n      bmi     region\n325  27.9  southwest\n326  34.4  southwest\n327  24.6  southwest\n328  40.3  southwest\n329  35.3  southwest\n..    ...        ...\n645  20.6  southwest\n646  38.6  southwest\n647  33.4  southwest\n648  44.7  southwest\n649  25.8  southwest\n\n[325 rows x 2 columns]\n\nThe variance of both data groups: 26.305165492071005 32.29731162130177\n\nTtestResult(statistic=-3.2844171500398582, pvalue=0.001076958496307695, df=648.0)\n\n(-3.2844171500398667, 0.0010769584963076643, 648.0)\n\nShapiroResult(statistic=0.9954646825790405, pvalue=0.4655335247516632)\n ShapiroResult(statistic=0.9949268698692322, pvalue=0.3629520535469055)\n\nBartlettResult(statistic=3.4000745256459286, pvalue=0.06519347353581818)\n--------------------------------------------------\n```\n\n__Conclusion__\\\nNull hypothesis - there will be no significant difference between the average bmi values of the northwest and southwest \nregions, alternative - there will be a difference. Since 0.001 (T test) is less than 0.005, the null theory must be \nrejected - there is a significant difference between the average bmi values of the two regions. Normality: in both \ntests the pvalue (Shapiro) is above 0.05, which means we need to accept the null hypothesis - bmi in both regions has \na normal distribution. Homogeneity – testing the equality of depressions in two samples. Null hypothesis – the samples \nunder consideration are obtained from general populations with the same depression. The alternative hypothesis is the \nopposite. Since 0.06 (Barlett) \u003e 0.05 – we accept the null hypothesis – the depressions of the samples are the same – \nthere are no significant differences between the bmi values of the regions.\n\n### Task 14\n\nThe dice was rolled 600 times and the following results were obtained (see Listing 13). Use the Chi-square test to \ncheck whether the resulting distribution is uniform. Use the scipy.stats.chisquare() function.\n\n`t14`\n\n__Output__\n```\nt14:\n   N  Observed  Expected\n0  1        97       100\n1  2        98       100\n2  3       109       100\n3  4        95       100\n4  5        97       100\n5  6       104       100\n\nPower_divergenceResult(statistic=1.44, pvalue=0.9198882077437889)\n--------------------------------------------------\n```\n\n__Conclusion__\\\nThe null hypothesis is that there will be a uniform distribution in the number of drops. Since 0.92 \u003e 0.05, we accept \nthe null hypothesis – uniform distribution.\n\n### Task 15\n\nUse the Chi-square test to test whether the variables are dependent. Create a dataframe using the following code \n(see Listing 14). Use the scipy.stats.chi2_contingency() function. Does marital status affect employment?\n\n`t15`\n\n__Output__\n```\nt15:\n                        Married  Civil marriage  Isn't in relationships\nFull working day             89              80                      35\nPart-time employment         17              22                      44\nTemporary doesn't work       11              20                      35\nOn the household             43              35                       6\nRetired                      22               6                       8\n\n1.7291616900960234e-21\n--------------------------------------------------\n```\n\n__Conclusion__\\\nThe null hypothesis is that marital status does not affect employment, the alternative hypothesis does (there is a \nsignificant relationship). Since the pvalue is very small (\u003c 0.05), we reject the null hypothesis and accept the \nalternative - there is a relationship (marital status affects employment).\n\n---\n\n## Practice 4 - Methods for calculating correlation and linear regression, conducting analysis of variance\n\n### Task 1\n\nDetermine two vectors representing the number of cars parked during 5 working days at the business center in the \nstreet parking lot and in the underground garage. \n1. Find and interpret the correlation between the variables “Street” and “Garage” (calculate the Pearson correlation).\n2. Construct a scatter plot for the above variables.\n\n`t1`\n\n__Output__\n```\nt1:\n[[ 1. -1.]\n [-1.  1.]]\n-0.9999999999999998\n--------------------------------------------------\n```\n![](images/p4_1.png)\n\n__Conclusion__\\\nFrom the correlation matrix and the separately derived correlation coefficient, it is clear that there is a \nrelationship - the correlation is almost -1, which means there is a strong negative correlation.\n\n### Task 2\n\nFind and download data. Derive, preprocess and describe the features.\n1. Construct a correlation matrix for one target variable. Determine the most correlated variable and continue working with it in the next paragraph.\n2. Implement regression manually, display slope, shift and MSE.\n3. Visualize the regression on a graph.\n\n`t2`\n\n__Output__\n```\nt2:\n    species     island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \\\n0    Adelie  Torgersen              39.1             18.7              181.0\n1    Adelie  Torgersen              39.5             17.4              186.0\n2    Adelie  Torgersen              40.3             18.0              195.0\n3    Adelie  Torgersen               NaN              NaN                NaN\n4    Adelie  Torgersen              36.7             19.3              193.0\n..      ...        ...               ...              ...                ...\n339  Gentoo     Biscoe               NaN              NaN                NaN\n340  Gentoo     Biscoe              46.8             14.3              215.0\n341  Gentoo     Biscoe              50.4             15.7              222.0\n342  Gentoo     Biscoe              45.2             14.8              212.0\n343  Gentoo     Biscoe              49.9             16.1              213.0\n\n     body_mass_g     sex\n0         3750.0    MALE\n1         3800.0  FEMALE\n2         3250.0  FEMALE\n3            NaN     NaN\n4         3450.0  FEMALE\n..           ...     ...\n339          NaN     NaN\n340       4850.0  FEMALE\n341       5750.0    MALE\n342       5200.0  FEMALE\n343       5400.0    MALE\n\n[344 rows x 7 columns]\n\n     species  island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \\\n0          0       0              39.1             18.7              181.0\n1          0       0              39.5             17.4              186.0\n2          0       0              40.3             18.0              195.0\n3          0       0               NaN              NaN                NaN\n4          0       0              36.7             19.3              193.0\n..       ...     ...               ...              ...                ...\n339        2       1               NaN              NaN                NaN\n340        2       1              46.8             14.3              215.0\n341        2       1              50.4             15.7              222.0\n342        2       1              45.2             14.8              212.0\n343        2       1              49.9             16.1              213.0\n\n     body_mass_g  sex\n0         3750.0    0\n1         3800.0    1\n2         3250.0    1\n3            NaN   -1\n4         3450.0    1\n..           ...  ...\n339          NaN   -1\n340       4850.0    1\n341       5750.0    0\n342       5200.0    1\n343       5400.0    0\n\n[344 rows x 7 columns]\n\n     species  island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \\\n0        0.0     0.0          0.254545         0.666667           0.152542\n1        0.0     0.0          0.269091         0.511905           0.237288\n2        0.0     0.0          0.298182         0.583333           0.389831\n3        0.0     0.0          0.167273         0.738095           0.355932\n4        0.0     0.0          0.261818         0.892857           0.305085\n..       ...     ...               ...              ...                ...\n337      1.0     0.5          0.549091         0.071429           0.711864\n338      1.0     0.5          0.534545         0.142857           0.728814\n339      1.0     0.5          0.665455         0.309524           0.847458\n340      1.0     0.5          0.476364         0.202381           0.677966\n341      1.0     0.5          0.647273         0.357143           0.694915\n\n     body_mass_g       sex\n0       0.291667  0.333333\n1       0.305556  0.666667\n2       0.152778  0.666667\n3       0.208333  0.666667\n4       0.263889  0.333333\n..           ...       ...\n337     0.618056  0.666667\n338     0.597222  0.666667\n339     0.847222  0.333333\n340     0.694444  0.666667\n341     0.750000  0.333333\n\n[342 rows x 7 columns]\n\n     species  island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \\\n0        0.0     0.0          0.254545         0.666667           0.152542\n1        0.0     0.0          0.269091         0.511905           0.237288\n2        0.0     0.0          0.298182         0.583333           0.389831\n3        0.0     0.0          0.167273         0.738095           0.355932\n4        0.0     0.0          0.261818         0.892857           0.305085\n..       ...     ...               ...              ...                ...\n337      1.0     0.5          0.549091         0.071429           0.711864\n338      1.0     0.5          0.534545         0.142857           0.728814\n339      1.0     0.5          0.665455         0.309524           0.847458\n340      1.0     0.5          0.476364         0.202381           0.677966\n341      1.0     0.5          0.647273         0.357143           0.694915\n\n     body_mass_g       sex\n0       0.291667  0.333333\n1       0.305556  0.666667\n2       0.152778  0.666667\n3       0.208333  0.666667\n4       0.263889  0.333333\n..           ...       ...\n337     0.618056  0.666667\n338     0.597222  0.666667\n339     0.847222  0.333333\n340     0.694444  0.666667\n341     0.750000  0.333333\n\n[342 rows x 7 columns]\n\n[[1.         0.75049112]\n [0.75049112 1.        ]]\n\n0.7504911189081507\n\n                   species  island  culmen_length_mm  culmen_depth_mm  \\\nspecies              1.000   0.005             0.731           -0.744\nisland               0.005   1.000             0.223            0.180\nculmen_length_mm     0.731   0.223             1.000           -0.235\nculmen_depth_mm     -0.744   0.180            -0.235            1.000\nflipper_length_mm    0.854  -0.145             0.656           -0.584\nbody_mass_g          0.750  -0.189             0.595           -0.472\nsex                  0.012   0.047            -0.269           -0.323\n\n                   flipper_length_mm  body_mass_g    sex\nspecies                        0.854        0.750  0.012\nisland                        -0.145       -0.189  0.047\nculmen_length_mm               0.656        0.595 -0.269\nculmen_depth_mm               -0.584       -0.472 -0.323\nflipper_length_mm              1.000        0.871 -0.197\nbody_mass_g                    0.871        1.000 -0.347\nsex                           -0.197       -0.347  1.000\n[0.93208977] 0.10126324431123246\n\n0.013649956706809902\n\n--------------------------------------------------\n```\n![](images/p4_2.png)\n![](images/p4_3.png)\n\n__Conclusion__\n1. A pair of any 2 variables was selected, a correlation matrix was calculated from them and a coefficient of \n0.75049111189081507 was determined - rather, there is a positive correlation. Then, based on the calculated table, the \npair of variables with the highest correlation was precisely determined - body_mass_g and flipper_length_mm with a \ncoefficient of 0.871.\n2. Regression was calculated (best fit line - a line that captures the majority of points, a line that shows how the \ndata (points) are correlated). Next, a slope (model1.coef_) equal to 0.93208977 and an offset (model1.intercept_) \nequal to 0.10126324431123246 were found, the slope coefficient reports a slope of approximately 30 degrees from the \norigin. At the end, MSE was found - mean squared error (mean squared error - a metric that shows how accurate the \nforecasts are and what is the magnitude of the deviation from the actual values) equal to 0.013649956706809902, since \nthe closer the value is to zero, the better the model, then at a given value ( 0.01) the constructed model is almost ideal.\n3. A regression graph was constructed, from which a positive relationship can be seen - the more one, the more the other.\n\n### Task 3\n\nLoad data: 'insurance.csv'. Output and preprocess. List unique regions.\n1. Perform a one-way ANOVA test to test the effect of region on body mass index (BMI) using the first method through the Scipy library.\n2. Perform a one-way ANOVA test to test the effect of region on body mass index (BMI) using the second method, using the anova_lm() function from the statsmodels library.\n3. Using Student's t test, sort through all pairs. Define the Bonferroni correction. Draw conclusions.\n4. Perform Tukey's post-hoc tests and plot the graph.\n5. Run a two-way ANOVA test to test the effect of region and gender on body mass index (BMI) using the anova_lm() function from the statsmodels library.\n6. Perform Tukey's post-hoc tests and plot the graph.\n\n`t3`\n\n__Output__\n```\nt3:\n      age     sex     bmi  children smoker     region      charges\n0      19  female  27.900         0    yes  southwest  16884.92400\n1      18    male  33.770         1     no  southeast   1725.55230\n2      28    male  33.000         3     no  southeast   4449.46200\n3      33    male  22.705         0     no  northwest  21984.47061\n4      32    male  28.880         0     no  northwest   3866.85520\n...   ...     ...     ...       ...    ...        ...          ...\n1333   50    male  30.970         3     no  northwest  10600.54830\n1334   18  female  31.920         0     no  northeast   2205.98080\n1335   18  female  36.850         0     no  southeast   1629.83350\n1336   21  female  25.800         0     no  southwest   2007.94500\n1337   61  female  29.070         0    yes  northwest  29141.36030\n\n[1338 rows x 7 columns]\n\n\u003cclass 'pandas.core.frame.DataFrame'\u003e\nRangeIndex: 1338 entries, 0 to 1337\nData columns (total 7 columns):\n #   Column    Non-Null Count  Dtype\n---  ------    --------------  -----\n 0   age       1338 non-null   int64\n 1   sex       1338 non-null   object\n 2   bmi       1338 non-null   float64\n 3   children  1338 non-null   int64\n 4   smoker    1338 non-null   object\n 5   region    1338 non-null   object\n 6   charges   1338 non-null   float64\ndtypes: float64(2), int64(2), object(3)\nmemory usage: 73.3+ KB\n\n\n age : 0.0%\n sex : 0.0%\n bmi : 0.0%\n children : 0.0%\n smoker : 0.0%\n region : 0.0%\n charges : 0.0%\n\n\n['southwest' 'southeast' 'northwest' 'northeast']\n\nF_onewayResult(statistic=39.49505720170283, pvalue=1.881838913929143e-24)\n\n                sum_sq      df          F        PR(\u003eF)\nregion     4055.880631     3.0  39.495057  1.881839e-24\nResidual  45664.319755  1334.0        NaN           NaN\n\nnortheast northwest\nTtestResult(statistic=-0.060307727183293185, pvalue=0.951929170821864, df=647.0)\n5.711575024931184\nnortheast southeast\nTtestResult(statistic=-8.790905562598699, pvalue=1.186014937424813e-17, df=686.0)\n7.116089624548878e-17\nnortheast southwest\nTtestResult(statistic=-3.1169000930045923, pvalue=0.0019086161671573072, df=647.0)\n0.011451697002943843\nnorthwest southeast\nTtestResult(statistic=-9.25649013552548, pvalue=2.643571405230106e-19, df=687.0)\n1.5861428431380637e-18\nnorthwest southwest\nTtestResult(statistic=-3.2844171500398582, pvalue=0.001076958496307695, df=648.0)\n0.006461750977846171\nsoutheast southwest\nTtestResult(statistic=5.908373821545118, pvalue=5.4374009639680636e-09, df=687.0)\n3.2624405783808385e-08\n\n\n30.66339686098655\n30.4\n\n\n   Multiple Comparison of Means - Tukey HSD, FWER=0.05\n==========================================================\n  group1    group2  meandiff p-adj   lower   upper  reject\n----------------------------------------------------------\nnortheast northwest   0.0263 0.9999 -1.1552  1.2078  False\nnortheast southeast   4.1825    0.0   3.033   5.332   True\nnortheast southwest   1.4231 0.0107  0.2416  2.6046   True\nnorthwest southeast   4.1562    0.0  3.0077  5.3047   True\nnorthwest southwest   1.3968 0.0127  0.2162  2.5774   True\nsoutheast southwest  -2.7594    0.0 -3.9079 -1.6108   True\n----------------------------------------------------------\n\n                df        sum_sq      mean_sq          F        PR(\u003eF)\nregion         3.0   4055.880631  1351.960210  39.602259  1.636858e-24\nsex            1.0     86.007035    86.007035   2.519359  1.126940e-01\nregion:sex     3.0    174.157808    58.052603   1.700504  1.650655e-01\nResidual    1330.0  45404.154911    34.138462        NaN           NaN\n\n         Multiple Comparison of Means - Tukey HSD, FWER=0.05\n======================================================================\n     group1          group2     meandiff p-adj   lower   upper  reject\n----------------------------------------------------------------------\nnortheastfemale   northeastmale  -0.2998 0.9998 -2.2706  1.6711  False\nnortheastfemale northwestfemale  -0.0464    1.0 -2.0142  1.9215  False\nnortheastfemale   northwestmale  -0.2042    1.0 -2.1811  1.7728  False\nnortheastfemale southeastfemale   3.3469    0.0    1.41  5.2839   True\nnortheastfemale   southeastmale   4.6657    0.0  2.7634   6.568   True\nnortheastfemale southwestfemale   0.7362 0.9497 -1.2377    2.71  False\nnortheastfemale   southwestmale   1.8051 0.1007 -0.1657   3.776  False\n  northeastmale northwestfemale   0.2534 0.9999 -1.7083  2.2152  False\n  northeastmale   northwestmale   0.0956    1.0 -1.8752  2.0665  False\n  northeastmale southeastfemale   3.6467    0.0  1.7159  5.5775   True\n  northeastmale   southeastmale   4.9655    0.0  3.0695  6.8614   True\n  northeastmale southwestfemale    1.036 0.7515 -0.9318  3.0037  False\n  northeastmale   southwestmale   2.1049 0.0258  0.1402  4.0697   True\nnorthwestfemale   northwestmale  -0.1578    1.0 -2.1257    1.81  False\nnorthwestfemale southeastfemale   3.3933    0.0  1.4656   5.321   True\nnorthwestfemale   southeastmale    4.712    0.0  2.8192  6.6049   True\nnorthwestfemale southwestfemale   0.7825 0.9294 -1.1822  2.7473  False\nnorthwestfemale   southwestmale   1.8515 0.0806 -0.1103  3.8132  False\n  northwestmale southeastfemale   3.5511    0.0  1.6141  5.4881   True\n  northwestmale   southeastmale   4.8698    0.0  2.9676  6.7721   True\n  northwestmale southwestfemale   0.9403 0.8354 -1.0335  2.9142  False\n  northwestmale   southwestmale   2.0093  0.042  0.0385  3.9801   True\nsoutheastfemale   southeastmale   1.3187 0.3823  -0.542  3.1795  False\nsoutheastfemale southwestfemale  -2.6108 0.0011 -4.5446 -0.6769   True\nsoutheastfemale   southwestmale  -1.5418 0.2304 -3.4726   0.389  False\n  southeastmale southwestfemale  -3.9295    0.0 -5.8286 -2.0304   True\n  southeastmale   southwestmale  -2.8606 0.0001 -4.7565 -0.9646   True\nsouthwestfemale   southwestmale    1.069 0.7201 -0.8988  3.0367  False\n----------------------------------------------------------------------\n```\n![](images/p4_4.png)\n![](images/p4_5.png)\n\n__Conclusion__\\\nThere are 4 unique regions in total: southwest, southeast, northwest, northeast.\n1. The result of a one-way ANOVA test (analysis of variance, a statistical procedure for comparing the average values \nof a certain variable and two or more independent groups) shows that the p-value is 1.881838913929143e-24, it is less \nthan 0.05, which means the feature region has a statistically significant effect on the feature bmi (body mass index).\n2. The result of the test (PR(\u003eF) - p-relationship is p-value) coincides with the result of the previous test. In this \ntest, you do not need to pre-divide into 4 regions, unlike the previous one.\n3. Based on the results of calculating the Student's t test (compares all pairs and shows whether there is an effect) \nand the Bonferroni correction (the simplest and most well-known way to control the group probability of error) for all \npairs, it was found that only the northeast northwest pair had a p-value more than 0.05 - you need to accept the null \nhypothesis that there is no significant influence of features on each other. The Bonferroni correction is the p-value, \ncalculated as the Student's t-test p-value multiplied by the number of pairs.\n4. Results of post-hoc tests (they check due to which differences the effect turned out to be significant, the \ndifferences are significant or not) Tukey (the most popular of them) determined that for all pairs except the first \n(northeast northwest) the null hypothesis should be rejected (reject column) and accept the alternative hypothesis - \nthere is a significant effect. The graph also shows that southwest and southeast have no intersections (there is a \nsignificant difference), and northwest northeast (p-adj equals 0.9999 \u003e 0.05) has intersections (shown by horizontal \nlines - if the line is exactly under the other, then there is an intersection) - between them no difference. The larger \nthe intersection, the smaller the difference. The red line on the graph is the average bmi value. Meandiff – the \ndifference between the average values for each pair. Black dots are average values. The horizontal black lines are the \nsame length because the number of elements is the same. Lower is the least, upper is the most. P-adj – p-value adjusted \n– normalized p value – 0.1 is very small.\n5. The test result shows that only the region factor has a significant effect on the body mass index (bmi) trait, since \nthe p-value (PR(\u003eF)) equal to 1.636858e-24 is less than 0.05, the gender factor does not affect the ratio between 2 \nfactors (region and gender) does not affect bmi. Two-factor (one more factor is added) – explains the influence of \nfactors a, b and error checking - the resulting answer does not come from the influence of a and b on each other.\n6. First, a combination of the characteristics region and gender is created, with its help we find its effect on the \ntrait body mass index (bmi). Tukey finds all the unique pairs and compares them. On the graph, the last 4 elements have \nfew differences between each other, but the first 4 have significant differences both between themselves and between \nthe rest.\n\n---\n\n## Practice 5 - Applying machine learning algorithms to solve classification problems\n\n### Task 1\n\nFind data for classification. Pre-process the data if necessary.\n\n`t1`\n\n__Output__\n```\nt1:\n    species     island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \\\n0    Adelie  Torgersen              39.1             18.7              181.0\n1    Adelie  Torgersen              39.5             17.4              186.0\n2    Adelie  Torgersen              40.3             18.0              195.0\n3    Adelie  Torgersen               NaN              NaN                NaN\n4    Adelie  Torgersen              36.7             19.3              193.0\n..      ...        ...               ...              ...                ...\n339  Gentoo     Biscoe               NaN              NaN                NaN\n340  Gentoo     Biscoe              46.8             14.3              215.0\n341  Gentoo     Biscoe              50.4             15.7              222.0\n342  Gentoo     Biscoe              45.2             14.8              212.0\n343  Gentoo     Biscoe              49.9             16.1              213.0\n\n     body_mass_g     sex\n0         3750.0    MALE\n1         3800.0  FEMALE\n2         3250.0  FEMALE\n3            NaN     NaN\n4         3450.0  FEMALE\n..           ...     ...\n339          NaN     NaN\n340       4850.0  FEMALE\n341       5750.0    MALE\n342       5200.0  FEMALE\n343       5400.0    MALE\n\n[344 rows x 7 columns]\n\n    species     island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \\\n0    Adelie  Torgersen              39.1             18.7              181.0\n1    Adelie  Torgersen              39.5             17.4              186.0\n2    Adelie  Torgersen              40.3             18.0              195.0\n4    Adelie  Torgersen              36.7             19.3              193.0\n5    Adelie  Torgersen              39.3             20.6              190.0\n..      ...        ...               ...              ...                ...\n338  Gentoo     Biscoe              47.2             13.7              214.0\n340  Gentoo     Biscoe              46.8             14.3              215.0\n341  Gentoo     Biscoe              50.4             15.7              222.0\n342  Gentoo     Biscoe              45.2             14.8              212.0\n343  Gentoo     Biscoe              49.9             16.1              213.0\n\n     body_mass_g     sex\n0         3750.0    MALE\n1         3800.0  FEMALE\n2         3250.0  FEMALE\n4         3450.0  FEMALE\n5         3650.0    MALE\n..           ...     ...\n338       4925.0  FEMALE\n340       4850.0  FEMALE\n341       5750.0    MALE\n342       5200.0  FEMALE\n343       5400.0    MALE\n\n[334 rows x 7 columns]\n\n     species  island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \\\n0          0       0              39.1             18.7              181.0\n1          0       0              39.5             17.4              186.0\n2          0       0              40.3             18.0              195.0\n4          0       0              36.7             19.3              193.0\n5          0       0              39.3             20.6              190.0\n..       ...     ...               ...              ...                ...\n338        2       1              47.2             13.7              214.0\n340        2       1              46.8             14.3              215.0\n341        2       1              50.4             15.7              222.0\n342        2       1              45.2             14.8              212.0\n343        2       1              49.9             16.1              213.0\n\n     body_mass_g  sex\n0         3750.0    0\n1         3800.0    1\n2         3250.0    1\n4         3450.0    1\n5         3650.0    0\n..           ...  ...\n338       4925.0    1\n340       4850.0    1\n341       5750.0    0\n342       5200.0    1\n343       5400.0    0\n\n[334 rows x 7 columns]\n\n--------------------------------------------------\n```\n\n### Task 2\n\nDraw a histogram that shows the balance of classes. Draw conclusions.\n\n`t2`\n\n__Output__\n```\nt2:\n species\n0    146\n1    146\n2    146\nName: count, dtype: int64\n--------------------------------------------------\n```\n![](images/p5_1.png)\n\n__Conclusion__\\\nThe data was divided into 3 classes (based on the number of unique penguin breeds - spicies). The class with the \nlargest number of elements is zero, the class with the smallest number is first, and the second class has an average \nnumber of elements. The classes are unbalanced; to correct this, the method of adding similar values to the first and \nsecond classes was used to equalize the number of elements in them and, accordingly, balance all classes - \noversampling. There are also 2 more methods: undersampling - reduces the number of elements to a minimum, synthetic \ndata - adding data using neural networks.\n\n### Task 3\n\nDivide the sample into training and test. Training to train the model, test to check its quality.\n\n`t3`\n\n__Output__\n```\nt3:\nSize of Predictor Train set (350, 6)\n Size of Predictor Test set (88, 6)\n Size of Target Train set (350,)\n Size of Target Test set (88,)\n--------------------------------------------------\n```\n\n__Conclusion__\\\nX_train.shape - size for training set features, x_test.shape - size for test set features, y_train.shape - size for \ntarget training set indicator, y_test.shape - size for test set indicator. Predictor - columns/features, target - \ngoal - what you need to teach the machine to find (rock type, remove from x). X - without the spices column. The test \nsample (20% of the data) differs from the training sample (80% of the data) only in quantity. Based on train, the \nmachine searches for connections, test is used to check the identified connections.\n\n### Task 4\n\nApply classification algorithms: logistic regression, SVM, KNN. Construct an error matrix based on the results of the \nmodels (use confusion_matrix from sklearn.metrics).\n\n`t4`\n\n__Output__\n```\nt4:\nPrediction values:\n [2 1 0 2 1 0 2 1 2 1 1 0 1 2 0 1 2 1 0 1 2 1 0 1 2 0 1 0 2 0 0 1 2 0 1 2 1\n 0 2 0 0 1 0 2 2 0 2 0 2 1 0 0 2 2 1 1 2 1 1 2 1 1 1 0 2 0 0 1 0 1 1 1 1 2\n 2 0 1 2 1 2 2 0 1 1 0 1 0 1]\nTarget values:\n [2 1 0 2 1 0 2 1 2 1 1 0 1 2 0 1 2 1 0 1 2 1 0 1 2 0 1 0 2 0 0 1 2 0 1 2 1\n 0 2 0 0 1 0 2 2 0 2 0 2 1 0 0 2 2 1 1 2 1 1 2 1 1 1 0 2 0 0 1 0 1 1 1 1 2\n 0 0 1 2 1 2 2 0 1 1 0 1 0 1]\n\n0.9872727272727272\n0.9866666666666667\n\nGridSearchCV(cv=6, estimator=SVC(),\n             param_grid={'kernel': ('linear', 'rbf', 'poly', 'sigmoid')})\nlinear\n\n0.8430742255990649\n\nKNeighborsClassifier(n_neighbors=3)\n--------------------------------------------------\n```\n![](images/p5_2.png)\n![](images/p5_3.png)\n![](images/p5_4.png)\n\n__Conclusion__\\\nThe linear regression method showed an almost perfect result of classifying data into classes; the algorithm made only \none error. The SVM method showed an ideal result (several runs were carried out and the result was the same in all of \nthem). The KNN method showed average results overall and the worst among all results.\nThe model is better when the classes have the same number of elements. Macro_avg - arithmetic average of the indicator \nbetween classes (used when there is an imbalance, shows how accurately a small class is predicted), precision - \naccuracy of predicting classes (% of correct answers), recall - accuracy of predicting positive values (the same as \nprecision) (positive classes - are predicted correctly) (both should be close to unity), f1-score - a general metric \nfor assessing the relationship between precision and recall, support - shows how many correct values there are, \nmacro_avg - takes the recall values, adds them and divides them by 3.\nLinear regression is a model of the dependence of a variable on one or more other variables (factors, regressors, \nindependent variables) with a linear dependence function.\nSVM - support vector machine - support vector machine, looks at the distance between each point, predicts classes \nbased on the distance between points, makes vector analysis point by point and based on the results. If a point is \nclose to a cluster of other points FROM THE LINE, then this point will have the same class as the class of the cluster \nof those points. Having previously separated all classes, draws a line, finds the distance from the line to the points \nand uses vector analysis. Distance from line to points. divides into classes using answers.\nThere are 4 parameters - linear (simply multiplication), radial basis function (exponent), polynomial (a * b + c)^d, \nsigmoidal (tangent). GridSearch is a method that allows you to find more accurate parameters for a model, runs through \nall the parameters and selects the best one. grid_search_svm.best_estimator_.best_model.kernel shows the best model \n(in this case linear). SVM is better than logistic regression and better than KNN.\nKNN is the simplest classification algorithm, it uses finding neighbors, takes a point, finds which other points it \nis closer to, their class will be the class of this point. Also uses answers, but does not divide into classes. Works \nworst in terms of accuracy.\n\n### Task 5\n\nCompare classification results using accuracy, precision, recall and f1-measure (you can use classification_report \nfrom sklearn.metrics). Draw conclusions.\n\n`t5`\n\n__Output__\n```\nt5:\n              precision    recall  f1-score   support\n\n           0       1.00      0.96      0.98        28\n           1       1.00      1.00      1.00        35\n           2       0.96      1.00      0.98        25\n\n    accuracy                           0.99        88\n   macro avg       0.99      0.99      0.99        88\nweighted avg       0.99      0.99      0.99        88\n\n\n              precision    recall  f1-score   support\n\n           0       1.00      1.00      1.00        28\n           1       1.00      1.00      1.00        35\n           2       1.00      1.00      1.00        25\n\n    accuracy                           1.00        88\n   macro avg       1.00      1.00      1.00        88\nweighted avg       1.00      1.00      1.00        88\n\n\n              precision    recall  f1-score   support\n\n           0       0.64      0.82      0.72        22\n           1       0.86      0.81      0.83        37\n           2       0.96      0.83      0.89        29\n\n    accuracy                           0.82        88\n   macro avg       0.82      0.82      0.81        88\nweighted avg       0.84      0.82      0.82        88\n\n\n--------------------------------------------------\n```\n\n__Conclusion__\\\nThe best classification method is SVM, and the worst is KNN. It was described in more detail in the output of task 4.\nIn the first two conclusions (linear regression and SVM), the results are almost everywhere one, which tells us about \nthe high quality of class prediction - these models work great. The KNN method showed an accuracy of around 82%, which \nis also good, but if data balancing had not been carried out, its result would have been worse - during testing, an \naccuracy of around 50% was shown.\n\n---\n\n## Practice 6 - Applying machine learning algorithms to solve clusterization problems\n\n### Task 1\n\nFind data for clustering. If the features in the data have very different scales, then the data must first be normalized.\n\n`t1`\n\n__Output__\n```\n    species     island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \\\n0    Adelie  Torgersen              39.1             18.7              181.0\n1    Adelie  Torgersen              39.5             17.4              186.0\n2    Adelie  Torgersen              40.3             18.0              195.0\n3    Adelie  Torgersen               NaN              NaN                NaN\n4    Adelie  Torgersen              36.7             19.3              193.0\n..      ...        ...               ...              ...                ...\n339  Gentoo     Biscoe               NaN              NaN                NaN\n340  Gentoo     Biscoe              46.8             14.3              215.0\n341  Gentoo     Biscoe              50.4             15.7              222.0\n342  Gentoo     Biscoe              45.2             14.8              212.0\n343  Gentoo     Biscoe              49.9             16.1              213.0\n\n     body_mass_g     sex\n0         3750.0    MALE\n1         3800.0  FEMALE\n2         3250.0  FEMALE\n3            NaN     NaN\n4         3450.0  FEMALE\n..           ...     ...\n339          NaN     NaN\n340       4850.0  FEMALE\n341       5750.0    MALE\n342       5200.0  FEMALE\n343       5400.0    MALE\n\n[344 rows x 7 columns]\n     species  island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \\\n0          0       0              39.1             18.7              181.0\n1          0       0              39.5             17.4              186.0\n2          0       0              40.3             18.0              195.0\n4          0       0              36.7             19.3              193.0\n5          0       0              39.3             20.6              190.0\n..       ...     ...               ...              ...                ...\n338        2       1              47.2             13.7              214.0\n340        2       1              46.8             14.3              215.0\n341        2       1              50.4             15.7              222.0\n342        2       1              45.2             14.8              212.0\n343        2       1              49.9             16.1              213.0\n\n     body_mass_g  sex\n0         3750.0    0\n1         3800.0    1\n2         3250.0    1\n4         3450.0    1\n5         3650.0    0\n..           ...  ...\n338       4925.0    1\n340       4850.0    1\n341       5750.0    0\n342       5200.0    1\n343       5400.0    0\n\n[334 rows x 7 columns]\n```\n\n### Task 2\n\nPerform data clustering using the k-means algorithm. Use the elbow rule and silhouette coefficient to find the \noptimal number of clusters.\n\n`t2`\n\n__Output__\n```\n[[1.74809160e+00 1.03816794e+00 4.71885496e+01 1.57244275e+01\n  2.14786260e+02 5.06660305e+03 3.96946565e-01]\n [3.89162562e-01 1.34975369e+00 4.19330049e+01 1.80871921e+01\n  1.92128079e+02 3.65566502e+03 5.66502463e-01]]\ncluster\n1    203\n0    131\nName: count, dtype: int64\n```\n![](images/p6_1.png)\n![](images/p6_2.png)\n![](images/p6_3.png)\n\n__Conclusion__\\\nThe task is to understand with the help of graphs how many classes to define, whether it is possible to divide the \ndataset into classes, whether there is a pure division into classes in the dataset. We select the number of clusters \nthrough graphical analysis and then divide each point into classes.\nThe first graph (score1) is the value of the cost function. Applying the elbow rule, we look for the break point of \nthe line (from minus to plus / from decreasing to increasing), in this case the change in slope is the very first on \nthe left, the value 3 on the X-axis is the first break, so the number of clusters is also 3. The elbow method is not \naccurate and , so another method is used. In the second graph, the silhouette coefficient is applied, we are looking \nfor the maximum value - the value on the X axis - the optimal number of clusters (in our case 2) - this method is more \naccurate, but is not ideal, since there is a way to automatically find the number of clusters.\nThe k means method is the most commonly used clustering algorithm. The algorithm takes random points, takes them as \ncluster centers (centroids), for each other point it finds the centroid closest to it, each centroid corresponds to a \nset of closest points, the centroids go to the found cluster center and everything repeats. Cluster_centers - \ncentroids, value_counts - how many elements are in each class.\nWe create a Kmeans model, adjust the model to the data, get centroids, if there is a point that is closer to a \ncertain centroid, then this point will belong to the centroid class. The algorithm requires specifying the number of \nclusters. We create a cluster column by adding labels_ to the dataset, these are essentially our centroids (their \nnumbers), they are needed to determine the color of each cluster. Coordinates in three-dimensional graphs are \nimportant predictors/features/signs, they are the most important and are determined subjectively.\n\n### Task 3\n\nPerform data clustering using a hierarchical clustering algorithm.\n\n`t3`\n\n__Output__\\\n![](images/p6_4.png)\n\n__Conclusion__\\\nHierarchical agglomerative - starts with a small number of points, then more and more are added - from smaller \nclusters to larger ones - a method of creating groupings between clusters (this is a designation of \nagglomerativeness), this method is no better than k means. In this method, clusters are nested within each other and \nform a tree structure. Hierarchical clustering is used to determine relationships between objects. Trees are better \nthan logistic regression. Here we also choose the number of clusters equal to 2.\n\n### Task 4\n\nPerform data clustering using the DBSCAN algorithm.\n\n`t4`\n\n__Output__\n```\n22\ncluster\n-1     195\n 0      12\n 8      10\n 6       9\n 3       9\n 5       8\n 12      7\n 1       7\n 4       6\n 2       6\n 9       6\n 13      6\n 14      6\n 15      6\n 16      6\n 7       5\n 11      5\n 10      5\n 18      5\n 17      5\n 20      5\n 19      5\nName: count, dtype: int64\n```\n![](images/p6_5.png)\n\n__Conclusion__\\\nDBSCAN (Density-based spatial clustering of applications with noise) is based on density. The algorithm groups \ntogether points that are closely spaced (high density), and marks points that are in areas of low density with \noutliers (which are ignored). This algorithm itself determines the number of clusters (22 in this case). In this case, \nthere were 3 clusters according to the number of penguin breeds (spicies field).\nThe eps parameter (you can change these parameters) is the minimum distance between points; in our case, if it is \nless than 11, then these 2 points are considered neighbors. The samples parameter is the minimum number of neighbors \nfor each point at which this point can be considered a centroid.\n\n### Task 5\n\nVisualize clustered data using t-SNE or UMAP if necessary. If the data is three-dimensional, then a three-dimensional \nscatter plot can be used.\n\n`t5`\n\n__Output__\\\n![](images/p6_6.png)\n\n__Conclusion__\\\nVisualization was done using the t-SNE method, and the rest was visualized by 3D scatter plots. In principle, there \nis no need to use t-SNE or UMAP to visualize the data, since the visualization has already been done in another way. \nThis method works better than all the previous ones. They are essentially the same thing - they depend on the distance \nbetween the points.\n\n---\n\n## Practice 7 - Ensemble learning methods\n\n### Task 1\n\nFind data for a classification task or for a regression task.\n\n`t1`\n\n__Output__\n```\n    species     island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \\\n0    Adelie  Torgersen              39.1             18.7              181.0\n1    Adelie  Torgersen              39.5             17.4              186.0\n2    Adelie  Torgersen              40.3             18.0              195.0\n3    Adelie  Torgersen               NaN              NaN                NaN\n4    Adelie  Torgersen              36.7             19.3              193.0\n..      ...        ...               ...              ...                ...\n339  Gentoo     Biscoe               NaN              NaN                NaN\n340  Gentoo     Biscoe              46.8             14.3              215.0\n341  Gentoo     Biscoe              50.4             15.7              222.0\n342  Gentoo     Biscoe              45.2             14.8              212.0\n343  Gentoo     Biscoe              49.9             16.1              213.0\n\n     body_mass_g     sex\n0         3750.0    MALE\n1         3800.0  FEMALE\n2         3250.0  FEMALE\n3            NaN     NaN\n4         3450.0  FEMALE\n..           ...     ...\n339          NaN     NaN\n340       4850.0  FEMALE\n341       5750.0    MALE\n342       5200.0  FEMALE\n343       5400.0    MALE\n\n[344 rows x 7 columns]\n     species  island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \\\n0          0       0              39.1             18.7              181.0\n1          0       0              39.5             17.4              186.0\n2          0       0              40.3             18.0              195.0\n4          0       0              36.7             19.3              193.0\n5          0       0              39.3             20.6              190.0\n..       ...     ...               ...              ...                ...\n338        2       1              47.2             13.7              214.0\n340        2       1              46.8             14.3              215.0\n341        2       1              50.4             15.7              222.0\n342        2       1              45.2             14.8              212.0\n343        2       1              49.9             16.1              213.0\n\n     body_mass_g  sex\n0         3750.0    0\n1         3800.0    1\n2         3250.0    1\n4         3450.0    1\n5         3650.0    0\n..           ...  ...\n338       4925.0    1\n340       4850.0    1\n341       5750.0    0\n342       5200.0    1\n343       5400.0    0\n\n[334 rows x 7 columns]\n\n Size of Predictor Train set (267, 6)\n Size of Predictor Test set (67, 6)\n Size of Target Train set (267,)\n Size of Target Test set (67,)\n```\n\n### Task 2\n\nImplement bugging.\n\n`t2`\n\n__Output__\n```\nElapsed time: 0.13482189178466797 seconds\nF1 metric for training set 0.9955112808599441\nF1 metric for test set 0.98200460347353\n\nElapsed time: 5.649371862411499 seconds\nF1 metric for training set 0.9910658307210031\nF1 metric for test set 0.98200460347353\n```\n\n__Conclusion__\\\nFirst, we create a regular tree (RandomForest is the most frequently used) and do not set parameters for what ratio \nto look for - we try everything. With this we check whether the tree works on our data in principle - if the tree \nshows an accuracy of less than 60%, then we need to use a neural network. We get 2 results: f1 for training - how well \nthe tree has trained (not the most important result); f1 for test - how well the tree works on new data (the result \nmay be worse, but in this case the same - ~0.98). Tree parameters: max_depth - maximum depth of the tree, \nmin_samples_split - minimum number of examples to split an internal node. Regression tries to find a linear \nrelationship (x increases, which means y also increases), the tree checks all possible relationships, each leaf is a \ntest of the relationships between points, so trees are better than regression.\nWe set intervals in params_grid, make grid_search_cv, it lays out all the options, runs them all and gives the most \naccurate answer. Estimator - the model to be used; scoring - the metric we want to get (f1_macro - accuracy in % in \nprediction), cv - the number of runs of each parameter. Best_model - best_estimator_ - the best parameters with which \nthe answers are predicted. Average=macro in f1_score - take the largest (best) average value.\nBagging - tuning for trees (tuning is included in bagging). In this case, bagging for trees is done using grid_search \n(all options are laid out and selected). We set the intervals (we choose not by eye, but according to the standard - \nindicated in the documentation and in the manual). Next, the best parameters are selected after running all possible \ncombinations. The prediction result in the test set may deteriorate after tuning due to the randomness factor of the \ntree. Tuning should be used when the prediction result is 60% or less. The train set contains the answers, but the \ntest set (more importantly) does not contain the answers. If after applying tuning the result has not changed, then \nthis is the best result.\n\n### Task 3\n\nImplement boosting on the same data that was used for bugging.\n\n`t3`\n\n__Output__\n```\nLearning rate set to 0.019854\n0:\tlearn: 1.0712628\ttotal: 64.1ms remaining: 3m 12s\n1:\tlearn: 1.0446230\ttotal: 100ms remaining: 2m 30s\n2:\tlearn: 1.0206709\ttotal: 152ms remaining: 2m 32s\n3:\tlearn: 0.9960718\ttotal: 202ms remaining: 2m 31s\n...\n2997:\tlearn: 0.0028512\ttotal: 44s\tremaining: 29.4ms\n2998:\tlearn: 0.0028501\ttotal: 44s\tremaining: 14.7ms\n2999:\tlearn: 0.0028491\ttotal: 44s\tremaining: 0us\n\nElapsed time: 46.70735192298889 seconds\nBoosted F1 metric for train set 1.0\nBoosted F1 metric for test set 0.98200460347353\n```\n\n__Conclusion__\\\nBoosting makes several trees at once. Boosting uses trees as its base algorithms and most often works on samples with \nheterogeneous data. Boosting helps to find connections where there are none (helps trees), ordinary trees \n(RandomForest) should find connections, but cannot always find them, regression will not be able to find connections \nif the dependencies are “not visible to the eye.” In this case, cat boost is used (not the best algorithm), there are \nbetter algorithms, such as (the best): ada boost and xgboost.\n\n### Task 4\n\nCompare the results of the algorithms (working time and quality of models). Draw conclusions.\n\n__Conclusion__\\\nThe results of the execution of the algorithms, measurements of their operating time and the quality of the models \nare given in the results of the corresponding tasks.\nFirst we built a simple tree, then we did bugging, and finally we did boosting. Boosting works best and should be \nthe slowest, since 3000 trees are built, but each of them is built relatively quickly. Bagging is slower than \nbuilding a single tree in boosting and slower than building a simple tree.\nTrees work better than regression (for classification mostly), boosting works better than bugging, and bugging works \nbetter than a regular tree. The first level is a tree with a choice of parameters “by eye”, the second level is \nbugging (tuning) and the third level is boosting. Bagging uses intervals and boosting uses iterations. The result \nof “F1 metric for test set” is much more important than “F1 metric for training set” (based on train, the machine \nsearches for connections, test is used to check the identified connections), it is this result that we check.\nExecution on a graphics processor (GPU) is faster than on a central processing unit (CPU).\n\n---\n\n## Practice 8 - Teaching methods based on association rules\n\n### Task 1\n\nLoad data.\n\n`t1`\n\n__Output__\n```\n              shrimp            almonds      avocado    vegetables mix  \\\n0            burgers          meatballs         eggs               NaN\n1            chutney                NaN          NaN               NaN\n2             turkey            avocado          NaN               NaN\n3      mineral water               milk   energy bar  whole wheat rice\n4     low fat yogurt                NaN          NaN               NaN\n...              ...                ...          ...               ...\n7495          butter         light mayo  fresh bread               NaN\n7496         burgers  frozen vegetables         eggs      french fries\n7497         chicken                NaN          NaN               NaN\n7498        escalope          green tea          NaN               NaN\n7499            eggs    frozen smoothie  yogurt cake    low fat yogurt\n\n     green grapes whole weat flour yams cottage cheese energy drink  \\\n0             NaN              NaN  NaN            NaN          NaN\n1             NaN              NaN  NaN            NaN          NaN\n2             NaN              NaN  NaN            NaN          NaN\n3       green tea              NaN  NaN            NaN          NaN\n4             NaN              NaN  NaN            NaN          NaN\n...           ...              ...  ...            ...          ...\n7495          NaN              NaN  NaN            NaN          NaN\n7496    magazines        green tea  NaN            NaN          NaN\n7497          NaN              NaN  NaN            NaN          NaN\n7498          NaN              NaN  NaN            NaN          NaN\n7499          NaN              NaN  NaN            NaN          NaN\n\n     tomato juice low fat yogurt green tea honey salad mineral water salmon  \\\n0             NaN            NaN       NaN   NaN   NaN           NaN    NaN\n1             NaN            NaN       NaN   NaN   NaN           NaN    NaN\n2             NaN            NaN       NaN   NaN   NaN           NaN    NaN\n3             NaN            NaN       NaN   NaN   NaN           NaN    NaN\n4             NaN            NaN       NaN   NaN   NaN           NaN    NaN\n...           ...            ...       ...   ...   ...           ...    ...\n7495          NaN            NaN       NaN   NaN   NaN           NaN    NaN\n7496          NaN            NaN       NaN   NaN   NaN           NaN    NaN\n7497          NaN            NaN       NaN   NaN   NaN           NaN    NaN\n7498          NaN            NaN       NaN   NaN   NaN           NaN    NaN\n7499          NaN            NaN       NaN   NaN   NaN           NaN    NaN\n\n     antioxydant juice frozen smoothie spinach  olive oil\n0                  NaN             NaN     NaN        NaN\n1                  NaN             NaN     NaN        NaN\n2                  NaN             NaN     NaN        NaN\n3                  NaN             NaN     NaN        NaN\n4                  NaN             NaN     NaN        NaN\n...                ...             ...     ...        ...\n7495               NaN             NaN     NaN        NaN\n7496               NaN             NaN     NaN        NaN\n7497               NaN             NaN     NaN        NaN\n7498               NaN             NaN     NaN        NaN\n7499               NaN             NaN     NaN        NaN\n\n[7500 rows x 20 columns]\n\n\u003cclass 'pandas.core.frame.DataFrame'\u003e\nRangeIndex: 7500 entries, 0 to 7499\nData columns (total 20 columns):\n #   Column             Non-Null Count  Dtype\n---  ------             --------------  -----\n 0   shrimp             7500 non-null   object\n 1   almonds            5746 non-null   object\n 2   avocado            4388 non-null   object\n 3   vegetables mix     3344 non-null   object\n 4   green grapes       2528 non-null   object\n 5   whole weat flour   1863 non-null   object\n 6   yams               1368 non-null   object\n 7   cottage cheese     980 non-null    object\n 8   energy drink       653 non-null    object\n 9   tomato juice       394 non-null    object\n 10  low fat yogurt     255 non-null    object\n 11  green tea          153 non-null    object\n 12  honey              86 non-null     object\n 13  salad              46 non-null     object\n 14  mineral water      24 non-null     object\n 15  salmon             7 non-null      object\n 16  antioxydant juice  3 non-null      object\n 17  frozen smoothie    3 non-null      object\n 18  spinach            2 non-null      object\n 19  olive oil          0 non-null      float64\ndtypes: float64(1), object(19)\nmemory usage: 1.1+ MB\n```\n\n### Task 2\n\nVisualize data (display relative and actual frequency of occurrence for the 20 most popular products on histograms).\n\n`t2`\n\n__Output__\n```\nmineral water        1787\neggs                 1348\nspaghetti            1306\nfrench fries         1282\nchocolate            1230\ngreen tea             990\nmilk                  972\nground beef           737\nfrozen vegetables     715\npancakes              713\nburgers               654\ncake                  608\ncookies               603\nescalope              595\nlow fat yogurt        573\nshrimp                535\ntomatoes              513\nolive oil             493\nfrozen smoothie       474\nturkey                469\nName: count, dtype: int64\n\nmineral water        0.238267\neggs                 0.179733\nspaghetti            0.174133\nfrench fries         0.170933\nchocolate            0.164000\ngreen tea            0.132000\nmilk                 0.129600\nground beef          0.098267\nfrozen vegetables    0.095333\npancakes             0.095067\nburgers              0.087200\ncake                 0.081067\ncookies              0.080400\nescalope             0.079333\nlow fat yogurt       0.076400\nshrimp               0.071333\ntomatoes             0.068400\nolive oil            0.065733\nfrozen smoothie      0.063200\nturkey               0.062533\nName: count, dtype: float64\n```\n![](images/p8_1.png)\n![](images/p8_2.png)\n\n__Conclusion__\\\nThe stack() function makes a column/stack (arranges from top to bottom by quantity), value_counts() - finds the most \nfrequent ones - highlighting the most frequently repeated products. We find the most frequent ones and put them on the \nstack. You can also apply normalization to build relative frequencies and, in order to reduce them to close values, \notherwise the most unpopular values will be invisible on the graph, we lower the large values and raise the small \nones. Shape - quantity - dimension of all data. We divide each element by this dimension.\n\n### Task 3\n\nApply the Apriori algorithm using 3 different libraries (apriori_python, apyori, efficient_apriori). Select \nhyperparameters for the algorithms so that about 10 best rules are output.\n\n`t3`\n\n__Output__\n```\nburgers ['burgers', 'meatballs', 'eggs']\n\nfirst:\n [[{'frozen vegetables', 'spaghetti', 'turkey', 'milk'}, {'mineral water'}, 0.9], [{'chocolate', 'frozen vegetables', 'olive oil', 'shrimp'}, {'mineral water'}, 0.9], [{'pasta', 'eggs', 'mineral water'}, {'shrimp'}, 0.9090909090909091], [{'herb \u0026 pepper', 'rice', 'mineral water'}, {'ground beef'}, 0.9090909090909091], [{'pancakes', 'ground beef', 'whole wheat rice'}, {'mineral water'}, 0.9090909090909091], [{'red wine', 'soup'}, {'mineral water'}, 0.9333333333333333], [{'pasta', 'mushroom cream sauce'}, {'escalope'}, 0.95], [{'french fries', 'pasta', 'mushroom cream sauce'}, {'escalope'}, 1.0], [{'cake', 'olive oil', 'shrimp'}, {'mineral water'}, 1.0], [{'meatballs', 'cake', 'mineral water'}, {'milk'}, 1.0], [{'olive oil', 'ground beef', 'light cream'}, {'mineral water'}, 1.0]]\nend\n\nsecond:\n [RelationRecord(items=frozenset({'pasta', 'escalope', 'mushroom cream sauce'}), support=0.002533333333333333, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta', 'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=0.95, lift=11.974789915966385)]), RelationRecord(items=frozenset({'red wine', 'soup', 'mineral water'}), support=0.0018666666666666666, ordered_statistics=[OrderedStatistic(items_base=frozenset({'red wine', 'soup'}), items_add=frozenset({'mineral water'}), confidence=0.9333333333333333, lift=3.917179630665921)]), RelationRecord(items=frozenset({'meatballs', 'cake', 'mineral water', 'milk'}), support=0.0010666666666666667, ordered_statistics=[OrderedStatistic(items_base=frozenset({'meatballs', 'cake', 'mineral water'}), items_add=frozenset({'milk'}), confidence=1.0, lift=7.71604938271605)]), RelationRecord(items=frozenset({'shrimp', 'cake', 'olive oil', 'mineral water'}), support=0.0012, ordered_statistics=[OrderedStatistic(items_base=frozenset({'cake', 'olive oil', 'shrimp'}), items_add=frozenset({'mineral water'}), confidence=1.0, lift=4.196978175713486)]), RelationRecord(items=frozenset({'pasta', 'shrimp', 'eggs', 'mineral water'}), support=0.0013333333333333333, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta', 'eggs', 'mineral water'}), items_add=frozenset({'shrimp'}), confidence=0.9090909090909091, lift=12.744265080713678)]), RelationRecord(items=frozenset({'pasta', 'french fries', 'escalope', 'mushroom cream sauce'}), support=0.0010666666666666667, ordered_statistics=[OrderedStatistic(items_base=frozenset({'french fries', 'pasta', 'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=1.0, lift=12.605042016806722)]), RelationRecord(items=frozenset({'herb \u0026 pepper', 'rice', 'ground beef', 'mineral water'}), support=0.0013333333333333333, ordered_statistics=[OrderedStatistic(items_base=frozenset({'herb \u0026 pepper', 'rice', 'mineral water'}), items_add=frozenset({'ground beef'}), confidence=0.9090909090909091, lift=9.251264339459725)]), RelationRecord(items=frozenset({'olive oil', 'mineral water', 'ground beef', 'light cream'}), support=0.0012, ordered_statistics=[OrderedStatistic(items_base=frozenset({'olive oil', 'ground beef', 'light cream'}), items_add=frozenset({'mineral water'}), confidence=1.0, lift=4.196978175713486)]), RelationRecord(items=frozenset({'pancakes', 'ground beef', 'whole wheat rice', 'mineral water'}), support=0.0013333333333333333, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pancakes', 'ground beef', 'whole wheat rice'}), items_add=frozenset({'mineral water'}), confidence=0.9090909090909091, lift=3.8154347051940785)])]\nend\n\nsecond beautified:\nfrozenset({'pasta', 'mushroom cream sauce'}) frozenset({'escalope'})\nSupport: 0.002533333333333333; Confidence: 0.95; Lift: 11.974789915966385;\n\nfrozenset({'red wine', 'soup'}) frozenset({'mineral water'})\nSupport: 0.0018666666666666666; Confidence: 0.9333333333333333; Lift: 3.917179630665921;\n\nfrozenset({'meatballs', 'cake', 'mineral water'}) frozenset({'milk'})\nSupport: 0.0010666666666666667; Confidence: 1.0; Lift: 7.71604938271605;\n\nfrozenset({'cake', 'olive oil', 'shrimp'}) frozenset({'mineral water'})\nSupport: 0.0012; Confidence: 1.0; Lift: 4.196978175713486;\n\nfrozenset({'pasta', 'eggs', 'mineral water'}) frozenset({'shrimp'})\nSupport: 0.0013333333333333333; Confidence: 0.9090909090909091; Lift: 12.744265080713678;\n\nfrozenset({'french fries', 'pasta', 'mushroom cream sauce'}) frozenset({'escalope'})\nSupport: 0.0010666666666666667; Confidence: 1.0; Lift: 12.605042016806722;\n\nfrozenset({'herb \u0026 pepper', 'rice', 'mineral water'}) frozenset({'ground beef'})\nSupport: 0.0013333333333333333; Confidence: 0.9090909090909091; Lift: 9.251264339459725;\n\nfrozenset({'olive oil', 'ground beef', 'light cream'}) frozenset({'mineral water'})\nSupport: 0.0012; Confidence: 1.0; Lift: 4.196978175713486;\n\nfrozenset({'pancakes', 'ground beef', 'whole wheat rice'}) frozenset({'mineral water'})\nSupport: 0.0013333333333333333; Confidence: 0.9090909090909091; Lift: 3.8154347051940785;\nend\n\nthird:\n{mushroom cream sauce, pasta} -\u003e {escalope} (conf: 0.950, supp: 0.003, lift: 11.975, conv: 18.413)\n{red wine, soup} -\u003e {mineral water} (conf: 0.933, supp: 0.002, lift: 3.917, conv: 11.426)\n{cake, meatballs, mineral water} -\u003e {milk} (conf: 1.000, supp: 0.001, lift: 7.716, conv: 870400000.000)\n{cake, olive oil, shrimp} -\u003e {mineral water} (conf: 1.000, supp: 0.001, lift: 4.197, conv: 761733333.333)\n{eggs, mineral water, pasta} -\u003e {shrimp} (conf: 0.909, supp: 0.001, lift: 12.744, conv: 10.215)\n{french fries, mushroom cream sauce, pasta} -\u003e {escalope} (conf: 1.000, supp: 0.001, lift: 12.605, conv: 920666666.667)\n{herb \u0026 pepper, mineral water, rice} -\u003e {ground beef} (conf: 0.909, supp: 0.001, lift: 9.251, conv: 9.919)\n{ground beef, light cream, olive oil} -\u003e {mineral water} (conf: 1.000, supp: 0.001, lift: 4.197, conv: 761733333.333)\n{ground beef, pancakes, whole wheat rice} -\u003e {mineral water} (conf: 0.909, supp: 0.001, lift: 3.815, conv: 8.379)\n{chocolate, frozen vegetables, olive oil, shrimp} -\u003e {mineral water} (conf: 0.900, supp: 0.001, lift: 3.777, conv: 7.617)\n{frozen vegetables, milk, spaghetti, turkey} -\u003e {mineral water} (conf: 0.900, supp: 0.001, lift: 3.777, conv: 7.617)\nend\n```\n\n__Conclusion__\\\nThe task is to teach the machine to find associations between things (search for relationships, associations, search \nfor associations between things). For example: 10 people went to the grocery store - the task is to conduct an \ninference to find connections between customers and purchases. Dataset - transactions of people, we need to find \nassociations.\nARL algorithms - association rules learning - learning algorithms according to association rules.\nThe first algorithm: rules - algorithm responses, minSup - minimum support from 0 to 1 - a measure of reliability \nwith which an associative rule expresses the association between a condition and a consequence, minimal because the \ntask is to put as little as possible in order to find the point from which to start ; minConf - minimum confidence - \nthe indicator characterizes that the association a to b is an associative rule - the algorithm is so confident that a \nis associated with b.\nEach dataset has its own values, you need to define them so that the number of rules is not too small and not too \nlarge; the smaller both values are, the more time is needed for the process; if it is too large (closer to 1), then \nconnections may not be found - you need to find a balance - the smallest at which it gives out associations; minSup - \nit is he who influences the main thing. Both values must be between 0 and 1; minConf is so high because you need the \n10 most accurate results.\nThe first algorithm shows an association (rule) with an accuracy percentage [{'chocolate', 'frozen vegetables', \n'olive oil', 'shrimp'}, {'mineral water'}, 0.9] - when buying these 4 things, they also bought water - The algorithm \nis 90% confident in this, 0.9 - confidence - at least 80% (minConf) is confident of producing results.\nSecond algorithm: min_lift is added - excludes the output of independent rules - more than 1 is needed so that it \nweeds out answers with a parameter less than 1, which show independence; min_lift - the ratio of the dependence of \nthings to their independence - how dependent things are on each other - if equal to 1 - things are independent, if \nmore than 1, then there is a dependence, the more 1 the better, less than 1 - a negative impact. Also a different \ntype of output. A frozen set is unchangeable. The output contains 9 rules, since some associations come out with the \nsame accuracy - exactly 10 are not obtained.\nThird algorithm: conv in the output - persuasiveness - conviction - frequency of errors rules - how often they bought \nbeer without diapers and vice versa - the higher the 1, the better. The rule (association) {mushroom cream sauce, \npasta} -\u003e {escalope} means that the algorithm is sure that escalope is also bought with cream and pasta.\nMinSup - how often elements occur in the data set, minConf - how often the rule","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvadniks%2Fakabigdata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvadniks%2Fakabigdata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvadniks%2Fakabigdata/lists"}