{"id":21627533,"url":"https://github.com/billsioros/crime-data-exploration","last_synced_at":"2026-05-06T22:36:17.001Z","repository":{"id":105137537,"uuid":"186845791","full_name":"billsioros/crime-data-exploration","owner":"billsioros","description":"Visual exploration of crime incident reports provided by the Boston Police Department","archived":false,"fork":false,"pushed_at":"2019-06-09T13:38:06.000Z","size":28001,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-03-18T20:34:39.787Z","etag":null,"topics":["crime-analysis","crime-data","crime-incidents","folium","ipython","ipython-notebook","jupyter","jupyter-notebook"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/billsioros.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-05-15T14:40:56.000Z","updated_at":"2019-06-09T13:38:08.000Z","dependencies_parsed_at":"2023-04-12T20:25:08.284Z","dependency_job_id":null,"html_url":"https://github.com/billsioros/crime-data-exploration","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/billsioros/crime-data-exploration","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/billsioros%2Fcrime-data-exploration","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/billsioros%2Fcrime-data-exploration/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/billsioros%2Fcrime-data-exploration/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/billsioros%2Fcrime-data-exploration/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/billsioros","download_url":"https://codeload.github.com/billsioros/crime-data-exploration/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/billsioros%2Fcrime-data-exploration/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32715315,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-06T19:35:05.142Z","status":"ssl_error","status_checked_at":"2026-05-06T19:35:03.996Z","response_time":117,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crime-analysis","crime-data","crime-incidents","folium","ipython","ipython-notebook","jupyter","jupyter-notebook"],"created_at":"2024-11-25T01:16:51.908Z","updated_at":"2026-05-06T22:36:16.995Z","avatar_url":"https://github.com/billsioros.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Contributors\n\n* Sioros Vasileios (billsioros)\n* Konstantinos Kyriakos (Qwerkykk)\n\n# Reader:\n\n- Reads the *csv* file and stores it in memory in panda format. If it already exists on the disk in pickled format it loads it instead\n- Preprocesses the data (converts NaN values in the 'SHOOTING' column to 'N', drops rows with NaN values, creates the factorized equivilant of each column to be used in the kmeans clustering, etc)\n\n\n```python\nimport os\n\nimport numpy as np\nimport pandas as pd\n\nfrom sklearn.preprocessing import MinMaxScaler\n\nclass Reader:\n\n    headers = [\n        'INCIDENT_NUMBER',\n        'OFFENSE_CODE_GROUP',\n        'DISTRICT',\n        'SHOOTING',\n        'YEAR',\n        'MONTH',\n        'DAY_OF_WEEK',\n        'HOUR',\n        'Lat',\n        'Long'\n    ]\n\n    types = dict(zip(headers, [object, object, object, object, np.int32, np.int32, object, np.int32, np.float64, np.float64]))\n\n    def __init__(self, filename, lat_predicate=lambda entries: entries \u003e 40, lon_predicate=lambda entries: entries \u003c -60):\n\n        if not isinstance(filename, str):\n            raise ValueError(\"'filename' is not an instance of 'str'\")\n\n        if not os.path.isdir('out'):\n            os.mkdir('out')\n\n        pickled = os.path.splitext(os.path.basename(filename))[0] + '.pkl'\n\n        pickled = os.path.join(os.path.curdir, 'out', pickled)\n\n        if os.path.isfile(pickled):\n\n            print('\u003cLOG\u003e: Loading pickled dataset from', \"'\" + pickled + \"'\")\n\n            self.data = pd.read_pickle(pickled)\n\n            print('\u003cLOG\u003e: The dataset consists', len(self.data.index), 'rows and', len(self.data.columns), 'columns')\n\n            return\n\n        print('\u003cLOG\u003e: Processing file', \"'\" + filename + \"'\")\n\n        self.data = pd.read_csv(filename, dtype=self.types, skipinitialspace=True, usecols=self.headers)\n\n        print('\u003cLOG\u003e: The dataset consists', len(self.data.index), 'rows and', len(self.data.columns), 'columns')\n\n        self.data['SHOOTING'].fillna('N', inplace=True)\n\n        print('\u003cLOG\u003e: Dropping NaN values')\n\n        self.data.dropna(inplace=True)\n\n        print('\u003cLOG\u003e: Restricting longitude and latitude')\n\n        self.data = self.data[lat_predicate(self.data['Lat']) \u0026 (lon_predicate(self.data['Long']))]\n\n        print('\u003cLOG\u003e: Creating column', \"'\" + 'TIME_PERIOD' + \"'\")\n\n        self.data['TIME_PERIOD'] = ['Night' if hour \u003c= 6 or hour \u003e= 18 else 'Day' for hour in list(self.data['HOUR'])]\n\n        print('\u003cLOG\u003e: Augmenting the dataset by the factorized equivalent of each column')\n\n        gmin = self.data[['Long', 'Lat']].min().min()\n        gmax = self.data[['Long', 'Lat']].max().max()\n\n        for header in self.headers:\n\n            self.data[[header + '_FACTORIZED']] = self.data[[header]].stack().rank(method='dense').unstack()\n\n            self.data[[header + '_FACTORIZED']] = MinMaxScaler((gmin, gmax)).fit_transform(self.data[[header + '_FACTORIZED']])\n\n        print('\u003cLOG\u003e: The dataset consists', len(self.data.index), 'rows and', len(self.data.columns), 'columns')\n\n        print('\u003cLOG\u003e: Saving pickled datafrime to', \"'\" + pickled + \"'\")\n\n        self.data.to_pickle(pickled)\n\n\n    def groupby(self, headers):\n\n        if isinstance(headers, str):\n            headers = set([headers])\n        elif isinstance(headers, list):\n            headers = set(headers)\n        else:\n            raise ValueError(\"'headers' must be an instance of 'list'\")\n\n        if not headers.issubset(self.headers):\n            raise ValueError(headers.difference(self.headers), 'header(s) are not supported')\n\n        return self.data.groupby(list(headers))\n```\n\n# KMeans:\n\n- Initialization requires a Reader to be passed as an arguement\n- Performs clustering on the dataset according to the geographical location and (optionally) a supplied header / category.\n\n\n```python\nfrom re import sub\n\nfrom sklearn import cluster\n\nfrom reader import Reader\nfrom visualizer import Visualizer\n\nclass KMeans:\n\n    def __init__(self, reader):\n\n        if not isinstance(reader, Reader):\n            raise ValueError(\"'reader' is not an instance of 'Reader'\")\n\n        self.data = reader.data\n\n\n    def fit(self, n_clusters=None, header=None):\n\n        if header:\n\n            if not isinstance(header, str):\n                raise ValueError(\"'header' is not an instance of 'str'\")\n\n            print('\u003cLOG\u003e: Clustering according to geographical location and', \"'\" + header.replace('_', ' ').title() + \"'\")\n\n            header = header + '_FACTORIZED'\n\n            data = self.data[['Long', 'Lat', header]]\n\n            n_clusters = len(self.data[header].unique())\n\n        else:\n\n            if not isinstance(n_clusters, int) or n_clusters \u003c= 0:\n                raise ValueError(\"'n_clusters' must have an integer value greater than zero\")\n\n            print('\u003cLOG\u003e: Clustering according to geographical location')\n\n            data = self.data[['Long', 'Lat']]\n\n        print('\u003cLOG\u003e: Running kmeans with', '{0:2}'.format(n_clusters), 'clusters')\n        \n        return cluster.KMeans(n_clusters=n_clusters).fit(data).labels_.astype(float)\n```\n\n# Visualizer:\n\n- Initialization requires a Reader to be passed as an arguement\n- Data can be visualized with countplot() and scatterplot()\n\n\n```python\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n\nfrom reader import Reader\n\nclass Visualizer:\n\n    def __init__(self, reader):\n\n        if not isinstance(reader, Reader):\n            raise ValueError(\"'reader' is not an instrance of 'Reader'\")\n\n        sns.set(style='whitegrid')\n\n        self.data = reader.data\n\n\n    def countplot(self, header, title, squeeze=False, predicate=None, figsize=(16, 6), palette='Set3'):\n\n        if not isinstance(header, str):\n            raise ValueError(\"'header' is not an instrance of 'str'\")\n\n        if not isinstance(title, str):\n            raise ValueError(\"'title' is not an instrance of 'str'\")\n\n        if header == 'DAY_OF_WEEK':\n            order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']\n        else:\n            order = sorted(self.data[header].unique())\n\n        data = self.data\n\n        if predicate:\n            data = data[predicate(data)]\n\n        plt.figure(figsize=figsize)\n\n        axes = sns.countplot(x=header, data=data, order=order, palette=palette)\n\n        if header == 'MONTH':\n            axes.set_xticklabels(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])\n\n        axes.set_title(title)\n\n        axes.set(xlabel='', ylabel='')\n\n        if squeeze:\n            axes.set_xticklabels(axes.get_xticklabels(), rotation=90, fontsize=7, ha='left')\n\n            plt.tight_layout()\n\n        plt.show()\n\n\n    def scatterplot(self, hue, title, figsize=(12, 12), palette='Set2'):\n\n        plt.figure(figsize=figsize)\n\n        axes = sns.scatterplot(x='Long', y='Lat', data=self.data, hue=hue, palette=palette, legend=False)\n\n        axes.set(xlabel='Longitude', ylabel='Latitude')\n\n        if title:\n            axes.set_title(title)\n\n        plt.show()\n```\n\n# Map:\n\n- Initialization requires a Reader to be passed as an arguement\n- The display() uses a header / category and a coloring attribute to group and colorize the incidents accordingly\n\n\n```python\nimport folium\nfrom folium import IFrame\nfrom folium import Popup\nfrom folium.plugins import MarkerCluster\nfrom IPython.core.display import display\n\nfrom reader import Reader\n\ntable = \"\"\"\n\u003c!DOCTYPE html\u003e\n\u003chtml\u003e\n\n\u003chead\u003e\n    \u003cstyle\u003e\n        #info {{\n            font-family: \"Trebuchet MS\", Arial, Helvetica, sans-serif;\n            border-collapse: collapse;\n            width: 100%;\n        }}\n\n        #info td,\n        #info th {{\n            border: 1px solid #ddd;\n            padding: 8px;\n        }}\n\n        #info tr:nth-child(even) {{\n            background-color: #f2f2f2;\n        }}\n\n        #info tr:hover {{\n            background-color: #ddd;\n        }}\n\n        #info th {{\n            padding-top: 12px;\n            padding-bottom: 12px;\n            text-align: left;\n            background-color: rgb(86, 76, 175);\n            color: white;\n        }}\n    \u003c/style\u003e\n\u003c/head\u003e\n\n\u003cbody\u003e\n\n    \u003ctable id=\"info\"\u003e\n        \u003ctr\u003e\n            \u003cth\u003eIncident Number\u003c/th\u003e\n            \u003cth\u003e{}\u003c/th\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e{}\u003c/td\u003e\n            \u003ctd\u003e{}\u003c/td\u003e\n        \u003c/tr\u003e\n    \u003c/table\u003e\n\n\u003c/body\u003e\n\n\u003c/html\u003e\n\"\"\".format\n\nclass Map:\n\n    def __init__(self, reader, sample_size=500):\n\n        if not isinstance(reader, Reader):\n            raise ValueError(\"'reader' is not an instance of 'Reader'\")\n\n        if sample_size:\n            if not isinstance(sample_size, int) or sample_size \u003c= 0:\n                raise ValueError(\"'sample_size' must have an integer value greater than zero\")\n\n            self.sample_size = sample_size\n\n            self.data = reader.data.sample(n=sample_size)\n\n        else:\n            self.data = reader.data\n\n        self.center_x, self.center_y = self.data['Lat'].mean(), self.data['Long'].mean()\n\n\n    def display(self, header,coloring_attr = 'YEAR', predicate=None, zoom_start=11, popup_width=400, popup_height=100):\n\n        if not isinstance(header, str):\n            raise ValueError(\"'header' is not an instance of 'str'\")\n        \n        if not isinstance(coloring_attr,str):\n            raise ValueError(\"'coloring_attr' is not an instance of 'str'\")\n\n        data = self.data[['INCIDENT_NUMBER', 'Lat', 'Long', header,coloring_attr]]\n\n        if predicate:\n            data = data[predicate(self.data)]\n\n        locations, popups ,icons = {}, {}, {}\n\n        available_colors = [ 'blue', 'green', 'purple', 'orange', 'darkred',\n            'lightred', 'beige', 'darkblue', 'darkgreen', 'cadetblue',\n            'darkpurple', 'gray']\n        \n        unique_tag = list(set(data[coloring_attr]))\n    \n        color_pallete = {}\n        \n        for tag in unique_tag:\n            color_pallete[tag] = available_colors[unique_tag.index(tag)]\n        \n        for key in color_pallete.keys():\n            print('\u003c' + str(key) +': ' + str(color_pallete[key]) + '\u003e',end =\" \")\n        \n        formatted_header = header.replace('_', ' ').title()\n\n        for _, row in data.iterrows():\n\n            if not row[header] in locations:\n                locations[row[header]] = []\n                popups[row[header]] = []\n                icons[row[header]] = []\n\n            locations[row[header]].append([row['Lat'], row['Long']])    \n            \n            icons[row[header]].append(folium.Icon(color=color_pallete[row[coloring_attr]]))\n            \n            html = table(formatted_header, row['INCIDENT_NUMBER'], str(row[header]).title())\n\n            ifrm = IFrame(html=html, width=popup_width, height=popup_height)\n            popups[row[header]].append(Popup(ifrm))\n\n        underlying = folium.Map(location=[self.center_x, self.center_y], zoom_start=zoom_start)\n\n        for key in locations.keys():\n\n            group = folium.FeatureGroup(str(key).title())\n        \n            group.add_child(MarkerCluster(locations[key], popups[key],icons[key]))\n\n            underlying.add_child(group)\n\n        underlying.add_child(folium.LayerControl())\n\n        display(underlying)\n```\n\n\n```python\nreader = Reader('../data/crime.csv')\n```\n\n    \u003cLOG\u003e: Processing file '../data/crime.csv'\n    \u003cLOG\u003e: The dataset consists 327820 rows and 10 columns\n    \u003cLOG\u003e: Dropping NaN values\n    \u003cLOG\u003e: Restricting longitude and latitude\n    \u003cLOG\u003e: Creating column 'TIME_PERIOD'\n    \u003cLOG\u003e: Augmenting the dataset by the factorized equivalent of each column\n    \u003cLOG\u003e: The dataset consists 305542 rows and 21 columns\n    \u003cLOG\u003e: Saving pickled datafrime to '.\\out\\crime.pkl'\n    \n\n\n```python\nvisualizer = Visualizer(reader)\n```\n\n\n```python\nvisualizer.countplot('YEAR', 'Crimes per Year')\n```\n\n\n![png](./img/output_11_0.png)\n\n\n\n```python\nvisualizer.countplot('MONTH', 'Crimes per Month')\n```\n\n\n![png](./img/output_12_0.png)\n\n\n\n```python\nvisualizer.countplot('DAY_OF_WEEK', 'Crimes per Day')\n```\n\n\n![png](./img/output_13_0.png)\n\n\n\n```python\nvisualizer.countplot('DISTRICT', 'Crimes per District')\n```\n\n\n![png](./img/output_14_0.png)\n\n\n\n```python\nvisualizer.countplot('YEAR', 'Shootings per Year', predicate=lambda data: data['SHOOTING'] == 'Y')\n```\n\n\n![png](./img/output_15_0.png)\n\n\n\n```python\nvisualizer.countplot('DISTRICT', 'Shootings per District', predicate=lambda data: data['SHOOTING'] == 'Y')\n```\n\n\n![png](./img/output_16_0.png)\n\n\n\n```python\nvisualizer.countplot('TIME_PERIOD', 'Crimes per Time Period')\n```\n\n\n![png](./img/output_17_0.png)\n\n\n\n```python\nvisualizer.countplot('OFFENSE_CODE_GROUP', 'Types Of Crime During The Day', predicate=lambda data: data['TIME_PERIOD'] == 'Day', squeeze=True)\n```\n\n\n![png](./img/output_18_0.png)\n\n\n\n```python\ntitle = 'Geospatial Clustering [{} clusters]'\n```\n\n\n```python\nvisualizer.scatterplot(KMeans(reader).fit(2), title.format(2))\n```\n\n    \u003cLOG\u003e: Clustering according to geographical location\n    \u003cLOG\u003e: Running kmeans with  2 clusters\n    \n\n\n![png](./img/output_20_1.png)\n\n\n\n```python\nvisualizer.scatterplot(KMeans(reader).fit(3), title.format(3))\n```\n\n    \u003cLOG\u003e: Clustering according to geographical location\n    \u003cLOG\u003e: Running kmeans with  3 clusters\n    \n\n\n![png](./img/output_21_1.png)\n\n\n\n```python\nvisualizer.scatterplot(KMeans(reader).fit(5), title.format(5))\n```\n\n    \u003cLOG\u003e: Clustering according to geographical location\n    \u003cLOG\u003e: Running kmeans with  5 clusters\n    \n\n\n![png](./img/output_22_1.png)\n\n\n\n```python\nvisualizer.scatterplot(KMeans(reader).fit(10), title.format(10))\n```\n\n    \u003cLOG\u003e: Clustering according to geographical location\n    \u003cLOG\u003e: Running kmeans with 10 clusters\n    \n\n\n![png](./img/output_23_1.png)\n\n\n\n```python\nvisualizer.scatterplot(KMeans(reader).fit(header='MONTH'), 'Crimes per Month')\n```\n\n    \u003cLOG\u003e: Clustering according to geographical location and 'Month'\n    \u003cLOG\u003e: Running kmeans with 12 clusters\n    \n\n\n![png](./img/output_24_1.png)\n\n\n\n```python\nvisualizer.scatterplot(KMeans(reader).fit(header='OFFENSE_CODE_GROUP'), 'Crimes per Offense Type')\n```\n\n    \u003cLOG\u003e: Clustering according to geographical location and 'Offense Code Group'\n    \u003cLOG\u003e: Running kmeans with 67 clusters\n    \n\n\n![png](./img/output_25_1.png)\n\n\n# *Note*\n\nWe encountered some rendering problems, so we plot only a sample of the data. The sampling amount is passed as a parameter to the 'Map' class constructor and the default value is 500.\n\n\n```python\nMap(reader).display('OFFENSE_CODE_GROUP',coloring_attr = 'YEAR')\n```\n\n    \u003c2016: blue\u003e \u003c2017: green\u003e \u003c2018: purple\u003e \u003c2015: orange\u003e \n\n\n![png](./img/map.png)\n\n# Conclusions:\n\n- Most incidents occurred in 2017.\n- Most incidents occurred between June and September.\n- Slightly more incidents occurred on Fridays.\n- The districts with the most criminality are A1 , B2, B3, C11 and D4.\n- Most incidents happened during the day and most of them were either larceny or motor vehicle accidents.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbillsioros%2Fcrime-data-exploration","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbillsioros%2Fcrime-data-exploration","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbillsioros%2Fcrime-data-exploration/lists"}