{"id":14977227,"url":"https://github.com/kunalj101/data-science-hacks","last_synced_at":"2025-10-28T03:30:40.289Z","repository":{"id":46202154,"uuid":"239520370","full_name":"kunalj101/Data-Science-Hacks","owner":"kunalj101","description":"Data Science Hacks consists of tips, tricks to help you become a better data scientist. Data science hacks are for all - beginner to advanced. Data science hacks consist of python, jupyter notebook, pandas hacks and so on. ","archived":false,"fork":false,"pushed_at":"2022-11-09T10:40:31.000Z","size":1782,"stargazers_count":403,"open_issues_count":11,"forks_count":324,"subscribers_count":29,"default_branch":"master","last_synced_at":"2025-02-01T10:41:34.241Z","etag":null,"topics":["computer-vision","data","data-analysis","data-science","data-visualization","dataset","hacks","image-augmentation","ipynb","machine-learning","nlp","nlp-machine-learning","numpy","pandas","pandas-dataframe","pandas-python","pandas-tutorial","python","python3","tips-and-tricks"],"latest_commit_sha":null,"homepage":"https://courses.analyticsvidhya.com/courses/data-science-hacks-tips-and-tricks","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kunalj101.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-02-10T13:38:04.000Z","updated_at":"2025-01-06T22:52:01.000Z","dependencies_parsed_at":"2023-01-21T18:00:13.331Z","dependency_job_id":null,"html_url":"https://github.com/kunalj101/Data-Science-Hacks","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kunalj101%2FData-Science-Hacks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kunalj101%2FData-Science-Hacks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kunalj101%2FData-Science-Hacks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kunalj101%2FData-Science-Hacks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kunalj101","download_url":"https://codeload.github.com/kunalj101/Data-Science-Hacks/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238590593,"owners_count":19497351,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","data","data-analysis","data-science","data-visualization","dataset","hacks","image-augmentation","ipynb","machine-learning","nlp","nlp-machine-learning","numpy","pandas","pandas-dataframe","pandas-python","pandas-tutorial","python","python3","tips-and-tricks"],"created_at":"2024-09-24T13:55:19.291Z","updated_at":"2025-10-28T03:30:39.843Z","avatar_url":"https://github.com/kunalj101.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Science Hacks, Tips and Tricks\nData Science Hacks is created and maintained by Analytics Vidhya for the data science community. \n\nIt includes a variety of tips, tricks and hacks related to data science, machine learning \n\nThese Hacks are for all the data scientists out there. It doesn’t matter if you are a beginner or an advanced professional, these hacks will definitely make you efficient!\n\nFeel free to contribute your own data science hacks here. Make sure that your hack follows the [contribution guidelines](/CONTRIBUTING.md)\n\n\u003e This repository is part of the free course by [Analytics Vidhya](https://www.analyticsvidhya.com/). To learn more of such awesome hacks visit [Data Science Hacks, Tips and Tricks](https://courses.analyticsvidhya.com/courses/data-science-hacks-tips-and-tricks)\n\n- ### Data Science Hack #1 - Resource Downloader \nHow can you extract image data directly from chrome in one click?\nImagine that you want to make your own machine learning project but you don't have enough data, it becomes a daunting task\nWorry not you can use the ResourceSaver extension to directly download data! Let's see how!\n\nSteps:\n1. Install the chrome extension from the given URL.\n1. Go to Google Images or any webpage from where you want to save the data.\n1. Open Inspect Element and click to ResourceSaver Tab\n1. Click on the button Save All Resources and a zip file will be created.\n1. Unzip the file and open folder encrypted-tbn0.gstatic.com\n1. You can find the images here.\n \n- ### [Data Science Hack #2 Pandas Apply](./Code/Pandas%20Apply.ipynb) \nPandas Apply is one of the most commonly used functions for playing with data and creating new variables. It returns some value after passing each row/column of a data frame with some function. The function can be both default or user-defined. \n\n- ### [Data Science Hack #3 Pandas Boolean Indexing](./Code/Pandas_boolean%20indexing.ipynb) \nIt helps to select subset of data based on the value of the data in the dataframe\n\n- ### [Data Science Hack #4 Pandas Pivot Table](./Code/pandas_pivot_table.ipynb) \nIt is used to create MS Excel style spreadsheet. Levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.\n\n- ### [Data Science Hack #5 Pandas crosstab](./Code/pandas_crosstab.ipynb) \npd.crosstab() function is used to get an initial “feel” (view) of the data.\n\n- ### [Data Science Hack #6 Pandas str.split](./Code/first%20and%20last%20name%20extraction.ipynb) \nIt is used to apply vectorized string functions on a pandas dataframe column.\nLet’s say you want to split the names in a dataframe column into first name and last name.\npandas.Series.str along with split( ) can be used to perform this task.\n\n- ### [Data Science Hack #7 Extract E-mails from text](./Code/Extract%20E-mails%20from%20text.ipynb) \nHere is an interesting hack to extract email ids present in long pieces of text by just using 2 lines of code in Python using regular expressions. Extracting information from social media posts and websites has become a common practice in data analytics but sometimes we end up trying complicated methods to achieve things that can be solved easily by using the right technique. \n\n- ### [Data Science Hack #8 Normal Distribution](./Code/Convert%20normal%20Distribution.ipynb)\nOne of the most important assumptions in linear and logistic regression is that our data must follow a normal distribution but we all know that's usually not the case in real life. We often need to transform our data into normal/ gaussian distribution.\n \n- ### [Data Science Hack #9 Remove Emojis from text](./Code/Removing%20emojis%20from%20text.ipynb)\nPreprocessing is one of the key steps for improving the performance of a model. \nOne of the main reasons for text preprocessing is to remove unwanted characters from text like punctuation, emojis, links and so on which is not required for our problem statement. \n\n- ### Data Science Hack #10 Elbow method for classifier\nElbow Method is used for identifying the value of k in k-Nearest Neighbors. It's a plot of errors at different values of k and we select the k value having least error!\n\n- ### Data Science Hack #11 MinMax Scaler\nAn important part of data analysis is to preprocessing. Many times we need to scale our features like in the case of k-NN we always need to scale the data before building model or else it'll give spurious results.\n\n- ### [Data Science Hack #12 Feature engineering for time series data](.Code/Hack%20of%20the%20day%20-%20Time%20series.ipynb)\nMost of the data collected today, hold the date and time variables. There is a lot of information that you can extract from these features and you can utilize it in your analysis! \n\n- ### [Data Science Hack #13 Dummy data for linear regression](./Code/make_regression.ipynb)\nDeeplearning models usually require a lot of #data for training. But acquiring massive amounts of data comes with its own challenges. Instead of spending days manually collecting data, you can make use of Image Augmentation techniques. It is the process of generating new images. These new images are generated using the existing training images and hence we don’t have to collect them manually.\n\n- ### [Data Science Hack #14 HuggingFace Tokenization](./Code/av_hack.ipynb)\nTokenization is the primary task while building the vocabulary. \nHuggingFace recently created a library for tokenization which provides an implementation of today's most used tokenizers, with a focus on performance and versatility.\nKey features:\nUltra-fast: They can encode 1GB of text in ~20sec on a standard server's CPU\n\n\n- ### [Data Science Hack #15 Divide Continuous and categorical data](./Code/select_dtype.ipynb)\nYou can extract categorical and numeric features into seperate dataframes in just 1 line of code! \nThis can be done using the select_dtypes function.\n\n- ### [Data Science Hack #16 Pandas Profiling](./Code/pandas%20profiling.ipynb)\nDo you want to to do perform quick data analysis on your dataframe? \nYou can use pandas profiling to generate profile report of your dataset in just 1 line of code!\n\n- ### [Data Science Hack #17 Formatting of DataFrame](./Code/melt().ipynb)\nConvert wide form dataframe into long form dataframe in just 1 line of code!\nIn pd.melt(), one more columns are used as identifiers. \"Unmelt the data\", use pivot() function\n\n- ### [Data Science Hack #18 Magic Function- %history](./Code/HoD_history.ipynb)\nDo you know how you can get the history of all the commands running inside your jupyter notebook?\nUse %history, jupyter notebook's built-in magic function! \nNote - Even if you have cut the cells in your notebook, %history will print those commands as well!\n\n- ### [Data Science Hack #19 Heatmap on pandas dataframe](./Code/Styling%20pandas.ipynb)\nCreate heatmap on pandas dataframe using seaborn!\nIt helps you understand the complete range of values at a glimpse.\n\n- ### [Data Science Hack #20 Plot confusion matrix](./Code/plot_confusion_matrix.ipynb)\nScikit-learn has released its stable 0.22.1 version with new features and bug fixes.\nOne new function is the plot_confusion_matrix function which generates an extremely intuitive and customisable confusion matrix for your classifier.\nBonus tip: You can specify the format of the numbers appearing in the boxes using the values_format parameter('n' for whole numbers, '.2f' for float, etc)\n\n- ### [Data Science Hack #21 Ipython Interactive shell](./Code/interactive_notebook.ipynb)\nWhat will be the output if you run the following commands in single cell of your jupyter notebook?\ndf.shape\ndf.head()\nOfcourse it'll be first five rows of your dataframe. Can we get output of both the command run in same cell? \nYou can do it using InteractiveShell.\n\n- ### Data Science Hack #22 Python tqdm\nMost of you have heard about the library tqdm and you might be using it track the progress of forever running for loops. Most of the times we write complex functions having nested for loops. #tqdm allows tracking that too. Here is how you can track the nested loops using tdqm in python.\n\n- ### [Data Science Hack #23 Image Augmentation](./Code/Image%20Augmentation%20-%20Article%20Shoot.ipynb)\nDeeplearning models usually require a lot of data for training. But acquiring massive amounts of data comes with its own challenges. Instead of spending days manually collecting data, you can make use of Image Augmentation techniques. It is the process of generating new images. These new images are generated using the existing training images and hence we don’t have to collect them manually.\n\n- ### Data Science Hack #24 Setup Dark Jupyter Notebook Theme\n[jupyter-themes](https://github.com/dunovank/jupyter-themes) provides an easy way to change theme, fonts and much more in your jupyter notebook. \n\nSteps - \n\n1. Install jupyter-themes -\n   - using anaconda \u003cp\u003e\u003ccode\u003econda install -c conda-forge jupyterthemes\u003c/code\u003e\u003c/p\u003e\n   - using pip \u003cp\u003e\u003ccode\u003epip install jupyterthemes\u003c/code\u003e\u003c/p\u003e\n2. Check list of themes - \u003cp\u003e\u003ccode\u003e jt - l\u003c/code\u003e\u003c/p\u003e\n3. Select a theme \u003cp\u003e\u003ccode\u003ejt -t chesterish\u003c/code\u003e\u003c/p\u003e\n4. To restore to default theme - \u003cp\u003e\u003ccode\u003ejt -r\u003c/code\u003e\u003c/p\u003e \n\n- ### Data Science Hack #25 Change cell width in jupyter notebook\nTo do this we use jupyter-themes, it provides an easy way to change theme, fonts and much more in your jupyter notebook.\n\nSteps -\n1. Install jupyter-themes -\n   - using anaconda \u003cp\u003e\u003ccode\u003econda install -c conda-forge jupyterthemes\u003c/code\u003e\u003c/p\u003e\n   - using pip \u003cp\u003e\u003ccode\u003econda install -c pip install jupyterthemes\u003c/code\u003e\u003c/p\u003e\n\n2. Change the theme, cell width, cell height \u003cp\u003e\u003ccode\u003ejt -t chesterish -cellw 100% lineh 170\u003c/code\u003e\u003c/p\u003e\n\n- ### [Data Science Hack #26 Parse_dates in read_csv() to change data type to datetime](./Code/read_csv_ParseDate.ipynb)\nWhat do you do when you need to change the data type of a column to DateTime? We can do this directly at the time of reading data using parse_dates argument.\n\n- ### Data Science Hack 27 Share jupyter notebook using nbviewer\nYou can share your jupyter notebook with non-programmers very easily and the best way to do it is by using [jupyter nbviewer](https://nbviewer.jupyter.org/).\nPro tip - You can use [Binder](https://mybinder.org/) to execute the code from nbviewer on your machine!\n\n- ### [Data Science Hack #28 Plotting Decision Tree](./Code/Decision%20Tree%20Plot.ipynb)\nDo you know how to plot a decision tree in just 1 line of code? \nSklearn provides a simple function plot_tree() to do this task. You can tweak the hyperparameters as per your requirements.\n\n- ### [Data Science Hack #29 Invert Dictionary](./Code/invert_dictionary.ipynb)\nDo you know how you can invert a dictionary in python?\nDictionary is a collection which is unordered, changeable and indexed.  It is widely used in day to day programming, and machine learning tasks.\n\n- ### [Data Science Hack #30 Interactive plots using plotly](/Code/interactive%20plot%20-%20plotly.ipynb)\n[Cufflinks](https://plot.ly/python/v3/ipython-notebooks/cufflinks/) binds plotly directly to pandas dataframes! Therefore you can make interactive charts without any hassle or long codes.\n\n- ### [Data Science Hack #31 Write python file directly from jupyter notebook cell](./Code/write%20python%20script.ipynb)\nThis hack is about saving contents of a cell to a .py file using the magic command %%writefile and then running the file in another jupyter notebook using the magic command [%run](./Code/run%20python%20script.ipynb)\n\n- ### [Data Science Hack #32 Pretty-print Data structures](./Code/pretty%20print.ipynb)\nAre you getting confused while printing some of the data structures? Worry not, it is very common. \n[pretty-print](https://docs.python.org/3/library/pprint.html) module provides an easy way to print the data structures in a visually pleasing way!\n\n- ### [Data Science Hack #33 Date Parser](./Code/Date%20Parser.ipynb)\nThis code allows you to convert date of any format into a specified format. Many times, we receive dates of various formats in our data. This hack will help you to convert all those formats into a specified format.\n\n- ### [Data Science Hack #34 Feature Selection using SelectFromModel](./Code/FeatureSelection_SelectFromModel.ipynb)\nOne of the ways to perform feature selection is by using feature_importance_ attribute of the base estimators. Using SelectFromModel function you can specify the estimator and the threshold for feature_importance_, This hack uses 'mean' as the threshold. You can tweak the threshold to get optimum results. To learn more visit the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel)\n\n- ### [Data Science Hack #35 Convert Strings into Characters](./Code/convert_string_to_characters.ipynb)\nWhat could be the easiest way to convert a string to characters?\nHere is a simple hack which comes in handy while working with text data\n \n- ### [Data Science Hack #36 Resize Image Size](./Code/Resizing%20images.ipynb)\nWhile building an image classification model using deep learning, it is required that all the images should be of same size. However, as the data comes from different sources, images may have different shapes. So, to convert them to same shape, we can use the resize function from open cv. This hack will will help you convert the images of any shape to a specified shape.\n\n- ### [Data Science Hack #37 Apply pandas in parallel](./Code/pandarellel.ipynb)\nDoes it take time to perform operations on your pandas dataframe? [Pandarallel](https://github.com/nalepae/pandarallel) is a simple and efficient tool to parallelize Pandas operations on all your available CPUs!\n\n- ### [Data Science Hack #38 Generator Expressions vs List comprehension](./Code/generator%20vs%20list.ipynb)\nThe generator yields one item at a time and generates them only when in demand. Generators are much more memory efficient. This hack compares generator expressions with list comprehensions.\n\n- ### Data Science Hack #39 Test your Regex\nDo you avoid regex because they are hard to read and write as well as tricky to get right? This hack helps you get your regex correct.\n[regex101](https://regex101.com/) is an online regex tester, debugger with highlighting for PHP, PCRE, Python, Golang and JavaScript\n\n- ### [Data Science Hack #40 Convert List of Lists to List](./Code/list_of_lists_to_list.ipynb)\nSometimes the data can be in the form of nested list. For example, the data can be date-wise transaction records for a particular product. However, you might need only in a single dimension. This hack will help you to flatten the list of lists into a single list.\n\n- ### [Data Science Hack #41 Hide Print Statements](./Code/hide_print.ipynb)\nWe often use print statements for debugging purposes. This hack will help you to turn off print statements in a particular section of the code so that it will make debugging easier.\n\n- ### [Data Science Hack #42 Split PDF Document page-wise](./Code/split_pdf_pages.ipynb)\nThis hack will help you to split a single PDF document into multiple pages.\n\n- ### [Data Science Hack #43 Merge PDF Documents](./Code/merge_pdf.ipynb)\nThis hack will help you to combine multiple PDF documents into a single document. This hack is the inverse of [Hack #42 Split PDF Document page-wise](#data-science-hack-42-split-pdf-document-page-wise)\n\n- ### [Data Science Hack #44 Create a Custom Image DataGenerator in Keras](./Code/CustomDataGen_Keras.ipynb)\nSometimes you would need a functionality which is not directly provided by Keras's ImageDataGenerator. You can easily create a wrapper around it to suit your needs. \n\n1. For example, your usecase is that you have multi-input Deep Learning model like this\n\n![](./Data/muti_input_nn.png)\n\n(i.e. a neural network which takes input from multiple data sources, and does a combined training on this data), and you want that the data generator should be able to handle the data preparation on the fly, you can create a wrapper around ImageDataGenerator class to give the required output.[This notebook](./Code/CustomDataGen_Keras.ipynb) explains a simple solution to this usecase. \n\n2. Another use case could be that you want to resize the images from a shape say 150x150 to a shape 224x224, which is generally utilized by the pretrained models, you can customize the ImageDataGenerator without coding your own data generator from ground up [(Example Notebook)](https://github.com/faizankshaikh/AV_Article_Codes/blob/master/Inception_From_Scratch/improvements/Inception_v1_from_Scratch.ipynb).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkunalj101%2Fdata-science-hacks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkunalj101%2Fdata-science-hacks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkunalj101%2Fdata-science-hacks/lists"}