Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/raksh710/whatsapp_chat_analysis

I did a thorough analysis (of word count and message count by per person on annual basis) of the chats present in a whatsapp group of mine which includes me and my friends. Following is the way I proceeded in a really brief manner 1) Exported the chat data from whatsapp(excluding the media files). Used excel to make it a csv file (delimiters used were semicolon, tab, and colon(only for one column)) 2)Imported the data (using pandas in the form of a dataframe) in a jupyter notebook. Now this dataframe was like a sparse matrix. 3) Initially the data was filled with junk values, so I had to clean it. I started out with removing any thing which doesn't belonged to alphanumerics, this removed all the emojis. I made a function and ran it accross the dataframe which removed the alphanumerics and replaced them with a pipe (|) symbol and later I filtered out this pipe symbol from the dataframe. 4) In total there were around 6100+ samples and 36 instances, all of them had texts. 5) All these column were filled with N/A values. I remooved all those columed which had more than 95% N/A values as they could be treated as junk 6) Changed the date column's format to datetime and extracted month and year. 7) After all this I was left with a huge text column, year column, month column, and contact info column. 8) Calculated the length of the text column in order to calculate how many letters or alphanumeric characters were used in each message. 9) I categorized the datframe by grouping them by various parameters like Contact, Year etc. 10) Made various plots and represented the data by each category. NOTE: You can reference the code to understand the entire process in detail.
https://github.com/raksh710/whatsapp_chat_analysis

Last synced: 1 day ago
JSON representation

I did a thorough analysis (of word count and message count by per person on annual basis) of the chats present in a whatsapp group of mine which includes me and my friends. Following is the way I proceeded in a really brief manner 1) Exported the chat data from whatsapp(excluding the media files). Used excel to make it a csv file (delimiters used were semicolon, tab, and colon(only for one column)) 2)Imported the data (using pandas in the form of a dataframe) in a jupyter notebook. Now this dataframe was like a sparse matrix. 3) Initially the data was filled with junk values, so I had to clean it. I started out with removing any thing which doesn't belonged to alphanumerics, this removed all the emojis. I made a function and ran it accross the dataframe which removed the alphanumerics and replaced them with a pipe (|) symbol and later I filtered out this pipe symbol from the dataframe. 4) In total there were around 6100+ samples and 36 instances, all of them had texts. 5) All these column were filled with N/A values. I remooved all those columed which had more than 95% N/A values as they could be treated as junk 6) Changed the date column's format to datetime and extracted month and year. 7) After all this I was left with a huge text column, year column, month column, and contact info column. 8) Calculated the length of the text column in order to calculate how many letters or alphanumeric characters were used in each message. 9) I categorized the datframe by grouping them by various parameters like Contact, Year etc. 10) Made various plots and represented the data by each category. NOTE: You can reference the code to understand the entire process in detail.

Awesome Lists containing this project