Welcome to twitstat’s documentation!

Contents:

About Twitstat

Twitstat is a simple web application that analyses twitter data to provide interesting insights into trending hashtags and topics. It cleverly clusters and charts data to ease the process of better understanding trends around the world!

Twitstat is split into multiple modules

Scraping Module

Twitstat uses Twitter’s python API tweepy to get all the tweets for the analysis. Tweepy is first used to fetch the trending topics around a specified geographical location, these fetched topics are then fed into the api’s search method. The search method gets Twitstat all the tweets (and other important information such as the likes, retweets, et cetera for each tweet) corresponding to the search query.

Analysis Module

Twitsat uses three major modules to facilitate its data analysis

  • Preprocessing module

  • Clustering module

  • Sentiment analysis module

Preprocessing Module

Before data can be loaded into any of the actual analyser functions, it has to be preprocessed or cleaned. The preprocessing module removes any unwanted text such as emoticons, line breaks, punctuations et cetera, from the tweets. Certain words (called stop-words) are also removed as they do not add meaning to the text. At last, all the words are tokenized (split into multiple words) and stemmed. These tasks are done with the help of nltk’s algorithms.

Clustering Module

Twitstat’s clustering module uses scikit-learn’s DBSCAN clustering algorithm to cluster tweets falling under the trending categories. Density-based spatial clustering of applications with noise (DBSCAN) is a density-based clustering algorithm, that is, given a set of points in some space, it groups together points that are closely packed together. Points which are sparsely packed are classified as outliers.

Sentiment Analysis Module

At last, after splitting tweets into clusters, the most popular tweet of each cluster is identified. These popular tweets are then fed into texblob’s sentiment analysis module where the tone (positive, negative or neutral) of the tweets is decided.

Future Iterations

Twitter + Statistics = Amazing information!

And that is why, we want to keep improving. Future iterations of Twitstat will include (but are not limited to)

  • A new and improved clustering algorithm to cluster data with higher fidelity

  • Get better insights on data by geo-locating tweets and forming heat-maps

  • Create gists of each modelled topic for a quick look into what’s the most talked about in real time

Contributors

Made with love by Aditya Raman and Garima Singh!

Indices and tables