tweetGPT - what twitter is saying about ChatGPT
Introduction
Background and Motivation
ChatGPT, the openAI language model has been a trending topic in recent months following the release of its latest version, and its successes in passing the bar, AP exams, Sommerlier exams and the SAT among others. https://www.businessinsider.com/list-here-are-the-exams-chatgpt-has-passed-so-far-2023-1?r=US&IR=T. ChatGPT has been used to generate essays, song lyrics, emails and even CVs. Despite these positive attributes, ChatGPT along with other language models like Dall-E, have received mixed reactions from the public with concerns about security, copyright infringement, inaccuracies and potential to perpetuate biases and spread misinformation.
It is against this backdrop that I undertook this project to analyze how twitter users are talking about ChatGPT.
About the Data
The data for this project was obtained from Kaggle. It is a collection of tweets with the hashtag #chatgpt between December 2022 and April 2023, containing discussions about the ChatGPT language model, experiences with using the model and asking for help with use-related issues.
Methodology
I cleaned the data usingdplyr
and tidytext
packages. The biggest challenge with the data was that, being tweets, it
naturally contained emojis which are treated as unicode characters by R.
I had to remove them using the stringi
package.
Analysis
User Analysis
I began by analyzing the user data that I had. The plots below show some of the basic characterisitics of the people who tweeted about ChatGPT. From the plots, we can see that a small proprtion of the users were verified, with a large unverified portion. The number of followers ranged from 0 to over 17 million, friends ranged from 0 to over 1 million and favourites ranged from 0 to nearly 1,5 million. Overall we can tell that there is a very diverse group of users talking about ChatGPT.
Variable | Min | Max | Mean | Median |
---|---|---|---|---|
followers | 0 | 17728427 | 20387 | 345 |
friends | 0 | 1172077 | 2107 | 366 |
favourites | 0 | 1460610 | 10356 | 1189 |
I also wanted to analyze the common words that showed up in the bios of the people who tweeted about ChatGPT. The wordcloud below illustrates the most common words that appeared, filtered to omit frequencies less than 1000. For this and other subsequent wordclouds, the following words were omitted since they are frequent but not important at analysing the sentiment of the discourse: “https”, “t.co”, “openai”, “ai”, “artificialintelligence”, “http”, “gpt”, “gpt4”, “nannannannannew”, “t”, “chatgpt”, “amp”
Bio wordcloud
From the wordcloud, we can see that some career terms appear to be popular such as engineer, developer, author, founder, creator and artist. Some terms related to finance appear to be popular too, with words such as crypto, fintech, blockchain and bitcoin showing up. In general, it appears that the ChatGPT conversation draws from a wide range of users.
Hashtag wordcloud
I was also interested in seeing what hashtags were used in conjuction with #chatgpt as these could add more context to the tweets. The following wordcloud shows the result, filtered for words with frequency less than 100.
From the wordcloud, we can see that some of the common hashtags associated with #chatgpt in this period were #bard, #midjourney, #bing and #dalle, showing that the conversations also reference other models. Tags like #crypto, #nft , #stocks and #generativeai demonstrate the anticipated and applied use of AI in art and finance.
Tweet analysis
Next, I wanted to analyze the tweets themselves. I created the following wordcloud from the tweets, filtered to omit frequencies less than 1000.
Tweet wordcloud
From the wordcloud, we can see that the terms chatbot, answer, bing and bard were among some of the most common words in the tweets. Elon musk also made am appearence.
Sentiment Analysis
I was also interested in analyzing the sentiments carried by the
tweets around #chatgpt. Using the afinn
lexicon, each tweet
was given a score. The afinn
scores range from minus
five(negative) to plus five (positive) for a word, so the scores were
calculated per word then summed up in the tweet. The tweets were then
classified into negative, positive and neutral (zero) using the nnrc
lexicon.
From the bar plot above, we can see that the majority of tweets were overall positive, with a small fraction being negative and a slightly smaller fraction being neutral. It is important to note that one drawback of the afinn lexicon is that it does not contextualise words but calculates their sentiment score in isolation, so there is some bias in the way it may classify tweet sentiments. Even though this was partially rectified by then summing words in a tweet, the lack of contextualization should be factored in considering the general outlook of the tweets between December 2022 and April 2023.
Time analysis
I also decided to investigate how the volume if tweets with #chatgpt changed per day in this time period. The interactive time series graph below shows the tweet volume trend.
From the graph, we can see that there was a gradual decline in the mention of ChatGPT between December of 2022 and January in 2023 followed by a constant increase until February. From then on, the data alternates between peaks and valleys, with the highest traffic generated on February 7, coinciding with Microsoft’ s announcement of its ChatGPT-powered version of Bing. The sharpest dip occured between the 19th and 21st of February. Ironically, in February, ChatGPT attracted over 1 billion visits. Perhaps people were using the model more than they were tweeting about it.
To analyse composite data of cumulative tweets and tweets per month, I constructed the plot below. The plot clearly shows that the most content was generated in February after a constant increase since January, with a gradual descent in March and into April.
Conclusion
The world is yet to truly understand the impact that advanced language models like ChatGPT will have on society in sectors such as education, art & design and publishing. A few things are clear: that this technology is here to stay and that it is only going to get better. Conversations like these carried on social media sites between novices and experts, proponents and oponents and creators and consumers give us a small insight in the way that the technology is being received, understood and used.
Future work
Initially, I had intended on using a Twitter API to scrape tweet information about ChatGPT in real time, as Twitter discourses are dynamic and incessant. In this way, my analysis would always capture the current state of the conversation on ChatGPT. At the time of publication, our access application for the API had not yet been approved by Twitter.
The Kaggle dataset used in this analysis is updated daily, however it is quite tedious to download and put the file in the server repository every day.
References
Dataset
ChatGPT - the tweets. Kaggle, 2023. https://www.kaggle.com/datasets/konradb/chatgpt-the-tweets.