tweetGPT - what twitter is saying about ChatGPT

Introduction

Background and Motivation

ChatGPT, the openAI language model has been a trending topic in recent months following the release of its latest version, and its successes in passing the bar, AP exams, Sommerlier exams and the SAT among others. https://www.businessinsider.com/list-here-are-the-exams-chatgpt-has-passed-so-far-2023-1?r=US&IR=T. ChatGPT has been used to generate essays, song lyrics, emails and even CVs. Despite these positive attributes, ChatGPT along with other language models like Dall-E, have received mixed reactions from the public with concerns about security, copyright infringement, inaccuracies and potential to perpetuate biases and spread misinformation.

It is against this backdrop that I undertook this project to analyze how twitter users are talking about ChatGPT.

About the Data

The data for this project was obtained from Kaggle. It is a collection of tweets with the hashtag #chatgpt between December 2022 and April 2023, containing discussions about the ChatGPT language model, experiences with using the model and asking for help with use-related issues.

Methodology

I cleaned the data usingdplyr and tidytext packages. The biggest challenge with the data was that, being tweets, it naturally contained emojis which are treated as unicode characters by R. I had to remove them using the stringi package.

Analysis

User Analysis

I began by analyzing the user data that I had. The plots below show some of the basic characterisitics of the people who tweeted about ChatGPT. From the plots, we can see that a small proprtion of the users were verified, with a large unverified portion. The number of followers ranged from 0 to over 17 million, friends ranged from 0 to over 1 million and favourites ranged from 0 to nearly 1,5 million. Overall we can tell that there is a very diverse group of users talking about ChatGPT.

Summary Statistics
Variable	Max	Mean	Median
followers	17728427	20387	345
friends	1172077	2107	366
favourites	1460610	10356	1189

I also wanted to analyze the common words that showed up in the bios of the people who tweeted about ChatGPT. The wordcloud below illustrates the most common words that appeared, filtered to omit frequencies less than 1000. For this and other subsequent wordclouds, the following words were omitted since they are frequent but not important at analysing the sentiment of the discourse: “https”, “t.co”, “openai”, “ai”, “artificialintelligence”, “http”, “gpt”, “gpt4”, “nannannannannew”, “t”, “chatgpt”, “amp”

Bio wordcloud

From the wordcloud, we can see that some career terms appear to be popular such as engineer, developer, author, founder, creator and artist. Some terms related to finance appear to be popular too, with words such as crypto, fintech, blockchain and bitcoin showing up. In general, it appears that the ChatGPT conversation draws from a wide range of users.

Hashtag wordcloud

I was also interested in seeing what hashtags were used in conjuction with #chatgpt as these could add more context to the tweets. The following wordcloud shows the result, filtered for words with frequency less than 100.

From the wordcloud, we can see that some of the common hashtags associated with #chatgpt in this period were #bard, #midjourney, #bing and #dalle, showing that the conversations also reference other models. Tags like #crypto, #nft , #stocks and #generativeai demonstrate the anticipated and applied use of AI in art and finance.

Tweet analysis

Next, I wanted to analyze the tweets themselves. I created the following wordcloud from the tweets, filtered to omit frequencies less than 1000.

Tweet wordcloud

From the wordcloud, we can see that the terms chatbot, answer, bing and bard were among some of the most common words in the tweets. Elon musk also made am appearence.

Sentiment Analysis

I was also interested in analyzing the sentiments carried by the tweets around #chatgpt. Using the afinn lexicon, each tweet was given a score. The afinn scores range from minus five(negative) to plus five (positive) for a word, so the scores were calculated per word then summed up in the tweet. The tweets were then classified into negative, positive and neutral (zero) using the nnrc lexicon.

From the bar plot above, we can see that the majority of tweets were overall positive, with a small fraction being negative and a slightly smaller fraction being neutral. It is important to note that one drawback of the afinn lexicon is that it does not contextualise words but calculates their sentiment score in isolation, so there is some bias in the way it may classify tweet sentiments. Even though this was partially rectified by then summing words in a tweet, the lack of contextualization should be factored in considering the general outlook of the tweets between December 2022 and April 2023.

Time analysis

I also decided to investigate how the volume if tweets with #chatgpt changed per day in this time period. The interactive time series graph below shows the tweet volume trend.

From the graph, we can see that there was a gradual decline in the mention of ChatGPT between December of 2022 and January in 2023 followed by a constant increase until February. From then on, the data alternates between peaks and valleys, with the highest traffic generated on February 7, coinciding with Microsoft’ s announcement of its ChatGPT-powered version of Bing. The sharpest dip occured between the 19th and 21st of February. Ironically, in February, ChatGPT attracted over 1 billion visits. Perhaps people were using the model more than they were tweeting about it.

To analyse composite data of cumulative tweets and tweets per month, I constructed the plot below. The plot clearly shows that the most content was generated in February after a constant increase since January, with a gradual descent in March and into April.

Conclusion

The world is yet to truly understand the impact that advanced language models like ChatGPT will have on society in sectors such as education, art & design and publishing. A few things are clear: that this technology is here to stay and that it is only going to get better. Conversations like these carried on social media sites between novices and experts, proponents and oponents and creators and consumers give us a small insight in the way that the technology is being received, understood and used.

Future work

Initially, I had intended on using a Twitter API to scrape tweet information about ChatGPT in real time, as Twitter discourses are dynamic and incessant. In this way, my analysis would always capture the current state of the conversation on ChatGPT. At the time of publication, our access application for the API had not yet been approved by Twitter.

The Kaggle dataset used in this analysis is updated daily, however it is quite tedious to download and put the file in the server repository every day.

References

Dataset

ChatGPT - the tweets. Kaggle, 2023. https://www.kaggle.com/datasets/konradb/chatgpt-the-tweets.

R Packages

Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2022. Rmarkdown: Dynamic Documents for r. https://CRAN.R-project.org/package=rmarkdown.

Fraley, Chris, Adrian E. Raftery, and Luca Scrucca. 2022. Mclust: Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation. https://mclust-org.github.io/mclust/.

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.

Henry, Lionel, and Hadley Wickham. 2020. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.

Hester, Jim, and Jennifer Bryan. 2022. Glue: Interpreted String Literals. https://CRAN.R-project.org/package=glue.

Huntington-Klein, Nick. 2023. Vtable: Variable Table for Variable Documentation. https://nickch-k.github.io/vtable/.

Hvitfeldt, Emil. 2022. Textdata: Download and Load Various Text Datasets. https://github.com/EmilHvitfeldt/textdata.

Lang, Dawei. 2023. Wordcloud2: Create Word Cloud by htmlWidget. https://github.com/lchiffon/wordcloud2.

Müller, Kirill, and Hadley Wickham. 2022. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.

Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.

———. 2022. Sf: Simple Features for r. https://CRAN.R-project.org/package=sf.

Pedersen, Thomas Lin. 2022. Patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.

R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Robinson, David, Alex Hayes, and Simon Couch. 2022. Broom: Convert Statistical Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.

Robinson, David, and Julia Silge. 2022. Tidytext: Text Mining Using Dplyr, Ggplot2, and Other Tidy Tools. https://github.com/juliasilge/tidytext.

Ryan, Jeffrey A., and Joshua M. Ulrich. 2022. Xts: eXtensible Time Series. https://github.com/joshuaulrich/xts.

Scrucca, Luca, Michael Fop, T. Brendan Murphy, and Adrian E. Raftery. 2016. “mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models.” The R Journal 8 (1): 289–317. https://doi.org/10.32614/RJ-2016-021.

Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.

Sievert, Carson, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, and Pedro Despouy. 2022. Plotly: Create Interactive Web Graphics via Plotly.js. https://CRAN.R-project.org/package=plotly.

Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in r.” JOSS 1 (3). https://doi.org/10.21105/joss.00037.

Slowikowski, Kamil. 2021. Ggrepel: Automatically Position Non-Overlapping Text Labels with Ggplot2. https://github.com/slowkow/ggrepel.

Spinu, Vitalie, Garrett Grolemund, and Hadley Wickham. 2021. Lubridate: Make Dealing with Dates a Little Easier. https://CRAN.R-project.org/package=lubridate.

Vanderkam, Dan, JJ Allaire, Jonathan Owen, Daniel Gromer, and Benoit Thieurmel. 2018. Dygraphs: Interface to Dygraphs Interactive Time Series Charting Library. https://github.com/rstudio/dygraphs.

Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1): 1–29. https://www.jstatsoft.org/v40/i01/.

———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

———. 2022a. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.

———. 2022b. Plyr: Tools for Splitting, Applying and Combining Data. https://CRAN.R-project.org/package=plyr.

———. 2022c. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.

———. 2022d. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2022. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, and Maximilian Girlich. 2022. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.

Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2022. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.

Wickham, Hadley, and Dana Seidel. 2022. Scales: Scale Functions for Visualization. https://CRAN.R-project.org/package=scales.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

———. 2022. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.

Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.

Zeileis, Achim, and Gabor Grothendieck. 2005. “Zoo: S3 Infrastructure for Regular and Irregular Time Series.” Journal of Statistical Software 14 (6): 1–27. https://doi.org/10.18637/jss.v014.i06.

Zeileis, Achim, Gabor Grothendieck, and Jeffrey A. Ryan. 2022. Zoo: S3 Infrastructure for Regular and Irregular Time Series (z’s Ordered Observations). https://zoo.R-Forge.R-project.org/.

Zhu, Hao. 2021. kableExtra: Construct Complex Table with Kable and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.