Product News

Three approaches to topic discovery with Twitter data

  By Prasanna Selvaraj   30 November  2021

  By Prasanna Selvaraj  

  30 November  2021

The businesses and brands we work with here at the Developer Platform consistently monitor Twitter via our APIs for a variety of reasons, in a variety of ways. From tracking the latest consumer trends and analyzing competitors to staying ahead of breaking news and responding to customer service requests, Twitter APIs are key for unlocking insights into real-time public conversations impacting business. 

Twitter is a treasure trove of data, but language is complex, and the journey to insights involves processing a massive amount of Tweets by ways of organizing, sorting and filtering. In this article, I will discuss three common approaches of organizing large volumes of Tweets by ways of Topic analysis, a process to identify and categorize the underlying themes in the Tweet text. I’ll also go over when it would make sense to use natural language processing (NLP) and custom machine learning models (CMLM) for topic and keyword extraction to power industry specific use cases.

The purpose of this article is to introduce you to some common approaches to Topic discovery with Twitter data so that you can choose the approach that makes the most sense for your use case.   

Topic Discovery 

The first step in Topic analysis is Topic discovery (aka topic detection or entity extraction). The goal of this technique is to organize and understand large collections of Tweet text by assigning tags or categories according to each topic or theme in the Tweet text.

The typical use cases to discover topics from a large volume of Tweets are:

  • Trend analysis
  • Power alerts and recommendations
  • Enhance search and personalization
  • Gain insights (customer feedback, market research, competitive intelligence, etc.)
  • Issue detection (customer service/support issues)

Approach 1: Tweet Annotations

A turnkey solution for topic discovery with the Twitter API is Tweet annotations, which offer named entity recognition and context annotations. 

Twitter categorizes entities as “people,” “places,”  “products,”  “organizations,” or “other.” Entities are programmatically assigned based on what is explicitly mentioned in the Tweet text and delivered in the entity object within a Tweet payload. 

Context annotations are labeled for a Tweet if the Tweet’s text matches with Twitter’s semantically classified Tweets. Twitter curates a list of keywords, hashtags, and @handles that are relevant to a given topic and assigns context annotations labels. Context annotations are added to a Tweet’s text based on semantic rules as opposed to a machine learning approach, where a model is trained to classify text. Context annotations can be used to discover Tweets on topics that may have been previously difficult to surface.

Tweet Annotations Example

Let’s explore Tweet annotations for a set of Tweets specific to the customer experience domain and use annotations as filters to narrow down to specific Tweets of interest. The below examples leverage a set of 300 Tweets ingested into a database. We’ll start with filtering by Entity annotations. The five entity types (image below) provided by Tweet annotations are: 

  • Organization
  • Product
  • Person
  • Place
  • Other

The image above depicts how a Tweet dataset is distributed based on entity annotation types, which Twitter programmatically identifies based on explicit mentions in the Tweet text, and enriches the Tweet object by categorizing the entity into one of the above categories.

For example, filtering a Tweet dataset by the entity type “Organization” will yield the corresponding entity names of organizations which are mentioned. So if you want to filter Tweets mentioning a competitor, you can do this using the entity annotation type. 

Here’s an example of organization as an entity type with its corresponding entity type pairings:

Entity annotations offer a quick and easy way to categorize large volumes of Tweets without any need for 3rd party entity extraction libraries or APIs. Entity annotation gives a macro level view of how Tweets are spread across entities like “Place”, “Product”, “Organization” and “Person”.

Similarly, context annotations are delivered as a context_annotations field in the payload, and are semantically classified based on the Tweet text and result in domain and entity pairings. Context annotations can yield zero, one, or many domain and entity pairings. Currently, Twitter is using a list of 50+ domains

Unlike entity annotations, context annotations are added based on semantics of Tweet text and surface domain and entity pairings. For instance, filtering “sports” as a context annotation domain will yield the corresponding context entity pairings which are, of course, the specific types of sports. Context entity pairings like the “Boxing” entity paired under the “Sports” domain can be used to zero-in on Tweets that are relevant to the context of “Boxing” itself.

Filtering on “Sports” as a context type and “Boxing” as context name surfaced the following Tweet.

Tweet annotations (context and entity annotations) offer a quick path for topic discovery and entity extraction as the annotations are available within the Tweet payload as enrichments.  Without a 3rd party integration or a custom machine learning model, Tweet annotations enable quick wins for trend detection and search and personalization use cases. Also, Tweet annotations can be used to drive personalized user preferences by intersecting the Tweet annotations entity / context pairing with user interests and preferences.

Tweet annotations currently do not offer sentiment analysis, but let’s explore the possibilities of topic extraction along with sentiment analysis.

Approach 2: NLP for topic / entity identification

NLP fits best for topic / entity extraction coupled with sentiment analysis. Sentiment analysis is an important factor when it comes to product and brand recognition, customer loyalty, customer satisfaction, advertising, and product acceptance. Sentiment analysis provides the ability to quickly understand consumer attitudes and react accordingly.

NLP performs analysis on the Tweet’s text with several layers like sentiment and entity analysis, entity sentiment analysis and content classification. The different layers can not only provide topics / entity extraction but metrics like sentiments, resonance, etc. that can be leveraged for advanced analytics.

For example, this image illustrates a list of negative and positive topics/entities along with sentiment score, relevance and salience as metrics for a Tweet dataset processed with NLP. With these metrics you could select top entities for subsequent analysis.

With sentiment metrics, key insights can be surfaced which may not be possible otherwise. For example, some use cases can require monitoring of underrated or neutral topics, topics which are not heavily positive or negative. In this case, sentiment analysis provides a score range for emotions and a neutral score can be picked to surface Tweets that represent an underrated topic. 

The image below illustrates a heatmap of topics based on sentiment scores, where the topics from positive to negative are shaded from darker to lighter colors. The idea is that these topics can be fed into a business process for subsequent analysis based on the Tweets matching a particular topic and the sentiment scores. For example, the Tweets with negative sentiment can be fed into a customer service tool that will auto-create cases for customer representatives to act. Similarly, Tweets with highest positive sentiment can be surfaced to marketing for brand promotion and campaigns. 

This image illustrates the analysis of entity and keyword sentiment scores:

I've selected the Tweet below based on a topic “Thanks bot orders,” which the NLP engine picked as top entity based on a sentiment score. This example illustrates how topic discovery with NLP coupled with sentiment analysis can be used to discover Tweets / conversations which may not be possible with other methods. This type of insight can be significant for businesses looking to improve their customer experience.

Gaining new insight is an important outcome when performing both for market research and competitive analysis.

While Tweet annotations and NLP are great tools for topic extraction, it cannot classify Tweets pertaining to a domain or industry specific needs. This is an area where custom machine learning models (CMLM) can be leveraged to classify Tweets for a specific industry problem.

Approach 3: Custom Machine Learning Models

If you want to classify Tweets based on predefined categories or domain categories, Custom machine learning models (CMLM) can be leveraged. CMLM is best suited for tagging Tweets based on domain specific categories or labels. For instance, a customer experience management tool can leverage a CMLM model to categorize Tweets based on order issues, order tracking and product returns. Additionally, Tweets which cannot be bucketed under a specific category can be tagged as noisy / spammy Tweets.

The image below is based on a tool which leverages CMLM, where the model predicts whether a Tweet is an “order issue” or “order tracking” type.

CMLM cannot predict a single category; however, it assigns a probability score for each of the predefined categories. A single Tweet when analyzed for a prediction with a CMLM model will output a combination of categories with probability scores. A high probability score on a specific category may be an indicator to classify the Tweet on that category. 

This image shows categories “product returns” and “noisy” Tweets based on a high probability score:

CMLM models identify the context in Tweet classification which is crucial when the problem domain expands. For example, a consumer product goods (CPG) CMLM model could detect new products and brands without developers having to author new rules. This is significant in advanced research areas where predefined problem sets do not exist. While CMLM isn’t a silver bullet, if the use case requires repetitive manual classification of Tweets at a high volume and advanced research, CMLM models can be effective. 

TL;DR: Determining the best approach 

If you’ve made it this far, you should have a general understanding of the three approaches to detect and discover topics of interest from a Tweet dataset. Remember, for use cases such as trend analysis and issue detection, it’s best to start with small experiments. 

Here’s a short recap and some resources to dive deeper into each approach:

Approach 1: Tweet Annotations

Best for: Quick wins with low barrier of entry for Topic discovery

Learn more: 

Approach 2: NLP

Best for: Categorizing Tweets with sentiment analysis

Want to learn more about discovering topics using NLP with sentiment analysis. Check out this demo:

Approach 3: CMLM 

Best for: Specific problem domains 

If you want to leverage a custom machine learning model to classify Tweets with domain based tagging, check out this demo: