Developer Guide

Twitter API Toolkit for Google Cloud: Filtered Stream

Twitter API
Google Cloud

By Prasanna Selvaraj

Detecting trends from Twitter requires listening to real-time Twitter APIs and processing Tweets on the fly. And while trend detection can be complex work, to categorize trends, Tweet themes and topics must also be identified—another potentially complex endeavor as it involves integrating with NER (Named Entity Recognition) / NLP (Natural Language Processing) services.

 

The Twitter API Toolkit for Google Cloud: Filtered Stream solves these challenges and supports the developer with a trend detection framework that can be installed on Google Cloud in 60 minutes or less.

Why use the Twitter API toolkit for Google Cloud: Filtered Stream?

  • Twitter API Toolkit for Google Cloud: Filtered Stream -  framework can be used to detect macro and micro-level trends across domains and industry verticals

  • Designed to horizontally scale and process higher volumes of Tweets in the order of millions of Tweets/day

  • Automates the data pipeline process to ingest Tweets into Google Cloud

  • Visualization of trends in an easy-to-use dashboard

How much time will this take? 60 mins is all you need

In 60 minutes, or less, you’ll learn the basics of the Twitter API and Tweet annotations—plus, you’ll gain experience in Google Cloud, Analytics, and the foundations of data science.

 

What Cloud Services this toolkit will leverage and what are the costs?

  • This toolkit requires a Twitter API account; sign-up for free essential/elevated access today to get started. Essential access allows 500K Tweets/month and Elevated access allows 2 million Tweets/month
  • This toolkit leverages Google BigQuery, App Engine, and DataStudio.   For information on pricing, refer to the Google Cloud pricing

What kind of dashboard can you build with the toolkit?

  • Below are a few real-time dashboard illustrations that are built with the Filtered Stream toolkit. 
  • Fig 1, depicts a real-time ‘Gaming’ dashboard that depicts the conversations on Video games on Twitter. You can get insights on the trending topics in Gaming, popular hashtags, and the underlying Tweets that are streamed in real-time.

  • Similarly, Fig 2 depicts the real-time analysis of the ‘DogeCoing Cryptocurrency’ 

  • This toolkit helps you to build a real-time trend detection dashboard 

  • Monitor real-time trends for configured rules as they unfold on the platform

  • The dashboards below is an example built with the toolkit that illustrates real-time trends in gaming

Give me the big picture

This toolkit is comprised of 5 core components that are responsible for listening to real-time Tweets, pushing Tweets to a topic/queue, a CRON job that triggers a Tweet loader service which pulls the Tweets from the topic/queue to store them in a database, and, finally, Tweet visualization in a dashboard that connects to the database via a SQL query.

 

  • Tweet streamer service (nodeJs component)

The Tweet streamer service is responsible for listening to the real-time Twitter Filtered Stream API and pushes the Tweets temporarily to a Topic that is based on Google PubSub. The Filtered Stream rules are governed by the Filtered Stream Rules API.

 

  • Stream topic based on Google PubSub

The stream topic based on Google PubSub acts as a shock absorber for this architecture. When there is a sudden surge of Tweets, the toolkit can handle the increased volume of Tweets with the help of PubSub. The PubSub topic will act as temporary storage for the Tweets and batch them for the Tweet loader service to store the Tweets in the database. Also, this will act as a shield to protect the database from a huge number of ingesting calls to persist the Tweets.

 

  • CRON job based on Google Cloud Scheduler

A CRON job based on Google Cloud Scheduler will act as a poller that will trigger the Tweet loader service at regular intervals.

 

  • Tweet loader service (nodeJs component)

The Tweet loader service that gets triggered based on the CRON job will pull the Tweets in a batch mode (i.e. 25 Tweets per pull, configurable via config.js file), and store the Tweets in a BigQuery database.

 

  • Google DataStudio as a Dashboard for analytics

Google DataStudio is the dashboard for trend detection and connects to the BigQuery via a SQL query with a time interval as a parameter. Trends can be analyzed based on the time interval which can range from minutes to hours. For example, you can analyze trends “60 minutes ago”, passing the time interval variable to the SQL query.

 

As a user of this toolkit, you need to perform four steps:

  1. Add rules to the stream with the Filtered Stream rules API endpoint

  2. Install and involve the toolkit from GitHub in your Google Cloud project

  3. Configure the CRON job - Google Cloud Scheduler

  4. Configure the dashboard, by connecting to the BigQuery database with DataStudio

Prerequisites: As a developer, what do I need to run this toolkit?

How should I use this toolkit? - Tutorial

Step One: Add rules to the stream

 

  1. Add rules to the stream. Let’s listen to Tweets related to “DogeCoin”, however, we only want to listen to 10% of random Tweets of “DogeCoin”. This can be accomplished by 

 

      curl -X POST 'https://api.twitter.com/2/tweets/search/stream/rules' -H "Content-type: application/json" -H "Authorization: Bearer <<YOUR_BEARER_TOKEN>>" -d  '{  "add": [ { "value" : "(doge) sample:10"}] }'
    

2. Validate the rules

      curl https://api.twitter.com/2/tweets/search/stream/rules -H "Content-type: application/json" -H "Authorization: Bearer <<YOUR_BEARER_TOKEN>>"
    

3. You should get an output like the below if there are no rules previously added. If you have previously added rules, they will also be returned here; ensure you delete them.

      {"data":
[{
"id":"1494395695620575239",
"Value":"doge"}],
"meta":{"sent":"2022-02-17T19:46:16.150Z",
"Result_count":1
}}
    

Step Two: Install and configure the toolkit (Tweet Streamer and Loader service)

  1. Github repo

  2. Access the Google Cloud console and launch the “Cloud Shell”. Ensure you are on the right Google Cloud Project

  3. Set the Google Project ID and enable BigQuery API by running the following commands:
      gcloud config set project <<PROJECT_ID>>
gcloud services enable bigquery.googleapis.com

    

4. Ensure you have access to the BigQuery data owner role. Navigate to Google Cloud Console -> Choose IAM under the main menu -> Add a new IAM permission

Principal: Your Google account email address

Role: BigQuery Data Owner

5. From the “Cloud shell terminal”  command prompt, download the code for the toolkit be executing the command:

      git clone https://github.com/twitterdev/gcloud-toolkit-filtered-stream.git
    

6. Navigate to the source code folder

      cd gcloud-toolkit-filtered-stream
    

7. Make changes to the configuration file. Use your favorite editor like vi or emacs
Once you’ve made the following changes, save them and quit the editor mode.

      vi config.js

Edit line #5 in config.js by inserting the Twitter API bearer token (ensure the word ‘Bearer’ must be prepended before the token with a space

Edit line#19 in config.js by inserting the Google Cloud project id
    

8. Back in the cloud shell, deploy the code in AppEngine by executing the below command: Note that the deployment can take a few minutes.

      gcloud app deploy
 Authorize the command
	If prompted: 
 Choose a region for deployment like (18 for USEast1)
 Accept the default config with Y

    

9. After the deployment, get the URL endpoint for the deployed application with the command:

      gcloud app browse -s default

# 	The above command will output an app endpoint URL similar to this one:

https://trend-detection-dot.uc.r.appspot.com/

    

10. Use the endpoint URL from the output of step 8) and make a CURL or browser request with “/stream” appended to the request path. This will invoke the toolkit and it starts listening to the Tweets as defined by the rules in the “stream/rules” endpoint

      curl https://<<APP_ENDPOINT_URL>>/stream
    

11. Start tailing the log file for the deployment application

      gcloud app logs tail -s default
    

12. If you don’t see the messages in the logs console as “Received Tweet” or “~~ Heartbeat Payload ~~”, the stream may be disconnected. To reconnect to the stream, make a curl/browser request as below:

      curl https://<<APP_ENDPOINT_URL>>/stream
    

Step Three: Configure the CRON job - Google Cloud Scheduler

  1. Create a Google Cloud Scheduler Job by navigating to the Google Cloud console and clicking the Cloud Scheduler option under the main menu.

  2. Create a new Cloud Scheduler job and define the schedule with frequency as below: Ensure a space between each asterisk like “* * * * *”

This will ensure that the Cloud Scheduler job triggers every minute.

 

      * * * * *
    

3. Configure the execution with “Target Type” as “HTTP” and insert your application endpoint URL as below:

      https://<<YOUR_APP_ENDPOINT_URL>>/stream/poll/2/30000
    

The request “/stream/poll” points to the “Tweet loader” service. Parameters 2 and 30000 refer to the invocation frequency of the Tweet loader service and the time interval in milliseconds. For example “2/30000” will invoke the “Tweet loader” service 2 times within a minute with a delay of 30000 milliseconds or 30 seconds. If you anticipate more Tweets for a topic, increase the invocation frequency and decrease the delay to increase the consumption. This is a calibration that can be fine-tuned based on monitoring of a specific topic like “Crypto” or “Doge coin”  

Step four: Configure the Trends dashboard with Google DataStudio

  1. SQL query to be used for Trend detection in DataStudio

  2. Replace your <<datasetId.table_name>> in the below SQL. It should look something like “sixth-hawk.tweets”

      SELECT
  context.entity.name AS ENTITY_NAME, context.domain.name AS DOMAIN_NAME, context.domain.id AS C_ID, entity.normalized_text as ENTITY_TEXT, entity.type as ENTITY_TYPE, 
  COUNT(*) AS MENTIONS, TRENDS.text as TWEET_TXT, TRENDS.tweet_url as TWEET_URL, TRENDS.public_metrics.like_count as likes, TRENDS.public_metrics.quote_count as quotes, TRENDS.public_metrics.reply_count as replies, TRENDS.public_metrics.retweet_count as retweets
FROM
  `<<datasetId.table_name>>` AS TRENDS,
  UNNEST(context_annotations) AS context,
  UNNEST(entities.annotations) AS entity
where created_at > DATETIME_SUB(current_datetime(), INTERVAL @time_interval MINUTE) 
GROUP BY
  ENTITY_NAME, DOMAIN_NAME, ENTITY_TEXT, ENTITY_TYPE, C_ID, TWEET_TXT, TWEET_URL, likes, quotes, replies, retweets
ORDER BY
  MENTIONS DESC
    

Step Five: Twitter Compliance

It is crucial that any developer who stores Twitter content offline ensures the data reflects user intent and the current state of content on Twitter. For example, when someone on Twitter deletes a Tweet or their account, protects their Tweets, or scrubs the geoinformation from their Tweets, it is critical for both Twitter and our developers to honor that person’s expectations and intent. The batch compliance endpoints provide developers an easy tool to help maintain Twitter data in compliance with the Twitter Developer Agreement and Policy.

Optional - Delete the Google cloud project to avoid any overage costs

      gcloud projects delete <<PROJECT_ID>>
    

Troubleshooting

Use this forum for support, issues, and questions.

true

What's next?

Process, analyze, and visualize Tweets with the Twitter API Toolkit for Google Cloud: Recent Search

Read this guide on Post-processing Twitter data with the Google Cloud Platform for incorporating the search query with a user interface, advanced analytics, and integration with Natural language processing

You also might be interested in this blog post on Topic discovery with Twitter data