This toolkit is comprised of 5 core components that are responsible for listening to real-time Tweets, pushing Tweets to a topic/queue, a CRON job that triggers a Tweet loader service which pulls the Tweets from the topic/queue to store them in a database, and, finally, Tweet visualization in a dashboard that connects to the database via a SQL query.
Tweet streamer service (nodeJs component)
The Tweet streamer service is responsible for listening to the real-time Twitter PowerTrack API and pushes the Tweets temporarily to a Topic that is based on Google PubSub. The PowerTrack rules are governed by the PowerTrack Rules API.
Stream topic based on Google PubSub
The stream topic based on Google PubSub acts as a shock absorber for this architecture. When there is a sudden surge of Tweets, the toolkit can handle the increased volume of Tweets with the help of PubSub. The PubSub topic will act as temporary storage for the Tweets and batch them for the Tweet loader service to store the Tweets in the database. Also, this will act as a shield to protect the database from a huge number of ingesting calls to persist the Tweets.
CRON job based on Google Cloud Scheduler
A CRON job based on Google Cloud Scheduler will act as a poller that will trigger the Tweet loader service at regular intervals.
Tweet loader service (nodeJs component)
The Tweet loader service that gets triggered based on the CRON job will pull the Tweets in a batch mode (i.e. 500 Tweets per pull, configurable via config.js file), and store the Tweets in a BigQuery database.
Google DataStudio as a Dashboard for analytics
Google DataStudio is used to create a dashboard for trend detection and connects to the BigQuery via a SQL query with a time interval as a parameter. The time interval is a range (30 minutes, 60 minutes) that will be used to query the Tweets based on their creation time. Trends can be analyzed based on the time interval which can range from minutes to hours. For example, you can analyze trends “60 minutes ago”, passing the time interval variable to the SQL query.
As a user of this toolkit, you need to perform four steps:
Add rules to the stream with the PowerTrack rules API endpoint
Install and involve the toolkit from GitHub in your Google Cloud project
Configure the CRON job - Google Cloud Scheduler
Configure the dashboard, by connecting to the BigQuery database with DataStudio