Consuming streaming data — Twitter Developers

Tutorials

Consuming streaming data

Overview

PowerTrack, Volume (e.g. Decahose), and Replay streams utilize Streaming HTTP protocol to deliver data through an open, streaming API connection. Rather than delivering data in batches through repeated requests by your client app, as might be expected from a REST API, a single connection is opened between your app and the API, with new results being sent through that connection whenever new matches occur. This results in a low-latency delivery mechanism that can support very high throughput.

The overall interaction between your app and Twitter’s API for these streams is described as follows:

  {"error":{"message":"Forced Disconnect: Too many connections. (Allowed Connections = 2)","sent":"2017-01-11T18:12:52+00:00"}}
  {"error":{"message":"Invalid date format for query parameter 'fromDate'. Expected format is 'yyyyMMddHHmm'. For example, '201701012315' for January 1st, 11:15 pm 2017 UTC.\n\n","sent":"2017-01-11T17:04:13+00:00"}}
  {"error":{"message":"Force closing connection to <description of connection> because it reached the maximum allowed backup (buffer size is <size>).","sent":"2017-01-11T17:04:13+00:00"}}

 

Keep-alive signals

At least every 10 seconds, the stream will send a keep-alive signal, or heartbeat in the form of an \r\n carriage return through the open connection to prevent your client from timing out. Your client application should be tolerant of the \r\n characters in the stream.

If your client properly implements a read timeout on your HTTP library as described here, your app will be able to rely on the HTTP protocol and your HTTP library to throw an event if no data is read within this period, and you will not need to explicitly monitor for the \r\n character.

This event will typically be an exception being thrown or some other event depending on the HTTP library used. It is highly recommended to wrap your HTTP methods with error/event handlers to detect these timeouts. On timeout, your application should attempt to reconnect.

Gzip compression

Streams are delivered in compressed Gzip format. Your client will need to decompress the data as it’s read off the line. If you consume a low-volume stream, some libraries (e.g. Java’s GZIPInputStream or many Ruby stream consumers) handle decompression of incoming data poorly, and will need to be overridden in order to decompress the Tweets as they are received, without waiting for a set threshold of data to be received. As an example of this, see our Java code HERE.

Chunked encoding

Streaming API connections will be encoded using chunked transfer encoding, as indicated by the presence of a Transfer-Encoding: chunked HTTP header in the response. Because most HTTP libraries deal with chunked transfer encoding transparently, this document will assume that your code has access to the reassembled HTTP stream and does not have to deal with this encoding.

If this is not the case, note that Tweets and other streamed messages will not necessarily fall on HTTP chunk boundaries – be sure to use the delimiter defined above to determine activity boundaries when reassembling the stream.

Backup buffering

The stream will send data to you as fast as it arrives from the source firehose, which can result in high volumes in many cases. If Twitter’s system cannot write new data to the stream right away (for example if your client is not reading fast enough, or there is a network bottleneck, etc), it will buffer the content on its end to allow your client to catch up. However, when this buffer is full, a forced disconnect will be initiated to drop the connection, and the buffered Tweets will be dropped and not resent. See below for more details.

One way to identify times where your app is falling behind is to compare the timestamp of the Tweets being received with the current time, and track this over time.

Although stream backups cannot ever be completely eliminated due to potential latency and hiccups over the public internet, they can be largely eliminated through proper configuration of your app. To minimize the occurrence of backups:

  • Ensure that your client is reading the stream fast enough. Typically you should not do any real processing work as you read the stream. Read the stream and hand the activity to another thread/process/data store to do your processing asynchronously.
  • Ensure that your data center has inbound bandwidth sufficient to accomodate large sustained data volumes as well as significantly larger spikes (e.g. 3-4x normal volume). For filtered streams like PowerTrack, the volume and corresponding bandwidth required on your end are wholly dependent on what you are tracking, and how many Tweets those filters match.

A simple way to test the speed of your app is to connect to the stream using your app, while also connecting in parallel via a redundant connection using curl. The curl connection will provide insight into the type of performance that would occur with your app code taken out of the chain, and allowing you to eliminate variables as you debug.

Volume monitoring

Consider monitoring your stream data volumes for unexpected deviations. A data volume decrease may be symptomatic of a different issue than a stream disconnection. In such a situation, a stream would still be receiving the keep-alive signal and probably some new activity data. However, a significantly decreased number of Tweets should lead you to investigate whether there is anything causing the decrease in inbound data volume to your application or network, check status.gnip.com for any related notices.

To create such monitoring, you could track the number of new Tweets you expect to see in a set amount of time. If a stream’s data volume falls far enough below the specified threshold, and does not recover within a set period of time, then alerts and notifications should be initiated. You may also want to monitor for a large increase in data volume, particularly if you are in the process of modifying rules in a PowerTrack stream, or if an event occurs that produces a spike in Tweet activity.

Because each of our customers have very different expectations around what “normal” data volume should be for their streams, we do not have a general recommendation for a specific percentage decrease/increase or period of time. We recommend that each of our customers determine their own metrics to use for this type of monitoring.

Furthermore, for PowerTrack streams, in order to keep a close watch and tight control over the volume of data your app is pulling in, your app should monitor the volume being consumed on a rule-by-rule basis in realtime or near-realtime. Additionally, we encourage you to implement measures within your app that will alert your team if the volume passes a pre-set threshold, and to possibly introduce other measures such as automated deletion of rules that are bringing in too much data, or disconnecting from the stream completely in extreme circumstances. These steps will help prevent the “runaway volume” scenario that could result in large data bills.

Disconnections

For Twitter’s purposes, stream disconnections are grouped into two categories – client disconnections and forced disconnections.

Client disconnects

Client disconnects occur when your application independently terminates the connection to the data stream, whether because your code actively closes the connection, or where network settings or conditions terminate the connection.

Common issues in client code that can create this behavior include:

  • Closing the connection, rather than allowing it to remain open indefinitely
  • Timing out the connection even though data (activities, system messages, or keep-alive signals) are still being sent through the conneciton
  • Decompressing the compressed stream using incorrect evaluation of chunk sizes
  • Network or firewall restrictions in your infrastructure can close your connection to the data stream (e.g. firewall session limits) – be sure to check for anything along these lines that might sever existing connections from time to time.
  • Again, using curl and comparing disconnect behavior can be a productive exercise. If your app is experiencing client-side disconnects, yet your curl connection persists, then your code needs some attention. If curl and your app disconnect at the same approximate time, then perhaps your network is timing out your connection. 

Forced disconnects

Forced Disconnections occur when our system actively disconnects your app from the stream. There are 3 different types of forced disconnects.

  • Full Buffer – your app is not reading the data fast enough, or a network bottleneck is slowing data flow.
  • Too many connections – your app established too many simultaneous connections to the data stream. When this occurs, Twitter will wait 1 minute, and then disconnect the most recently established connection if the limit is still being exceeded.
  • Server Maintenance – the Twitter team deployed a change or update to the system servers. This will be accompanied by a “Twitter is closing for operational reasons” message. These deploys happen on an approximately every-two-weeks cadence, and are announced at this status page.  

Forced disconnects close your connection by sending a zero byte chunk in accordance with standard HTTP chunked-encoding practice. See Section 9 of this page for reference: http://www.httpwatch.com/httpgallery/chunked. Note that in the case of a full buffer, our app sends out the zero byte chunk, but your app may never receive it due to whatever is causing the data backup in the internet or within your app.

To account for this situation, be sure that your app is set to timeout/reconnect if it does not see any data (either a Tweet activity or our keep alive signal) for more than 30 seconds. They are registered as a “force disconnect” on your console/dashboard, in contrast to a “client disconnect” which is not initiated by Twitter’s servers.

To minimize forced disconnects for a full buffer, see Backup buffering above.

Reconnecting

When an established connection drops, attempt to reconnect immediately. If the initial reconnect attempt is unsuccessful, your client should continue attempting to reconnect using an exponential back-off pattern until it successfully reconnects.

Regardless of how your client gets disconnected, you should configure your app to reconnect immediately. If your first reconnection attempt is unsuccessful, we recommend that your app implement an exponential back-off pattern in subsequent reconnection attempts (e.g. wait 1 second, then 2 seconds, then 4, 8, 16, etc), with some reasonable upper limit. If this upper limit is reached, you should configure your client to notify your team so that you can investigate further.

Be sure that your app can handle all of the following scenarios:

  • Connection is closed by something on your end (client disconnect)
  • Connection is closed by Twitter with a zero-byte chunk (forced disconnect)
  • Your app has received zero data through the connection – i.e. no new Tweets AND no keep-alive carriage return signals – for more than 30 seconds. Your app must allow the connection to time out in this situation – note that by default, some clients persistent connections can remain open even without new data unless your code tells it to close based on an absence of the above signals. Note that unlike the third scenario above, if you are still receiving the keep-alive signal, but no new data, your app should remain connected – there may be an issue on the Twitter side preventing data from being sent down the wire, or there may just not be any new activities that match your filters. Remaining connected as long as you are at least receiving the keep-alive signal ensures that you will receive new matching Tweets when they are received by Twitter. This is typically accomplished by setting the read timeout on your client to something beyond 30 seconds.

Missing data from disconnections

Any time you disconnect from the stream, you are potentially missing data that would be sent through at that time. However, Twitter provides multiple ways to mitigate these disconnects and recover data when they occur. See below for more information.

  • Redundant Connections - Consume the stream from multiple servers to prevent missed data when one is disconnected.
  • Backfill - Reconnect within 5 minutes and request a backfill of missed data.
  • Replay - Recover data from within the last 5 days using a separate stream.

Special considerations for Replay streams

Replay allows Gnip customers to recover activities missed due to technical hiccups from a rolling window of historical data, and uses the Streaming API for data delivery. Full documentation for the Replay API can be found HERE.

However, your client must take into account the following requirements and special considerations when consuming data from a Replay stream.

Reconnections

Note that If you are disconnected from the stream during a Replay request, your app should not reconnect to the same URL used to initiate the request without altering the fromDate. Doing so will restart the request from the beginning and repeat the portion of time covered prior to disconnecting. Instead, upon being disconnected prior to the completion of a request, your client must adjust the fromDate and toDate in the URL to begin at the minute of the last activity you collected prior to the disconnection, to pick up from the same minute you left off, minimizing the amount of overlap.

Duplicates

We will not omit activities you may already have from a prior request, or your normal Power Track stream when delivering the results of a Replay request, and your app should deduplicate accordingly. Note, however, that for billing purposes, we deduplicate your activity counts across PowerTrack Replay requests that occur on the same day (based on UTC time) to ensure you are only billed once for each unique activity delivered via Replay. Activities delivered via the realtime Power Track stream are counted separately from those delivered via Replay.

Recovery

As mentioned above, PowerTrack and Volume Streams provide important features to help with realtime streams. See our documentation on Additional Streams, Redundant Connections, Replay Streams, and Backfill HERE.

Ready to build your solution?

Review the documentation to get started.

  • Read Previous
  • Read Next