Full-Archive Search metadata timeline
How Twitter evolved as a platform, and how that affected the JSON used to encode Tweets, is discussed here. That article also begins the discussion of how these JSON details affect creating the filters needed to find your historical signal of interest. This article continues that discussion by exploring how these details affect writing filters for Full-Archive Search. This, and a complementary article about Historical PowerTrack, will serve as a ‘compare and contrast’ discussion of the two Twitter historical products.
The enterprise-tier Full-archive Search was launched in August 2015, and the premium-tier version was launched in February 2018. These search products enable customers to immediately access any publicly available Tweet. With Full-archive Search you submit a single query and receive a response in classic RESTful fashion. Full-archive Search implements (up to) 500-Tweets-per-response pagination, and supports up to a 60-requests-per-minute (rpm) rate-limit for premium, 120 rpm for enterprise. Given these details, Full-archive Search can be used to rapidly retrieve Tweets, and at large scale using concurrent requests.
Unlike Historical PowerTrack, whose archive is based on a set of Tweet flat-files on disk, the Full-archive Search Tweet archive is much like an on-line database. As with all databases, it supports making queries on its contents. It also makes use of an index to enable high-performance data retrieval. With Full-archive search endpoints, the querying language is made up of PowerTrack Operators, and these Operators each correspond to a Tweet JSON attribute that is indexed.
Also, like Historical PowerTrack, there are Tweet attributes that are current to the time a query is made. For example, if you are using Search API to access a Tweet posted in 2010 today, the user's profile description, account 'home' location, display name, and Tweet metrics for Favorites and Retweet counts will be updated to today’s values and not what they were in 2010.
Below is a timeline of when Full-archive search endpoint Operators begin matching. In some cases Operator matching began well after a ‘communication convention’ became commonplace on Twitter (see here for a discussion of user conventions being integrated into the Twitter platform). For example, @Replies emerged as a user convention in 2006, but did not become a first-class object or event with ‘supporting’ JSON until early 2007. Accordingly, matching on @Replies in 2006 requires an examination of the Tweet body, rather than relying on the
in_reply_to_status_id: PowerTrack Operators.
The details provided here were generated using Full-Archive Search (a product of hundreds of searches), and were informed by the Twitter timeline provided HERE. This timeline is not 100% complete or precise. If you identify another filtering/metadata “born on date” fundamental to your use-case, please let us know.
Note that the underlying Search index is subject to being rebuilt. Accordingly, these timeline details are subject to change.
- March 26 -
lang:. An example of Tweet metadata being backfilled while generating the Search index.
- July 13 -
- October 6 -
has:symbols. $cashtags (or symbols) for discussing stock symbols does not become common until early 2009. Until then most usages were probably slang (e.g., $slang).
- October 26 -
- November 23 -
- January 30 - First first-class @reply (in_reply_to_user_id),
- August 23 - Hashtags emerge as a common convention for organizing topics and conversations. First real use a week later.
- May 15 -
is:retweet. Note that this Operator starts matching with the ‘beta’ release of official Retweets and its “Via @’ pattern. During this beta period, the Tweet verb is ‘post’ and the original Tweet is not included in the payload.
- August 13 - Final version of official Retweets is released with “RT @” pattern, a verb set to ‘share’, and the ‘retweet_status’ attribute containing the original Tweet (thus approximately doubling the JSON payload size).
- March 6 -
point_radius:geo Operators begin matching.
- August 28 -
has:videos(Until February 2015, this Operator matches on Tweets with links to select video hosting sites such as youtube.com, vimeo.com, and vivo.com).
- July 20 -
has:imagesbegin matching. Native photos officially announced August 9, 2010.
- December 3 - (Approximately) Some Enhanced URL metadata with HTML title and description begins in payloads. Enhanced metadata more fully emerged in May 2016.
- February 10 -
has:videosmatches on ‘native’ Twitter videos.
- February 17 -
profile_locality:Profile Geo Operators begin matching.
- February 17 -
place:Tweet geo Operators begin matching.
- May 1 - Enhanced URL metadata more fully available, and was officially announced as part of the Gnip 2.0 launch in August 2016. No associated Operators for these metadata with Search APIs.
- February 22 - Poll metadata become available in enriched native format. No associated Operators for these metadata.
Given all the above timeline information, it is clear that there are a lot of details to consider when writing Search APIs filters. There are two key things to consider:
- Some metadata have ‘born-on’ dates so filters can result in false negatives. Such searches include Operators reliant on metadata that did not exist for all of part of the search period. For example, if you are searching for Tweets with the
has:imagesOperator, you will not have any matches for periods before July 2011. That is because that Operator matches on native photos (attached to a Tweet using the Twitter user-interface). For a more complete data set of photo-sharing Tweets, filters for before July 2011 would need to contain rule clauses that match on common URLs for photo hosting.
- Some metadata has been backfilled with metadata from a time after the Tweet was posted.
As discussed HERE there are several attribute types that are commonly focused on when creating PowerTrack queries:
- Twitter Profiles
- Original or shared Tweets
- Tweet language classification
- Geo-referencing Tweets
- Shared links media
Some of these have product-specific behavior while others have identical behavior. See below for more details.
The Search APIs serves historical Tweets with the user profile data set as it is at the time of retrieval. If you request a Tweet from 2014, the user’s profile metadata will reflect how it exists at query-time.
Original Tweets and Retweets
is:retweet Operator enables users to either include or exclude Retweets. Users of this Operator need to have two strategies for Retweet matching (or not matching) for data before August 2009. Before August 2009, the Tweet message itself needs to be checked, using exact phrase matching, for matches on the “@RT ” pattern (Actually, if you are filtering on Retweets from between May-August 2009, the “Via @” pattern should be included). For periods after August 2009, the is:retweet Operator is available.
Tweet language classifications
For filtering on a Tweet’s language classification, Twitter’s historical products are quite different. When the Search archive was built, all Tweets were backfilled with the Twitter language classification. Therefore the lang: Operator is available for the entire Tweet archive.
As discussed HERE there are three primary ways to geo-reference Tweets:
Geographical references in Tweet message. Matching on geographic references in the Tweet message, while often the most challenging method since it depends on local knowledge, is an option for the entire Tweet archive. Here is an example geo-referenced match from 2006 for the San Francisco area based on a ‘golden gate’ filter.
Tweets geo-tagged by the user. With the search APIs the ability to start matching on Tweets with some Geo Operators started in March 2010, and with others in February 2015:
- March 6, 2010:
- February 17, 2015:
- March 6, 2010:
Account profile ‘home’ location set by the user. Profile Geo Operators are available in both Historical PowerTrack and the Search APIs. With the Search APIs, these Profile Geo metadata is available starting in February 2015. For Tweets posted before Profile Geo metadata became available, the
bio_location:Operator is available which can be used to match on non-normalized user input.
Shared links and media
In March 2012, the expanded URL enrichment was introduced. Before this time, the Tweet payloads included only the URL as provided by the user. So, if the user included a shortened URL it can be challenging to match on (expanded) URLs of interest. With the Search APIs, these metadata are available starting in March 2012.
In July 2016, the enhanced URL enrichment was introduced. This enhanced version provides a web site’s HTML title and description in the Tweet payload, along with Operators for matching on those. These metadata begin emerging in December 2014.
In September 2016 Twitter introduced ‘native attachments’ where a trailing shared link is not counted against the 140 Tweet character limit. Both URL enrichments still apply to these shared links.
Here are when related Search Operators begin matching:
- 2006 October 26 -
- 2011 July 20 -
- 2011 August -
url:with the Expanded URLs enrichment As early as September 2006
(url:"spotify.com" OR url:gnip OR url:microsoft OR url:google OR url:youtube)matches http://twitter.com/Adam/statuses/16602, even though there is no urls metadata in twitter_entities and gnip objects. “youtube.com” is an example of message content that, without any urls metadata, matches url:youtube.
- 2015 February 10 -
has:videosfor native videos. Between 2010/08/28 and 2015/02/10, this Operator matches on Tweets with links to select video hosting sites such as youtube.com, vimeo.com, and vivo.com.
- 2016 May 1 -
url_description:, based on the Enhanced URLs enrichment, generally available. First Enhanced URL metadata began appearing in December 2014.