Full-archive search - Metadata and filtering timeline

Full-Archive Search metadata timeline

That article discusses how the historical changes of the full-archive roadmap affects creating the filters needed to find your historical signal of interest. This article and a complementary article about Historical PowerTrack, will serve as a ‘compare and contrast’ discussion of the two X historical products.

Product overview

The enterprise-tier Full-archive Search was launched in August 2015, and the premium-tier version was launched in February 2018. These search products enable customers to immediately access any publicly available Post. With Full-archive Search you submit a single query and receive a response in classic RESTful fashion. Full-archive Search implements (up to) 500-Posts-per-response pagination, and supports up to a 60-requests-per-minute (rpm) rate-limit for premium, 120 rpm for enterprise. Given these details, Full-archive Search can be used to rapidly retrieve Posts, and at large scale using concurrent requests.

Unlike Historical PowerTrack, whose archive is based on a set of Post flat-files on disk, the Full-archive Search Post archive is much like an on-line database. As with all databases, it supports making queries on its contents. It also makes use of an index to enable high-performance data retrieval. With Full-archive search endpoints, the querying language is made up of PowerTrack Operators, and these Operators each correspond to a Post JSON attribute that is indexed.

Also, like Historical PowerTrack, there are Post attributes that are current to the time a query is made. For example, if you are using Search API to access a Post posted in 2010 today, the user's profile description, account 'home' location, display name, and Post metrics for Favorites and Retweet counts will be updated to today’s values and not what they were in 2010. 

Metadata timelines

Below is a timeline of when Full-archive search endpoint Operators begin matching. In some cases Operator matching began well after a ‘communication convention’ became commonplace on X. For example, @Replies emerged as a user convention in 2006, but did not become a first-class object or event with ‘supporting’ JSON until early 2007. Accordingly, matching on @Replies in 2006 requires an examination of the Post body, rather than relying on the to: and in_reply_to_status_id: PowerTrack Operators.

The details provided here were generated using Full-Archive Search (a product of hundreds of searches). This timeline is not 100% complete or precise. If you identify another filtering/metadata “born on date” fundamental to your use-case, please let us know.

Note that the underlying Search index is subject to being rebuilt. Accordingly, these timeline details are subject to change.

2006

  • March 26 - lang:. An example of Post metadata being backfilled while generating the Search index.
  • July 13 - has:mentions begins matching.
  • October 6 - has:symbols. $cashtags (or symbols) for discussing stock symbols does not become common until early 2009. Until then most usages were probably slang (e.g., $slang).
  • October 26 - has:links begins matching.
  • November 23 - has:hashtags begins matching.

2007

  • January 30 - First first-class @reply (in_reply_to_user_id), reply_to_status_id: begins matching.
  • August 23 - Hashtags emerge as a common convention for organizing topics and conversations. First real use a week later.

2009

  • May 15 - is:retweet. Note that this Operator starts matching with the ‘beta’ release of official Retweets and its “Via @’ pattern. During this beta period, the Post verb is ‘post’ and the original Post is not included in the payload.
  • August 13 - Final version of official Retweets is released with “RT @” pattern, a verb set to ‘share’, and the ‘retweet_status’ attribute containing the original Post (thus approximately doubling the JSON payload size).

2010

  • March 6 - has:geo, bounding_box: and point_radius: geo Operators begin matching.
  • August 28 - has:videos (Until February 2015, this Operator matches on Posts with links to select video hosting sites such as youtube.com, vimeo.com, and vivo.com).

2011

  • July 20 - has:media and has:images begin matching. Native photos officially announced August 9, 2010.

2014

  • December 3 - (Approximately) Some Enhanced URL metadata with HTML title and description begins in payloads. Enhanced metadata more fully emerged in May 2016.

2015

  • February 10 - has:videos matches on ‘native’ X videos.
  • February 17 - has:profile_geo, profile_country:, profile_region:, profile_locality: Profile Geo Operators begin matching.
  • February 17 - place_country: and place: Post geo Operators begin matching.

2016

2017

  • February 22 - Poll metadata become available in enriched native format. No associated Operators for these metadata.

2022

  • September 27 - All Post objects created since this date have Edited Post metadata available. All Enterprise endpoints that provide Post objects were updated to provide this metadata starting on this date. The edit metadata provided includes edit_history and edit_controls objects. These metadata will not be returned for Posts that were created before September 27, 2022. Currently, there are no Enterprise Operators available that match these metadata.  To learn more about Edit Post metadata, check out the Edit Posts fundamentals page.

 

2022

  • September 29 - All Post objects created since this date have Edited Post metadata available. All Enterprise endpoints that provide Post objects were updated to provide this metadata starting on this date. The edit metadata provided includes edit_history and edit_controls objects. These metadata will not be returned for Posts that were created before September 27, 2022. Currently, there are no Enterprise Operators available matching these metadata.  To learn more about Edit Post metadata, check out the Edit Posts fundamentals page.

 

Filtering tips

Given all the above timeline information, it is clear that there are a lot of details to consider when writing Search APIs filters. There are two key things to consider:

  • Some metadata have ‘born-on’ dates so filters can result in false negatives. Such searches include Operators reliant on metadata that did not exist for all of part of the search period. For example, if you are searching for Posts with the has:images Operator, you will not have any matches for periods before July 2011. That is because that Operator matches on native photos (attached to a Post using the X user-interface). For a more complete data set of photo-sharing Posts, filters for before July 2011 would need to contain rule clauses that match on common URLs for photo hosting.
  • Some metadata has been backfilled with metadata from a time after the X was posted.

There are several attribute types that are commonly focused on when creating PowerTrack queries:

  • X Profiles
  • Original or shared Posts
  • Post language classification
  • Geo-referencing Posts
  • Shared links media

Some of these have product-specific behavior while others have identical behavior. See below for more details.

X Profiles

The Search APIs serves historical Posts with the user profile data set as it is at the time of retrieval. If you request a Post from 2014, the user’s profile metadata will reflect how it exists at query-time.

Original Posts and Retweets

The PowerTrack is:retweet Operator enables users to either include or exclude Retweets. Users of this Operator need to have two strategies for Retweet matching (or not matching) for data before August 2009. Before August 2009, the Post message itself needs to be checked, using exact phrase matching, for matches on the “@RT ” pattern (Actually, if you are filtering on Retweets from between May-August 2009, the “Via @” pattern should be included). For periods after August 2009, the is:retweet Operator is available.

Post language classifications

For filtering on a Post's language classification, X's historical products are quite different. When the Search archive was built, all Posts were backfilled with the X language classification. Therefore the lang: Operator is available for the entire Post archive.

Geo-referencing Posts

There are three primary ways to geo-reference Posts:

  • Geographical references in Post message. Matching on geographic references in the Post message, while often the most challenging method since it depends on local knowledge, is an option for the entire Post archive. Here is an example geo-referenced match from 2006 for the San Francisco area based on a ‘golden gate’ filter.

  • Posts geo-tagged by the user. With the search APIs the ability to start matching on Posts with some Geo Operators started in March 2010, and with others in February 2015:

    • March 6, 2010: has:geo, bounding_box: and point_radius:
    • February 17, 2015: place_country: and place:
  • Account profile ‘home’ location set by the user. Profile Geo Operators are available in both Historical PowerTrack and the Search APIs. With the Search APIs, these Profile Geo metadata is available starting in February 2015. For Posts posted before Profile Geo metadata became available, the bio_location: Operator is available which can be used to match on non-normalized user input.

Shared links and media

In March 2012, the expanded URL enrichment was introduced. Before this time, the Post payloads included only the URL as provided by the user. So, if the user included a shortened URL it can be challenging to match on (expanded) URLs of interest. With the Search APIs, these metadata are available starting in March 2012.

In July 2016, the enhanced URL enrichment was introduced. This enhanced version provides a web site’s HTML title and description in the Post payload, along with Operators for matching on those. These metadata begin emerging in December 2014.

In September 2016 X introduced ‘native attachments’ where a trailing shared link is not counted against the 140 Post character limit. Both URL enrichments still apply to these shared links.

Here are when related Search Operators begin matching:

  • 2006 October 26 - has:links
  • 2011 July 20 - has:images and has:media
  • 2011 August - url: with the Expanded URLs enrichment As early as September 2006 (url:"spotify.com" OR url:gnip OR url:microsoft OR url:google OR url:youtube) matches http://twitter.com/Adam/statuses/16602, even though there is no urls[] metadata in twitter_entities and gnip objects. “youtube.com” is an example of message content that, without any urls[] metadata, matches url:youtube.
  • 2015 February 10 - has:videos for native videos. Between 2010/08/28 and 2015/02/10, this Operator matches on Posts with links to select video hosting sites such as youtube.com, vimeo.com, and vivo.com.
  • 2016 May 1 - url_title: and url_description:, based on the Enhanced URLs enrichment, generally available. First Enhanced URL metadata began appearing in December 2014.

Next steps