Expanded and Enhanced URLs

The Expanded and Enhanced URL enrichment automatically expands shortened URLs that are included in the body of a Tweet, and includes the resulting URL as metadata within the payload. In addition, this enrichment also provides HTML page metadata from the title and description of the destination page.

Tweet payload

The Expanded and Enhanced URL enrichment can be found within the entities object of the Tweet payload - specifically in the entitites.urls.unwound object. It provides the following fields of metadata:

  • Expanded URL (unwound.url)
  • Expanded HTTP Status (unwound.status)
  • Expanded URL HTML title - 300 character limit (unwound.title)
  • Expanded URL HTML description - 1000 character limit (unwound.description)

Below is an example payload:

  "entities": {
    "hashtags": [
      
    ],
    "urls": [
      {
        "url": "https:\/\/t.co\/HkTkwFq8UT",
        "expanded_url": "http:\/\/bit.ly\/2wYTb9y",
        "display_url": "bit.ly\/2wYTb9y",
        "unwound": {
          "url": "https:\/\/www.forbes.com\/sites\/laurencebradford\/2016\/12\/08\/11-websites-to-learn-to-code-for-free-in-2017\/",
          "status": 200,
          "title": "11 Websites To Learn To Code For Free In 2017",
          "description": "It\u2019s totally possible to learn to code for free...but what are the best resources to achieve that? Here are 11 websites where you can get started."
        },
        "indices": [
          10,
          33
        ]
      }
    ],
    "user_mentions": [
      
    ],
    "symbols": [
      
    ]
  },

PowerTrack stream filtering

The following PowerTrack operators will filter and provide a tokenized match on the related fields of URL metadata:

url:

  • Example: “url:tennis”
  • Tokenized match on any Expanded URL that includes the word tennis
  • Could also be used as a filter to include or exclude links from specific website using something like “url:npr.org”

url_title:

  • Example: “url_title:tennis”
  • Tokenized match on any Expanded URL HTML title that includes the word tennis
  • Matches on the HTML title data included in the payload, which is limited to 300 characters.

url_description:

  • Example: “url_description:tennis”
  • Tokenized match on any Expanded URL HTML description that includes the word tennis
  • Matches on the HTML description included in the payload, which is limited to 1000 characters.

HTTP Status Codes

The expanded URL enrichment also provides the HTTP status code for the final URL we are attempting to unwind. In normal cases, this will be a 200 value. Other 400-series values indicate problems with resolving the URL.

Various status codes may be returned when attempting to unwind a URL. During the process of unwinding a URL, if we get a redirect, we will follow them indefinitely until we either:

  • Hit a 200 series code (success)
  • Hit a non-redirect series code (failures)
  • Time out because the final URL could not be resolved in a reasonable amount of time (returns a 408 - timeout)
  • Hit an exception of some sort

If an exception is hit, we use the following mapping between reasons and status codes returned:

Reason Status Code Returned
SSL Exceptions 403 (Forbidden)
Unwinding not allowed by URL 405
Socket Timeout 408 (Timeout)
Unknown Host Exception 404 (Not Found)
Unsupported Operation 404 (Not Found)
Connect Exception 404 (Not Found)
Illegal Argument 400 (Bad Request)
Everything else 400 (Bad Request)

 

FAQ

To resolve a shortened link as described above, our system sends HTTP HEAD requests to the URL provided, and follows any redirects until it arrives at the final URL. This URL (NOT the content of the page itself) is then included in the response payload.

For requests made to the Full Archive Search API, we currently only support expanded URL data for Tweets 13 months old or newer.