Historical PowerTrack Technical Details

Technical Overview

The Historical PowerTrack API brings the same filtering capabilities developed for real-time streaming to the entire archive of public Tweets. The HPT API was launched in July 2012 by Gnip, and serves data from an archive first assembled for the HPT launch. HPT makes available every public Tweet ever posted, and was designed to deliver Tweet volumes at scale. The Historical PowerTrack API is used to manage the lifecycle of a historical Job. Using the API, a Job is created with up to 1,000 filtering rules (each one up to 2,048 characters), covering a research period as long as needed. At this point in time, you will be assigned a Universally Unique ID (UUID) that you will use to reference when making API requests and when accessing information via Job URLs. Next a rough estimate of associated Tweets is provided. This estimate is ‘order of magnitude’ accurate: is there 100M Tweets associated with my filters, or 100,000? If the Job is accepted, every single Tweet posted during the period of interest is examined for a match to any included rules. Jobs can produce millions of tweets requiring large amounts of storage space. In order to help ensure file sizes that are quick to download, data files are generated as a 10-minute time-series, with each file covering a ten-minute period. Depending on the data volumes associated with the job’s filters, even these 10-minute files can contain many thousands of tweets. Conversely, if a 10-minute period doesn’t have any data associated with it, the file can be ‘silent.’ Since each hour of the job’s time-period can generate up to 6 files, the number of data files generated by a HPT job can be large. For example, a 90-day job can produce up to 12,960 files, one for each 10-minute period of those 90 days. The data files that are generated are hosted at Amazon’s Simple Storage Service (S3), and are available for 15 days. These files are gzip-compressed JSON files and are based on the UTF-8 character set. All timestamps used in a Job description, included in API responses, used in filenames, and in the returned tweet data are in UTC. When a job is complete, a list of download links within a Job URL is provided via the Historical PowerTrack API or your account manager. Given that this list can contain thousands of links, some form of download automation is needed to retrieve the data. Please refer to the Downloading Historical PowerTrack Files for our best practices on downloading your data.

Historical PowerTrack Job URLs

One of the main resources that you will use throughout the Historical PowerTrack purchase process is the Job URL. This URL uses your Enterprise account name and job UUID. In its different forms, you can review your job status, access a list of your download files in JSON format, and download a list of your download files in CSV format.

The root host domain:

gnip-api.gnip.com.

The URL path has the following pattern:

/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}/

 

Here are the Job URLs that you will use to access your data:

  • Individual Job Status:

    http://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json
    
  • Download links as a JSON array:

    http://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}/results.json
    
  • Download links as a CSV:

    http://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}/results.csv
    

Working with Historical PowerTrack files

Here are some high-level details that provide some technical background on the Historical PowerTrack (HPT) product and the data files it generates. This information will help you work with the data files after you have downloaded them or develop your our automation script/application.

  • Files are gzip compressed.
  • File contents are formatted in JSON.
  • File contents use the UTF-8 character set.
  • All timestamps in filenames and data are in UTC.
  • Each HPT job has an Universally Unique ID (UUID) associated with it, and this UUID is referenced when making API requests and is used to name the resulting files.
  • HPT generates a 10-minute time-series of files. A file is only generated if the ten-minute period it covers has activity.
  • All filename and tweet metadata timestamps are in UTC.
  • Data is encoded in JSON.
    • Individual activities are written as ‘atomic’ JSON objects, and are not placed in a JSON array.
    • Each file has a single “info” footer:
    • {"info":{"message":"Replay Request Completed","sent":"2014-05-15T17:47:27+00:00","activity_count":895}}
      
  • Time periods start and include the ‘top’ unit of time and exclude the next ‘top’ unit of time. For example, the first hour of the day (00:00 - 01:00 UTC) would produce up to 6 files covering these 10-minute time periods:
    • 00:00:00-00:09:59 UTC
    • 00:10:00-00:19:59 UTC
    • 00:20:00-00:29:59 UTC
    • 00:30:00-00:39:59 UTC
    • 00:40:00-00:49:59 UTC
    • 00:50:00-00:59:59 UTC
  • Some planning numbers:
    • 6 files per hour.
    • 144 files per day.
    • 4,320 per 30-day month.
    • 52,560 files per year.

File-naming conventions

  • HPT file names are a composite of the following details:
    • Job start date, YYYYMMDD.
    • Job end date, YYYYMMDD.
    • Job UUID.
    • Starting time of 10-minute period, YYYYMMDDHHMM.
    • A static “activities” string.
    • File extension of “.json.gz” (gzip-compressed JSON files).
<start_date>-<end_date>_{JOB_UUID}<10-min-starting-time>_activities.json.gz

For example, Given a Job UUID of gv96x96q3a covering a period of 2014-05-16 to 2014-05-20, the first hour of 2014-05-17 would produce the following 6 files:

  • 20140516-20140520_gv96x96q3a201405170000_activities.json.gz
  • 20140516-20140520_gv96x96q3a201405170010_activities.json.gz
  • 20140516-20140520_gv96x96q3a201405170020_activities.json.gz
  • 20140516-20140520_gv96x96q3a201405170030_activities.json.gz
  • 20140516-20140520_gv96x96q3a201405170040_activities.json.gz
  • 20140516-20140520_gv96x96q3a201405170050_activities.json.gz