Tutorials

Advanced filtering with geo data

Introduction

In our “Filtering Tweets by location” tutorial, we introduced the two types of geographical metadata found in Tweets (see below), the available enterprise operators used to filter for geo data, and a brief example of how to use these operators.

  1. Tweet level - geo metadata available at the Tweet level
  2. Account level - geo metadata available at the user account level (provided by user in public profile)
     

In this tutorial, we will briefly review some of that information and then show how to create effective filters using these operators to target both Tweet level and account level geo in Tweets. If you've read the “Filtering Tweets by location” tutorial, you may want to skip the review section below and jump straight to the “Building effective rules” portion of this tutorial.

This guide is intended to support premium and enterprise developers with a use case that involves filtering Tweets by geographical attributes. While some of the content may be applicable to the broader developer platform (for example, to developers using the Twitter API v2), this tutorial assumes that a developer has access to the Profile Geo enrichment which is only available in the premium and enterprise APIs today.
 

A brief review

Before diving in, we'll briefly review the different types of geographical metadata available, where to find it, and how prevalent it is (volume estimates).
 

1. Geo-tagged Tweets (Tweet-level)

Prevalence: ~1-2% of Tweets are geo-tagged

Individual Tweets can be tagged with location information at the time a Tweet is published. There are two types of location information that is available at the Tweet level:

  1. Precise location (LONG, LAT) - decimal degree coordinates for the exact location
    1. Note: In June 2019, Twitter removed the ability to tag Tweets with precise location when using the Twitter iOS or Android app. The ability to tag a Tweet with precise location still exists through the API, with some third-party clients (via API), and when using the in-app camera on Twitter for iOS and Android.
  2. A Twitter "place" (e.g., a local coffee shop, neighborhood, or city) includes a name, type, country code, and bounding box consisting of four [LONG, LAT] coordinates that define the area.

Tweets tagged with a place use venues powered by Foursquare and make up the vast majority (~80%) of geo-tagged Tweets that are published on the platform. It is possible for a Tweet to be tagged with both a "place" and a precise location; however, the usage of precise location is very limited.

Precise location data is rendered in the root-level "coordinates" object of a Tweet payload (example below):

      "coordinates": {
  "type": "Point",
  "coordinates": [
    -149.90629456,
    61.19710597
  ]
}
    

Note: the payload also contains a "geo" object that has the coordinates in reverse order [LAT, LONG], but it is deprecated and we recommend using the "coordinates" object instead.

Place data is rendered in the root-level "place" object of a Tweet payload. Places can have the following types: POI, neighborhood, city (most common), country, and admin. Here's an example below:

      "place": {
  "id": "07d9c93846c86001",
  "url": "https://api.twitter.com/1.1/geo/id/07d9c93846c86001.json",
  "place_type": "poi",
  "name": "Cannon Mine Coffee",
  "full_name": "Cannon Mine Coffee",
  "country_code": "US",
  "country": "United States",
  "bounding_box": {
    "type": "Polygon",
    "coordinates": [
      [
        [
          -105.090323,
          39.996365
        ],
        [
          -105.090323,
          39.996365
        ],
        [
          -105.090323,
          39.996365
        ],
        [
          -105.090323,
          39.996365
        ]
      ]
    ]
  },
  "attributes": {
    
  }
}
    


2. Profile location (account level)

Prevalence: ~30-40% of Tweets contain some profile location information.

The user object contained within a Tweet payload may also contain geographic information. Users can customize their profile through several fields, including a "location" field. If present, the Profile Geo Enrichment (only available with enterprise APIs) adds structured geodata relevant to the user provided location value by geocoding and normalizing location strings where possible.

Note: The user provided location field accepts any valid string, even if it isn't a real location or place. Users may choose to input a fictional place or something generic that isn't tied to an actual location. In these instances, the profile geo enrichment will not add any structured metadata (as you might expect) and the "derived" object will not be present in the payload. 

The user defined location is rendered in the "user.location" object of a Tweet payload (see below):

      "user": {
  "id": 495309159,
  "id_str": "495309159",
  "name": "Twitter New York City",
  "screen_name": "TwitterNYC",
  "location": "New York, NY",
  ...
    

For enterprise API customers that have Profile Geo enabled, the additional metadata is found in the "user.derived" object of a Tweet payload (see below):

Example at the region/state level:

      "derived": {
  "locations": [
    {
      "country": "United States",
      "country_code": "US",
      "region": "New York",
      "full_name": "New York, United States",
      "geo": {
        "coordinates": [
          -75.4999,
          43.00035
        ],
        "type": "point"
      }
    }
  ]
}
    

Example at the locality (city) level:

      "derived": {
  "locations": [
    {
      "country": "United States",
      "country_code": "US",
      "locality": "Boulder",
      "region": "Colorado",
      "sub_region": "Boulder County",
      "full_name": "Boulder, Colorado, United States",
      "geo": {
        "coordinates": [
          -105.27055,
          40.01499
        ],
        "type": "point"
      }
    }
  ]
}
    

A more nuanced example with sub_region (county level):

London is the capital and largest city of both England and the United Kingdom. Below shows how it is represented in the profile geo part of the payload:

      {
 "locations": [
   {
     "country": "United Kingdom",
     "country_code": "GB",
     "locality": "London",
     "region": "England",
     "sub_region": "Greater London",
     "full_name": "London, England, United Kingdom",
     "geo": {
       "coordinates": [
         -0.12574,
         51.50853
       ],
       "type": "point"
     }
   }
 ]
}
    

In summary, these are the two types of geographical metadata that you can expect to find in Tweets when working with the API. It's important to keep in mind that both types of geo metadata are elective, and controlled by the user. Additionally, Twitter does not validate the location information provided at the Tweet or account level against IP address or other sources of signal; therefore, the geo metadata may not represent where a user was when a Tweet was sent, or where they're geographically based (profile location).


List of Operators

The tables below list the available operators for each type of geographical metadata as well as a brief description and which field in the payload the operator matches on.
 

Tweet location

Tweet location operators Description

has:geo

Not a supported operator with Search API

Matches Tweets that have been geo-tagged with a precise location or location in the form of a Twitter "place."


Matches object: coordinates*, place*

place:

Matches Tweets tagged with the specified location or Twitter place ID


Matches object: place.full_name, place.id

place_country:

Matches Tweets where the country code associated with a tagged place/location matches the given ISO alpha-2 character code.


Matches object: place.country_code

point_radius:

Radius must be less than 25mi

Matches against the precise location of a geo-tagged Tweet and/or against a “place” geo polygon, where the place is fully contained within the defined region.


Matches object: coordinates.coordinates, place.bounding_box.coordiatnes

bounding_box:

Width and height of the bounding box must be less than 25mi

Matches against the precise location of a geo-tagged Tweet and/or against a “place” geo polygon, where the place is fully contained within the defined region.


Matches object: coordinates.coordinates, place.bounding_box.coordinates


Profile location

Profile location operators

Description

has:profile_geo:

Not a supported operator with Search API

Matches Tweets that have any Profile Geo metadata present


Matches object: user.derived[]

profile_country:

Exact match on the “country_code” field from the derived locations object.


Matches object: user.derived.locations[].country_code

profile_region:

Exact match on the “region” field from the derived locations object.


Matches object: user.derived.locations[].region

profile_locality:

Exact match on the “locality” field from the derived locations object.


Matches object: user.derived.locations[].locality

profile_subregion:

Not a supported operator with Search API

Exact match on the "sub_region" field from the derived locations object.


Matches object: user.derived.locations[].sub_region

*bio_location:

Not a supported operator with Search API

Tokenized match on the optional user location field in their account profile.


Matches object: user.location

*The only operator that doesn't use/require the enterprise Profile Geo enrichment.
 

Proxy geo operators

The table below contains operators that can proxy for geographic information:

Proxy geo operators

Description

bio:

Matches a keyword or phrase within a user's bio description.

Note: bio descriptions are user-generated text and may not include location information

Matches object: user.description

lang:

Matches Tweets that have been classified by Twitter as being of a particular language.


Matches object: lang

keyword

Matches a keyword within the body of a Tweet (including both URLs and unwound URLs). This is a tokenized match.


Matches object: text

"exact phrase"

Matches an exact phrase within the body of a Tweet.


Matches object: text

Note: The keyword and exact phrase operators above can be used to match explicit mentions of a place in a Tweet.
 

Building effective filters

Upfront decisions

Before you create any queries or rules, you'll need to decide up front whether or not you want to filter on geo-tagged Tweets, profile location, or a combination of both. This really depends on your use case and how you plan to use the Tweets collected from the API. For example, if you want to plot Tweets on a map for visual display, it may be best to focus on geo-tagged Tweets. However, if you want to follow conversations or analyze sentiment of a topic and how it varies across regions, the comparatively large volume of Tweets with profile location may suit you better.

Another factor to consider is the availability of the data. As mentioned in the review section (<anchor link>), approximately 1-2% of Tweets are geo-tagged; but 30-40% of Tweets published contain some profile location information. Given the limited volume of geo-tagged Tweets, it can be difficult to satisfy a use case that's focused solely on Tweets with geo dat In that case, you may want to explore filtering on profile location in addition to, or in place of, geo-tagged Tweets.
 

Get place IDs

Some of the examples below will utilize a "place ID" (e.g., 96683cc9126741d1). There are two ways to programmatically retrieve Twitter place information:

  1. GET geo/reverse_geocode - takes a latitude and a longitude and retrieves up to 20 places (note: the place ID returned can be used as the value with the 'place:' operator)
  2. GET geo/search - search by lat/long or query (free form text) to get a list of all the valid places (note: the place ID returned can be used as the value with the 'place:' operator)

Let's use the GET geo/search endpoint to get the place ID for the “United States,” which we'll use in the first real world example below:

Request:

twurl '/1.1/geo/search.json?query=United%20States&granularity=country&max_results=2'

Response:

      {
  "query": {
    "params": {
      "granularity": "country",
      "query": "United%20States",
      "trim_place": false
    },
    "type": "search",
    "url": "https://api.twitter.com/1.1/geo/search.json?query=United%2520States&granularity=country&max_results=2"
  },
  "result": {
    "places": [
      {
        "id": "6416b8512febefc9",
        "name": "United Kingdom",
        "full_name": "United Kingdom",
        "country": "United Kingdom",
        "country_code": "GB",
        "url": "https://api.twitter.com/1.1/geo/id/6416b8512febefc9.json",
        "place_type": "country",
        "attributes": {},
        "bounding_box": {
          "type": "Polygon",
          "coordinates": [
            [
              [
                -8.662663,
                49.1626564
              ],
              [
                -8.662663,
                60.86165
              ],
              [
                1.768926,
                60.86165
              ],
              [
                1.768926,
                49.1626564
              ],
              [
                -8.662663,
                49.1626564
              ]
            ]
          ]
        },
        "centroid": [
          -1.9280975903801871,
          54.3306827
        ],
        "contained_within": []
      },
      {
        "id": "96683cc9126741d1",
        "name": "United States",
        "full_name": "United States",
        "country": "United States",
        "country_code": "US",
        "url": "https://api.twitter.com/1.1/geo/id/96683cc9126741d1.json",
        "place_type": "country",
        "attributes": {},
        "bounding_box": {
          "type": "Polygon",
          "coordinates": [
            [
              [
                -179.231086,
                13.182335
              ],
              [
                -179.231086,
                71.434357
              ],
              [
                179.859685,
                71.434357
              ],
              [
                179.859685,
                13.182335
              ],
              [
                -179.231086,
                13.182335
              ]
            ]
          ]
        },
        "centroid": [
          -98.99308143101959,
          36.890333500000004
        ],
        "contained_within": []
      }
    ]
  }
}
    

Two results were returned and we can quickly see that we want the "id": "96683cc9126741d1" field from the "United States" place.
 

Real-world examples

In this section, we'll use a real world use case and walk through how you can apply geo filters to a rule at different levels of specificity: country, state/province, city, and neighborhood level. Geo data is often times just one element of a query or rule that you're looking to build, so for the purposes of this section, we'll use the following scenario:

Use case: Collect Tweets about the Tokyo Summer Olympic Games with varying levels of granularity. More specifically, we will use the following elements to compose the rule:

  1. Mentions: @Tokyo2020, @Olympics
    1. This will capture both @ mentions and replies to the specified handle.
  2. Hashtags: #Tokyo2020
  3. Keywords: olympics, "olympic games", "olympic team"
  4. Exclude Retweets: -is:retweet
    1. Note: A Retweet cannot be tagged with a location; therefore, we do not need to add this negation clause to rules that target geo-tagged Tweets. It will only be added in the rules that target profile geo.
  5. Geo filters: these will change based on each level of granularity below.

The conditions above will be combined together, as clauses, to form the following rule value (see documentation on building rules):

       "value": "(@Tokyo2020 OR @Olympics OR #Tokyo2020 OR olympics OR \"olympic games\" OR \"olympic team\") (<geo-filter-clauses>) -is:retweet"

Results: ~734,000 (without any geo clauses)
    

The results count above represents Tweet volume for a 3-month period (March-June 2021) for the query before any geo filters are applied. In the sections below, the results are also shared for each filter option to illustrate the impact of the geo filters on the volume of matching Tweets over the same period.

Next, we're going to add geo filters at varying levels of granularity – starting with the country level.
 

How to filter for Tweets at the country level

Now, let's say that we want to limit our Tweet results to those that are in a specific country. We must first identify the operators that provide the ability to filter at the country level:

  1. place_country:
  2. profile_country:
  3. place:

We'll use the United States as an example. The following are a few options to accomplish this:

      Filter for geo-tagged Tweets sent from the US
"value": "place_country:US"

Filter for Tweets from accounts in the US
"value": "profile_country:US"

Filter for Tweets tagged with the exact place ID for the US
(this is a narrow subset of place_country:US)
"value": "place:96683cc9126741d1"

    

Now, let's add in the additional clauses to create a rule value that encompasses our use case and compare the results (note: the results represent Tweet volume that matches each query from March-June 2021):

      "value": "(@Tokyo2020 OR @Olympics OR #Tokyo2020 OR olympics OR \"olympic games\" OR \"olympic team\") place_country:US"

Results: ~5,500

"value": "(@Tokyo2020 OR @Olympics OR #Tokyo2020 OR olympics OR \"olympic games\" OR \"olympic team\") profile_country:US -is:retweet"

Results: ~125,000

"value": "(@Tokyo2020 OR @Olympics OR #Tokyo2020 OR olympics OR \"olympic games\" OR \"olympic team\") place:96683cc9126741d1"

Results: ~10
    

You'll notice that the results increase significantly (20x) when using the profile_country operator instead of place_country. This is because significantly more Tweets contain profile location compared to Tweet-level location. Additionally, very few Tweets matched the specific place ID for the "United States." The place ID has to be an exact match; it doesn't encompass all of the places such as states and cities that roll up to the US as the place_country operator does.
 

How to filter for Tweets at the state/province level

In most cases, "region" represents the state/province level, so the following two operators can help us collect Tweets at the state/province level:

  1. place:
  2. profile_region:

We'll use California as our example state that we want to limit our results to. The following are a few options to accomplish this:

      Filter for geo-tagged Tweets sent from California
"value": "place:CA"

    

Note: "place:CA" will capture all Tweets tagged at the city-level (e.g., Los Angeles, CA, San Francisco, CA). We assign the value "CA" to the "place:" operator to do a sub-string match on what's contained within the "place.full_name" object, for example:

"full_name": "West Hollywood, CA"

Of all geo-tagged Tweets, over 80% are tagged at the city-level. This will also capture Tweets with precise location (lat, long) as Twitter reverse geocodes it, so it includes the city and state in the "place.full_name" object.

      Filter for Tweets tagged with the exact place ID for California
"value": "place:fbd6d2f5a4e4a15e"

Filter for Tweets from accounts in California
"value": "profile_region:California"

    

Now, let's add in the additional clauses to create a rule value that encompasses our use case and compare the results (note: the results represent Tweet volume from March-June 2021):

      "value": "(@Tokyo2020 OR @Olympics OR #Tokyo2020 OR olympics OR \"olympic games\" OR \"olympic team\") place:CA"

Results: ~700

"value": "(@Tokyo2020 OR @Olympics OR #Tokyo2020 OR olympics OR \"olympic games\" OR \"olympic team\") place:fbd6d2f5a4e4a15e"

Results: ~30

"value": "((@Tokyo2020 OR @Olympics OR #Tokyo2020 OR olympics OR \"olympic games\" OR \"olympic team\") profile_region:California) -is:retweet"

Results: ~17,100

    

As seen above, the results increased significantly again with the profile location clause (profile_region) instead of the Tweet-level location clause (place:CA). This is expected as 30-40% of Tweets contain profile location information; whereas, only 1-2% of Tweets are geo-tagged. While there were 3x more Tweets that matched the specific place ID for California, it was still nominal compared to the other filter options.

Important notes about filtering for the "state" level:

  • The "place:CA" clause will NOT match Tweets tagged at the neighborhood level or tagged with a point of interest (POI). These would need to be captured by using bounding boxes (covered in the next section) or with additional place operator clauses that reference the city name.
  • For example, "place:Los Angeles" will match on this pattern: "place.full_name": "{Neighborhood}, Los Angeles"
  • The clause, "place:California", will only match Tweets tagged at the state-level  ("place.full_name":"California, USA"). This won't include a majority of Tweets tagged at the city level because the place object doesn't contain the word, "California" ("place.full_name":"San Francisco, CA")
     

How to filter for Tweets at the city level

Locality generally refers to “city level". The following operators can help us collect Tweets at the city level:

  1. place:
  2. point_radius:
  3. bounding_box:
  4. profile_locality:

To continue with the West Coast (US) example, we'll use San Francisco as the example city here. The following are a few options to help us filter results to San Francisco, CA:

      Filter for geo-tagged Tweets in San Francisco, CA
"value": "place:\"San Francisco, CA\""

Filter for Tweets tagged with the exact place ID for San Francisco
"value": "place:5a110d312052166f"

Filter for Tweets tagged with an exact location (x,y) or a place
object that is fully contained within the defined region
"value": "point_radius:[-122.4461400159226 37.759828999999996 25mi]"

    

Tip: The response from the GET geo/search endpoint includes a "centroid" object that gives you the center coordinates (long, lat) of a given place. We’re using those centroid values for the point_radius long/lat arguments in the clause above.

      Filter for Tweets tagged with an exact location (x,y) or a place
object that is fully contained within the defined region
"value": "bounding_box:[-122.521071 37.701803 -122.350457 37.817365]"

Filter for Tweets from accounts in San Francisco
"value": "profile_locality:\"San Francisco\""
    

Now, let's add in the additional clauses to create a rule value that encompasses our use case and compare the results (note: the results represent Tweet volume from March-June 2021):

      "value": "(@Tokyo2020 OR @Olympics OR #Tokyo2020 OR olympics OR \"olympic games\" OR \"olympic team\") place:\"San Francisco, CA\""

Results: ~50

"value": "(@Tokyo2020 OR @Olympics OR #Tokyo2020 OR olympics OR \"olympic games\" OR \"olympic team\") place:5a110d312052166f"

Results: ~50

"value": "(@Tokyo2020 OR @Olympics OR #Tokyo2020 OR olympics OR \"olympic games\" OR \"olympic team\") point_radius:[-122.4461400159226 37.759828999999996 25mi]"

Results: ~300

"value": "(@Tokyo2020 OR @Olympics OR #Tokyo2020 OR olympics OR \"olympic games\" OR \"olympic team\") bounding_box:[-122.521071 37.701803 -122.350457 37.817365]"

Results: ~200

    

Tip: This online tool allows you to select a visual bounding box on a map. If you select TSV format (bottom left), it will give you the decimal degree coordinates in the correct order to provide as arguments to the bounding box operator (simply copy/paste and change the tabs to a space).

      "value": "(@Tokyo2020 OR @Olympics OR #Tokyo2020 OR olympics OR \"olympic games\" OR \"olympic team\") profile_locality:\"San Francisco\" -is:retweet"

Results: ~1650
    

How to filter for Tweets at the neighborhood level

Neighborhood level filtering may be useful for very granular filtering where city-level is too broad. The following operators can help us collect Tweets at the neighborhood level:

  1. place:
  2. point_radius:
  3. bounding_box:
  4. bio_location:
  5.  

To continue with our theme, we'll use neighborhoods in the San Francisco area for the examples below. The following are a few options to help us filter results to the SoMa neighborhood in San Francisco (note: the results represent Tweet volume from March-June 2021):

Note: given the level of granularity here and the low likelihood of Tweets matching both the geo clause and use case clauses (e.g., the Olympics), we’ve run the numbers without the use case clauses:

      Filter for geo-tagged Tweets in SoMa, San Francisco
"value": "place:\"South of market, San Francisco\" OR place:\"SoMa, San Francisco\""

Results: ~1000

    

Tip: Neighborhoods can sometimes have multiple names referring to the same place. In the example above, "SoMa" is short for "South of Market" and is actually tagged more commonly on Twitter than its longer alternative. Combining the clauses with an OR ensures that the rule captures both permutations of the same place.

      Filter for Tweets tagged with the exact place ID for SoMa and
South of Market (they have unique place IDs)
"value": "place:2b6ff8c22edd9576 OR place:1d019624e6b4dcff"

Results: ~1000

Filter for Tweets tagged with an exact location (x,y) or a place
object that is fully contained within the defined region
"value": "point_radius:[-122.40848289157051 37.77823196999999 0.4mi]"

Results: ~2200
    

Tip: The response from the GET geo/search endpoint includes a "centroid" object that gives you the center coordinates (long, lat) of a given place. I'm using those centroid values for the point_radius long/lat arguments in the clause above.

      Same as above but slightly expanded the radius value by 0.1 mi
"value": "point_radius:[-122.40848289157051 37.77823196999999 0.5mi]"

Results: ~10700

Filter for Tweets tagged with an exact location (x,y) or a place
object that is fully contained within the defined region
"value": "bounding_box:[-122.418875 37.765248 -122.38037 37.795198]"

Results: ~9900
    
Soma, San Francisco bounding box region on map
Visual plot of bounding box region
      Filter for Tweets from accounts with a bio location of "SoMa"
"value": "bio_location:SoMa OR bio_location:\"South of Market\""

Results: ~100,000
    

Note: As written, the bio_location clauses above are less precise and may retrieve unwanted Tweets. For example, bio_location:SoMa may also match other locations such as Soma, a town in Turkey. You may want to add some negation clauses to the rule such as:

"value": "bio_location:SoMa -bio_location:Turkey"

Let’s break down the results above. You’ll notice that we got the same results using the place operator with a string location as we did with the place ID. These rules are identical in this instance, so you may use the text or the place ID as the results will be the same.

We used two options for the point_radius rule. The first uses a radius of 0.4 miles (which is a best guess of the region that encompasses the SoMa neighborhood) and the second with a marginally larger radius (0.5 miles); however, the results are very different. The slightly expanded radius in the second rule yields nearly 5X more results.Keep this in mind as you define your rules with this operator, as small changes can yield large differences.

Lastly, you'll notice that the point_radius (0.5 miles) and bounding_box rules returned close to the same number of results. These operators are very similar to each other, with the difference being the shape of the area you define (circle vs rectangle). You may find bounding boxes are more useful in covering a larger region as you can chain them together, with each box covering up to 625 sqft (25 x 25).

Tip: see this repository of scripts to help convert a large rectangular geographic area into smaller 25-mile square bounding boxes.
 

Conclusion

This tutorial provided a review of the geographic metadata found in Tweets, the available operators used to filter for geo attributes, and explored real world use cases and how to filter for Tweets from the country level down to the neighborhood level. For additional information on some of the topics covered in this tutorial, please review the following resources:

  1. Available operators (PowerTrack)
  2. Data dictionary: Geo objects
  3. Profile geo
  4. Bounding boxes (scripts)
  5. Tweet Location FAQs (for the consumer experience)