Learning path / How to detect signal from noise and build powerful filtering rules

Step 3: Building initial filters

 


This is step 3 of the learning path, How to detect signal from noise and build powerful filtering rules.

 

As you think about what Tweets matter to you, don’t worry at first about writing rules with the correct syntax and operators. Instead, start by writing “pseudo rules” in a format that comes naturally to you. This will enable you to identify and really understand what signal you’re looking for. 

The table below demonstrates how you might start listing filters that matter to you. In this example, we want to analyze the conversation around veganism and vegan foods/diets in the US, among users who are influential on Twitter. On the left-hand side you will find examples of “pseudo rules,” and on the right-hand side these have been translated using the correct rule logic and syntax. 

At the bottom of the table, we have combined the different segments together to form one single rule. Remember that each rule can be up to 2,048 characters long, with no limit on the number of clauses or operators. With the PowerTrack API, you can have thousands of concurrent rules on one single stream. 

  Pseudo rule in “human readable” format Translate to rule/logic with correct syntax
  Accounts of interest
1 Accounts where bio contains the terms: “vegan” or “veganism” bio:vegan OR bio:veganism
2 Influential accounts - in this example, we define this as having a substantial follower base (5,000+) followers_count:5000
3 Accounts that are (or have been) active, and have 1,000+ Tweets on their timeline statuses_count:1000
4 Influential accounts - in this example, we define this as having been added to 20+ Lists on Twitter listed_count:20
  Keywords and exact phrase filters (matching on Tweet content)
5

Filter through Tweets that contain the following keywords: vegan, veganism, veganuary.

(Please note the simplified version on row 6)

vegan OR vegans OR veganism OR veganuary
6 contains:vegan
7 Filter through the keywords “environment” and “earth,” when they are no more than 3 tokens apart from the terms “food” or “diet.” "food environment"~3 OR "diet environment"~3 OR "food earth"~3 OR "diet earth"~3
8 Filter through relevant hashtags (standalone) #plantbaseddiet OR #veganlifestyle OR #govegan OR #veganrecipes OR #whatveganseat OR #vegancommunity OR #veganfortheanimals
9 Filter through relevant hashtags (only when used in the specific context of vegan food and/or diets) (contains:vegan OR contains:food OR contains:diet) (#fortheanimals OR #crueltyfree OR #plantbased)
10 Limit the search to Tweets in English lang:en
  Tweet attributes or Tweet types
11 Only include original Tweets -is:retweet -is:reply -is:quote
  Geo attributes
12

Users who are in the US

Please note: most Tweets do not have geo attributes attached to them and using these geo filtering operators will significantly narrow down the scope of our request.

(profile_country:US OR place_country:US)
  Combine the above groupings into one rule
(see the rule breakdown under "Validating filter logic" below for more details on how to do this)
 
((contains:vegan OR \"food environment\"~3 OR \"diet environment\"~3 OR \"food earth\"~3 OR \"diet earth\"~3 OR #plantbaseddiet OR #veganlifestyle OR #govegan OR #veganrecipes OR #whatveganseat OR #vegancommunity OR #veganfortheanimals) OR (contains:vegan OR contains:food OR contains:diet) (#fortheanimals OR #crueltyfree OR #plantbased)) (profile_country:US OR place_country:US) (bio:vegan OR bio:veganism) followers_count:5000 statuses_count:1000 listed_count:20 lang:en -is:retweet -is:reply -is:quote

 

Please note: the addition of the backslash (\) to escape quotation marks. You will have to escape quotation marks in this way when you add your rules to PowerTrack using the Rules API.

 


Validating filter logic

At a high level this rule can be broken down into ten groupings (listed below) all of which are combined with “AND” logic (achieved with a whitespace). This means that the conditions outlined in each of the ten groupings must be satisfied for a Tweet to match your rule.

1
((contains:vegan OR \"food environment\"~3 OR \"diet environment\"~3 OR \"food earth\"~3 OR \"diet earth\"~3 OR #plantbaseddiet OR #veganlifestyle OR #govegan OR #veganrecipes OR #whatveganseat OR #vegancommunity OR #veganfortheanimals) OR (contains:vegan OR contains:food OR contains:diet) (#fortheanimals OR #crueltyfree OR #plantbased))
2
(profile_country:US OR place_country:US)
3
(bio:vegan OR bio:veganism)
4
followers_count:5000
5
statuses_count:1000
6
listed_count:20
7
lang:en
8
-is:retweet
9
-is:reply
10
-is:quote


If you take the first grouping, you’ll notice that it can itself be divided into two groupings that are separated by “OR” logic (see below). This means that the conditions outlined in only one of these two groupings must be satisfied for a Tweet to match.

1
(contains:vegan OR \"food environment\"~3 OR \"diet environment\"~3 OR \"food earth\"~3 OR \"diet earth\"~3 OR #plantbaseddiet OR #veganlifestyle OR #govegan OR #veganrecipes OR #whatveganseat OR #vegancommunity OR #veganfortheanimals)
 
OR
2
(contains:vegan OR contains:food OR contains:diet) (#fortheanimals OR #crueltyfree OR #plantbased)


When troubleshooting a rule (for example, if you need to understand why a Tweet did or did not match your rule) it can be helpful to manually break down your rule, as demonstrated in the above example. This can help you understand which section of the rule is causing unwanted data to be returned or valuable data to be missing.
 

Validating rule syntax

With the PowerTrack Rules API, you can use the POST /validation endpoint to validate the syntax of your rule. 

This endpoint will highlight syntax errors, such as the lowercase "or" (which should be an uppercase OR) in the example below: 

      {
  "summary": {
    "valid": 0,
    "not_valid": 1
  },
  "detail": [
    {
      "rule": {
        "value": "((contains:vegan or \"food environment\"~3 OR \"diet environment\"~3 OR \"food earth\"~3 OR \"diet earth\"~3 OR #plantbaseddiet OR #veganlifestyle OR #govegan OR #veganrecipes OR #whatveganseat OR #vegancommunity OR #veganfortheanimals) OR (contains:vegan OR contains:food OR contains:diet) (#fortheanimals OR #crueltyfree OR #plantbased)) (profile_country:US OR place_country:US) (bio:vegan OR bio:veganism) followers_count:5000 statuses_count:1000 listed_count:20 lang:en -is:retweet -is:reply -is:quote",
        "tag": "twitter-api-1"
      },
      "valid": false,
      "message": "Ambiguous use of or as a keyword. Use OR to logically join two clauses, or \"or\" to find occurrences of or in text (at position 18)\n"
    }
  ],
  "sent": "2021-05-10T11:12:53.335Z"
}

    


The API will return a "valid": true response when your rule is syntactically correct:
 

      {
  "summary": {
    "valid": 1,
    "not_valid": 0
  },
  "detail": [
    {
      "rule": {
        "value": "((contains:vegan OR \"food environment\"~3 OR \"diet environment\"~3 OR \"food earth\"~3 OR \"diet earth\"~3 OR #plantbaseddiet OR #veganlifestyle OR #govegan OR #veganrecipes OR #whatveganseat OR #vegancommunity OR #veganfortheanimals) OR (contains:vegan OR contains:food OR contains:diet) (#fortheanimals OR #crueltyfree OR #plantbased)) (profile_country:US OR place_country:US) (bio:vegan OR bio:veganism) followers_count:5000 statuses_count:1000 listed_count:20 lang:en -is:retweet -is:reply -is:quote",
        "tag": "twitter-api-1"
      },
      "valid": true
    }
  ],
  "sent": "2021-05-10T11:13:16.354Z"
}