Learning path / How to detect signal from noise and build powerful filtering rules

Walkthrough: What this means in practice

 

This is the final article of the learning path, How to detect signal from noise and build powerful filtering rules.

 

In this example, we’d like to monitor all Tweets that mention the Twitter API or Twitter Developer Platform. 

  • We have identified a list of keywords, phrases, #hashtags, and URLs that are relevant in the context of the content we want to monitor. 
  • We own three official accounts (@TwitterAPI, @TwitterDev, and @AdsAPI) and we want to filter through all Tweets that mention these accounts.
     

This is a relatively challenging example, because the keyword “Twitter” will match many Tweets, the majority of which are unrelated to the Twitter API. This is because of internal fields such as URL metadata that contain the term “Twitter.” See Step 2 and Step 3 below for additional details and examples on filtering out unwanted Tweets and refining this filtering rule.
 

Step 1: Identifying “signal” and building an initial rule

  Pseudo rule in “human readable” format Translate to rule/logic with correct syntax
  Accounts of interest
1 Tweets that mention any of the following accounts: @TwitterAPI @TwitterDev @AdsAPI @TwitterAPI OR @TwitterDev OR @AdsAPI
  Keywords and exact phrase filters (matching on Tweet content)
2 Twitter API or Twitter APIs "Twitter API" OR "Twitter APIs" OR "Twitter developer platform" OR "Twitter dev" OR TwitterDev
3 Twitter developer platform or Twitter dev or TwitterDev
4 API or APIs (in the context of Twitter) Twitter (API OR APIs OR endpoint OR endpoints OR developer OR developers OR dev OR devs)
5 Endpoint or endpoints (in the context of Twitter)
6 Developer or developers or dev or devs (in the context of Twitter)
7 Any URL that contains “developer.twitter.com” url:"developer.twitter.com" OR url:"twitterdevfeedback.uservoice.com" OR url:"github.com/twitterdev" OR url:"twitch.tv/twitterdev" OR url:"dev.to/twitterdev"
8 Any URL that contains “twitterdevfeedback.uservoice.com”
9 Any URL that contains “github.com/twitterdev”
10 Any URL that contains “twitch.tv/twitterdev”
11 Any URL that contains “dev.to/twitterdev”
12 #TwitterAPI #TwitterAPI OR #TwitterAPIv2 OR Twitter (#v2 OR #EarlyAccess)
13 #TwitterAPIv2
14 Twitter and #v2
15 Twitter and #EarlyAccess
16 Twitter and “v2” or “Labs” Twitter (v2 OR Labs)
17 Enterprise API or Enterprise APIs (in the context of Twitter)

These will be captured by the following rule, which we have already outlined above:

Twitter (API OR APIs OR endpoint OR endpoints OR developer OR developers)

18 Premium API or Premium APIs (in the context of Twitter)
19 Standard API or Standard APIs (in the context of Twitter)
20 Public API or Public APIs (in the context of Twitter)
21 Tweet object or Tweet payload "Tweet object" OR "Tweet payload"
22 User object (in the context of Twitter) Twitter "user object"
  Tweet attributes or Tweet types
23 None  
  Geo attributes
24 None  
  Combined the above groupings into one rule
 
Twitter (API OR APIs OR endpoint OR endpoints OR developer OR developers OR dev OR devs OR #v2 OR #EarlyAccess OR v2 OR Labs OR \"user object\") OR TwitterDev OR url:\"developer.twitter.com\" OR url:\"twitterdevfeedback.uservoice.com\" OR url:\"github.com/twitterdev\" OR url:\"twitch.tv/twitterdev\" OR url:\"dev.to/twitterdev\" OR #TwitterAPI OR #TwitterAPIv2 OR \"Tweet object\" OR \"Tweet payload\" OR @TwitterAPI OR @TwitterDev OR @AdsAPI


Note that, where possible, we merged groupings together. Specifically:

  1. We combined all keywords and “exact phrase matches” that need to be paired with the keyword “Twitter” into one set of parentheses. 
    Twitter (API OR APIs OR endpoint OR endpoints OR developer OR developers OR dev OR devs OR #v2 OR #EarlyAccess OR v2 OR Labs OR \"user object\")
    
  2. Most of the exact phrase matches identified in rows 2-3 are already covered by the rule portion outlined in rows 4-6. The only keyword that had to be added separately and combined with “OR” logic is “TwitterDev.”
     

Step 2: Identifying "noise"

For our use case, in this example, we have identified the following as being unwanted “noise.” 

  • We are not interested in any requests for Twitter support; for example, Tweets relating to blocked, restricted, or suspended accounts.
  • We are not interested in verification requests.
  • Although we are interested in Tweets that mention the accounts @TwitterAPI, @TwitterDev, and @TwitterAds, we are not interested in Tweets from these accounts (as we own them).
  • We are not interested in automated content (in other words, we want to filter out Tweets from bots).
  • We want to filter out Retweets to avoid unnecessary noise. In other words, we do not need duplicate content, and we can use the Engagement API to track engagement metrics, such as number of Retweets for a given Tweet.
  • We are only interested in organic content (in other words, we want to filter out promoted Tweets).
     

Please note: What you might consider to be unwanted “noise” will very much depend on your use case and intended outcome.
 

Pseudo rule in “human readable” format Translate to rule/logic with correct syntax
Identifying common patterns of noise and creating narrow casted rules
Filter out Tweets that mention the following terms: verify, verification, verified, blue badge, blocked, suspended, restricted, restrictions -contains:verif -"blue badge" -blocked -suspended -contains:restrict
Filter out Tweets sent from @TwitterAPI @TwitterDev @AdsAPI -from:TwitterAPI -from:TwitterDev -from:AdsAPI
Filter out Tweets that mention these accounts: @TwitterSupport, @verified
-@TwitterSupport -@verified
Filter out Tweets from bots
Filter out Tweets from users with a bio that contains the word “bot” or “TwitterBot” -bio:bot -bio_name:bot -bio_location:bot -bio:TwitterBot -bio_name:TwitterBot -bio_location:TwitterBot
Filter out Tweets from accounts that follow 10 users or less (i.e. although bots may have a lot of followers, they themselves tend to not follow many users). -friends_count:0..10
Using attribute filters
Filter out retweets -is:retweet
Filter out promoted content -is:nullcast


By combining all of the above, we get the following grouping, which will allow us to filter out unwanted Tweets.

-contains:verif -\"blue badge\" -blocked -suspended -contains:restrict -from:TwitterAPI -from:TwitterDev -from:AdsAPI -@TwitterSupport -@verified -bio:bot -bio_name:bot -bio_location:bot -bio:TwitterBot -bio_name:TwitterBot -bio_location:TwitterBot -friends_count:0..10 -is:retweet -is:nullcast 


Please note: It is considered best practice to not group together negations by applying the negating hyphen (-) to an entire group within parentheses. Instead, you should negate each individual operator, stringing them together with whitespace (in other words, “AND” logic), as exemplified above.
 

Step 3: Refining your rule

If we return to our simplified rule structure below: 

(SIGNAL)    -NOISE

↑                          ↑

   Grouping 1        Grouping 2


We now have:

1 A first grouping with our signal (in other words, what we want to include)

Twitter (API OR APIs OR endpoint OR endpoints OR developer OR developers OR dev OR devs OR #v2 OR #EarlyAccess OR v2 OR Labs OR \"user object\") OR TwitterDev OR url:\"developer.twitter.com\" OR url:\"twitterdevfeedback.uservoice.com\" OR url:\"github.com/twitterdev\" OR url:\"twitch.tv/twitterdev\" OR url:\"dev.to/twitterdev\" OR #TwitterAPI OR #TwitterAPIv2 OR \"Tweet object\" OR \"Tweet payload\" OR @TwitterAPI OR @TwitterDev OR @AdsAPI
2 A second grouping with our noise (in other words, what we want to exclude)

-contains:verif -\"blue badge\" -blocked -suspended -contains:restrict -from:TwitterAPI -from:TwitterDev -from:AdsAPI -@TwitterSupport -@verified -bio:bot -bio_name:bot -bio_location:bot -bio:TwitterBot -bio_name:TwitterBot -bio_location:TwitterBot -friends_count:0..10 -is:retweet -is:nullcast
3 Leading to the following rule:

Please note: We added a set of parentheses around the positive clause to ensure that the negations are properly applied to the entirety of the rule.

(Twitter (API OR APIs OR endpoint OR endpoints OR developer OR developers OR dev OR devs OR #v2 OR #EarlyAccess OR v2 OR Labs OR \"user object\") OR TwitterDev OR url:\"developer.twitter.com\" OR url:\"twitterdevfeedback.uservoice.com\" OR url:\"github.com/twitterdev\" OR url:\"twitch.tv/twitterdev\" OR url:\"dev.to/twitterdev\" OR #TwitterAPI OR #TwitterAPIv2 OR \"Tweet object\" OR \"Tweet payload\" OR @TwitterAPI OR @TwitterDev OR @AdsAPI) -contains:verif -\"blue badge\" -blocked -suspended -contains:restrict -from:TwitterAPI -from:TwitterDev -from:AdsAPI -@TwitterSupport -@verified -bio:bot -bio_name:bot -bio_location:bot -bio:TwitterBot -bio_name:TwitterBot -bio_location:TwitterBot -friends_count:0..10 -is:retweet -is:nullcast


An initial analysis of the data returned by the above rule shows that this rule returns unwanted data and it becomes quickly clear that the scope of our rule needs to be reduced.

To run this initial analysis of our filtering rule, we reviewed the Tweet payloads as they got delivered by the PowerTrack API in real time. However, if your filtering rule is built with operators that are also available with the Search API, you can equally use the Search API to review counts and/or Tweet payloads that match your query. 

Specifically, the following section of the rule is problematic in this case.

Twitter (API OR APIs OR endpoint OR endpoints OR developer OR developers OR dev OR devs OR #v2 OR #EarlyAccess OR v2 OR Labs OR \"user object\")


This is because the keyword “Twitter” will match many Tweets on the platform: fields contained in the Tweet payload, such as URLs, will contain the term “Twitter,” even if the keyword “Twitter” is not explicitly mentioned in the Tweet text. As a result, the keyword “Twitter” becomes redundant in our rule, and any Tweet that contains the terms “API,” “endpoint,” “developer,” etc. will match this rule, even if they do not relate to “Twitter.”

At this point, there are two different ways of narrowing down the scope of this rule. 

1 We can use the “exact phrase match” operator to refine the above grouping:

\"Twitter API\" OR \"Twitter APIs\" OR \"Twitter endpoint\" OR \"Twitter endpoints\" OR \"Twitter developer\" OR \"Twitter developers\" OR \"Twitter dev\" OR \"Twitter devs\" OR \"Twitter v2\" OR \"Twitter Labs\" OR \"Twitter user object\" OR Twitter (#v2 OR #EarlyAccess)
2 Another solution would be to use the proximity parameter, which matches a Tweet where the specified keywords are no more than N tokens apart, to refine the above grouping. For example:

\"Twitter API\"~3 OR \"Twitter APIs\"~3 OR \"Twitter endpoint\"~3 OR \"Twitter endpoints\"~3 OR \"Twitter developer\"~3 OR \"Twitter developers\"~3 OR \"Twitter dev\"~3 OR \"Twitter devs\"~3 OR \"Twitter v2\"~3 OR \"Twitter Labs\"~3 OR \"Twitter user object\" OR Twitter (#v2 OR #EarlyAccess)

Some additional testing and analysis of the data returned can help us decide on which of the two options best works for our use case and ensure that we get the data we want to extract.

Refining your rules and overall ruleset is an iterative process, but it is well worth your time to ensure that you build robust filtering operators. 

 

 

 

That's it! You've completed all articles that are included in the learning path: How to detect signal from noise and build powerful filtering rules.

If you haven't already, please continue reading through our PowerTrack API or Search API resources. You might also want to consider exploring the new Twitter API v2 version of these endpoints, filtered stream and search Tweets.

 

Please tell us about your experience with this learning path by submitting this quick three question survey: