Counting characters

For programmers with experience in Unicode processing the short answer to the question is that Tweet length is measured by the number of codepoints in the NFC normalized version of the text. If that’s too geeky, keep reading and we’ll explain.

Note that the current best definition of the algorithm for counting characters on Twitter is described in the page on the twitter-text Tweet parsing library. This documentation will be refreshed shortly to reflect the most recent changes.

See How Twitter wraps URLs with t.co for more information on how URLs effect character counting.

Twitter Character Encoding

All Twitter attributes accept UTF-8 encoded text via the API. All other encodings must be converted to UTF-8 before sending them to Twitter in order to guarantee that the data is not corrupted.

Definition of a Character

While Wikipedia has an article for a Character (computing) it’s a very technical and purposely vague definition. The definition we’re interested in here is not the general definition of a character in computing but rather the definition of what “character” means when we say “a specific number of characters”.

For many Tweets, all characters are a single byte, and this page is of no use. The number of characters in a Tweet will effectively be equal to the byte length of the text. If you use anything beyond the most basic letters, numbers, and punctuation the situation gets more confusing. While many people use multi-byte Kanji characters to exemplify these issues, Twitter has found that accented vowels cause the most confusion because English speakers simply expect them to work. Take the following example: the word “café”. It turns out there are two byte sequences that look exactly the same, but use a different number of bytes:

café	0x63 0x61 0x66 0xC3 0xA9	Using the “é” character, called the “composed character”.
café	0x63 0x61 0x66 0x65 0xCC 0x81	Using the combining diacritical, which overlaps the “e”

General Concepts

The “café” issue mentioned above raises the question of how you count the characters in the Tweet string “café”. To the human eye the length is clearly four characters. Depending on how the data is represented this could be either five or six UTF-8 bytes. Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation. Therefore, Twitter does count “café” as four characters, no matter which representation is sent.

Nearly all user input methods automatically convert the longer combining mark version into the composed version but the Twitter API cannot count on that. Even if we did ignore that the byte length of the “é” character is two bytes rather than the one you would expect. Below there is some more specific information on how to get that information out of Ruby/Rails but for now I’ll cover the general concepts that should be available in any language.

The Unicode Standard covers much more that a listing of characters with numbers associated. Unicode does provide such a list of “codepoints” (`more info <http://www.unicode.org/charts/>`__), which is the U+XXXX notation you sometimes see. The Unicode Standard also provides several different ways to encode those codepoints (UTF-8 and UTF-16 are examples, but there are others). The Unicode standard also provides some detailed information on how to deal with character issues such as Sorting, Regular Expressions and of importance to this issue, Normalization.

Combining Diacritical Marks - A Prelude to Normalization

So, back in the café, the issue of multiple byte sequences having the same on-screen representation was breezed right by. There is an entire section of the Unicode tables devoted to the “Combining Diacritical Marks” (see that Unicode “block” here). These are not stand-alone characters but instead the additional “diacritical marks” used in addition to other base characters in many languages. For example the ¨ over the ü, common to German; or the ˜ over the ñ in Spanish. There are a great many combinations needed to cover all languages in the world so Unicode provides some simple building blocks, the Combining Diacritical Marks.

For the most common characters (like é, ü and company) there is also a character just for the combination. The reasons for that are mostly historical but since they exist it’s something we’ll always need to be aware of. This historical oddity is the exact reason for the two “café” representations. If you look back at the representations you’ll see one uses 0x65 0xCC 0x81, where 0x65 is simply the letter “e” and >0xCC 0x81 is the Combining Diacritical Mark for ´. Since there are multiple ways to represent the same thing using Unicode the Unicode Standard provides information on how to normalize the multiple different representations.

Unicode Normalization

The Unicode Standard provides information on several different kinds of normalization, Canonical and Compatibility. There is a full description of the different options in the Unicode Standard Annex #15, the report on normalization. The normalization report is 32 pages and covers the issue in great detail. Reproducing the entire report here would be of very little use so instead we’ll focus on what normalization Twitter is using.

Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text. This type of normalization favors the use of a fully combined character (0xC3 0xA9 from the café example) over the long-form version (0x65 0xCC 0x81). Twitter also counts the number of codepoints in the text rather than UTF-8 bytes. The 0xC3 0xA9 from the café example is one codepoint (U+00E9) that is encoded as two bytes in UTF-8, whereas 0x65 0xCC 0x81 is two codepoints encoded as three bytes.

Language Specific Information

Ruby Specific Information

In Ruby 1.8 multi-byte characters are supported via the ActiveSupport::Multibyte class. That class provides a method for Unicode Normalization but unfortunately the results of the length method are not intuitive. For some examples check out http://gist.github.com/159484 … this was actually the script used when troubleshooting character counting issues in the Twitter code base. After the experimentation shown in the gist above we settled on the following code for checking Tweet length (minus other comments surrounding it, which are a short version of this page):

class String def display_length ActiveSupport::Multibyte::Chars.new(self).normalize(:c).length end end

Java Specific Information

Java was the language used by the Unicode Consortium for the example of how to do normalization. The original code, along with an applet demo, can be found at http://unicode.org/reports/tr15/Normalizer.html.

Perl Specific Information (and a command line tool)

The World Wide Web Consortium (W3C) provides a command line tool written in Perl for performing character normalization. Information on the latest version of this tool can be found at http://www.w3.org/International/charlint.

PHP Specific Information

In PHP normalization can be performed by the Normalizer class.