Question 1

Why does my emoji count as 2, 4 or even 7 characters?

Accepted Answer

It depends on what you mean by "character". Most emoji outside the BMP take 2 UTF-16 code units (.length = 2) and 4 UTF-8 bytes. Compound emoji like 👩‍💻 use a zero-width joiner sequence and have a UTF-16 length of 5 but a grapheme count of 1.

Question 2

What's the difference between code points and characters?

Accepted Answer

A code point is a Unicode value (like U+1F600 😀). A user-perceived character (a "grapheme cluster") may be made of several code points joined together — the family emoji 👨‍👩‍👧‍👦 is one grapheme but seven code points.

Question 3

Why are UTF-8 and UTF-16 byte counts different?

Accepted Answer

They're different encodings. UTF-8 uses 1–4 bytes per code point and is efficient for ASCII. UTF-16 uses 2 or 4 bytes per code point and is efficient for CJK. For "café" UTF-8 is 5 bytes, UTF-16 is 8 bytes.

Question 4

Which length should I use for a database VARCHAR limit?

Accepted Answer

It depends on the column collation. PostgreSQL VARCHAR(n) counts code points; MySQL VARCHAR(n) with utf8mb4 counts code points but the row size limit is in bytes. Use the UTF-8 byte count for "will it fit in a fixed-size field?" questions.

String Length online

Get the exact character and byte length of any string in multiple encodings

How to Use the String Length Calculator

How the String Length Calculator Works

Frequently Asked Questions

Why does my emoji count as 2, 4 or even 7 characters?

What's the difference between code points and characters?

Why are UTF-8 and UTF-16 byte counts different?

Which length should I use for a database VARCHAR limit?