Skip to content

What Is Correct Locale Tag? en_US vs. en-US

How would you thank your user? How would you know what language he is using? And finally, how would you distinguish one language variant from the other (US english from GB english)? For all that cases you need locale tag. It is designed to store every information important to proper localization. We have many standards out there, but only one can be considered true standard.

What Is a Locale Tag?

Locale is an identifier consisting information of preferred user’s language, region and additional data regarding script or encoding. Usually it looks like this:

en-US

where en is language and US is region.

Locale tag can help you store user region preference
Locale tag can help you store user region preference

Why is that information important? Consider application where you are using colors and you need to translate it to other languages. Color is english word, that’s obvious so specifying it by en tag would be enough.

Or is that colour?

Exactly, even one language can be spoken/spelled differently in different regions. If you want to satisfy user’s expectations in those countries you’ll use color for United States’ en-US and colour for Great Britain’s en-GB. That’s why you need specific locale tag.

Region is also important for configuring your application date formats, currencies or time zones. That’s why you need to be specific about user’s locale tag.

en_US vs. en-US – Which Is Correct?

You probably realized that in previous paragrapgh I’ve been using notation with hyphen, but underscore is also well-spread solution. In fact, both of notations are different standards but it’s easy to mix them up when working as a programmer.

For example, when you’re configuring Unix environment, you have to define POSIX locale with underscore as en_US.UTF-8.

On the other hand, if you’re trying to set locale (called CultureInfo) in your C# app it will be en-US.

We need to go deeper. Let’s see what standards do we have, and where are they used.

IETF Language Tag

IETF locale tags are used in other modern standards such as those related to HTML, World Wide Web Consortium and Unicode. I would call it a standard, because of Unicode (and those are the one and only standards you should use). It is tag consisting subtags for language and country delimited with hyphen.

So en-US is the standard.

Syntax consists language two-letter tag and then it’s country two-to-three-letter ISO code.

IETF is used in almost every modern technology, and you should use it too :)

PS: There’s a catch with Java which can use underscore, but that’s only backwards compatibility, as hyphen is also accepted in Java environment.

ISO 15897

ISO 15897 is the one used when defining POSIX locale in Unix environment. In opposite to IETF, language is separated from region with underscore. It can have (and usually does) third part with preferred encoding placed after dot. For example, POSIX locale used on my machine is:

pl_PL.UTF-8

Microsoft’s LCID

It’s hard to call it standard, but it’s good to know that such thing exists. Microsoft in their products also need to handle locale, but instead of naming it with country and language codes, they’re operating on binary numbers so every pair of language-country is defined by different number.

Numeric locale tags are standard in Microsoft Windows
Numeric locale tags are standard in Microsoft Windows

For example, you have 1033 number as identifier for english language in US region. As HEX it would be 0409, and as binary is:

0000 0100 0000 1001

Lower ten bits 00 0000 1001 is language identifier and six higher bits 0000 01 means country. You can read full LCID list on Microsoft’s Go global Developer Center.

What Other Information You Can Find In Locale Tag?

So, now we know that locale tag can store information about language and region preferred by user. Unfortunately, it’s enough but not perfect. Even across one region, we will have people using different calendars (for example gregorian vs. islamic). So, according to IETF U extension, we can add additional information to our language tag. For example:

Full list of possible extensions is available as Unicode Technical Standard #35.

So, knowing that, how can we use it? Let’s assume that I want to describe my preferences in one very specific locale tag. I would like my language to be english, my region as Poland, my script as Latin, calendar as gregorian, and first day of the week as Sunday. Let’s build it!

We know already the first part: en-PL, but where to put others?

Script can be placed between language and region:

en-Latn-PL

All other parameters must be placed after -u- extension in pairs parameterShortcode-value, so ca-gregory will be my calendar, fw-sun my first day of the week:

en-Latn-PL-u-ca-gregory-fw-sun

That way you can build very long and specific locale tags perfectly reflecting user preferences

Why Should You Know That?

I’m pretty sure, that we’re living in years where language barrier will be eliminated. We see globalization process almost everywhere. Providing services for users from many regions with different preferences challenges us to let them express those preferences in a convenient way. Locale tags were created exactly for that reason.

Let’s answer that question one more time: which locale is correct en-US or en_US: both are correct, but en-US is Unicode standard and you should stick to it.

If you want to read more about Unicode, CLDR, L10n and i18n in following articles:

PS: Quick note here, I’m working on a series of articles about L10n industry and all that services helping in i18n process. If you’re running (or just know) any good translation/content quality/global search engine marketing service I can’t leave out, please contact me :)

[do action=”cc-image-attribution” author=”Charis Tsevis” photourl=”https://www.flickr.com/photos/tsevis/6127839974/” cclicense=”by-nc-nd”/]
[do action=”cc-image-attribution” author=”woodleywonderworks” photourl=”https://www.flickr.com/photos/wwworks/4759535950/” cclicense=”by”/]