Three-letter Codes for Identifying Languages
One feature of Ethnologue since its inception as a database in 1971 has been a system of three-letter codes for uniquely identifying languages. These became part of the publication in 1984. In the interest of fostering the uniform identification of all the world's languages in information systems, beginning with the 14th edition (2000), SIL International has released the complete set of three-letter codes (plus indexing information involving countries and alternate names) as downloadable data tables that the public may incorporate into their own database applications and dynamic web sites. Prior to the publication of the 15th edition in 2005, Ethnologue worked in cooperation with the International Organization for Standardization (ISO) to create a new international standard for language codes. This was fully adopted in 2007 as ISO 639-3, Codes for the representation of names of languages — Part 3: Alpha‐3 code for comprehensive coverage of languages. The current downloadable tables are compatible with the latest updates to that standard. Examples of efforts that are already using these codes as a standard for language identification are the Open Language Archives Community and its participating archives.
Any application that makes use of these language identifiers is just one click away from access to the full language descriptions that are available in Ethnologue. That is, for any language identifier [abc] that may be stored in a database, an application may present a link to the following URL in order to give the user access to Ethnologue's description of that language:
https://ethnologue.com/language/abc
The remainder of this document, after describing the terms of use for the download tables, describes their relationship to standards, explains their structure, gives some hints on how to use them, and offers links for downloading them.
Relation to Standards
This 15th edition of Ethnologue (2005) marked an important milestone in the development of the language identifiers, namely, their emergence as part of the draft international standard, ISO 639-3. (See The History of Ethnologue for a fuller discussion of the history of the language identifiers.) The aim of that standard is to enable the uniform identification of all known human languages in information systems. ISO 639-3 was devised to enable the uniform identification of all known languages in a wide range of applications, particularly including information systems. It provides as complete an enumeration of languages as possible, including living, extinct, ancient, and constructed languages, whether major or minor. Ethnologue does not cover this entire scope; it seeks to catalog all known living languages, languages that have gone extinct since the inception of Ethnologue around 1950, and languages that have no native speakers but which are still in use as a second language in certain communities. Long extinct and constructed languages that fall outside this scope are documented by Linguist List.
The most widely used standard for identifying languages in Internet documents (such as in HTTP headers or HTML metadata or in the XML lang attribute) is BCP 47 of the Internet Engineering Task Force. In that standard, any three-letter identifier from ISO 639-3 is recognized as a valid language identifier. Thus any of the three-letter codes reported in Ethnologue is valid for use in Internet documents.
Structure of the Code Tables
Three files make up the package of data tables that SIL International releases in support of the ISO 639-3 standard for language identifiers. They are tab-delimited files in which each line represents one row of a database table. The characters are encoded in the 8-bit standard known as ISO 8859-1 (which is a subset of the default Windows code page 1252). See Downloading the Code Tables for the latest version of the tables.
LanguageCodes.tab | The complete list of three-letter language identifiers used in the current Ethnologue (along with name, primary country, and language status). |
CountryCodes.tab | The list of two-letter country codes that are used in the main language code table. |
LanguageIndex.tab | An index for finding languages by country and by all known names (including primary name, alternate names, and dialect names). |
The following declarations provide the formal definitions for SQL data tables into which the tab-delimited files can be loaded:
CREATE TABLE LanguageCodes ( LangID char(3) NOT NULL, -- Three-letter code CountryID char(2) NOT NULL, -- Main country where used LangStatus char(1) NOT NULL, -- L(iving), (e)X(tinct) Name varchar(75) NOT NULL) -- Primary name in that countryCREATE TABLE CountryCodes ( CountryID char(2) NOT NULL, -- Two-letter code from ISO3166 Name varchar(75) NOT NULL, -- Country name Area varchar(10) NOT NULL ) -- World areaCREATE TABLE LanguageIndex ( LangID char(3) NOT NULL, -- Three-letter code for language CountryID char(2) NOT NULL, -- Country where this name is used NameType char(2) NOT NULL, -- L(anguage), LA(lternate), -- D(ialect), DA(lternate) -- LP,DP (a pejorative alternate) Name varchar(75) NOT NULL ) -- The name
Using the Code Tables
LanguageCodes.tab lists the 7,600+ distinct language identifiers used in the current Ethnologue database. All values in the Name column are unique; in cases where distinct languages have the same name, a parenthetical disambiguator is added. The following shows the entries for the first six languages identifiers:
LangID CountryID LangStatus Name ------ --------- ---------- ------------- aaa NG L Ghotuo aab NG L Alumu-Tesu aac PG L Ari aad PG L Amal aae IT L Albanian, Arbëreshë aaf IN L Aranadan
We see that aaa and aab denote living languages spoken in Nigeria, aac and aad denote living languages spoken in Papua New Guinea, and so on. When a language is actually spoken in more than one country, the CountryId gives the country that is considered primary; usually the country of origin or country where most of the speakers are located.
CountryCodes.tab lists the two-letter identifier and name for the countries reported on by Ethnologue. The codes are from the international standard known as ISO 3166-1 (1997. Codes for the representation of names of countries and their subdivisions—Part 1: Country codes. Geneva: International Organization on Standardization. http://www.din.de/gremien/nas/nabd/iso3166ma/). The following shows the entries for the first five codes in the list:
CountryID Name Area --------- --------------------- ---------- AD Andorra Europe AE United Arab Emirates Asia AF Afghanistan Asia AG Antigua and Barbuda Americas AI Anguilla Americas
The CountryCodes.tab table can be used to narrow the search for an identifier to a particular country. The user would choose a country from the country list in order to select the appropriate country code. That code would then be used in a SQL query to restrict the language identifier list to just entries for that country. For instance, if the user were interested only in Afghanistan, the following SQL query would return just the table rows for that country:
SELECT * FROM LanguageCodes WHERE CountryID='AF'
Alternatively, the following link to the Ethnologue website could be used to generate a report listing all the languages for Afghanistan:
http://www.ethnologue.com/country/AF
LanguageIndex.tab documents over 55,000 distinct names used for the languages and their dialects. The entries in this index of names indicate in which country each name is used. The table thus contains over 70,000 records since many of the names are used in more than one country and some are used with more than one language or dialect. The following shows the entries in the name index for the first three language identifiers:
LangID CountryID NameType Name ------ --------- -------- ------------- aaa NG L Ghotuo aab NG D Alumu aab NG D Tesu aab NG DA Arum aab NG L Alumu-Tesu aab NG LA Alumu aab NG LA Arum-Cesu aab NG LA Arum-Chessu aab NG LA Arum-Tesu aac PG D Serea aac PG L Ari
We see that aaa has just one name, Ghotuo; aab has four alternate names, two dialect names, and an alternate dialect name in addition to its primary name; aac has a dialect name in addition to the primary name of Ari.
The LanguageIndex.tab table would be used to implement a search by name. For instance, the following query returns the three-letter codes for all the languages that use the name xyz:
SELECT DISTINCT LangID FROM LanguageIndex
WHERE Name='xyz'
Note that DISTINCT is used since the same language could be known by the same name in multiple countries. To allow the user to verify that a proposed identifier is indeed the right one, the software would offer the following link to the Ethnologue website to see a report giving detailed information about the selected language (where abc is the proposed three-letter identifier):
http://www.ethnologue.com/language/abc
Another application of the LanguageIndex.tab table is to find all the countries in which a given language is spoken. For instance, the following query returns the names of all the countries in which language abc is spoken:
SELECT DISTINCT C.Name
FROM CountryCodes AS C
JOIN LanguageIndex AS L ON C.CountryID=L.CountryID
WHERE L.LangID='abc'
In this case DISTINCT must be used since a language could have multiple names in a given country.
Finally, the LanguageIndex.tab table can be used to find all the languages spoken in a particular country. Whereas the query illustrated previously retrieves all languages whose primary country is Afghanistan, the following query retrieves all languages spoken in Afghanistan:
SELECT DISTINCT LangID FROM LanguageIndex
WHERE CountryID='AF'
Downloading the Codes Tables
The code tables (as tab-delimited, UTF-8 encoded plain text files) may be downloaded individually by clicking the following links. In each case, the first line contains the column names rather than the first row of data.
- LanguageCodes.tab (144K)
- CountryCodes.tab (5K)
- LanguageIndex.tab (1,325K)
Or download the complete set of tables with the terms of use statement in a single zip file:
- Language_Code_Data_20240221.zip (427K)
- Language_Code_Data_20230221.zip (421K)
- Language_Code_Data_20220221.zip (447K)
- Language_Code_Data_20210221.zip (441K)
- Language_Code_Data_20200221.zip (439K)
- Language_Code_Data_20190221.zip (423K)
- Language_Code_Data_20180221.zip (414K)
- Language_Code_Data_20170221.zip (403K)
- Language_Code_Data_20160222.zip (396K)
- Language_Code_Data_20150221.zip (385K)
- Language_Code_Data_20140425.zip (381K)
- Language_Code_Data_20130225.zip (347K)
Terms of use
In the interest of fostering the use of ISO 639-3 for the uniform identification of all the world's languages in information systems, SIL International releases certain information from the Ethnologue database for use in the development of information systems, specifically, information that describes language identifiers in terms of alternate names and countries where spoken. You are welcome to download the information as provided and incorporate the supplied tables into your own database application. You are authorized to include the information in a product that you make available to the public (even on a commercial basis), provided that you:
- cite SIL International and this website (ethnologue.com) as the source of the information,
- do not modify or extend the codes other than those set aside for local use (i.e. qaa to qtz),
- do not redistribute the code tables for download, and
- use only the data in the tables and no other data posted on this site. Other information on this site should be accessed by supplying a link like the following (where abc is an ISO 639-3 code):
http://www.ethnologue.com/language/abc SIL International periodically updates the supplied information, and intends this site to be the sole distribution source in order to ensure uniformity of versions. You are not authorized to redistribute the code tables for download, whether in the exact form they were obtained from this site or in a modified form you have developed, without the written consent of SIL International (see instructions above).