The Language of Spam: Spammers do their Homework before Spamming Specific Regions
Posted on behalf of Dan Bleaken, Malware Data Analyst, Symantec Hosted Services
Globally, for the past month, spam accounts for roughly 75 percent of all email in circulation. And about 75 percent of that spam is sent from one of the ten to 20 heavyweight botnets, which are huge networks of infected PCs, in some cases more than 1 million strong, sending spam 24/7. The remaining 25 percent of spam is sent via some other technique such as
• spam sent manually/automatically in large volumes using possibly thousands of newly generated, automatic CAPTCHA-broken, free webmail accounts
• spam sent manually/automatically using a compromised private webmail account e.g. a company webmail, university webmail etc
• spam sent manually/automatically using servers with a weak SMTP AUTH password, which the spammers have guessed
• spam sent manually/automatically via an open relay i.e. no authentication/relay restrictions
• spam sent manually/automatically from the spammers machine(s) from domains that they have purchased (but this isn’t a particularly clever method for the spammer to use!)
Regardless of the technique or resource used to send the spam, different spammers have different objectives and different priorities. Some spammers want to send to recipients globally, because they are trying to sell some product (e.g. Viagra, watches) that is available to be shipped globally via some spammy website. Other spammers want to mail globally to attempt to trick/defraud/scam their victims (e.g. Advance Fee Fraud, phish etc). These globally distributed spam campaigns account for the vast majority of spam.
However, there is always an element of spam that is specifically targeted at a specific region, or a specific country, or speakers of a specific language. In these cases, spammers can make their campaigns more tailored to the recipient, perhaps rolling in local knowledge, understanding of local customs, and references to local well known people/places/products.
Which spammers? Well, globally active spammers can target specific regions or languages as part of their normal business, or regionally based spam gangs may choose to exclusively spam ‘on their doorstep’. To send region/language specific spam, spammers could used a botnet (with the choice to either send from bots all over the world, or specifically hire bots in that region – to add further legitimacy), or they could use one of the other resources listed above.
What we see is a storm of spam consisting of globally, and locally circulating spam campaigns. Some spam campaigns are seen by recipients all over the world, others are seen only by recipients in a particular country or region.
Globally, English language spam is dominant and always has been, accounting for 90 to 95 percent of global spam. The proportion of spam that is written in English was higher in 2009 than in 2008. The largest spam campaigns tend to be sent from botnets, in enormous volumes, and in English. From the spammer’s point of view this maximizes the chances of a response.
However, the proportion of spam that is in English can be dramatically lower in certain regions/countries. Globally, Russian language spam was popular in 2008 but not nearly as much so this year. French/German language is seen fairly frequently. We tend to see the occasional large bursts of French/German language spam over the normal French/German language background level.
Portuguese is slowly becoming more common. Very occasionally we see a big wave of Italian spam. Chinese, Japanese – in terms of global spam -- don’t represent a very large proportion at all. Globally, the most common languages after English are French, Portuguese, Russian and German, in that order.
All relative of course – small percentages still account for massive volumes sent. For example, Italian language spam accounts for approximately 0.02 percent of global spam. Symantec estimates the total daily global spam volume as approximately 50 billion, which equates to 100 million Italian language spam messages.
This is the global average language breakdown, but what do recipients experience in individual countries? MessageLabs Intelligence analyzed spam received to tens of thousands of domains based all over the world, to investigate the language of spam received in a selection of 29 countries: Australia, Austria, Belgium, Brazil, Canada, China, Denmark, Finland, France, Germany, Hong Kong, India, Indonesia, Italy, Japan, Malaysia, Netherlands, New Zealand, Norway, Portugal, Singapore, South Africa, Spain, Sweden, Switzerland, Taiwan, Thailand, United Kingdom, USA.
Of the 29 countries studied, all countries have English as the top spam language, except for Brazil, whose top spam language is Portuguese (41 percent of spam received in Brazil is in Portuguese). The countries where recipients are most likely to receive English language spam are South Africa, Switzerland, (and surprisingly) Thailand and India. The countries where recipients are least likely to receive English language spam are Brazil, Taiwan, Italy and Malaysia (although even here recipients are still quite likely to receive English spam, approximately 40 percent is in English). Brazil has the highest percentage of spam in the local language (Portuguese at 41 percent), followed by Italy (Italian at 35 percent), and China (Chinese at 19 percent). Baltic States Sweden, Finland, Norway, and Denmark have a very low proportion of local language spam. Taiwan, China, Hong Kong, Singapore, Indonesia have the highest percentage of Chinese language spam, especially China and Taiwan (18 to 20 percent of spam). Everywhere else gets a small proportion of Chinese language spam.
The following table shows what proportion of spam in each of the 29 countries studied, is in the local language of that country (where the primary local language is not English).
Next, MessageLabs Intelligence analyzed the language breakdown for the TLD of the 29 countries analyzed above: .au (Australia), .at (Austria), .be (Belgium), .br (Brazil), .ca (Canada), .cn (China), .dk (Denmark), .fi (Finland), .fr (France), .de (Germany), .hk (Hong Kong), .in (India), .id (Indonesia), .it (Italy), .jp (Japan), .my (Malaysia), .nl (Netherlands), .nz (New Zealand), .no (Norway), .pt (Portugal), .sg (Singapore), .za (South Africa), .es (Spain), .se (Sweden), .ch (Switzerland), .tw (Taiwan), .th (Thailand), .uk (United Kingdom), .us (USA). This time, rather than calculating the language of spam based on the location of the business, the analysis looked at the language of spam based on the apparent country of the recipient’s domain.
The following table shows what proportion of spam for each of the 29 TLDs studied, is in the local language of that country (where the primary local language is not English).
The results were surprising. Most people would probably expect the TLD of a given country to have a higher proportion of local language spam, than the myriad of domains owned by businesses based in that country. But in fact, it’s the opposite. For example: Brazilian domains ending .br receive 5 percent of Portuguese spam, but domains of businesses located in Brazil receive 41 percent. Another example: Chinese domains ending .cn receive 1 percent of Chinese spam, but domains of businesses located in China receive 19 percent. Final example: German domains ending in .de receive 6 percent German language spam, but domains of businesses located in Germany receive 18 percent. A similar result is seen for the other countries.
So what’s the explanation for this? Why would recipients be more likely to get, say, Chinese spam to a .com domain in China, than to a .cn? It suggests that frequently, when spammers prepare a new campaign in a specific language, they do not simply select domains with the appropriate TLD, they look deeper than that for any domains of businesses that are based in the appropriate country.
This makes good sense. Imagine that you want to send a Chinese language spam run. Choosing email addresses containing .cn domains may seem like a good idea, but what are the chances that the recipients of the emails will speak Chinese? Actually, the chances may be quite low. Many global businesses buy variants of their main .com domain in different countries around the world. So a company based in, say, the USA, may have a .com, a .co.uk, a .com.au, a .cn etc. However, actively searching for domains registered in Chinese speaking countries, or where the administrative contacts for the domain are in a Chinese speaking country, greatly increases the likelihood that the recipients of the spam run speak Chinese.
Of course there are always exceptions, but the spammers have to rely on probability. Usually, the number of different domains to which spam is sent is very large, and it would be extremely time consuming to individually check and research every domain targeted. The spammers probably use some automated technique to suggest the best domains to send a language-specific spam run.
One way of doing this would be to perform a ‘whois’ lookup on domains and search within the results for the appearance of certain words. For example, for a Chinese spam run, spammers might perform a whois lookup on one thousand domains, and search in the results for ‘China’, ‘Hong Kong’, ‘Beijing’.... This way the spammers could gather a list of domains for organizations that are highly likely to have Chinese speaking employees/users. e.g. whois results for the domain of a Hong Kong based company. Spammers would be right to assume that the employees of this company are very likely to speak Chinese.
Other results for spam language by TLD:
The TLDs most likely to receive English spam are .in (India), .sg (Singapore), .id (Indonesia), .jp (Japan), .hk (Hong Kong). The TLDs least likely to receive English (still more than 45 percent English though) are .dk (Denmark), .se (Sweden), .fi (Finland), .tw (Taiwan), .th (Thailand). The TLDs most likely to receive Chinese are .sg (Singapore), .hk (Hong Kong), .tw (Taiwan), .at (Austria!). The TLDs least likely to receive Chinese are .pt (Portugal), .us (USA), .nl (Netherlands) and .de (Germany). English is the most likely language for all TLDs studied.