Word usage in spam

Created: 16 Mar 2010 • Updated: 18 Mar 2010
Posted on behalf of Mathew Nisbet, Malware Data Analyst, Symantec Hosted Services

There is a huge variety in the types of spam that are sent all over the internet, but there are patterns to be found in the chaos.
One way to see patterns is to look at the words most commonly used in spam. If we take a random sample of global spam over a one week period, then there is quite a jumble of topics, but even through all the noise you can see certain words still stand out, as illustrated here (the larger a word, the more often it occurs):
As you can see, the popular words are fairly generic but all seem to be geared towards encouraging an immediate reaction, trying to get some sense of urgency. This is further indictaded by the fact that 5 of the top 6 words have an exclamation mark. Spammers like to create a sense of urgency in their messages, as the less time someone spends thinking about it, the less likely they are to realise it is in fact a scam of some type.
Individual botnets have different profiles from general spam though, they tend to have more restricted sets of words used, as they stick to a smaller number of set topics. The reason for this is the way botnets are used. Spammers pay botnet 'herders' (the people responsible for the spread of the malware used to make botnets, and the control of the infected machines) for the use of their botnet, because botnets can send mail in volumes far greater than any individual spammer could manage. They also make it far less likely that the spammer will get caught and prosecuted, as there is no single source of the spam to trace them. With the use of the botnet effectively going to the highest bidder, that means that each botnet will only be sending a small number of topics at any given time, from a small number of spammers who are able to pay for the service. Below is a series of pictures showing the top words from 4 of the top 5 spam botnets, and a screenshot of a sample mail from each.








Bobax is slightly different from the first three, as it has a greater number of words that occur in its spam, however it is still clear that the vast majority of its spam is limited to a particular topic.
The last of the top 5 botnets is Cutwail, and Cutwail behaves differently to the the other 4. Rather than being limited to a small number of topics, which makes certain words stand out clearly from all others, it instead has lots of different topics all used in similar volumes. One reason for this could be that rather than just standard spam, which Cutwail still sends in abundance, Cutwail tends to be used to spread a lot of malware as well (Cutwail is resonsible for most of the emails being used to try and spread the Bredolab family of malware for example). With the objective being to deliver malware, it makes sense for the topic to be changed frequently, as the topic is just a means to get a users' attention, and having lots of topics therefore increases the chances of the mail (and its attachment) being opened. This differs from the other top botnets, they are mostly trying to sell something and so the topic of their spam is the whole point of the mail, meaning though the wording may be swapped around a bit, mostly it's just the same words in a different order, or with different pictures.
Below is a picture that shows the word distribution from Cutwail spam, notice how the variance in size accross the words is much less than the others, and also how many more words there are overall:


Also of note is that the cutwail sample shows that a lot more thought has gone into the text of the email. With spam from the other major botnets, the only objective is to get the user to go to a website, so for this reason their spam contains very little text, maybe one or two lines, and a link. For cutwail however, the objective is a little different. They have to convince the user to open the attached file, so they need to make the mail look as legitimate as possible. They may even copy text from legitimate emails or websites for their own use. This is why Cutwail malware e-mails are longer, use good English, and why Cutwail spam has a wider variety of words being used.

These images are created using Wordle to display MessageLabs Intelligence data.