Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: June 23, 2020
This article discusses the types of approaches to spam detection that we can use in various circumstances. We’ll also see what training datasets are publicly available for the construction of spam detectors in English and in some other languages.
A message is considered spam if it’s unwanted by the receiver. This means that no message is, in abstract terms, a spam message; but rather all spam messages are such only in relation to a particular receiver that doesn’t want them.
This is a problem for us if we want to build a spam detector: different users have different preferences, and we can’t aprioristically predict which messages they’ll want to receive and which not. We can see what this means with an example.
Common spam emails are, for instance, those that relate to the sale of medicines, financial services, or adult content. They may not, however, be spam if the user is:
Theoretically, we want the user to first express their preferences to us in a sufficiently clear manner, so we can then build a system that filters out all unwanted messages. Let’s, for now, imagine that this happened and that we have a clear picture of what the user wants and what they don’t.
So how do we then proceed to build a spam filter from here?
There are three main approaches to the creation of a system for the detection of spam in a corpus of emails. The first approach is rule-based and works by classifying as spam all texts that satisfy certain sets of RegEx patterns:
Programmers identify these patterns a priori, which leads them to be static and unchangeable.
We should select this approach when we have to share the same spam filter across multiple users; in doing so, keep in mind that the system will be difficult to scale since all rules must be input by hand.
Another approach is the statistical method, which we can implement by means of machine learning.
This method requires us to check features of words or texts and compare them against the associated spam labels. In doing so, a machine learning algorithm can then learn how to classify previously unseen texts as spam on the basis of a learned training dataset.
This is, for instance, how the coefficient matrix of a Bernoulli Naive Bayesian classifier for spam looks after training:
Finally, there are hybrid approaches that merge rule-based and statistical methods. They are somewhat less common and rarer to find in commercial applications, at least for emails in the English language, while statistical methods are the most common.
It should be noted that, while the statistical approach is the most suited for English, this is not generally valid for all languages. Some, and in particular Chinese, benefit more from using hybrid methods instead. This means that we should consider the specificity of language structures when building spam detectors.
Regardless of the approach we choose, when creating a spam filter we’ll need to first create a base detector that identifies spam, and then update it online with the preferences expressed by the user.
So how do we create a base spam classifier, which we can then adapt to the specific preferences of any individual user?
We need a training dataset. To find a dataset we have two options: either we develop it ourselves on the basis of the individual user’s behavior, or we find a premade one and adapt it.
When filtering spam for a user, the ideal dataset is the one which that user creates. By using that dataset, the spam filter would replicate exactly that user’s preferences, and not somebody else’s.
This means that, ideally, we’d like to ask the user to label for us all messages they receive. They would then tell us which emails they consider to be spam, and we’d guess whether they want any given new email on the basis of their preferences.
Users do exactly this when they flag an email as spam in an email client. As a consequence, we might think to take advantage of this behavior and use it to build our training dataset. This is however seldom possible for two main reasons:
This also means that building a base spam detector by using individual user preferences is unlikely to happen, though not impossible in principle. It can only happen though when new users receive many spam messages and diligently label them for us accordingly.
The second solution which, while not ideal, is in practice very common, is to use a premade dataset for the identification of spam. Several datasets of this type are available, for both commercial and scientific purposes:
For languages other than English a ready-made public training dataset for spam classification is not always available. We can, in that case, find datasets of texts, and of internet texts in particular, and work up from those.
If we can put some work into a text corpus we can sometimes re-adapt it for training a spam classifier. If, for example, a certain dataset lacks the labels for spam classification, that doesn’t mean we can’t use it. In that case, we can still either label the texts manually or use automatic methods for their labeling.
We can see here some text datasets for languages other than English:
Since user emails contain personal and sensitive information, it’s generally hard to find open datasets for the classification of emails. With the exception of the ones in English, indicated above, such datasets are generally not publicly available. If we still need a more specific dataset, we can sometimes obtain access to it from those who have one.
All email service providers own proprietary datasets, and can in some cases grant access to third parties. This type of access is normally given on the basis of a commercial or research contract. Programmers or researchers can negotiate this type of contract directly with the service providers.
In this article, we’ve seen what are the main approaches to spam filtering. We commonly use the rule-based approach when we share the same spam filter between multiple users. We use instead the statistical approach when the spam filter must be scalable.
We’ve also seen what are the most common public datasets for the initial training of spam classifiers. We discussed which ones are available in English and other languages, and have seen what features they possess.