Spam Filtering, Bayesian Approach

4th Jan 2025
Tags:
Machine Learning,
Naive Bayes

Introduction

Spam refers to unwanted messages sent, usually in bulk, via email, text (SMS), or instant messaging applications. These unsolicited messages are often used for advertising products, scamming recipients, or even initiating cyberattacks. Due to their large volume, spam can clutter inboxes, making it difficult for users to distinguish important messages from irrelevant ones.

To combat this issue, software solutions known as spam filters are employed. These filters automatically delete spam messages or move them to a separate folder, ensuring that users can focus on more important communications. The core functionality of a spam filter is to classify messages as either spam or not spam (commonly referred to as "ham").

This article explores one of the many techniques used to identify spam messages: the Naive Bayes algorithm. While this approach is not novel, my goal is to walk through the fundamental mathematics behind this powerful algorithm, as well as discuss the assumptions and adaptations made when implementing it in code.

Theory

According to Bayes' theorem, the probability of an event $A$occurring given that event $B$ has occurred is the ratio of the product of the probability of $B$ given $A$ and probability of $A$ to the probability of $B$ ¹. This is long however mathimatically it is expressed as: $$ P(A|B) = \frac{P(B|A) \times P(A)}{P(B)} $$

Spam filters that use this approach ask the question, "What is the probability that the message received is spam given the content?". For a one-word message, this can be expressed as:

$$ P(S|W) = \frac{P(W|S) \times P(S)}{P(W)} $$

Similarly, the probability of the message being ham is: $$ P(H|W) = \frac{P(W|H) \times P(H)}{P(W)} $$

Where:

$P(S|W)$ is the probability of the message being spam given the word.
$P(H|W)$ is the probability of the message being ham given the word.
$P(W|S)$ is the probability of the word occurring given that the message is spam.
$P(W|H)$ is the probability of the word occurring given that the message is ham.
$P(W)$ is the probability of the word.

To classify the message, we use a simple logic: if $P(S|W) > P(H|W)$, the message is classified as spam; otherwise, it is classified as ham. By comparing the two probabilities:
$$ \frac{P(W|S) \times P(S)}{\cancel{P(W)}} > \frac{P(W|H) \times P(H)}{\cancel{P(W)}} $$

we eliminate the need to calculate $P(W)$.

In real-world scenarios, one-word messages are rare. Most messages contain multiple words. Thus, we modify our equation to account for multiple words. $P(S|W)$ becomes $P(S|w_0,w_1,w_2,\ldots,w_{N-1})$ and $P(H|W)$ becomes $P(H|w_0,w_1,w_2,\ldots,w_{N-1})$. Assuming the words are used independently, we have: $$P(S|w_0,w_1,w_2,\ldots,w_{N-1}) = P(S|w_0)\times P(S|w_1) \times P(S|w_2) \times \ldots \times P(S|w_{N-1})$$ $$ = \prod_{n=0}^{N-1}{P(S|w_n)}$$

This assumption of independence is what makes the algorithm "Naive." While words in sentences are not truly independent, the assumption simplifies calculations and still yields reasonable performance.

Our comparison logic then becomes: $$P(S|w_0,w_1,w_2,\ldots,w_{N-1}) > P(H|w_0,w_1,w_2,\ldots,w_{N-1})$$ $$ P(S)\prod_{n=0}^{N-1}{P(S|w_n)} > P(H)\prod_{n=0}^{N-1}{P(H|w_n)}$$

Addressing Numerical Challenges

Computers represent numbers using a finite number of bits, which limits the range of values they can accurately handle. Since probabilities are between 0 and 1, multiplying many probabilities can result in extremely small values, potentially leading to underflow errors where the product is misrepresented as zero.

To mitigate this, we take the logarithm of both sides of the equation: $$ \log \left(P(S)\prod_{n=0}^{N-1}{P(S|w_n)}\right) > \log \left(P(H)\prod_{n=0}^{N-1}{P(H|w_n)} \right)$$

Using the logarithmic property that the log of a product is the sum of the logs, we rewrite the equation as: $$ \log(P(S)) + \sum_{n=0}^{N-1}{\log(P(S|w_n))} > \log(P(H)) + \sum_{n=0}^{N-1}{\log(P(H|w_n))} $$

Implementation

The source code for my implementation can be found on GitHub. The dataset used for this project was obtained from Kaggle. For storing ham and spam words, Python's dictionary was used as the datastore. However, the implementation can be extended to use a database or other storage solutions depending on the requirements. The filter was trained using $500$ messages from the dataset.

Result

Despite the relatively small size of the training dataset, the model achieved an accuracy and F1 score exceeding $90\%$. This performance is particularly impressive given the minimal computational resources required for training and inference.

Limitations

While the results are promising, the filter has some limitations:

Assumption of Word Independence:

The Naive Bayes algorithm assumes that words in a message are independent of each other. In reality, this is rarely the case, as the context and relationships between words often play a significant role in determining whether a message is spam or ham.

Sensitivity to Imbalanced Data:

If the dataset has an uneven distribution of spam and ham messages, the model's predictions may become biased toward the majority class. Proper balancing or the use of techniques like class weighting is necessary to address this.

Limited Vocabulary:

The filter is trained on a relatively small dataset, which means it may struggle with messages containing words or phrases not present in the training data. This can lead to reduced accuracy when encountering new or uncommon terms.

No Contextual Understanding:

The filter relies solely on word frequencies and does not account for more complex patterns or contextual nuances in messages. Advanced techniques, such as deep learning models, could improve performance by capturing these subtleties.

Performance on Noisy Data:

The model may perform poorly on messages with typos, slang, or non-standard formatting, as these can affect the probabilities calculated for classification.

Future Improvements

To address these limitations, future iterations of the filter could incorporate techniques such as:

Using a larger and more diverse training dataset. Employing preprocessing steps to handle noisy data (e.g., stemming, lemmatization, or typo correction). Exploring hybrid models that combine Naive Bayes with more sophisticated machine learning techniques.

References

This document includes insights generated by ChatGPT, a language model developed by OpenAI. (https://openai.com/chatgpt)

Bayes' theorem, Wikipedia, https://en.wikipedia.org/wiki/Bayes%27_theorem (accessed Jan. 5, 2025).