Heavily inspired by Tsoding's "Email Spam Filter in Go" video.
Formula:
where:
-
$P(D)$ - probability of a given document to be exist based on the current Bag-of-Words (order of words is not relevant) -
$P(C)$ - probability of aclassof document to beTrue(in this case theSPAM class)
Go to https://www2.aueb.gr/users/ion/data/enron-spam/ and download and extract enron1, enron2 etc. and put them in data/
- generate Bag-of-Words for a directory
- compute
$P(D)$ for a given document (at afilePath) - compute
$P(C)$ - compute
$P(D|C)$