Spam filtering made easy for you

Overview

spammy

PyPI version Build Status Python Versions percentagecov Requirements Status License

Author: Tasdik Rahman
Latest version: 1.0.3

1   Overview

spammy : Spam filtering at your service

spammy powers the web app https://plino.herokuapp.com

2   Features

  • train the classifier on your own dataset to classify your emails into spam or ham
  • Dead simple to use. See usage
  • Blazingly fast once the classifier is trained. (See benchmarks)
  • Custom exceptions raised so that when you miss something, spammy tells you where did you go wrong in a graceful way
  • Written in uncomplicated python
  • Built on top of the giant shoulders of nltk

3   Example

[back to top]

  • Your data directory structure should be something similar to
$ tree /home/tasdik/Dropbox/projects/spammy/examples/test_dataset
/home/tasdik/Dropbox/projects/spammy/examples/test_dataset
├── ham
│   ├── 5458.2001-04-25.kaminski.ham.txt
│   ├── 5459.2001-04-25.kaminski.ham.txt
│   ...
│   ...
│   └── 5851.2001-05-22.kaminski.ham.txt
└── spam
    ├── 4136.2005-07-05.SA_and_HP.spam.txt
    ├── 4137.2005-07-05.SA_and_HP.spam.txt
    ...
    ...
    └── 5269.2005-07-19.SA_and_HP.spam.txt

Example

>>> import os
>>> from spammy import Spammy
>>>
>>> directory = '/home/tasdik/Dropbox/projects/spamfilter/data/corpus3'
>>>
>>> # directory structure
>>> os.listdir(directory)
['spam', 'Summary.txt', 'ham']
>>> os.listdir(os.path.join(directory, 'spam'))[:3]
['4257.2005-04-06.BG.spam.txt', '0724.2004-09-21.BG.spam.txt', '2835.2005-01-19.BG.spam.txt']
>>>
>>> # Spammy object created
>>> cl = Spammy(directory, limit=100)
>>> cl.train()
>>>
>>> SPAM_TEXT = \
... """
... My Dear Friend,
...
... How are you and your family? I hope you all are fine.
...
... My dear I know that this mail will come to you as a surprise, but it's for my
... urgent need for a foreign partner that made me to contact you for your sincere
... genuine assistance My name is Mr.Herman Hirdiramani, I am a banker by
... profession currently holding the post of Director Auditing Department in
... the Islamic Development Bank(IsDB)here in Ouagadougou, Burkina Faso.
...
... I got your email information through the Burkina's Chamber of Commerce
... and industry on foreign business relations here in Ouagadougou Burkina Faso
... I haven'disclose this deal to any body I hope that you will not expose or
... betray this trust and confident that I am about to repose on you for the
... mutual benefit of our both families.
...
... I need your urgent assistance in transferring the sum of Eight Million,
... Four Hundred and Fifty Thousand United States Dollars ($8,450,000:00) into
... your account within 14 working banking days This money has been dormant for
... years in our bank without claim due to the owner of this fund died along with
... his entire family and his supposed next of kin in an underground train crash
... since years ago. For your further informations please visit
... (http://news.bbc.co.uk/2/hi/5141542.stm)
... """
>>> cl.classify(SPAM_TEXT)
'spam'
>>>

3.1   Accuracy of the classifier

>>> from spammy import Spammy
>>> directory = '/home/tasdik/Dropbox/projects/spammy/examples/training_dataset'
>>> cl = Spammy(directory, limit=300)  # training on only 300 spam and ham files
>>> cl.train()
>>> data_dir = '/home/tasdik/Dropbox/projects/spammy/examples/test_dataset'
>>>
>>> cl.accuracy(directory=data_dir, label='spam', limit=300)
0.9554794520547946
>>> cl.accuracy(directory=data_dir, label='ham', limit=300)
0.9033333333333333
>>>

NOTE:

4   Installation

[back to top]

NOTE: spammy currently supports only python2

Install the dependencies first

$ pip install nltk==3.2.1, beautifulsoup4==4.4.1

To install use pip:

$ pip install spammy

or if you don't have pip``use ``easy_install

$ easy_install spammy

Or build it yourself (only if you must):

$ git clone https://github.com/tasdikrahman/spammy.git
$ python setup.py install

4.1   Upgrading

To upgrade the package,

$ pip install -U spammy

4.2   Installation behind a proxy

If you are behind a proxy, then this should work

$ pip --proxy [username:password@]domain_name:port install spammy

5   Benchmarks

[back to top]

Spammy is blazingly fast once trained

Don't believe me? Have a look

>>> import timeit
>>> from spammy import Spammy
>>>
>>> directory = '/home/tasdik/Dropbox/projects/spamfilter/data/corpus3'
>>> cl = Spammy(directory, limit=100)
>>> cl.train()
>>> SPAM_TEXT_2 = \
... """
... INTERNATIONAL MONETARY FUND (IMF)
... DEPT: WORLD DEBT RECONCILIATION AGENCIES.
... ADVISE: YOUR OUTSTANDING PAYMENT NOTIFICATION
...
... Attention
... A power of attorney was forwarded to our office this morning by two gentle men,
... one of them is an American national and he is MR DAVID DEANE by name while the
... other person is MR... JACK MORGAN by name a CANADIAN national.
... This gentleman claimed to be your representative, and this power of attorney
... stated that you are dead; they brought an account to replace your information
... in other to claim your fund of (US$9.7M) which is now lying DORMANT and UNCLAIMED,
...  below is the new account they have submitted:
...                     BANK.-HSBC CANADA
...                     Vancouver, CANADA
...                     ACCOUNT NO. 2984-0008-66
...
... Be further informed that this power of attorney also stated that you suffered.
... """
>>>
>>> def classify_timeit():
...    result = cl.classify(SPAM_TEXT_2)
...
>>> timeit.repeat(classify_timeit, number=5)
[0.1810469627380371, 0.16121697425842285, 0.16121196746826172]
>>>

6   Contributing

[back to top]

Refer CONTRIBUTING page for details

6.1   Roadmap

  • Include more algorithms for increased accuracy
  • python3 support

7   Licensing

[back to top]

Spammy is built by Tasdik Rahman and licensed under GPLv3.

spammy Copyright (C) 2016 Tasdik Rahman([email protected])

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

You can find a full copy of the LICENSE file here

8   Credits

[back to top]

If you'd like give me credit somewhere on your blog or tweet a shout out to @tasdikrahman, well hey, I'll take it.

9   Donation

If you have found my little bits of software of any use to you, you can help me pay my internet bills :)

Paypal badge

Instamojo

gratipay

patreon

Comments
  • Question: Classifiers used

    Question: Classifiers used

    Hi team,

    ~~Does Spammy use Naive Bayes or SVM classifier? I'm guessing more towards SVM as Plino describes it as Machine Learning.~~

    Also is Plino an HTTP implementation/API for Spammy or are there any other changes to it?

    EDIT: Plino's github page describes it as using Naive Bayes which I guess is still ML (but very outdated); it doesn't understand context and can't keep up with the evolving nature of spam.

    My final question is can Spammy be rewritten to use SVM or perhaps LSA?

    Kind Regards, Pavin Joseph.

    opened by pavinjosdev 1
  • Update nltk to 3.2.4

    Update nltk to 3.2.4

    There's a new version of nltk available. You are currently using 3.2.2. I have updated it to 3.2.4

    These links might come in handy: PyPI | Changelog | Homepage

    Changelog

    3.2.4

    Alex Constantin, Hatem Nassrat, Liling Tan

    3.2.3

    Mark Amery, Carl Bolz, Abdelhak Bougouffa, Matt Chaput, Michael Goodman, Jaehoon Hwang, Naoya Kanai, Jackson Lee, Christian Meyer, Dmitrijs Milajevs, Adam Nelson, Pierpaolo Pantone, Liling Tan, Vilhjalmur Thorsteinsson, Arthur Tilley, jmhutch, Yorwba, eromoe and others

    Got merge conflicts? Close this PR and delete the branch. I'll create a new PR for you.

    Happy merging! 🤖

    opened by pyup-bot 1
  • Update nltk to 3.2.3

    Update nltk to 3.2.3

    There's a new version of nltk available. You are currently using 3.2.2. I have updated it to 3.2.3

    These links might come in handy: PyPI | Changelog | Homepage

    Changelog

    3.2.3

    Mark Amery, Carl Bolz, Abdelhak Bougouffa, Matt Chaput, Michael Goodman, Jaehoon Hwang, Naoya Kanai, Jackson Lee, Christian Meyer, Dmitrijs Milajevs, Adam Nelson, Pierpaolo Pantone, Liling Tan, Vilhjalmur Thorsteinsson, Arthur Tilley, jmhutch, Yorwba, eromoe and others

    Got merge conflicts? Close this PR and delete the branch. I'll create a new PR for you.

    Happy merging! 🤖

    opened by pyup-bot 1
  • Update beautifulsoup4 to 4.6.0

    Update beautifulsoup4 to 4.6.0

    There's a new version of beautifulsoup4 available. You are currently using 4.5.3. I have updated it to 4.6.0

    These links might come in handy: PyPI | Homepage

    I couldn't find a changelog for this release. Do you know where I can find one? Tell me!

    Got merge conflicts? Close this PR and delete the branch. I'll create a new PR for you.

    Happy merging! 🤖

    opened by pyup-bot 0
  • Initial Update

    Initial Update

    Hi 👊

    This is my first visit to this fine repo, but it seems you have been working hard to keep all dependencies updated so far.

    Once you have closed this issue, I'll create seperate pull requests for every update as soon as I find one.

    That's it for now!

    Happy merging! 🤖

    opened by pyup-bot 0
  • build(deps): bump nltk from 3.2.2 to 3.4.5

    build(deps): bump nltk from 3.2.2 to 3.4.5

    Bumps nltk from 3.2.2 to 3.4.5.

    Changelog

    Sourced from nltk's changelog.

    Version 3.5 2019-08-07

    • drop support for Python 2
    • minor bug fixes and clean ups

    Thanks to the following contributors to 3.5: Nicolas Darr, Gerhard Kremer, Liling Tan

    Version 3.4.5 2019-08-20

    • Fixed security bug in downloader: Zip slip vulnerability - for the unlikely situation where a user configures their downloader to use a compromised server https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14751)

    Thanks to the following contributors to 3.4.5: Mike Salvatore

    Version 3.4.4 2019-07-04

    • fix bug in plot function (probability.py)
    • add improved PanLex Swadesh corpus reader

    Thanks to the following contributors to 3.4.4: Devashish Lal, Liling Tan

    Version 3.4.3 2019-06-07

    • add Text.generate()
    • add QuadgramAssocMeasures
    • add SSP to tokenizers
    • return confidence of best tag from AveragedPerceptron
    • make plot methods return Axes objects
    • don't require list arguments to PositiveNaiveBayesClassifier.train
    • fix Tree classes to work with native Python copy library
    • fix inconsistency for NomBank
    • fix random seeding in LanguageModel.generate
    • fix ConditionalFreqDist mutation on tabulate/plot call
    • fix broken links in documentation
    • fix misc Wordnet issues
    • update installation instructions

    Thanks to the following contributors to 3.4.3: alvations, Bharat123rox, cifkao, drewmiller, free-variation, henchc irisxzhou, nick-ulle, ppartarr, simonepri, yigitsever, zhaoyanpeng

    Version 3.4.1 2019-04-17

    • add chomsky_normal_form for CFGs
    • add meteor score
    • add minimum edit/Levenshtein distance based alignment function
    • allow access to collocation list via text.collocation_list()
    • support corenlp server options
    • drop support for Python 3.4
    ... (truncated)
    Commits
    • acca8d5 updates for 3.4.5
    • 083bbf0 updates for 3.4.5
    • f59d7ed CVE-2019-14751:
    • 2554ff4 updates for 3.4.4
    • fbda919 drop comment about implementation which is no longer accurate, and which did ...
    • 8bcc98a Merge pull request #2319 from BLaZeKiLL/BLaZeKiLL-polt-bug-fix
    • f6a4f38 Merge pull request #2291 from alvations/better-panlex
    • 8c75c56 Merge pull request #2324 from minho42/Fix-typo
    • afe23b3 Fix typo
    • ecdcc57 fixed retrieval of conditions from kwargs
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot ignore this [patch|minor|major] version will close this PR and stop Dependabot creating any more for this minor/major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Update nltk to 3.2.5

    Update nltk to 3.2.5

    There's a new version of nltk available. You are currently using 3.2.2. I have updated it to 3.2.5

    These links might come in handy: PyPI | Changelog | Homepage

    Changelog

    3.2.5

    Ali Abdullah, Lakhdar Benzahia, Henry Elder, Campion Fellin, Tsolak Ghukasyan, Thanh Ha, Jean Helie, Nelson Liu, Nathan Schneider, Chintan Shah, Fábio Silva, Liling Tan, Ziyao Wei, Zicheng Xu, Albert Au Yeung, AbdealiJK, porqupine, sbagan, xprogramer

    3.2.4

    Alex Constantin, Hatem Nassrat, Liling Tan

    3.2.3

    Mark Amery, Carl Bolz, Abdelhak Bougouffa, Matt Chaput, Michael Goodman, Jaehoon Hwang, Naoya Kanai, Jackson Lee, Christian Meyer, Dmitrijs Milajevs, Adam Nelson, Pierpaolo Pantone, Liling Tan, Vilhjalmur Thorsteinsson, Arthur Tilley, jmhutch, Yorwba, eromoe and others

    Got merge conflicts? Close this PR and delete the branch. I'll create a new PR for you.

    Happy merging! 🤖

    opened by pyup-bot 0
Owner
Tasdik Rahman
Engineering Platform @gojek, former SRE @razorpay. Weekend chef, Backpacker, past contributor to @oVirt (Redhat).
Tasdik Rahman
NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

Distributed (Deep) Machine Learning Community 2.5k Jan 4, 2023
NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

Distributed (Deep) Machine Learning Community 2.2k Feb 17, 2021
An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Khalid Saifullah 37 Sep 5, 2022
Khandakar Muhtasim Ferdous Ruhan 1 Dec 30, 2021
DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

DeepAmandine This is an artificial intelligence based on GPT-3 that you can chat with, it is very nice and makes a lot of jokes. We wish you a good ex

BuyWithCrypto 3 Apr 19, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
Easy, fast, effective, and automatic g-code compression!

Getting to the meat of g-code. Easy, fast, effective, and automatic g-code compression! MeatPack nearly doubles the effective data rate of a standard

Scott Mudge 97 Nov 21, 2022
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

Ubiquitous Knowledge Processing Lab 748 Jan 6, 2023
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.8k Dec 27, 2022
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022
🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

Axel Springer Ideas Engineering GmbH 231 Nov 18, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 11, 2021
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.5k Feb 18, 2021
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.1k Feb 14, 2021
🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

Axel Springer Ideas Engineering GmbH 220 Feb 10, 2021
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 17, 2021
A fast and easy implementation of Transformer with PyTorch.

FasySeq FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which

宁羽 7 Jul 18, 2022
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 2, 2023
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

null 49 Dec 17, 2022