Spam filtering made easy for you

Tasdik Rahman

Last update: Dec 18, 2022

Related tags

Overview

spammy

Author:	Tasdik Rahman
Latest version:	1.0.3

Contents

1 Overview
2 Features
3 Example
- 3.1 Accuracy of the classifier
4 Installation
- 4.1 Upgrading
- 4.2 Installation behind a proxy
5 Benchmarks
6 Contributing
- 6.1 Roadmap
7 Licensing
8 Credits
9 Donation

1 Overview

spammy : Spam filtering at your service

spammy powers the web app https://plino.herokuapp.com

2 Features

train the classifier on your own dataset to classify your emails into spam or ham
Dead simple to use. See usage
Blazingly fast once the classifier is trained. (See benchmarks)
Custom exceptions raised so that when you miss something, spammy tells you where did you go wrong in a graceful way
Written in uncomplicated python
Built on top of the giant shoulders of nltk

3 Example

[back to top]

Your data directory structure should be something similar to

$ tree /home/tasdik/Dropbox/projects/spammy/examples/test_dataset
/home/tasdik/Dropbox/projects/spammy/examples/test_dataset
├── ham
│   ├── 5458.2001-04-25.kaminski.ham.txt
│   ├── 5459.2001-04-25.kaminski.ham.txt
│   ...
│   ...
│   └── 5851.2001-05-22.kaminski.ham.txt
└── spam
    ├── 4136.2005-07-05.SA_and_HP.spam.txt
    ├── 4137.2005-07-05.SA_and_HP.spam.txt
    ...
    ...
    └── 5269.2005-07-19.SA_and_HP.spam.txt

Example

>>> import os
>>> from spammy import Spammy
>>>
>>> directory = '/home/tasdik/Dropbox/projects/spamfilter/data/corpus3'
>>>
>>> # directory structure
>>> os.listdir(directory)
['spam', 'Summary.txt', 'ham']
>>> os.listdir(os.path.join(directory, 'spam'))[:3]
['4257.2005-04-06.BG.spam.txt', '0724.2004-09-21.BG.spam.txt', '2835.2005-01-19.BG.spam.txt']
>>>
>>> # Spammy object created
>>> cl = Spammy(directory, limit=100)
>>> cl.train()
>>>
>>> SPAM_TEXT = \
... """
... My Dear Friend,
...
... How are you and your family? I hope you all are fine.
...
... My dear I know that this mail will come to you as a surprise, but it's for my
... urgent need for a foreign partner that made me to contact you for your sincere
... genuine assistance My name is Mr.Herman Hirdiramani, I am a banker by
... profession currently holding the post of Director Auditing Department in
... the Islamic Development Bank(IsDB)here in Ouagadougou, Burkina Faso.
...
... I got your email information through the Burkina's Chamber of Commerce
... and industry on foreign business relations here in Ouagadougou Burkina Faso
... I haven'disclose this deal to any body I hope that you will not expose or
... betray this trust and confident that I am about to repose on you for the
... mutual benefit of our both families.
...
... I need your urgent assistance in transferring the sum of Eight Million,
... Four Hundred and Fifty Thousand United States Dollars ($8,450,000:00) into
... your account within 14 working banking days This money has been dormant for
... years in our bank without claim due to the owner of this fund died along with
... his entire family and his supposed next of kin in an underground train crash
... since years ago. For your further informations please visit
... (http://news.bbc.co.uk/2/hi/5141542.stm)
... """
>>> cl.classify(SPAM_TEXT)
'spam'
>>>

3.1 Accuracy of the classifier

>>> from spammy import Spammy
>>> directory = '/home/tasdik/Dropbox/projects/spammy/examples/training_dataset'
>>> cl = Spammy(directory, limit=300)  # training on only 300 spam and ham files
>>> cl.train()
>>> data_dir = '/home/tasdik/Dropbox/projects/spammy/examples/test_dataset'
>>>
>>> cl.accuracy(directory=data_dir, label='spam', limit=300)
0.9554794520547946
>>> cl.accuracy(directory=data_dir, label='ham', limit=300)
0.9033333333333333
>>>

NOTE:

More examples can be found over in the examples directory

4 Installation

[back to top]

NOTE: spammy currently supports only python2

Install the dependencies first

$ pip install nltk==3.2.1, beautifulsoup4==4.4.1

To install use pip:

$ pip install spammy

or if you don't have pip``use ``easy_install

$ easy_install spammy

Or build it yourself (only if you must):

$ git clone https://github.com/tasdikrahman/spammy.git
$ python setup.py install

4.1 Upgrading

To upgrade the package,

$ pip install -U spammy

4.2 Installation behind a proxy

If you are behind a proxy, then this should work

$ pip --proxy [username:password@]domain_name:port install spammy

5 Benchmarks

[back to top]

Spammy is blazingly fast once trained

Don't believe me? Have a look

>>> import timeit
>>> from spammy import Spammy
>>>
>>> directory = '/home/tasdik/Dropbox/projects/spamfilter/data/corpus3'
>>> cl = Spammy(directory, limit=100)
>>> cl.train()
>>> SPAM_TEXT_2 = \
... """
... INTERNATIONAL MONETARY FUND (IMF)
... DEPT: WORLD DEBT RECONCILIATION AGENCIES.
... ADVISE: YOUR OUTSTANDING PAYMENT NOTIFICATION
...
... Attention
... A power of attorney was forwarded to our office this morning by two gentle men,
... one of them is an American national and he is MR DAVID DEANE by name while the
... other person is MR... JACK MORGAN by name a CANADIAN national.
... This gentleman claimed to be your representative, and this power of attorney
... stated that you are dead; they brought an account to replace your information
... in other to claim your fund of (US$9.7M) which is now lying DORMANT and UNCLAIMED,
...  below is the new account they have submitted:
...                     BANK.-HSBC CANADA
...                     Vancouver, CANADA
...                     ACCOUNT NO. 2984-0008-66
...
... Be further informed that this power of attorney also stated that you suffered.
... """
>>>
>>> def classify_timeit():
...    result = cl.classify(SPAM_TEXT_2)
...
>>> timeit.repeat(classify_timeit, number=5)
[0.1810469627380371, 0.16121697425842285, 0.16121196746826172]
>>>

6 Contributing

[back to top]

Refer CONTRIBUTING page for details

6.1 Roadmap

Include more algorithms for increased accuracy
python3 support

7 Licensing

[back to top]

Spammy is built by Tasdik Rahman and licensed under GPLv3.

spammy Copyright (C) 2016 Tasdik Rahman([email protected])

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

You can find a full copy of the LICENSE file here

8 Credits

[back to top]

If you'd like give me credit somewhere on your blog or tweet a shout out to @tasdikrahman, well hey, I'll take it.

9 Donation

If you have found my little bits of software of any use to you, you can help me pay my internet bills :)

Comments

Question: Classifiers used

Hi team,

~~Does Spammy use Naive Bayes or SVM classifier? I'm guessing more towards SVM as Plino describes it as Machine Learning.~~

Also is Plino an HTTP implementation/API for Spammy or are there any other changes to it?

EDIT: Plino's github page describes it as using Naive Bayes which I guess is still ML (but very outdated); it doesn't understand context and can't keep up with the evolving nature of spam.

My final question is can Spammy be rewritten to use SVM or perhaps LSA?

Kind Regards, Pavin Joseph.

opened by pavinjosdev 1
Update nltk to 3.2.4

There's a new version of nltk available. You are currently using 3.2.2. I have updated it to 3.2.4

These links might come in handy: PyPI | Changelog | Homepage

Changelog

3.2.4

Alex Constantin, Hatem Nassrat, Liling Tan

3.2.3

Mark Amery, Carl Bolz, Abdelhak Bougouffa, Matt Chaput, Michael Goodman, Jaehoon Hwang, Naoya Kanai, Jackson Lee, Christian Meyer, Dmitrijs Milajevs, Adam Nelson, Pierpaolo Pantone, Liling Tan, Vilhjalmur Thorsteinsson, Arthur Tilley, jmhutch, Yorwba, eromoe and others

Got merge conflicts? Close this PR and delete the branch. I'll create a new PR for you.

Happy merging! 🤖

opened by pyup-bot 1
Update nltk to 3.2.3

There's a new version of nltk available. You are currently using 3.2.2. I have updated it to 3.2.3

These links might come in handy: PyPI | Changelog | Homepage

Changelog

3.2.3

Mark Amery, Carl Bolz, Abdelhak Bougouffa, Matt Chaput, Michael Goodman, Jaehoon Hwang, Naoya Kanai, Jackson Lee, Christian Meyer, Dmitrijs Milajevs, Adam Nelson, Pierpaolo Pantone, Liling Tan, Vilhjalmur Thorsteinsson, Arthur Tilley, jmhutch, Yorwba, eromoe and others

Got merge conflicts? Close this PR and delete the branch. I'll create a new PR for you.

Happy merging! 🤖

opened by pyup-bot 1
Update beautifulsoup4 to 4.6.0

There's a new version of beautifulsoup4 available. You are currently using 4.5.3. I have updated it to 4.6.0

These links might come in handy: PyPI | Homepage

I couldn't find a changelog for this release. Do you know where I can find one? Tell me!

Got merge conflicts? Close this PR and delete the branch. I'll create a new PR for you.

Happy merging! 🤖

opened by pyup-bot 0
Initial Update

Hi 👊

This is my first visit to this fine repo, but it seems you have been working hard to keep all dependencies updated so far.

Once you have closed this issue, I'll create seperate pull requests for every update as soon as I find one.

That's it for now!

Happy merging! 🤖

opened by pyup-bot 0
build(deps): bump nltk from 3.2.2 to 3.4.5
Bumps nltk from 3.2.2 to 3.4.5.

Changelog

Sourced from nltk's changelog.

Version 3.5 2019-08-07

drop support for Python 2

minor bug fixes and clean ups

Thanks to the following contributors to 3.5: Nicolas Darr, Gerhard Kremer, Liling Tan

Version 3.4.5 2019-08-20

Fixed security bug in downloader: Zip slip vulnerability - for the unlikely situation where a user configures their downloader to use a compromised server https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14751)

Thanks to the following contributors to 3.4.5: Mike Salvatore

Version 3.4.4 2019-07-04

fix bug in plot function (probability.py)

add improved PanLex Swadesh corpus reader

Thanks to the following contributors to 3.4.4: Devashish Lal, Liling Tan

Version 3.4.3 2019-06-07

add Text.generate()

add QuadgramAssocMeasures

add SSP to tokenizers

return confidence of best tag from AveragedPerceptron

make plot methods return Axes objects

don't require list arguments to PositiveNaiveBayesClassifier.train

fix Tree classes to work with native Python copy library

fix inconsistency for NomBank

fix random seeding in LanguageModel.generate

fix ConditionalFreqDist mutation on tabulate/plot call

fix broken links in documentation

fix misc Wordnet issues

update installation instructions

Thanks to the following contributors to 3.4.3: alvations, Bharat123rox, cifkao, drewmiller, free-variation, henchc irisxzhou, nick-ulle, ppartarr, simonepri, yigitsever, zhaoyanpeng

Version 3.4.1 2019-04-17

add chomsky_normal_form for CFGs

add meteor score

add minimum edit/Levenshtein distance based alignment function

allow access to collocation list via text.collocation_list()

support corenlp server options

drop support for Python 3.4

... (truncated)

Commits

acca8d5 updates for 3.4.5

083bbf0 updates for 3.4.5

f59d7ed CVE-2019-14751:

2554ff4 updates for 3.4.4

fbda919 drop comment about implementation which is no longer accurate, and which did ...

8bcc98a Merge pull request #2319 from BLaZeKiLL/BLaZeKiLL-polt-bug-fix

f6a4f38 Merge pull request #2291 from alvations/better-panlex

8c75c56 Merge pull request #2324 from minho42/Fix-typo

afe23b3 Fix typo

ecdcc57 fixed retrieval of conditions from kwargs

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot ignore this [patch|minor|major] version will close this PR and stop Dependabot creating any more for this minor/major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Update nltk to 3.2.5

There's a new version of nltk available. You are currently using 3.2.2. I have updated it to 3.2.5

These links might come in handy: PyPI | Changelog | Homepage

Changelog

3.2.5

Ali Abdullah, Lakhdar Benzahia, Henry Elder, Campion Fellin, Tsolak Ghukasyan, Thanh Ha, Jean Helie, Nelson Liu, Nathan Schneider, Chintan Shah, Fábio Silva, Liling Tan, Ziyao Wei, Zicheng Xu, Albert Au Yeung, AbdealiJK, porqupine, sbagan, xprogramer

3.2.4

Alex Constantin, Hatem Nassrat, Liling Tan

3.2.3

Mark Amery, Carl Bolz, Abdelhak Bougouffa, Matt Chaput, Michael Goodman, Jaehoon Hwang, Naoya Kanai, Jackson Lee, Christian Meyer, Dmitrijs Milajevs, Adam Nelson, Pierpaolo Pantone, Liling Tan, Vilhjalmur Thorsteinsson, Arthur Tilley, jmhutch, Yorwba, eromoe and others

Got merge conflicts? Close this PR and delete the branch. I'll create a new PR for you.

Happy merging! 🤖

opened by pyup-bot 0

Owner

Tasdik Rahman

Engineering Platform @gojek, former SRE @razorpay. Weekend chef, Backpacker, past contributor to @oVirt (Redhat).

GitHub http://spammy.rtfd.io/

NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

Distributed (Deep) Machine Learning Community

2.5k Jan 4, 2023

NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

2.2k Feb 17, 2021

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

37 Sep 5, 2022

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)

Arabic-Phonetic-Output You can input the phonetic version of any Arabic text her

1 Dec 30, 2021

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

DeepAmandine This is an artificial intelligence based on GPT-3 that you can chat with, it is very nice and makes a lot of jokes. We wish you a good ex

3 Apr 19, 2022

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

1.6k Dec 27, 2022

Easy, fast, effective, and automatic g-code compression!

Getting to the meat of g-code. Easy, fast, effective, and automatic g-code compression! MeatPack nearly doubles the effective data rate of a standard

97 Nov 21, 2022

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

748 Jan 6, 2023

Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

1.8k Dec 27, 2022

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.6k Dec 27, 2022

🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

231 Nov 18, 2022

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

1.5k Feb 11, 2021

Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

1.5k Feb 18, 2021

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.1k Feb 14, 2021

🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

220 Feb 10, 2021

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

1.5k Feb 17, 2021

A fast and easy implementation of Transformer with PyTorch.

FasySeq FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which

7 Jul 18, 2022

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

15k Jan 2, 2023

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

49 Dec 17, 2022

Spam filtering made easy for you

Related tags

Overview

spammy

1 Overview

2 Features

3 Example

3.1 Accuracy of the classifier

4 Installation

4.1 Upgrading

4.2 Installation behind a proxy

5 Benchmarks

6 Contributing

6.1 Roadmap

7 Licensing

8 Credits

9 Donation

Comments

Question: Classifiers used

Update nltk to 3.2.4

Changelog

3.2.4

3.2.3

Update nltk to 3.2.3

Changelog

3.2.3

Update beautifulsoup4 to 4.6.0

Initial Update

build(deps): bump nltk from 3.2.2 to 3.4.5

Update nltk to 3.2.5

Changelog

3.2.5

3.2.4

3.2.3

Owner

Tasdik Rahman

NLP made easy

NLP made easy

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Easy, fast, effective, and automatic g-code compression!

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

Super easy library for BERT based NLP models

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

🏖 Easy training and deployment of seq2seq models.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Super easy library for BERT based NLP models

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

🏖 Easy training and deployment of seq2seq models.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

A fast and easy implementation of Transformer with PyTorch.

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.