Headless chatbot that detects spam and posts links to it to chatrooms for quick deletion.

Overview

SmokeDetector

Build Status Circle CI Coverage Status Open issues Open PRs

Headless chatbot that detects spam and posts it to chatrooms. Uses ChatExchange, takes questions from the Stack Exchange realtime tab, and accesses answers via the Stack Exchange API.

Example chat post:

Example chat post

Documentation

User documentation is in the wiki.

Detailed documentation for setting up and running SmokeDetector is in the wiki.

Basic setup

To set up SmokeDetector, please use

git clone https://github.com/Charcoal-SE/SmokeDetector.git
cd SmokeDetector
git checkout deploy
sudo pip3 install -r requirements.txt --upgrade
pip3 install --user -r user_requirements.txt --upgrade

Next, copy config.sample to a new file called config, and edit the values required.

To run, use python3 nocrash.py (preferably in a daemon-able mode, like a screen session.) You can also use python3 ws.py, but then SmokeDetector will be shut down after 6 hours; when running from nocrash.py, it will be restarted. (This is to be sure that closed websockets, if any, are reopened.)

Virtual environment setup

Running in a virtual environment is a good way to isolate dependency packages from your local system. To set up SmokeDetector in a virtual environment, you can use

git clone https://github.com/Charcoal-SE/SmokeDetector.git
cd SmokeDetector
git config user.email "[email protected]"
git config user.name "SmokeDetector"
git checkout deploy

python3 -m venv env
env/bin/pip3 install -r requirements.txt --upgrade
env/bin/pip3 install --user -r user_requirements.txt --upgrade

Next, copy the config file and edit as said above. To run SmokeDetector in this virtual environment, use env/bin/python3 nocrash.py.

[Note: On some systems (e.g. Mac's and Linux), some circumstances may require the --user option be removed from the last pip3 command line in the above instructions. However, the --user option is known to be necessary in other circumstances. Further testing is necessary to resolve the discrepancy.]

Docker setup

Running in a Docker container is an even better way to isolate dependency packages from your local system. To set up SmokeDetector in a Docker container, follow the steps below.

  1. Grab the Dockerfile and build an image of SmokeDetector:
DATE=$(date +%F)
mkdir temp
cd temp
wget https://raw.githubusercontent.com/Charcoal-SE/SmokeDetector/master/Dockerfile
docker build -t smokey:$DATE .
  1. Create a container from the image you just built
docker create --name=mysmokedetector smokedetector:$DATE
  1. Start the container. Don't worry, SmokeDetector won't run until it's ready, so you have the chance to edit the configuration file before SmokeDetector runs.

Copy config.sample to a new file named config and edit the values required, then copy the file into the container with this command:

docker cp config mysmokedetector:/home/smokey/SmokeDetector/config
  1. If you would like to set up additional stuff (SSH, Git etc.), you can do so with a Bash shell in the container:
docker exec -it mysmokedetector bash

After you're ready, put a file named ready under /home/smokey:

touch ~smokey/ready

Automate Docker deployment with Docker Compose

I'll assume you have the basic ideas of Docker and Docker Compose.

The first thing you need is a properly filled config file. You can start with the sample.

Create a directory (name it whatever you like), place the config file and docker-compose.yml file. Run docker-compose up -d and your SmokeDetector instance is up.

If you want additional control like memory and CPU constraint, you can edit docker-compose.yml and add the following keys to smokey. The example values are recommended values.

restart: always  # when your host reboots Smokey can autostart
mem_limit: 512M
cpus: 0.5  # Recommend 2.0 or more for spam waves

Requirements

SmokeDetector only supports Stack Exchange logins, and runs on Python 3.6 or higher, for now.

To allow committing blacklist and watchlist modifications back to GitHub, your system also needs Git 1.8 or higher, although we recommend Git 2.11+.

License

Licensed under either of

at your option.

Contribution Licensing

By submitting your contribution for inclusion in the work as defined in the Apache-2.0 license, you agree that it be dual licensed as above, without any additional terms or conditions.

Comments
  • Logos & Branding

    Logos & Branding

    Andy suggested, and I agree, that we should have a cool logo.

    More to the point, it'd be good to have slightly more consistent branding across all our stuff - at the moment, Smokey has one icon, GH has another, and our web projects have another. And no offence to our web projects an' all, but that icon took me literally two minutes to make (and let's not even talk about my artwork on Smokey's icon).

    So. Logos, and branding. I'm going to throw my logo ideas so far into the thread below - use standard :+1: :-1: reactions to indicate your thoughts about them.

    opened by ArtOfCode- 63
  • Integrate DeepSmoke

    Integrate DeepSmoke

    @tanmayb123 has created an API for us to query. StackOverflow posts only, for the time being.

    The API is basically

    99.239.154.69/dsd/index.php?q=[body270urlencoded]
    

    ... where body270urlencoded is the first 270 bytes of the post body.

    Detailed transcript starting around here: https://chat.stackexchange.com/transcript/message/39458944#39458944

    ... but details are a bit further down.

    type: feature request status: planned 
    opened by tripleee 45
  • Yet Another Debate: Should we use a database instead of flat-files?

    Yet Another Debate: Should we use a database instead of flat-files?

    This could apply to either Classic or NG, so I'm leaving it here.

    Context: Smokey currently saves most of its state data in local pickle files:

    $ ls *.p *.pickle
    apiCalls.pickle     bodyfetcherMaxIds.p        falsePositives.p  notifications.p     whyData.p
    autoIgnoredPosts.p  bodyfetcherQueue.p         ignoredPosts.p    whitelistedUsers.p
    blacklistedUsers.p  bodyfetcherQueueTimings.p  latestMessages.p  whyDataAllspam.p
    

    Proposal: Should we use a database for storing all (or most) of this state data, or should we stick to flat files? Why/not? What do we gain/lose?

    (If you were there for the Great Database Debate, assume this is SQLite, because of lack of install/maintenance.)

    type: feedback wanted 
    opened by ArtOfCode- 44
  • findspam.py: bad_ns_for_url_domain()

    findspam.py: bad_ns_for_url_domain()

    Identify rogue name server (for now, concentrate on the Indian pharma spammer's favorite namecheaphosting.com) in domain names in URLs.

    requirements.txt: pull in dnspython

    findspam.py: post_links(): refactor into a separate function so that multiple methods can invoke it.

    opened by tripleee 30
  • Do we need to change our blacklisting guidelines?

    Do we need to change our blacklisting guidelines?

    When we wrote our blacklisting guidelines in October last year, we set the following requirements:

    • Website has been used in at least 2 confirmed instances (tp fed back reports) of spam (You can use https://metasmoke.erwaysoftware.com/search to find other instances of a website being used for spam).
    • Website is not used legitimately in other posts on Stack Exchange.
    • Website is not currently caught in any of these filters: - bad keyword in body - blacklisted website - pattern matching website

    Circumstances have changed since then, and the number of blacklists has grown. With the addition of the !!/blacklist-* commands, over 830 more websites/keywords/usernames have been added to our blacklists. In fact, 106 (!!!) were made in the last five days alone. Many of these websites are already caught by one or two of the reasons specified above.

    Considering this, I think we need to have a discussion over whether these guidelines need to be changed to reflect the way we should/are using blacklists now. What should our new guidelines be?

    • Do we want to be blacklisting every spammy site that we see? Do we want to leave it to extreme circumstances?
    • Should we instead focus our time on improving our pattern-matching-* reasons?
    • Should average autoflag weight of matched posts have anything to do with this?
    • Should manually reported/posts with only 1 reason be given extra weight when counting the need for a blacklist?
    • Are our current guidelines just fine, and do we just need to enforce them more?

    Other things we should think about:

    • If we are going to blacklist everything, do we want to automate it somehow?
    • What sort of a performance hit does blacklisting make? (I think Art ran some stats on this a while ago, maybe they need to be re-run with the updated codebase)
    • How much do dormant blacklists clutter the list? Do we need to think about code readability?
    • Should blacklist entries be removed if they don't have any hits after a certain time?
    • What does the !!/watch-keyword command have to do with this? Should it follow similar guidelines? Should it have separate ones? Do we need to change the way that it is implemented, to give it 0 weight or not send reports to MS?

    What does everyone think about this?

    area: blacklists type: feedback wanted type: policy 
    opened by angussidney 29
  • Major refactoring of globalvars.py

    Major refactoring of globalvars.py

    According to this message, perform major refactoring over globalvars.py.

    Detailed changes are in commit messages.

    Please don't merge this PR until all refactoring is done.

    Note: draft PR won't work as it will not request review, which defeats the point.

    opened by user12986714 27
  • Limit minutes.">

    Limit "notify" to users that have been active in the room in the past minutes.

    At this moment, in the SO Close Vote Reviewers chatroom, SD is notifying 6 different users for every single report.

    Example:

    [ SmokeDetector | MS ] Few unique characters in body: SolvedSOLVEDSOLVED by Furkan Ayık on stackoverflow.com (@​PraveenKumar @​AndrasDeak @πάνταῥεῖ @​FrankerZ @​tripleee @​dorukayhan)

    The amount of users getting notified has steadily been growing. Imo, it's getting a little annoying. Only a portion of the users actually respond to these notifications, and they get notified even when they haven't been active for hours.

    I'd like to request these notifications to be filtered on user activity. IE: Don't bother notifying a user that hasn't been active in the last hour.

    Just to be clear: I have no issue with these users. Just that the list of notifications is getting close to the length of the actual report.

    type: feature request 
    opened by Cerbrus 26
  • Refactor blacklists; implement !!/watch-ip, !!/watch-ns, !!/watch-asn and !!/blacklist-xx for the first two

    Refactor blacklists; implement !!/watch-ip, !!/watch-ns, !!/watch-asn and !!/blacklist-xx for the first two

    Refactor blacklists.py to have a set of classes implementing the various blacklist types, with behavior implemented by way of mixins for the individual classes.

    Based on this refactoring, simplify chatcommands.py and extend to support the new blacklist types.

    Some related refactoring in gitmanager.py to let the blacklist take care of its own updates, and in findspam.py to correspondingly update how we query for matches.

    Finally, simplify globalvars.py by putting all the Github-managed black- and watchlists in a single global dict. This arrangement simplifies handling of the lists as a whole, and finding the correct list based on a keyword like "watch-number".

    area: commands status: confirmed 
    opened by tripleee 25
  • What do we need in a central blacklist tracker?

    What do we need in a central blacklist tracker?

    As brought up in https://github.com/Charcoal-SE/metasmoke/issues/257, there's a desire to move blacklists out of GitHub and into some other tool. Whether that tool is metasmoke, another app, or some other alien-tech-powered solution, we'll want to know what it should do. Here's my first take:

    • Store blacklist entries categorized by type ([watch|blacklist] [website, keyword,username], etc.)
    • Have an easy-to-use way to add entries
      • From a web UI, instances, and/or metasmoke
    • Have reasonably good uptime
    • (?) See what changed since the last update?

    Thoughts? Am I missing the mark? What else do we need?

    type: feedback wanted status: agreed 
    opened by Undo1 25
  • Add command to show flagged posts that got not deleted yet (or post automatically)

    Add command to show flagged posts that got not deleted yet (or post automatically)

    Sometimes spam reports about smaller sites get less attention than they need, either if they're followed by many other reports or if only few people are online at the time.

    I would suggest that Smokey should keep a list of all reported posts that got positive feedback or no feedback at all yet and are not yet removed from the site. A command like !!/pending would then show a list of all those reports that still need more flags or feedback. Example:

    "Skin care tips" by "SpamUser" on webmsaters.stackexchange.com [MS] (reported 12 minutes ago, 1 tp, 0 naa, 0 fp, post score -3)
    "Best essay writing service" by "Writer" on graphicdesign.stackexchange.com [MS] (reported 6 minutes ago, no feedback yet, post score -1)
    

    This would be very helpful to make sure no reports slip through and to verify if anything needs more flags after a bunch of reports appeared without having to walk through the links manually.

    Additionally, it might be useful to not only post this report on demand but also automatically for posts in the list that were reported more than e.g. 10 minutes ago.

    type: feature request status: completed 
    opened by ByteCommander 25
  • Relicense under dual MIT/Apache 2.0

    Relicense under dual MIT/Apache 2.0

    We currently don't have a specified license for SmokeDetector - and thus, by default, it is under full copyright. This is very restrictive, and we'd like to change it to something more permissive (in this case, dual licensed under MIT/Apache 2.0). We'll need consent from all contributors to this repository to do so:

    • [x] @ProgramFOX
    • [x] @normalhuman
    • [x] @hichris1234
    • [x] @Undo1
    • [x] @Manishearth
    • [x] @BrockA
    • [x] @rschrieken
    • [x] @AWegnerGitHub
    • [x] @Seth-Johnson
    • [x] @JC3
    • [x] @rekire
    • [x] @durron597
    • [x] @AstroCB
    • [x] @Siguza
    • [x] @ArcticEcho
    • [x] @K-Guan
    • [x] @ByteCommander
    • [x] @TheGuywithTheHat
    • [x] @apnorton
    • [x] @michaelb958
    • [x] @ndrewh

    To agree to relicensing, just leave this comment below or otherwise indicate consent:

    I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to choose either at their option.
    

    Some more info:

    This involves adding the following to the README and including the full text of both licenses in the repository:

    ## License
    
    Licensed under either of
    
     * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
     * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
    
    at your option.
    
    ### Contribution
    
    Unless you explicitly state otherwise, any contribution intentionally submitted
    for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
    additional terms or conditions.
    

    MIT is fairly permissive, so it's preferred by most, however it requires you to include the license in everything using the code. On the other hand, Apache doesn't have this issue, but is incompatible with GPLv2. A dual license gives users the freedom to choose a license of their choice.

    opened by Undo1 24
  • Should we use type hints?

    Should we use type hints?

    Is your feature request related to a problem? Please describe.

    Python has support for type hints. They have advantages (IDEs and other tools can help users avoid making type-related errors by checking, for one). I'm not aware of any major disadvantages currently (SD supports Python 3.6 and they were added in 3.5, so compatibility isn't a huge concern I think). Overall, I personally tend to like them as having my IDE verify I don't mess up types is handy.

    Describe the solution you'd like

    Use type hints on new code added to SD (and maybe some existing code, although I don't think that is really a priority).

    Describe alternatives you've considered

    N/A

    Additional context

    I asked if they should be used and Makyen said it would be reasonable to open an issue on GH to discuss it

    type: feedback wanted 
    opened by CoconutMacaroon 1
  • Add per-entry ignore of NS validation failures in CI testing

    Add per-entry ignore of NS validation failures in CI testing

    This adds the ability to ignore NS validation failures during CI testing on a per-entry basis in the *nses.yml files. If an entry in those files has the value ignore_validation_errors: true, then validation errors will not cause CI failures.

    This is primarily in order to be able to prevent CI failures due to the persistent intermittent issues with validating dns-parking.com.. This PR adds ignore_validation_errors: true to the dns-parking.com. entry in watched_nses.yml, which will result in not seeing CI testing failures when that domain fails DNS validation.

    There probably should be some automated testing which still validates, and fails on, entries which have ignore_validation_errors: true, as we do want to know about such domains if they actually stop having a DNS entry. However, this PR just patches over the persistent frustration of looking at CI failures and finding that it's yet another time that dns-parking.com. had an intermittent problem. No attempt is made here to address the issue of wanting some type reporting for such domains when they actually go away.

    status: confirmed 
    opened by makyen 3
  • Wishlist: !!/test-scan or !!/scan-test - non-reporting test of a post

    Wishlist: !!/test-scan or !!/scan-test - non-reporting test of a post

    Is your feature request related to a problem? Please describe.

    When you !!/scan a message, it will also be reported if it triggers a reason. If that's not what you want, there is no way really to test whether the post will trigger, short of copying it into a local JSON representation and testing locally (which is harder than it needs to be, because different parts of Smoke Detector look at different representations of posts).

    Describe the solution you'd like

    It would be nice if there was an option to !!/scan, like the opposite of !!/scan-force, where you only want to get a report of the result.

    In some ways, this is also similar to the !!/test command, so I'm thinking there should probably be an alias with that prefix.

    Describe alternatives you've considered

    I wanted to test a new expression I watched, but as outlined above, the only way to really know if it works as intended is to scan an actual post which contains the target phrase. Similarly, the only way to check against false positives is to scan posts which are not supposed to match (typically then posts which exhibit a corner case which you want to test you have covered adequately).

    You can add a test case to tests/test_findspam.py but this again hinges on getting the representation of the post right in the test case. If you guess wrong what the API actually returns for the post you are targeting, the test proves nothing.

    Additional context

    https://chat.stackexchange.com/transcript/message/61838114#61838114 where I rescanned the wrong post, adding insult to injury.

    opened by tripleee 3
  • Better WebSocket error recovery in EditWatcher and DeletionWatcher

    Better WebSocket error recovery in EditWatcher and DeletionWatcher

    Some of the exceptions logged here indicate we're rebooting sometimes when there's an error on the EditWatcher and/or DeletionWatcher WebSockets. We could use better recovery from WebSocket exceptions in both of these modules (which use similar code, so similar error recover seems reasonable). We already retry the main WebSocket, so it would be reasonable to have somewhat similar code in EditWatcher and DeletionWatcher.

    type: bug type: feature request area: DeletionWatcher status: confirmed area: EditWatcher 
    opened by makyen 0
  • The regex package isn't thread safe (tracking issue)

    The regex package isn't thread safe (tracking issue)

    opened by makyen 4
Owner
Charcoal
We make nice things that stop spam.
Charcoal
A Microsoft reward automator, designed to work headless on a raspberry pi

MsReward A Microsoft reward automator, designed to work headless on a raspberry pi. Tested with a pi 3b+ and a pi 4 2Gb . Using a discord bot to log e

null 10 Dec 21, 2022
Headless - Wrapper around Ghidra's analyzeHeadless script

Wrapper around Ghidra's analyzeHeadless script, could be helpful to some? Don't tell me anything is wrong with it, it works on my machine.

null 8 Oct 29, 2022
This is a spamming selfbot that has custom spammed message and @everyone spam.

This is a spamming selfbot that has custom spammed message and @everyone spam.

astro1212 1 Jul 31, 2022
TickerRain is an open-source web app that stores and analysis Reddit posts in a transparent and semi-interactive manner.

TickerRain is an open-source web app that stores and analysis Reddit posts in a transparent and semi-interactive manner

GonVas 180 Oct 8, 2022
basic tool for NFT. let's spam, this is the easiest way to generate a hell lotta image

NFT generator this is the easiest way to generate a hell lotta image buckle up and follow me! how to first have your image in .png (transparent backgr

null 34 Nov 18, 2022
Very Simple Zoom Spam Pinger!

Very Simple Zoom Spam Pinger!

Syntax. 2 Mar 5, 2022
Users can read others' travel journeys in addition to being able to upload and delete posts detailing their own experiences

Users can read others' travel journeys in addition to being able to upload and delete posts detailing their own experiences! Posts are organized by country and destination within that country.

Christopher Zeas 1 Feb 3, 2022
It is a personal assistant chatbot, capable to perform many tasks same as Google Assistant plus more extra features...

PersonalAssistant It is an Personal Assistant, capable to perform many tasks with some unique features, that you haven'e seen yet.... Features / Tasks

Roshan Kumar 95 Dec 21, 2022
A simple chatbot that I made for school project

Chatbot: Python A simple chatbot that I made for school Project. Tho this chatbot is dumb sometimes, but it's not too bad lol. Check it Out! FAQ How t

Prashant 2 Nov 13, 2021
Template (v0) do Sistema Chatbot - atividade síncrona - INE5404

ine-5404-sistema-chatbot-template Template (v0) do Sistema Chatbot - atividade síncrona - INE5404 Veja abaixo um exemplo de funcionamento do sistema:

null 0 Dec 7, 2021
A module to develop and apply old-style links

Old-Linkage-Dev (OLD) Old Linkage Development is a module to develop and apply old-style links. Old-style links stand for some traditional or conventi

Tarcadia 2 Dec 4, 2021
JD-backup is an advanced Python script, that will extract all links from a jDownloader 2 file list and export them to a text file.

JD-backup is an advanced Python script, that will extract all links from a jDownloader 2 file list and export them to a text file.

Kraken.snv 3 Jun 7, 2022
a pull switch (or BYO button) that gets you out of video calls, quick

zoomout a pull switch (or BYO button) that gets you out of video calls, quick. As seen on Twitter System compatibility Tested on macOS Catalina (10.15

Brian Moore 422 Dec 30, 2022
Quick script for automatically extracting syscall numbers for an OS

Syscalls-Extractor Quick script for automatically extracting syscall numbers for an OS $ python3 .\syscalls-extractor.py --help usage: syscalls-extrac

m0rv4i 54 Feb 10, 2022
A quick experiment to demonstrate Metamath formula parsing, where the grammar is embedded in a few additional 'syntax axioms'.

Warning: Hacked-up code ahead. (But it seems to work...) What it does This demonstrates an idea which I posted about several times on the Metamath mai

Marnix Klooster 1 Oct 21, 2021
A simple app that helps to train quick calculations.

qtcounter A simple app that helps to train quick calculations. Usage Manual Clone the repo in a folder using git clone https://github.com/Froloket64/q

null 0 Nov 27, 2021
A service to display a quick summary of a project on GitHub.

A service to display a quick summary of a project on GitHub. Usage ?? Paste the code below with details filled in as specified below into your Readme.

Rohit V 8 Dec 6, 2022
A utility control surface for Ableton Live that makes the initialization of a Mixdown quick

Automate Mixdown initialization A script that transfers all the VSTs on your MIDI tracks to a new track so you can freeze your MIDI tracks and then co

Aarnav 0 Feb 23, 2022
Blender Add-on That Provides Quick Access to Render Controls

Blender Render Buttons Blender Add-on That Provides Quick Access to Render Controls A Blender 3.0 compatablity update of Blender2.8x-RenderButton v0.0

Don Schnitzius 3 Oct 18, 2022