Headless chatbot that detects spam and posts links to it to chatrooms for quick deletion.

Charcoal

Last update: Dec 21, 2022

Related tags

Overview

SmokeDetector

Headless chatbot that detects spam and posts it to chatrooms. Uses ChatExchange, takes questions from the Stack Exchange realtime tab, and accesses answers via the Stack Exchange API.

Example chat post:

Documentation

User documentation is in the wiki.

Detailed documentation for setting up and running SmokeDetector is in the wiki.

Basic setup

To set up SmokeDetector, please use

git clone https://github.com/Charcoal-SE/SmokeDetector.git
cd SmokeDetector
git checkout deploy
sudo pip3 install -r requirements.txt --upgrade
pip3 install --user -r user_requirements.txt --upgrade

Next, copy config.sample to a new file called config, and edit the values required.

To run, use python3 nocrash.py (preferably in a daemon-able mode, like a screen session.) You can also use python3 ws.py, but then SmokeDetector will be shut down after 6 hours; when running from nocrash.py, it will be restarted. (This is to be sure that closed websockets, if any, are reopened.)

Virtual environment setup

Running in a virtual environment is a good way to isolate dependency packages from your local system. To set up SmokeDetector in a virtual environment, you can use

git clone https://github.com/Charcoal-SE/SmokeDetector.git
cd SmokeDetector
git config user.email "[email protected]"
git config user.name "SmokeDetector"
git checkout deploy

python3 -m venv env
env/bin/pip3 install -r requirements.txt --upgrade
env/bin/pip3 install --user -r user_requirements.txt --upgrade

Next, copy the config file and edit as said above. To run SmokeDetector in this virtual environment, use env/bin/python3 nocrash.py.

[Note: On some systems (e.g. Mac's and Linux), some circumstances may require the --user option be removed from the last pip3 command line in the above instructions. However, the --user option is known to be necessary in other circumstances. Further testing is necessary to resolve the discrepancy.]

Docker setup

Running in a Docker container is an even better way to isolate dependency packages from your local system. To set up SmokeDetector in a Docker container, follow the steps below.

Grab the Dockerfile and build an image of SmokeDetector:

DATE=$(date +%F)
mkdir temp
cd temp
wget https://raw.githubusercontent.com/Charcoal-SE/SmokeDetector/master/Dockerfile
docker build -t smokey:$DATE .

Create a container from the image you just built

docker create --name=mysmokedetector smokedetector:$DATE

Start the container. Don't worry, SmokeDetector won't run until it's ready, so you have the chance to edit the configuration file before SmokeDetector runs.

Copy config.sample to a new file named config and edit the values required, then copy the file into the container with this command:

docker cp config mysmokedetector:/home/smokey/SmokeDetector/config

If you would like to set up additional stuff (SSH, Git etc.), you can do so with a Bash shell in the container:

docker exec -it mysmokedetector bash

After you're ready, put a file named ready under /home/smokey:

touch ~smokey/ready

Automate Docker deployment with Docker Compose

I'll assume you have the basic ideas of Docker and Docker Compose.

The first thing you need is a properly filled config file. You can start with the sample.

Create a directory (name it whatever you like), place the config file and docker-compose.yml file. Run docker-compose up -d and your SmokeDetector instance is up.

If you want additional control like memory and CPU constraint, you can edit docker-compose.yml and add the following keys to smokey. The example values are recommended values.

restart: always  # when your host reboots Smokey can autostart
mem_limit: 512M
cpus: 0.5  # Recommend 2.0 or more for spam waves

Requirements

SmokeDetector only supports Stack Exchange logins, and runs on Python 3.6 or higher, for now.

To allow committing blacklist and watchlist modifications back to GitHub, your system also needs Git 1.8 or higher, although we recommend Git 2.11+.

License

Licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)

at your option.

Contribution Licensing

By submitting your contribution for inclusion in the work as defined in the Apache-2.0 license, you agree that it be dual licensed as above, without any additional terms or conditions.

Comments

Logos & Branding

Andy suggested, and I agree, that we should have a cool logo.

More to the point, it'd be good to have slightly more consistent branding across all our stuff - at the moment, Smokey has one icon, GH has another, and our web projects have another. And no offence to our web projects an' all, but that icon took me literally two minutes to make (and let's not even talk about my artwork on Smokey's icon).

So. Logos, and branding. I'm going to throw my logo ideas so far into the thread below - use standard :+1: :-1: reactions to indicate your thoughts about them.

opened by ArtOfCode- 63
Integrate DeepSmoke
@tanmayb123 has created an API for us to query. StackOverflow posts only, for the time being.

The API is basically

99.239.154.69/dsd/index.php?q=[body270urlencoded]

... where body270urlencoded is the first 270 bytes of the post body.

Detailed transcript starting around here: https://chat.stackexchange.com/transcript/message/39458944#39458944

... but details are a bit further down.
type: feature request status: planned
opened by tripleee 45
Yet Another Debate: Should we use a database instead of flat-files?
This could apply to either Classic or NG, so I'm leaving it here.

Context: Smokey currently saves most of its state data in local pickle files:

$ ls *.p *.pickle apiCalls.pickle bodyfetcherMaxIds.p falsePositives.p notifications.p whyData.p autoIgnoredPosts.p bodyfetcherQueue.p ignoredPosts.p whitelistedUsers.p blacklistedUsers.p bodyfetcherQueueTimings.p latestMessages.p whyDataAllspam.p

Proposal: Should we use a database for storing all (or most) of this state data, or should we stick to flat files? Why/not? What do we gain/lose?

(If you were there for the Great Database Debate, assume this is SQLite, because of lack of install/maintenance.)
type: feedback wanted
opened by ArtOfCode- 44
findspam.py: bad_ns_for_url_domain()

Identify rogue name server (for now, concentrate on the Indian pharma spammer's favorite namecheaphosting.com) in domain names in URLs.

requirements.txt: pull in dnspython

findspam.py: post_links(): refactor into a separate function so that multiple methods can invoke it.

opened by tripleee 30
Do we need to change our blacklisting guidelines?
When we wrote our blacklisting guidelines in October last year, we set the following requirements:

Website has been used in at least 2 confirmed instances (tp fed back reports) of spam (You can use https://metasmoke.erwaysoftware.com/search to find other instances of a website being used for spam).

Website is not used legitimately in other posts on Stack Exchange.

Website is not currently caught in any of these filters: - bad keyword in body - blacklisted website - pattern matching website

Circumstances have changed since then, and the number of blacklists has grown. With the addition of the !!/blacklist-* commands, over 830 more websites/keywords/usernames have been added to our blacklists. In fact, 106 (!!!) were made in the last five days alone. Many of these websites are already caught by one or two of the reasons specified above.

Considering this, I think we need to have a discussion over whether these guidelines need to be changed to reflect the way we should/are using blacklists now. What should our new guidelines be?

Do we want to be blacklisting every spammy site that we see? Do we want to leave it to extreme circumstances?

Should we instead focus our time on improving our pattern-matching-* reasons?

Should average autoflag weight of matched posts have anything to do with this?

Should manually reported/posts with only 1 reason be given extra weight when counting the need for a blacklist?

Are our current guidelines just fine, and do we just need to enforce them more?

Other things we should think about:

If we are going to blacklist everything, do we want to automate it somehow?

What sort of a performance hit does blacklisting make? (I think Art ran some stats on this a while ago, maybe they need to be re-run with the updated codebase)

How much do dormant blacklists clutter the list? Do we need to think about code readability?

Should blacklist entries be removed if they don't have any hits after a certain time?

What does the !!/watch-keyword command have to do with this? Should it follow similar guidelines? Should it have separate ones? Do we need to change the way that it is implemented, to give it 0 weight or not send reports to MS?

What does everyone think about this?
area: blacklists type: feedback wanted type: policy
opened by angussidney 29
Major refactoring of globalvars.py

According to this message, perform major refactoring over globalvars.py.

Detailed changes are in commit messages.

Please don't merge this PR until all refactoring is done.

Note: draft PR won't work as it will not request review, which defeats the point.

opened by user12986714 27
minutes.">

Limit "notify" to users that have been active in the room in the past minutes.

At this moment, in the SO Close Vote Reviewers chatroom, SD is notifying 6 different users for every single report.

Example:

[ SmokeDetector | MS ] Few unique characters in body: SolvedSOLVEDSOLVED by Furkan Ayık on stackoverflow.com (@PraveenKumar @AndrasDeak @πάνταῥεῖ @FrankerZ @tripleee @dorukayhan)

The amount of users getting notified has steadily been growing. Imo, it's getting a little annoying. Only a portion of the users actually respond to these notifications, and they get notified even when they haven't been active for hours.

I'd like to request these notifications to be filtered on user activity. IE: Don't bother notifying a user that hasn't been active in the last hour.

Just to be clear: I have no issue with these users. Just that the list of notifications is getting close to the length of the actual report.
type: feature request

opened by Cerbrus 26
Refactor blacklists; implement !!/watch-ip, !!/watch-ns, !!/watch-asn and !!/blacklist-xx for the first two

Refactor blacklists.py to have a set of classes implementing the various blacklist types, with behavior implemented by way of mixins for the individual classes.

Based on this refactoring, simplify chatcommands.py and extend to support the new blacklist types.

Some related refactoring in gitmanager.py to let the blacklist take care of its own updates, and in findspam.py to correspondingly update how we query for matches.

Finally, simplify globalvars.py by putting all the Github-managed black- and watchlists in a single global dict. This arrangement simplifies handling of the lists as a whole, and finding the correct list based on a keyword like "watch-number".
area: commands status: confirmed

opened by tripleee 25
What do we need in a central blacklist tracker?
As brought up in https://github.com/Charcoal-SE/metasmoke/issues/257, there's a desire to move blacklists out of GitHub and into some other tool. Whether that tool is metasmoke, another app, or some other alien-tech-powered solution, we'll want to know what it should do. Here's my first take:

Store blacklist entries categorized by type ([watch|blacklist] [website, keyword,username], etc.)

Have an easy-to-use way to add entries

From a web UI, instances, and/or metasmoke

Have reasonably good uptime

(?) See what changed since the last update?

Thoughts? Am I missing the mark? What else do we need?
type: feedback wanted status: agreed
opened by Undo1 25
Add command to show flagged posts that got not deleted yet (or post automatically)
Sometimes spam reports about smaller sites get less attention than they need, either if they're followed by many other reports or if only few people are online at the time.

I would suggest that Smokey should keep a list of all reported posts that got positive feedback or no feedback at all yet and are not yet removed from the site. A command like !!/pending would then show a list of all those reports that still need more flags or feedback. Example:

"Skin care tips" by "SpamUser" on webmsaters.stackexchange.com [MS] (reported 12 minutes ago, 1 tp, 0 naa, 0 fp, post score -3) "Best essay writing service" by "Writer" on graphicdesign.stackexchange.com [MS] (reported 6 minutes ago, no feedback yet, post score -1)

This would be very helpful to make sure no reports slip through and to verify if anything needs more flags after a bunch of reports appeared without having to walk through the links manually.

Additionally, it might be useful to not only post this report on demand but also automatically for posts in the list that were reported more than e.g. 10 minutes ago.
type: feature request status: completed
opened by ByteCommander 25
Relicense under dual MIT/Apache 2.0
We currently don't have a specified license for SmokeDetector - and thus, by default, it is under full copyright. This is very restrictive, and we'd like to change it to something more permissive (in this case, dual licensed under MIT/Apache 2.0). We'll need consent from all contributors to this repository to do so:

[x] @ProgramFOX

[x] @normalhuman

[x] @hichris1234

[x] @Undo1

[x] @Manishearth

[x] @BrockA

[x] @rschrieken

[x] @AWegnerGitHub

[x] @Seth-Johnson

[x] @JC3

[x] @rekire

[x] @durron597

[x] @AstroCB

[x] @Siguza

[x] @ArcticEcho

[x] @K-Guan

[x] @ByteCommander

[x] @TheGuywithTheHat

[x] @apnorton

[x] @michaelb958

[x] @ndrewh

To agree to relicensing, just leave this comment below or otherwise indicate consent:

I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to choose either at their option.

Some more info:

This involves adding the following to the README and including the full text of both licenses in the repository:

## License Licensed under either of * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0) * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT) at your option. ### Contribution Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

MIT is fairly permissive, so it's preferred by most, however it requires you to include the license in everything using the code. On the other hand, Apache doesn't have this issue, but is incompatible with GPLv2. A dual license gives users the freedom to choose a license of their choice.
opened by Undo1 24
Should we use type hints?

Is your feature request related to a problem? Please describe.

Python has support for type hints. They have advantages (IDEs and other tools can help users avoid making type-related errors by checking, for one). I'm not aware of any major disadvantages currently (SD supports Python 3.6 and they were added in 3.5, so compatibility isn't a huge concern I think). Overall, I personally tend to like them as having my IDE verify I don't mess up types is handy.

Describe the solution you'd like

Use type hints on new code added to SD (and maybe some existing code, although I don't think that is really a priority).

Describe alternatives you've considered

N/A

Additional context

I asked if they should be used and Makyen said it would be reasonable to open an issue on GH to discuss it
type: feedback wanted

opened by CoconutMacaroon 1
Add per-entry ignore of NS validation failures in CI testing

This adds the ability to ignore NS validation failures during CI testing on a per-entry basis in the *nses.yml files. If an entry in those files has the value ignore_validation_errors: true, then validation errors will not cause CI failures.

This is primarily in order to be able to prevent CI failures due to the persistent intermittent issues with validating dns-parking.com.. This PR adds ignore_validation_errors: true to the dns-parking.com. entry in watched_nses.yml, which will result in not seeing CI testing failures when that domain fails DNS validation.

There probably should be some automated testing which still validates, and fails on, entries which have ignore_validation_errors: true, as we do want to know about such domains if they actually stop having a DNS entry. However, this PR just patches over the persistent frustration of looking at CI failures and finding that it's yet another time that dns-parking.com. had an intermittent problem. No attempt is made here to address the issue of wanting some type reporting for such domains when they actually go away.
status: confirmed

opened by makyen 3
Wishlist: !!/test-scan or !!/scan-test - non-reporting test of a post

Is your feature request related to a problem? Please describe.

When you !!/scan a message, it will also be reported if it triggers a reason. If that's not what you want, there is no way really to test whether the post will trigger, short of copying it into a local JSON representation and testing locally (which is harder than it needs to be, because different parts of Smoke Detector look at different representations of posts).

Describe the solution you'd like

It would be nice if there was an option to !!/scan, like the opposite of !!/scan-force, where you only want to get a report of the result.

In some ways, this is also similar to the !!/test command, so I'm thinking there should probably be an alias with that prefix.

Describe alternatives you've considered

I wanted to test a new expression I watched, but as outlined above, the only way to really know if it works as intended is to scan an actual post which contains the target phrase. Similarly, the only way to check against false positives is to scan posts which are not supposed to match (typically then posts which exhibit a corner case which you want to test you have covered adequately).

You can add a test case to tests/test_findspam.py but this again hinges on getting the representation of the post right in the test case. If you guess wrong what the API actually returns for the post you are targeting, the test proves nothing.

Additional context

https://chat.stackexchange.com/transcript/message/61838114#61838114 where I rescanned the wrong post, adding insult to injury.

opened by tripleee 3
Better WebSocket error recovery in EditWatcher and DeletionWatcher

Some of the exceptions logged here indicate we're rebooting sometimes when there's an error on the EditWatcher and/or DeletionWatcher WebSockets. We could use better recovery from WebSocket exceptions in both of these modules (which use similar code, so similar error recover seems reasonable). We already retry the main WebSocket, so it would be reasonable to have somewhat similar code in EditWatcher and DeletionWatcher.
type: bug type: feature request area: DeletionWatcher status: confirmed area: EditWatcher

opened by makyen 0
The regex package isn't thread safe (tracking issue)

I ended up taking a look at the code for the regex package. There are some issues in it which make it not thread safe. I've created the issue Thread safety: need lock for both read and write (_cache and other global variables) in that repository. This issue is for tracking it here.
type: bug status: confirmed area: thread safety

opened by makyen 4

Owner

Charcoal

We make nice things that stop spam.

GitHub https://metasmoke.erwaysoftware.com

A Microsoft reward automator, designed to work headless on a raspberry pi

MsReward A Microsoft reward automator, designed to work headless on a raspberry pi. Tested with a pi 3b+ and a pi 4 2Gb . Using a discord bot to log e

10 Dec 21, 2022

Headless - Wrapper around Ghidra's analyzeHeadless script

Wrapper around Ghidra's analyzeHeadless script, could be helpful to some? Don't tell me anything is wrong with it, it works on my machine.

8 Oct 29, 2022

This is a spamming selfbot that has custom spammed message and @everyone spam.

1 Jul 31, 2022

TickerRain is an open-source web app that stores and analysis Reddit posts in a transparent and semi-interactive manner.

TickerRain is an open-source web app that stores and analysis Reddit posts in a transparent and semi-interactive manner

180 Oct 8, 2022

basic tool for NFT. let's spam, this is the easiest way to generate a hell lotta image

NFT generator this is the easiest way to generate a hell lotta image buckle up and follow me! how to first have your image in .png (transparent backgr

34 Nov 18, 2022

Very Simple Zoom Spam Pinger!

2 Mar 5, 2022

Users can read others' travel journeys in addition to being able to upload and delete posts detailing their own experiences

Users can read others' travel journeys in addition to being able to upload and delete posts detailing their own experiences! Posts are organized by country and destination within that country.

1 Feb 3, 2022

It is a personal assistant chatbot, capable to perform many tasks same as Google Assistant plus more extra features...

PersonalAssistant It is an Personal Assistant, capable to perform many tasks with some unique features, that you haven'e seen yet.... Features / Tasks

95 Dec 21, 2022

A simple chatbot that I made for school project

Chatbot: Python A simple chatbot that I made for school Project. Tho this chatbot is dumb sometimes, but it's not too bad lol. Check it Out! FAQ How t

2 Nov 13, 2021

Template (v0) do Sistema Chatbot - atividade síncrona - INE5404

ine-5404-sistema-chatbot-template Template (v0) do Sistema Chatbot - atividade síncrona - INE5404 Veja abaixo um exemplo de funcionamento do sistema:

0 Dec 7, 2021

A module to develop and apply old-style links

Old-Linkage-Dev (OLD) Old Linkage Development is a module to develop and apply old-style links. Old-style links stand for some traditional or conventi

2 Dec 4, 2021

JD-backup is an advanced Python script, that will extract all links from a jDownloader 2 file list and export them to a text file.

3 Jun 7, 2022

a pull switch (or BYO button) that gets you out of video calls, quick

zoomout a pull switch (or BYO button) that gets you out of video calls, quick. As seen on Twitter System compatibility Tested on macOS Catalina (10.15

422 Dec 30, 2022

Quick script for automatically extracting syscall numbers for an OS

Syscalls-Extractor Quick script for automatically extracting syscall numbers for an OS $ python3 .\syscalls-extractor.py --help usage: syscalls-extrac

54 Feb 10, 2022

A quick experiment to demonstrate Metamath formula parsing, where the grammar is embedded in a few additional 'syntax axioms'.

Warning: Hacked-up code ahead. (But it seems to work...) What it does This demonstrates an idea which I posted about several times on the Metamath mai

1 Oct 21, 2021

Headless chatbot that detects spam and posts links to it to chatrooms for quick deletion.

Related tags

Overview

SmokeDetector

Documentation

Basic setup

Virtual environment setup

Docker setup

Automate Docker deployment with Docker Compose

Requirements

License

Contribution Licensing

Comments

What does everyone think about this?

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Owner

Charcoal

A Microsoft reward automator, designed to work headless on a raspberry pi

Headless - Wrapper around Ghidra's analyzeHeadless script

This is a spamming selfbot that has custom spammed message and @everyone spam.

TickerRain is an open-source web app that stores and analysis Reddit posts in a transparent and semi-interactive manner.

basic tool for NFT. let's spam, this is the easiest way to generate a hell lotta image

Very Simple Zoom Spam Pinger!

Users can read others' travel journeys in addition to being able to upload and delete posts detailing their own experiences

It is a personal assistant chatbot, capable to perform many tasks same as Google Assistant plus more extra features...

A simple chatbot that I made for school project

Template (v0) do Sistema Chatbot - atividade síncrona - INE5404

A module to develop and apply old-style links

JD-backup is an advanced Python script, that will extract all links from a jDownloader 2 file list and export them to a text file.

a pull switch (or BYO button) that gets you out of video calls, quick

Quick script for automatically extracting syscall numbers for an OS

A quick experiment to demonstrate Metamath formula parsing, where the grammar is embedded in a few additional 'syntax axioms'.

A simple app that helps to train quick calculations.

A service to display a quick summary of a project on GitHub.

A utility control surface for Ableton Live that makes the initialization of a Mixdown quick

Blender Add-on That Provides Quick Access to Render Controls