4CAT: Capture and Analysis Toolkit

Digital Methods Initiative

Last update: Dec 20, 2022

Related tags

Overview

4CAT: Capture and Analysis Toolkit

4CAT is a research tool that can be used to analyse and process data from online social platforms. Its goal is to make the capture and analysis of data from these platforms accessible to people through a web interface, without requiring any programming or web scraping skills. Our target audience is researchers, students and journalists interested using Digital Methods in their work.

In 4CAT, you create a dataset from a given platform according to a given set of parameters; the result of this (usually a CSV file containing matching items) can then be downloaded or analysed further with a suite of analytical 'processors', which range from simple frequency charts to more advanced analyses such as the generation and visualisation of word embedding models.

4CAT has a (growing) number of supported data sources corresponding to popular platforms that are part of the tool, but you can also add additional data sources using 4CAT's Python API. The following data sources are currently supported actively:

4chan
8kun
Bitchute
Parler
Reddit
Telegram
Twitter API (Academic and regular tracks)

The following platforms are supported through other tools, from which you can import data into 4CAT for analysis:

Facebook (via CrowdTangle exports)
Instagram (via CrowdTangle)
TikTok (via tiktok-scraper)

A number of other platforms have built-in support that is untested, or requires e.g. special API access. You can view the full list of data sources in the GitHub repository.

Install

You can install 4CAT locally or on a server via Docker or manually. The usual

docker-compose up

will work, but detailed and alternative installation instructions are available in our wiki. Currently 4chan, 8chan, and 8kun require additional steps; please see the wiki.

Please check our issues and create one if you experience any problems (pull requests are also very welcome).

Components

4CAT consists of several components, each in a separate folder:

backend: A standalone daemon that collects and processes data, as queued via the tool's web interface or API.
webtool: A Flask app that provides a web front-end to search and analyze the stored data with.
common: Assets and libraries.
datasources: Data source definitions. This is a set of configuration options, database definitions and python scripts to process this data with. If you want to set up your own data sources, refer to the wiki.
processors: A collection of data processing scripts that can plug into 4CAT and manipulate or process datasets created with 4CAT. There is an API you can use to make your own processors.

Credits & License

4CAT was created at OILab and the Digital Methods Initiative at the University of Amsterdam. The tool was inspired by the TCAT, a tool with comparable functionality that can be used to scrape and analyse Twitter data.

4CAT development is supported by the Dutch PDI-SSH foundation through the CAT4SMR project.

4CAT is licensed under the Mozilla Public License, 2.0. Refer to the LICENSE file for more information.

Comments

Allow autologin to _always_ work (or perhaps disable login?)

I am running a 4cat server in docker, with a apache2 reverse proxy in front. It works fine except for one small thing.

MYSERVER.domain host my apache proxy.

In settings -> Flask settings I have: Auto-login name = MYSERVER.domain

However when i access through the proxy don't want to meet a login to 4cat. I just want to be inside. I was thinking that Auto-login name would whitelist hosts so they could bypass login?
enhancement

opened by anderscollstrup 21

Docker swarm server: Cannot make flask frontend work and login (not using default docker-compose) flask overwriting settings values in database

Hi, I have 4cat running in a docker swarm server. After modifying a little bit the compose file to be compatible in docker swarm and other little bit the environment variables i got it running but I cannot login. I see this is a security feature with flask. I have read https://github.com/digitalmethodsinitiative/4cat/issues/269 also it is related to issue https://github.com/digitalmethodsinitiative/4cat/issues/272 I cannot find the whitelist or where is it, since now there is no config.py

Here is a dump of my postgresql database table of settings, Maybe it is relevant.




DATASOURCES               | {"bitchute": {}, "custom": {}, "douban": {}, "customimport": {}, "parler": {}, "reddit": {"boards": "*"}, "telegram": {}, "twitterv2": {"id_lookup": false}}
 4cat.name                 | "4CAT"
 4cat.name_long            | "4CAT: Capture and Analysis Toolkit"
 4cat.github_url           | "https://github.com/digitalmethodsinitiative/4cat"
 path.versionfile          | ".git-checked-out"
 expire.timeout            | 0
 expire.allow_optout       | true
 logging.slack.level       | "WARNING"
 logging.slack.webhook     | null
 mail.admin_email          | null
 mail.host                 | null
 mail.ssl                  | false
 mail.username             | null
 mail.password             | null
 mail.noreply              | "noreply@localhost"
 SCRAPE_TIMEOUT            | 5
 SCRAPE_PROXIES            | {"http": []}
 IMAGE_INTERVAL            | 3600
 explorer.max_posts        | 100000
 flask.flask_app           | "webtool/fourcat"
 flask.secret_key          | "2e3037b7533c100f324e472a"
 flask.https               | false
 flask.autologin.name      | "Automatic login"
 flask.autologin.api       | ["localhost", "4cat.coraldigital.mx", "\"4cat.coraldigital.mx\"", "51.81.52.207", "0.0.0.0"]
 flask.server_name         | ""
 flask.autologin.hostnames | ["*"]

docker issue

opened by hydrosIII 17

Cannot make flask frontend work
Backend is running: root@my-4cat-server:/usr/local/4cat# root@my-4cat-server:/usr/local/4cat# ps -ef | grep python root 497 1 0 10:36 ? 00:00:02 /usr/bin/python3 /usr/bin/fail2ban-server -xf start root 516 1 0 10:36 ? 00:00:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal 4cat 18989 1 59 12:39 ? 00:00:01 /usr/bin/python3 4cat-daemon.py start root 19008 891 0 12:39 pts/0 00:00:00 grep python root@my-4cat-server:/usr/local/4cat#

root@my-4cat-server:/usr/local/4cat# root@my-4cat-server:/usr/local/4cat# pip install python-dotenv Collecting python-dotenv Downloading python_dotenv-0.20.0-py3-none-any.whl (17 kB) Installing collected packages: python-dotenv Successfully installed python-dotenv-0.20.0 root@my-4cat-server/usr/local/4cat# root@my-4cat-server:/usr/local/4cat# FLASK_APP=webtool flask run --host=0.0.0.0

Serving Flask app "webtool"

Environment: production WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.

Debug mode: off

Running on http://0.0.0.0:5000/ (Press CTRL+C to quit) /usr/local/lib/python3.9/dist-packages/flask/sessions.py:208: UserWarning: "localhost" is not a valid cookie domain, it must contain a ".". Add an entry to your hosts file, for example "localhost.localdomain", and use that instead. warnings.warn( MY PC IP - - [10/Jun/2022 12:36:54] "GET / HTTP/1.1" 404 -

And I get 404 in my browser when I point to http://server_ip:5000

4cat is installed using this guide: https://github.com/digitalmethodsinitiative/4cat/wiki/Installing-4CAT Install 4cat manually
docker issue
opened by anderscollstrup 17
Issue with migrate.py preventing me from running 4cat or accessing web interface
Hello, thanks for making this tool available. I'd be grateful for any tips: I'm getting an 'EOFError: EOF when reading a line' message when I run docker-compose up. I'm using Windows 10 Home. I initially tried to install 4cat manually to scrape 4chan, but I couldn't get it to work so I uninstalled and then tried to install through Docker.

I'm using Windows Powershell to run the command because when I run docker-compose up in Ubuntu 20.04 LTS I'm getting this message:

'The command 'docker-compose' could not be found in this WSL 2 distro. We recommend to activate the WSL integration in Docker Desktop settings.

See https://docs.docker.com/desktop/windows/wsl/ for details.'

The WSL integration is activated in Docker Desktop settings by default. Could it be because I didn't bind-mount the folder I'm storing 4cat in to the Linux file system? I skipped that step and just stored 4cat in /c/users/myusername/ on Windows.

This is the message I get when I run docker-compose up command from Powershell:

PS C:\users\myusername\4cat> docker-compose up [+] Running 2/2

Container cat_db_1 Running 0.0s

Container api Recreated 0.7s Attaching to api, db_1 api | Waiting for postgres... api | PostgreSQL started api | 1 api | Seed present api | Starting app api | Running migrations api | api | 4CAT migration agent api | ------------------------------------------ api | Current 4CAT version: 1.9 api | Checked out version: 1.16 api | The following migration scripts will be run: api | migrate-1.9-1.10.py api | migrate-1.10-1.11.py api | migrate-1.11-1.12.py api | migrate-1.12-1.13.py api | migrate-1.13-1.14.py api | migrate-1.14-1.15.py api | WARNING: Migration can take quite a while. 4CAT will not be available during migration. api | If 4CAT is still running, it will be shut down now. api | Do you want to continue [y/n]? Traceback (most recent call last): api | File "helper-scripts/migrate.py", line 142, in api | if not args.yes and input("").lower() != "y": api | EOFError: EOF when reading a line api exited with code 1
opened by robbydigital 15
Unknown local index '4chan_posts' in search request
We managed to overcome our previous issue thanks to your advise. However we are now stuck with a error related to the indexes, appearing whenever we query 4chan.

First we have generated the sphinx.conf using helper_script/generate_sphinx_config.py. This result in the following indexes:

` [...]

/* Indexes */

index 4cat_index { min_infix_len = 3 html_strip = 1 type = template charset_table = 0..9, a..z, _, A..Z->a..z, U+47, U+58, U+40, U+41, U+00C0->a, U+00C1->a, U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c,$ }

index 4chan_posts : 4cat_index { type = plain source = 4chan_posts_old path = /opt/sphinx/data/4chan_posts }

index 4chan_posts : 4cat_index { type = plain source = 4chan_posts_new path = /opt/sphinx/data/4chan_posts } [...] However starting sphinx with this setup result in the following error:Mar 16 11:48:44 dev sphinxsearch[505]: ERROR: section '4chan_posts' (type='index') already exists in /etc/sphinxsearch/sphinx.conf line 51 col 19. ` I have then attempted to uncomment one of the indexes and/or changing the path which allows for sphinx to start. However another error then appears when collection have been initiated:

16-03-2020 11:50:54 | ERROR (threading.py:884): Sphinx crash during query deb9cfe3e0a47d56612fd6e453208ed6: (1064, "unknown local index '4chan_posts' in search request\x00")

Hope you once again can help me figure out how the indexes should be set.
opened by bornakke 12
Installing problem: frontend failed to run with 'docker-compose up' command

When running the command docker-compose up, the database and backend components goes well, but the frontend component could not lead to a result, and always stuck at "[INFO] Booting worker with pid: 12" . The problem is still there after restarting the frontend component on Docker UI.
docker issue

opened by baiyuan523 11

Error "string indices must be integers" from search_twitter.py:403

From our 4cat.log

21-09-2021 10:48:11 | INFO (processor.py:890): Running processor count-posts on dataset a5eeaf86aa27ff91f212d35880090d70
21-09-2021 10:48:11 | INFO (processor.py:890): Running processor attribute-frequencies on dataset 659e224c54209146f7551523e8d26633
21-09-2021 10:48:11 | ERROR (worker.py:890): Processor count-posts raised TypeError while processing dataset a5eeaf86aa27ff91f212d35880090d70 (via 76e33804acca3ac18d3cfa8de8059780) in count_posts.py:59->processor.py:316->search_twitter.py:403:
   string indices must be integers

21-09-2021 10:48:11 | ERROR (worker.py:890): Processor attribute-frequencies raised TypeError while processing dataset 659e224c54209146f7551523e8d26633 (via 01db05ce10f58b320a397d68b61986a2) in rank_attribute.py:132->processor.py:316->search_twitter.py:403:
   string indices must be integers

The line in question is from SearchWithTwitterAPIv2.map_item() https://github.com/digitalmethodsinitiative/4cat/blob/f0e01fb500b7dafb58a05873cf34bf15e288a88c/datasources/twitterv2/search_twitter.py#L403

and I haven't found a good way to bring 4CAT under a debugger and/or inform me of an ID for the violating tweet.

Could this be related to #169 ?

opened by xmacex 10

AttributeError: 'Namespace' object has no attribute 'release'

Fresh installation on MAC with Docker from local files. Any idea what i did wrong?

4cat_backend:

Waiting for postgres... PostgreSQL started Database already created

Traceback (most recent call last): File "helper-scripts/migrate.py", line 66, in if args.release: AttributeError: 'Namespace' object has no attribute 'release'

4cat_backend EXITED (1)
bug deployment

opened by psegovias 9
Docker setup fails to "import config" on macOS Big Sur (M1)
Discussed in https://github.com/digitalmethodsinitiative/4cat/discussions/191

^{Originally posted by p-charis October 25, 2021} Hey everyone! First, thanks a million to the developers for building this & making it available :)

Now, I managed to get 4CAT working on a macOS (latest version-M1 native) but only after I removed the following lines from the docker-setup.py file (line #36 onwards). Without these lines the installation wouldn't work as it returned the error that no module named config was found. I suspect it might have sth to do with the way that Docker runs on macOS generally and the paths it creates, but I haven't figured it out yet. So, I just wanted to let the Devs know, as well as other macOS users that, if they've had a similar problem, they could try this workaround.

# Ensure filepaths exist import config for path in [config.PATH_DATA, config.PATH_IMAGES, config.PATH_LOGS, config.PATH_LOCKFILE, config.PATH_SESSIONS, ]: if Path(config.PATH_ROOT, path).is_dir(): pass else: os.makedirs(Path(config.PATH_ROOT, path))</div>
bug docker issue
opened by p-charis 8
Tokeniser exclusion list ignores last word in list

I'm filtering some commonly used words out of a corpus with the Tokenise processor and it only seems to be partially successful. For example in one month there are 37,325 instances of one word. When I add the word to the reject list there are still 6307 instances of the word. So it's getting most but not at all. I'm having the same issue with some common swear words that I'm trying to filter out - most are gone, but some remain. Is there a reason for this?

Thanks for any insight!

opened by robbydigital 6
Datasource that interfaces with a TCAT instance
It works, and arguably fixes #117, but:

The form looks hideous with the million query fields. Do we need them all for 4CAT? Is there a way to make it look better?

The list of bins displayed in the 'create dataset' form simply lists bins from all instances. This can get really long really fast when supporting multiple instances. A custom form control may be necessary to make this user-friendly.

The list of bins is loaded synchronously whenever get_options() is run. The result should probably be cached or updated in the background (with a separate worker...?)

The data format now follows that of twitterv2's map_item(), but there is quite a bit more data in the TCAT export that we could include.
opened by stijn-uva 6
Update 'FAQ' and 'About' pages

The 'About' page should probably refer to documentation and guides etc rather than the 'news' thing it's doing now, and the FAQ is still very 4chan-oriented.
enhancement (mostly) front-end

opened by stijn-uva 0
Feature request: allow data from linked telegram chat channels to be collected

Telegram chats have linked "discussion" channels, where users can respond to messages in the main channel. Occasionally, these are also public, and if so, can also be found by the API. It would be useful to allow users to also automatically collect data from these chat channels if they're found.

A note on this and future feature requests: we're (https://github.com/GateNLP) putting in some additions to the telegram data collector on our end. Thought it might be worth checking if there's scope for them to be added to the original/main instance.

If any issues with this/they don't really fit with what you have in mind for your instance, all fine, we'll continue to maintain them on our own fork instead!

Linked pull request: https://github.com/digitalmethodsinitiative/4cat/pull/322
enhancement data source

opened by muneerahp 1
LIHKG data source

A data source, for LIHKG. Uses the web interface's web API, which seems reasonable straightforward and stable. There is some rate limiting, which 4CAT tries to respect by pacing requests and implementing an exponential backoff.
enhancement data source questionable

opened by stijn-uva 0
ability to count frequency for specific (and multiple) keywords over time

a processor that can filter on multiple particular words or phrases within a dataset, and outputs the count values (overall, or over time) per item, outputting a .csv that can be imported into raw graphs to compare the evolution of different words/phrases over time, either in absolute or in relative numbers.
processors data source

opened by daniel-dezeeuw 0
Warn about need to update Docker `.env` file when upgrading 4CAT to new version

When using Docker, the .env file can be used to ensure you pull a particular version of 4CAT. If you then upgrade 4CAT interactively, we cannot modify the .env file (which exists on the users host machine). If a user removes or rebuilds 4CAT, it will pull the version of 4CAT listed in the .env file which will not be the latest version that was upgraded to.

I will look at adding a warning/notification to the upgrade logs to notify users of the need to update their .env file.
enhancement deployment

opened by dale-wahl 0

Releases(v1.29)

v1.29(Oct 6, 2022)
Snapshot of 4CAT as of October 2022. Many changes and fixes since the last official release, including:

Restart and upgrade 4CAT via the web interface (#181, #287, #288)

Addition of several processors for Twitter datasets to increase inter-operability with DMI-TCAT

DMI-TCAT data source, which can interface with a DMI-TCAT instance to create datasets from tweets stored therein (#226)

LinkedIn data source, to be used together with Zeeschuimer

Fixes & improvements to Docker container set-up and build process (#269, #270, #290)

A number of processors have been updated to transparently filter NDJSON datasets instead of turning them into CSV datasets (#253, #282, #291, #292)

And many smaller fixes & updates

From this release onwards, 4CAT can be upgraded to the latest release via the Control Panel in the web interface.
Source code(tar.gz)
Source code(zip)
v1.26(May 10, 2022)
Many updates:

Configuration is now stored in the database and (mostly) editable via the web GUI

The Telegram datasource now collects more data and stores the 'raw' message objects as NDJSON

Dialogs in the web UI now use custom widgets instead of alert()

Twitter datasets will retrieve the expected amount of tweets before capturing and ask for confirmation if it is a high number

Various fixes and tweaks to the Dockerfiles

New extended data source information pages with details about limitations, caveats, useful links, etc

And much more

Source code(tar.gz)
Source code(zip)
v1.25(Feb 24, 2022)
Snapshot of 4CAT as of 24 February 2022. Many changes and fixes since the last official release, including:

Explore and annotate your datasets interactively with the new Explorer (beta)

Datasets can be set to automatically get deleted after a set amount of time, and can be made private

Incremental refinement of the web interface

Twitter datasets can be exported to a DMI-TCAT instance

User accounts can now be deactivated (banned)

Many smaller fixes and new features

Source code(tar.gz)
Source code(zip)
v1.21(Sep 28, 2021)
Snapshot of 4CAT as of 28 September 2021. Many changes and fixes since the last official release, including:

User management via control panel

Improved Docker support

Improved 4chan data dump import helper scripts

Improved country code filtering for 4chan/pol/ datasets

More robust and versatile network analysis processors

Various new filter processors

Topic modeling processor

Support for non-academic Twitter API queries

Option to download NDJSON datasets as CSV

Support for hosting 4CAT with a non-root URL

And many more

Source code(tar.gz)
Source code(zip)
v1.18a(May 7, 2021)

A release to trigger publication on Zenodo.
Source code(tar.gz)
Source code(zip)
v1.17(Apr 8, 2021)

Tagging 4CAT at 1.17 because the previous release was super mega outdated
Source code(tar.gz)
Source code(zip)
v1.9b1(Jan 17, 2020)

First public release, licensed under the MPL 2.0
Source code(tar.gz)
Source code(zip)
v1.0b1(Feb 28, 2019)
4CAT is now ready for wider use! It offers...

An API that can be used to queue and manipulate queries programmatically

Diverse analytical post-processors that may be combined to further analyse data sets

A flexible interface for adding various data sources

A robust scraper

A very retro interface

Source code(tar.gz)
Source code(zip)

Owner

Digital Methods Initiative

The Digital Methods Initiative (DMI) is one of Europe's leading Internet Studies research groups. Research tools it develops are collected here.

GitHub

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

81 Dec 26, 2022

Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Numerics Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production Use procedure: Initialise a new i

1 Nov 13, 2021

Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

Hippolyzer Hippolyzer is a revival of Linden Lab's PyOGP library targeting modern Python 3, with a focus on debugging issues in Second Life-compatible

6 Sep 1, 2022

Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis ?? This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

1 Sep 3, 2022

A set of functions and analysis classes for solvation structure analysis

SolvationAnalysis The macroscopic behavior of a liquid is determined by its microscopic structure. For ionic systems, like batteries and many enzymes,

19 Nov 24, 2022

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

dbt-osmosis First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we wan

150 Jan 6, 2023

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Larch: Data Analysis Tools for X-ray Spectroscopy and More Documentation: http://xraypy.github.io/xraylarch Code: http://github.com/xraypy/xraylarch L

95 Dec 13, 2022

4CAT: Capture and Analysis Toolkit

Related tags

Overview

4CAT: Capture and Analysis Toolkit

Install

Components

Credits & License

Comments

/* Indexes */

Discussed in https://github.com/digitalmethodsinitiative/4cat/discussions/191

Releases(v1.29)

v1.29(Oct 6, 2022)

v1.26(May 10, 2022)

v1.25(Feb 24, 2022)

v1.21(Sep 28, 2021)

v1.18a(May 7, 2021)

v1.17(Apr 8, 2021)

v1.9b1(Jan 17, 2020)

v1.0b1(Feb 28, 2019)

Owner

Digital Methods Initiative

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

A set of functions and analysis classes for solvation structure analysis

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Probabilistic reasoning and statistical analysis in TensorFlow

Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Scraping and analysis of leetcode-compensations page.

A data analysis using python and pandas to showcase trends in school performance.

Toolchest provides APIs for scientific and bioinformatic data analysis.

Additional tools for particle accelerator data analysis and machine information

A collection of learning outcomes data analysis using Python and SQL, from DQLab.

BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

Weather analysis with Python, SQLite, SQLAlchemy, and Flask