https://github.com/digitalmethodsinitiative/4cat_web_studies_extensions
Web Studies companion datasources and processors for 4CAT
https://github.com/digitalmethodsinitiative/4cat_web_studies_extensions
Last synced: 5 months ago
JSON representation
Web Studies companion datasources and processors for 4CAT
- Host: GitHub
- URL: https://github.com/digitalmethodsinitiative/4cat_web_studies_extensions
- Owner: digitalmethodsinitiative
- License: mpl-2.0
- Created: 2024-08-29T09:30:45.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-12-03T16:47:52.000Z (6 months ago)
- Last Synced: 2025-12-06T21:40:36.112Z (6 months ago)
- Language: Python
- Size: 374 KB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Web Studies a 4CAT Extension
Web Studies is a companion extension to the [4CAT Capture and Analysis Toolkit](https://github.com/digitalmethodsinitiative/4cat?tab=readme-ov-file#-4cat-capture-and-analysis-toolkit). It add functionality to 4CAT by utilizing [Selenium](https://www.selenium.dev/) along with a [Firefox browser](https://www.mozilla.org/en-US/firefox/) to collect data from web sources.
# Features
## New datasources
### General web studies
- Selenium URL Collector
- Collect HTML, text, and links from a list of URLs
- Web Archive Collector
- Use [Web Archive's Wayback Machine](https://web.archive.org/) to collect archives of a URL over time
- Screenshot Generator
- Take screenshots of web pages
### App store studies
- Apple Store
- Collect data on [Apple's](https://www.apple.com) apps
- Google Store
- Collect data on [Google's](https://play.google.com/store/apps) apps
### Cloud app store studies
- Microsoft Azure App Store
- Collect data on [Microsoft Azure](https://azuremarketplace.microsoft.com/en-US/) applications
- Amazon Web Services (AWS) Marketplace
- Collect data on [AWS](https://azuremarketplace.microsoft.com/en-US/) applications
## New analysis processors
- Take screenshots of any column containing URLs
- Detect trackers
- Provide a list of various source code to search for in collected HTML
# Installation
These extensions are designed to work with [4CAT v1.46](https://github.com/digitalmethodsinitiative/4cat/releases/tag/v1.46) or later.
For instructions on adding the "I do not care about cookies" browser extension, [see below](https://github.com/digitalmethodsinitiative/4cat_web_studies_extensions/blob/main/README.md#browser-extensions).
## Docker installation
1. Download/clone extensions into both 4CAT backend and frontend containers
- `docker exec 4cat_backend git clone https://github.com/digitalmethodsinitiative/4cat_web_studies_extensions.git extensions/web_studies/`
- `docker exec 4cat_frontend git clone https://github.com/digitalmethodsinitiative/4cat_web_studies_extensions.git extensions/web_studies/`
2. Restart 4CAT containers
- `docker compose restart` from 4CAT directory where `docker-compose.yml` and `.env` files were previously downloaded
- This will automatically install necessary dependencies, Firefox, and Geckodriver
3. Activate desired new datasources from the 4CAT Control Panel
- Control Panel -> Settings -> Data sources
## Direct/manual installation
1. Download or clone this repository and copy the folders into the `extensions` folder in your 4CAT directory
- `git clone https://github.com/digitalmethodsinitiative/4cat_web_studies_extensions.git extensions/web_studies/`
2. Run 4CAT's migrate script to install necessary packages
- `python helper-scripts/migrate.py`
- Note: `fourcat_insall.py` is only designed to run on linux systems. For other systems you will need set up the following:
- Install python packages from `requirements.txt`
- Download Firefox
- Download the appropriate Geckodriver compatible with that version of Firefox (https://github.com/mozilla/geckodriver/releases/)
- Adjust settings in 4CAT interface via `Control Panel -> Settings -> selenium` to point to Firefox/Geckodriver programs
3. Activate desired datasources from the 4CAT Control Panel
- Control Panel -> Settings -> Data sources
## Browser extensions
Some datasources/processors can make use of a [Firefox extension that removes cookies](https://addons.mozilla.org/en-US/firefox/addon/istilldontcareaboutcookies/). To install:
### Docker
1. `docker exec 4cat_backend wget https://addons.mozilla.org/firefox/downloads/file/4216095/istilldontcareaboutcookies-1.1.4.xpi` # you can find the most recent version at the above link
2. Enable the extension in the 4CAT Control Panel
- Control Panel -> Settings -> selenium
- Update "Firefox Extensions" by adding the filename to the path section
- e.g. `{"i_dont_care_about_cookies": {"path": "istilldontcareaboutcookies-1.1.4.xpi", "always_enabled": false}}`