Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/filipinascimento/coordinationz

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/filipinascimento/coordinationz
Owner: filipinascimento
License: mit
Created: 2024-04-29T19:50:54.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-11-14T19:33:58.000Z (about 1 month ago)
Last Synced: 2024-11-14T20:11:58.488Z (about 1 month ago)
Language: Jupyter Notebook
Size: 39.8 MB
Stars: 0
Watchers: 4
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# coordinationz
Collection of scripts and package to analyze coordination in social media data.

To install the package, download the git repository and run the following command in the root directory:
```bash
pip install .
```

To install the package in development mode, run the following commands in the root directory:
```bash
pip install meson-python ninja numpy
pip install --no-build-isolation -e .
```

For debug mode, use the following command for local installation:
```bash
pip install --no-build-isolation -U -e . -Csetup-args=-Dbuildtype=debug
```
To debug the C code, use gdb:
```bash
gdb -ex=run -args python
```

## Run for INCAS datasets (e.g., phase2a or phase2b)
First install the package as described above.
The next step is setting up the config.toml file. You can use config_template.toml as a template.

```bash
cp config_template.toml config.toml
```

Setup the paths for the INCAS datasets and networks
```toml
# Location of jsonl files
INCAS_DATASETS = "/mnt/osome/INCAS/datasets"

# Location where the preprocessed datasets will be stored
PREPROCESSED_DATASETS = "Data/Preprocessed"

#Logation of the outputs
NETWORKS = "Outputs/Networks"
FIGURES = "Outputs/Figures"
TABLES = "Outputs/Tables"
CONFIGS = "Outputs/Configs"
```

The `INCAS_DATASETS` folder should contain the uncompressed jsonl files.

First, the files should be preprocessed. This can be done by running the following python script:
```bash
python pipeline/preprocess/preprocessINCAS.py
```
where `dataname` is the name of the dataset, which correspondts to the `/.jsonl` file. Together with the preprocessed data, the script will generate a .txt file with some information about the dataset.

The parameters of the indicators can be set in the config.toml file.

Currently, only co-hashtag, co-URL and co-retweets are supported.

To run the indicators, you can use the `pipeline/indicators.py` script by running the following command:
```bash
python pipeline/indicators.py
```
where `dataname` is the name of the dataset and `indicator` is the indicator to be run.

You an add a suffix to the output files by adding the `--suffix` parameter:
```bash
python pipeline/indicators.py --suffix
```
if no suffix is provided, the a timestamp will be used as suffix.

Such a process will generate files in the output directories defined by `NETWORKS`, `TABLES`, and `CONFIGS`.

In particular, the `TABLES` folder will contain the suspicious pairs of users and clusters in CSV format.

The `NETWORKS` folder will contain the networks in xnet format. xnet format can be read by using the xnetwork package:
```bash
pip install xnetwork
```
and using the following code:
```python
import xnetwork as xn
g = xn.load("network.xnet")
```

The result is an igraph network. You can convert it to the networkx format by using the following code:
```python
network = g.to_networkx()
```

The config file used to generate the data will be copied to the "CONFIG" directory. A new section will be added to the config with extra parameters about the run.

## Text similarity indicators
The text similarity indicators can be run by including `usctextsimilarity`, `textsimilarity` or `coword` to the indicator list. For instance `pipeline/indicators.py -i cohashtag coretweet courl textsimilarity. `usctextsimilarity` and textsimilarity requires the instalation of packages faiss and sentence-transformers. GPU is recommended for performance.

## Run for IO datasets
Repeat the same steps as for INCAS datasets, but set the `IO_DATASETS` variable in the config.toml file to the location of the IO datasets. Also, for preprocessing, use the `pipeline/preprocess/preprocessIO.py` script.

## Submitted methodologies
To generate the results submmited for the evaluation datasets, use the following procedures:

First preprocess the dataset according to the preprocess instructions above.

### For the UNION approach:
- Copy the `config_template_union.toml` to `config_union.toml` and set the PATHS accordingly.
- Run the following command:
```bash
python pipeline/indicators.py -c config_union.toml -i cohashtag coretweet courl coword -s union
```
where `` is the filename of the dataset (for the evaluation dataset it should be `TA2_full_eval_NO_GT_nat_2024-06-03` or `TA2_full_eval_NO_GT_nat+synth_2024-06-03`).
- The results will be stored in the `Outputs/Tables` (or the folder defined in the config file).

### For the SOFTUNION approach:
- Copy the `config_template_softunion.toml` to `config_softunion.toml` and set the PATHS accordingly.
- Run the following command:
```bash
python pipeline/indicators.py -c config_softunion.toml -i cohashtag coretweet courl coword -s softunion
```
where `` is the filename of the dataset (for the evaluation dataset it should be `TA2_full_eval_NO_GT_nat_2024-06-03` or `TA2_full_eval_NO_GT_nat+synth_2024-06-03`).
- The results will be stored in the `Outputs/Tables` (or the folder defined in the config file).