Code for replicating "Right and left, partisanship predicts vulnerability to misinformation" by Dimitar Nikolov, Alessandro Flammini and Filippo Menczer

public 1yr ago 0 bookmarks

View Workflow

misinfo-partisanship-hksmisinforeview-2021 — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Introduction
Datasets
Environment
Workflow
Contact

Introduction

In this repository, you can find code and instructions for reproducing the plots from Right and left, partisanship predicts vulnerability to misinformation by Dimitar Nikolov, Alessandro Flammini, and Filippo Menczer.

To start, clone the repo:

$ git clone https://github.com/dimitargnikolov/twitter-misinformation.git

You should run all subsequent commands from the directory where you clone the repo.

Datasets

There are three datasets you need to obtain. Before you begin, create a data directory at the root of the repo.

Link Sharing on Twitter

This dataset contains a set of link sharing actions that occurred on Twitter during the month of June 2017. The dataset is available on the Harvard Dataverse .

Political Valence

This is a dataset from Facebook , which gives political valence scores to several popular news sites. You can request access to the dataset from Dataverse . Once you have access, put the top500.csv file into the data directory.

Misinformation

This is a dataset of manually curated sources of misinformation available at OpenSources.co . Clone it from Github in your data directory.

$ git clone https://github.com/BigMcLargeHuge/opensources.git data/opensources

`data` Directory

Once you obtain all data as described above, your data directory should look like this:

data
├── domain-shares.data
├── opensources
│ ├── CONTRIBUTING.md
│ ├── LICENSE
│ ├── README.md
│ ├── badges.txt
│ ├── releasenotes.txt
│ └── sources
│ ├── sources.csv
│ └── sources.json
└── top500.csv

Environment

Make sure you have Python 3 installed on your system. Then, set up a virtualenv with the required modules at the root of the cloned repository:

$ virtualenv -p python3 VENV
$ source VENV/bin/activate
$ pip install -r requirements.txt

From now on, any time you want to run the analysis, activate your virtual environment with:

$ source VENV/bin/activate

Workflow

The replication code is contained in the .py files in the scripts directory. You can automate their execution with the provided snakemake workflow:

$ cd workflow
$ snakemake -p

The execution will display the actual shell commands being executed, so you can run them individually if you want. You can inspect the workflow/Snakefile file to see how the inputs and outputs for each script are specified. In addition, you can execute each script with

$ python <script_name.py> --help

to learn about what it does.

At the end of the execution, the generated plots will be in the data directory.

To regenerate the plots from scratch, in the workflow directory you can do:

$ snakemake clean
$ snakemake -p

Contact

If you have any questions about running this code or obtaining the data, please open an issue in this repository and we will get back to you as soon as possible.

Code Snippets

import os
import argparse
import csv
import logging
from operator import itemgetter
from utils import domain
from config import DEBUG_LEVEL

logging.basicConfig(level=DEBUG_LEVEL)


def read_domain_data(filepath, domain_col, data_cols_to_read, delimiter, skip_rows):
    domains = {}
    row_count = 0 # for debugging

    with open(filepath, 'r') as f:
        reader = csv.reader(f, delimiter=delimiter)

        # skip headers
        if skip_rows is not None:
            for _ in range(skip_rows):
                row_count += 1
                next(reader)

        # process the rows
        for row in reader:
            if domain_col >= len(row):
                raise ValueError('Invalid domain index: {}, {}'.format(domain_col, ', '.join(row)))
            d = domain(row[domain_col])

            if d in domains:
                logging.info('Domain has already been processed: {}. Skipping new values.'.format(d))
                logging.info('Existing data: {}'.format(', '.join(domains[d])))
                logging.info('New data: {}'.format(', '.join(row)))

            new_row = []
            if data_cols_to_read is not None:
                for idx in data_cols_to_read:
                    if idx >= len(row):
                        raise ValueError('Invalid index: {}, {}'.format(idx, ', '.join(row)))
                    elif idx == domain_col:
                        logging.info('Data column the same as the domain column. Skipping.')
                    else:
                        new_row.append(row[idx])

            row_count += 1
            domains[d] = new_row
    return domains


def main():
    parser = argparse.ArgumentParser(
        description=('Create a list of domains with standardized URLs.'
                     'Do this either from a primary CSV or from a provided list.'),
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )

    parser.add_argument('dest_file', type=str, 
                        help=('Destination file for the combined data. '
                              'The normalized domain will always be in the first column, '
                              'followed by the columns to keep from the primary file, '
                              'followed by the columns to keep from the secondary file.'))

    parser.add_argument('-p', '--primary_csv', type=str, 
                        help=('A CSV file containing domains. '
                              'All domains from this file will be kept in the final output.'))

    parser.add_argument('-s', '--secondary_csv', type=str, default=None,
                        help=('A CSV containing domain data. '
                              'Domains from this file will not be kept '
                              'unless they appear in the primary file.'))

    parser.add_argument('-domain1', '--primary_domain_col', type=int, default=0,
                        help='The column with the domain in the primary source file.')

    parser.add_argument('-data1', '--primary_data_cols', type=int, nargs='+',
                        help='Columns with additional data from the primary source file to keep in the output.')

    parser.add_argument('-delim1', '--primary_delim', type=str, default='\t',
                        help='The delimiter in the primary source file.')

    parser.add_argument('-skip1', '--primary_skip_rows', type=int, default=0,
                        help='The number of header rows in the primary source file to skip.')

    parser.add_argument('-domain2', '--secondary_domain_col', type=int, default=0,
                        help='The column with the domain in the secondary source file.')

    parser.add_argument('-data2', '--secondary_data_cols', type=int, nargs='+',
                        help='Columns with additional data from the secondary source file to keep in the output..')

    parser.add_argument('-delim2', '--secondary_delim', type=str, default='\t',
                        help='The delimiter in the secondary source file.')

    parser.add_argument('-skip2', '--secondary_skip_rows', type=int, default=0,
                        help='The number of header rows in the secondary source file to skip.')

    parser.add_argument('-ddelim', '--dest_delim', type=str, default='\t',
                        help='The delimiter in the destination file.')

    parser.add_argument('-dhead', '--dest_col_headers', type=str, nargs='+',
                        help=('The column headers in the destination file. '
                              'Must match the number of columns being kept from both source files, '
                              'plus the first column for the domain.'))

    parser.add_argument('-exclude', '--exclude_domains', type=str, nargs='+',
                        help='A list of domains to exclude.')

    parser.add_argument('-include', '--include_domains', type=str, nargs='+',
                        help='A list of additional domains to include in the final list.')

    args = parser.parse_args()

    if os.path.dirname(args.dest_file) != '' and not os.path.exists(os.path.dirname(args.dest_file)):
        os.makedirs(os.path.dirname(args.dest_file))

    if (args.primary_csv is None or not os.path.exists(args.primary_csv)) and args.include_domains is None:
        raise ValueError('No input provided.')

    # read the CSVs
    logging.debug('Reading primary file.')
    if args.primary_csv is not None:
        primary_data = read_domain_data(
            args.primary_csv,
            args.primary_domain_col,
            args.primary_data_cols,
            args.primary_delim,
            args.primary_skip_rows
        )
    else:
        primary_data = {}

    if args.include_domains is not None:
        for raw_d in args.include_domains:
            d = domain(raw_d)
            if d not in primary_data:
                primary_data[d] = []

    logging.debug('Reading secondary file.')
    if args.secondary_csv is not None:
        secondary_data = read_domain_data(
            args.secondary_csv,
            args.secondary_domain_col,
            args.secondary_data_cols,
            args.secondary_delim,
            args.secondary_skip_rows
        )
    else:
        secondary_data = {}

    # combine the data from both files into rows
    excluded_domains = frozenset(args.exclude_domains) if args.exclude_domains is not None else frozenset()
    combined_rows = []
    for d in primary_data.keys():
        if d in excluded_domains:
            logging.info('Skipping {}'.format(d))
            continue
        new_row = [d]
        new_row.extend(primary_data[d])
        if d in secondary_data:
            new_row.extend(secondary_data[d])
        combined_rows.append(new_row)
    sorted_data = sorted(combined_rows)

    # write the data to the dest file
    logging.debug('Writing combined file.')
    with open(args.dest_file, 'w') as f:
        writer = csv.writer(f, delimiter=args.dest_delim)
        if args.dest_col_headers is not None:
            sorted_data.insert(0, args.dest_col_headers)
        else:
            sorted_data.insert(0, ['domain'])

        writer.writerows(sorted_data)


if __name__ == '__main__':
    main()

Python utils config From line 10 of scripts/create_domain_list.py

shell:
    '''
    rm -rf {data_dir}/tweets/clean {data_dir}/tweets/with-* {data_dir}/tweets/only-* \
           {data_dir}/counts {data_dir}/sources \
           {data_dir}/indexed-tweets {data_dir}/measures \
           {data_dir}/plots
    '''.format(**config)

SnakeMake From line 69 of workflow/Snakefile

    shell:
        'python {code_dir}/scripts/clean_tweets.py {{threads}} {{input}} {{output}}'.format(**config)

'''

SnakeMake From line 90 of workflow/Snakefile

    shell:
        'python {code_dir}/scripts/count_links.py {{threads}} {{input}} {{output}} --transform_fn=domain -hdr Domain "Link Count"'.format(**config)

''' 

SnakeMake From line 106 of workflow/Snakefile

    shell:
        'python {code_dir}/scripts/index_tweets.py {{threads}} {index_level} {{input}} {{output}}'.format(**config)

'''

SnakeMake From line 127 of workflow/Snakefile

shell:
    '''
    python {code_dir}/scripts/create_domain_list.py {{output}} -p {{input.news}} \
    -domain1 0 -data1 1 -delim1 , -skip1 1 \
    -dhead domain "political bias" \
    -ddelim $'\t' \
    -exclude en.wikipedia.org amazon.com vimeo.com m.youtube.com youtube.com whitehouse.gov twitter.com
    '''.format(**config)

SnakeMake From line 140 of workflow/Snakefile

shell:
    '''
    python {code_dir}/scripts/create_domain_list.py {{output}} -p {{input.misinfo}} \
    -domain1 0 -data1 1 2 3 -delim1 , -skip1 1 \
    -dhead domain type1 type2 type3 \
    -ddelim $'\t'
    '''.format(**config)

SnakeMake From line 158 of workflow/Snakefile

shell:
    '''
    python {code_dir}/scripts/create_domain_list.py {{output}} -s {{input}} \
    -domain2 0 -data2 1 -delim2 $'\t' -skip2=0 \
    -dhead domain pagerank \
    -ddelim $'\t' \
    -include Snopes.com PolitiFact.com FactCheck.org OpenSecrets.org TruthOrFiction.com HoaxSlayer.com
    '''.format(**config)

SnakeMake From line 174 of workflow/Snakefile

    shell:
        'python {code_dir}/scripts/strip_tweets.py {{threads}} --domains={{input.domains}} {{input.tweets}} {{output}}'.format(**config)

'''

SnakeMake From line 198 of workflow/Snakefile

    shell:
        'python {code_dir}/scripts/expand_misinfo_dataset.py {{threads}} {{input}} {{wildcards.misinfo_type}} {{output}}'.format(**config)

'''

SnakeMake From line 213 of workflow/Snakefile

    shell:
        'python {code_dir}/scripts/count_tweets.py {{threads}} {{input}} {{output}} -hdr User "Tweet Count"'.format(**config)

'''

SnakeMake From line 233 of workflow/Snakefile

    shell:
        'python {code_dir}/scripts/strip_tweets.py {{threads}} --users={{input.users}} {{input.tweets}} {{output}}'.format(**config)

'''

SnakeMake From line 251 of workflow/Snakefile

    shell:
        'python {code_dir}/scripts/count_tweets.py {{threads}} {{input}} {{output}} -hdr User "Tweet Count"'.format(**config)

'''

SnakeMake From line 266 of workflow/Snakefile

    shell:
        'python {code_dir}/scripts/index_tweets.py {{threads}} {index_level} {{input}} {{output}}'.format(**config)

'''

SnakeMake From line 282 of workflow/Snakefile

    shell:
        'python {code_dir}/scripts/count_links.py {{threads}} {{input}} {{output}} --transform_fn=domain -hdr Domain "Link Counts"'.format(**config)

'''

SnakeMake From line 296 of workflow/Snakefile

shell:
    '''
    python {code_dir}/scripts/compute_hbias.py {{threads}} {{input}} {{output}} \
           --min_num_tweets={tweets_to_sample}
    '''.format(**config)

SnakeMake From line 318 of workflow/Snakefile

shell:
    '''
    python {code_dir}/scripts/compute_hbias.py {{threads}} {{input}} {{output}} \
           --num_tweets={tweets_to_sample}
    '''.format(**config)

SnakeMake From line 336 of workflow/Snakefile

shell:
    '''
    python {code_dir}/scripts/compute_hbias.py {{threads}} {{input}} {{output}} \
           --num_tweets={tweets_to_sample} --use_partition
    '''.format(**config)