Code for replicating "Right and left, partisanship predicts vulnerability to misinformation" by Dimitar Nikolov, Alessandro Flammini and Filippo Menczer
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Introduction
In this repository, you can find code and instructions for reproducing the plots from Right and left, partisanship predicts vulnerability to misinformation by Dimitar Nikolov, Alessandro Flammini, and Filippo Menczer.
To start, clone the repo:
$ git clone https://github.com/dimitargnikolov/twitter-misinformation.git
You should run all subsequent commands from the directory where you clone the repo.
Datasets
There are three datasets you need to obtain. Before you begin, create a
data
directory at the root of the repo.
Link Sharing on Twitter
This dataset contains a set of link sharing actions that occurred on Twitter during the month of June 2017. The dataset is available on the Harvard Dataverse .
Political Valence
This is a
dataset from Facebook
, which gives political valence scores to several popular news sites. You can request access to the dataset from
Dataverse
. Once you have access, put the
top500.csv
file into the
data
directory.
Misinformation
This is a dataset of manually curated sources of misinformation available at
OpenSources.co
. Clone it from Github in your
data
directory.
$ git clone https://github.com/BigMcLargeHuge/opensources.git data/opensources
data
Directory
Once you obtain all data as described above, your
data
directory should look like this:
data
├── domain-shares.data
├── opensources
│ ├── CONTRIBUTING.md
│ ├── LICENSE
│ ├── README.md
│ ├── badges.txt
│ ├── releasenotes.txt
│ └── sources
│ ├── sources.csv
│ └── sources.json
└── top500.csv
Environment
Make sure you have
Python 3
installed on your system. Then, set up a
virtualenv
with the required modules at the root of the cloned repository:
$ virtualenv -p python3 VENV
$ source VENV/bin/activate
$ pip install -r requirements.txt
From now on, any time you want to run the analysis, activate your virtual environment with:
$ source VENV/bin/activate
Workflow
The replication code is contained in the
.py
files in the
scripts
directory. You can automate their execution with the provided
snakemake
workflow:
$ cd workflow
$ snakemake -p
The execution will display the actual shell commands being executed, so you can run them individually if you want. You can inspect the
workflow/Snakefile
file to see how the inputs and outputs for each script are specified. In addition, you can execute each script with
$ python <script_name.py> --help
to learn about what it does.
At the end of the execution, the generated plots will be in the
data
directory.
To regenerate the plots from scratch, in the
workflow
directory you can do:
$ snakemake clean
$ snakemake -p
Contact
If you have any questions about running this code or obtaining the data, please open an issue in this repository and we will get back to you as soon as possible.
Code Snippets
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | import os import argparse import csv import logging from operator import itemgetter from utils import domain from config import DEBUG_LEVEL logging.basicConfig(level=DEBUG_LEVEL) def read_domain_data(filepath, domain_col, data_cols_to_read, delimiter, skip_rows): domains = {} row_count = 0 # for debugging with open(filepath, 'r') as f: reader = csv.reader(f, delimiter=delimiter) # skip headers if skip_rows is not None: for _ in range(skip_rows): row_count += 1 next(reader) # process the rows for row in reader: if domain_col >= len(row): raise ValueError('Invalid domain index: {}, {}'.format(domain_col, ', '.join(row))) d = domain(row[domain_col]) if d in domains: logging.info('Domain has already been processed: {}. Skipping new values.'.format(d)) logging.info('Existing data: {}'.format(', '.join(domains[d]))) logging.info('New data: {}'.format(', '.join(row))) new_row = [] if data_cols_to_read is not None: for idx in data_cols_to_read: if idx >= len(row): raise ValueError('Invalid index: {}, {}'.format(idx, ', '.join(row))) elif idx == domain_col: logging.info('Data column the same as the domain column. Skipping.') else: new_row.append(row[idx]) row_count += 1 domains[d] = new_row return domains def main(): parser = argparse.ArgumentParser( description=('Create a list of domains with standardized URLs.' 'Do this either from a primary CSV or from a provided list.'), formatter_class=argparse.ArgumentDefaultsHelpFormatter ) parser.add_argument('dest_file', type=str, help=('Destination file for the combined data. ' 'The normalized domain will always be in the first column, ' 'followed by the columns to keep from the primary file, ' 'followed by the columns to keep from the secondary file.')) parser.add_argument('-p', '--primary_csv', type=str, help=('A CSV file containing domains. ' 'All domains from this file will be kept in the final output.')) parser.add_argument('-s', '--secondary_csv', type=str, default=None, help=('A CSV containing domain data. ' 'Domains from this file will not be kept ' 'unless they appear in the primary file.')) parser.add_argument('-domain1', '--primary_domain_col', type=int, default=0, help='The column with the domain in the primary source file.') parser.add_argument('-data1', '--primary_data_cols', type=int, nargs='+', help='Columns with additional data from the primary source file to keep in the output.') parser.add_argument('-delim1', '--primary_delim', type=str, default='\t', help='The delimiter in the primary source file.') parser.add_argument('-skip1', '--primary_skip_rows', type=int, default=0, help='The number of header rows in the primary source file to skip.') parser.add_argument('-domain2', '--secondary_domain_col', type=int, default=0, help='The column with the domain in the secondary source file.') parser.add_argument('-data2', '--secondary_data_cols', type=int, nargs='+', help='Columns with additional data from the secondary source file to keep in the output..') parser.add_argument('-delim2', '--secondary_delim', type=str, default='\t', help='The delimiter in the secondary source file.') parser.add_argument('-skip2', '--secondary_skip_rows', type=int, default=0, help='The number of header rows in the secondary source file to skip.') parser.add_argument('-ddelim', '--dest_delim', type=str, default='\t', help='The delimiter in the destination file.') parser.add_argument('-dhead', '--dest_col_headers', type=str, nargs='+', help=('The column headers in the destination file. ' 'Must match the number of columns being kept from both source files, ' 'plus the first column for the domain.')) parser.add_argument('-exclude', '--exclude_domains', type=str, nargs='+', help='A list of domains to exclude.') parser.add_argument('-include', '--include_domains', type=str, nargs='+', help='A list of additional domains to include in the final list.') args = parser.parse_args() if os.path.dirname(args.dest_file) != '' and not os.path.exists(os.path.dirname(args.dest_file)): os.makedirs(os.path.dirname(args.dest_file)) if (args.primary_csv is None or not os.path.exists(args.primary_csv)) and args.include_domains is None: raise ValueError('No input provided.') # read the CSVs logging.debug('Reading primary file.') if args.primary_csv is not None: primary_data = read_domain_data( args.primary_csv, args.primary_domain_col, args.primary_data_cols, args.primary_delim, args.primary_skip_rows ) else: primary_data = {} if args.include_domains is not None: for raw_d in args.include_domains: d = domain(raw_d) if d not in primary_data: primary_data[d] = [] logging.debug('Reading secondary file.') if args.secondary_csv is not None: secondary_data = read_domain_data( args.secondary_csv, args.secondary_domain_col, args.secondary_data_cols, args.secondary_delim, args.secondary_skip_rows ) else: secondary_data = {} # combine the data from both files into rows excluded_domains = frozenset(args.exclude_domains) if args.exclude_domains is not None else frozenset() combined_rows = [] for d in primary_data.keys(): if d in excluded_domains: logging.info('Skipping {}'.format(d)) continue new_row = [d] new_row.extend(primary_data[d]) if d in secondary_data: new_row.extend(secondary_data[d]) combined_rows.append(new_row) sorted_data = sorted(combined_rows) # write the data to the dest file logging.debug('Writing combined file.') with open(args.dest_file, 'w') as f: writer = csv.writer(f, delimiter=args.dest_delim) if args.dest_col_headers is not None: sorted_data.insert(0, args.dest_col_headers) else: sorted_data.insert(0, ['domain']) writer.writerows(sorted_data) if __name__ == '__main__': main() |
69 70 71 72 73 74 75 | shell: ''' rm -rf {data_dir}/tweets/clean {data_dir}/tweets/with-* {data_dir}/tweets/only-* \ {data_dir}/counts {data_dir}/sources \ {data_dir}/indexed-tweets {data_dir}/measures \ {data_dir}/plots '''.format(**config) |
90 91 92 93 | shell: 'python {code_dir}/scripts/clean_tweets.py {{threads}} {{input}} {{output}}'.format(**config) ''' |
106 107 108 109 | shell: 'python {code_dir}/scripts/count_links.py {{threads}} {{input}} {{output}} --transform_fn=domain -hdr Domain "Link Count"'.format(**config) ''' |
127 128 129 130 | shell: 'python {code_dir}/scripts/index_tweets.py {{threads}} {index_level} {{input}} {{output}}'.format(**config) ''' |
140 141 142 143 144 145 146 147 | shell: ''' python {code_dir}/scripts/create_domain_list.py {{output}} -p {{input.news}} \ -domain1 0 -data1 1 -delim1 , -skip1 1 \ -dhead domain "political bias" \ -ddelim $'\t' \ -exclude en.wikipedia.org amazon.com vimeo.com m.youtube.com youtube.com whitehouse.gov twitter.com '''.format(**config) |
158 159 160 161 162 163 164 | shell: ''' python {code_dir}/scripts/create_domain_list.py {{output}} -p {{input.misinfo}} \ -domain1 0 -data1 1 2 3 -delim1 , -skip1 1 \ -dhead domain type1 type2 type3 \ -ddelim $'\t' '''.format(**config) |
174 175 176 177 178 179 180 181 | shell: ''' python {code_dir}/scripts/create_domain_list.py {{output}} -s {{input}} \ -domain2 0 -data2 1 -delim2 $'\t' -skip2=0 \ -dhead domain pagerank \ -ddelim $'\t' \ -include Snopes.com PolitiFact.com FactCheck.org OpenSecrets.org TruthOrFiction.com HoaxSlayer.com '''.format(**config) |
198 199 200 201 | shell: 'python {code_dir}/scripts/strip_tweets.py {{threads}} --domains={{input.domains}} {{input.tweets}} {{output}}'.format(**config) ''' |
213 214 215 216 | shell: 'python {code_dir}/scripts/expand_misinfo_dataset.py {{threads}} {{input}} {{wildcards.misinfo_type}} {{output}}'.format(**config) ''' |
233 234 235 236 | shell: 'python {code_dir}/scripts/count_tweets.py {{threads}} {{input}} {{output}} -hdr User "Tweet Count"'.format(**config) ''' |
251 252 253 254 | shell: 'python {code_dir}/scripts/strip_tweets.py {{threads}} --users={{input.users}} {{input.tweets}} {{output}}'.format(**config) ''' |
266 267 268 269 | shell: 'python {code_dir}/scripts/count_tweets.py {{threads}} {{input}} {{output}} -hdr User "Tweet Count"'.format(**config) ''' |
282 283 284 285 | shell: 'python {code_dir}/scripts/index_tweets.py {{threads}} {index_level} {{input}} {{output}}'.format(**config) ''' |
296 297 298 299 | shell: 'python {code_dir}/scripts/count_links.py {{threads}} {{input}} {{output}} --transform_fn=domain -hdr Domain "Link Counts"'.format(**config) ''' |
318 319 320 321 322 | shell: ''' python {code_dir}/scripts/compute_hbias.py {{threads}} {{input}} {{output}} \ --min_num_tweets={tweets_to_sample} '''.format(**config) |
336 337 338 339 340 | shell: ''' python {code_dir}/scripts/compute_hbias.py {{threads}} {{input}} {{output}} \ --num_tweets={tweets_to_sample} '''.format(**config) |
354 355 356 357 358 | shell: ''' python {code_dir}/scripts/compute_hbias.py {{threads}} {{input}} {{output}} \ --num_tweets={tweets_to_sample} --use_partition '''.format(**config) |
Support
- Future updates
Related Workflows





