Linking in the News Study Data Processor

Code to generate data for a study into cross-national linking norms in online news.

Install

conda install --file requirement.txt, or pip install -r requirements.txt
pip install click==7.1.2 - had to manually downgrade (issue)
python -m spacy download en_core_web_sm

Process

Download all the stories with HTML into the input folder - there should be one folder per media source in there (see below for full query)
Run run-pipeline.sh to export the metadata ndjson/csv (in export/links-by-media) - this takes a while!
Combine the CSV files into one: combine-link-csvs.sh and combine-story-csvs.sh for use with Tableau or R Studio
Run python remove-duplicates.py to remove duplicates in the links-all.csv file, creating an links-all-no-dupes.csv
Combine the NDJSON files into one: cat export/links-by-media/ndjson/*.ndjson > export/links-by-media/all.ndjson
Run python export-domains.py to write a file for each media source with all the domains link to/from
Combine those into one file of all unique domains linked to: cat export/domain-links-by-media/*.csv | sort | uniq > export/all-domains.txt
Run python fetch-domain-info.py to check for Media Cloud metadata for each media source
Run python export-network-graphs.py to generate network graphs for each country, with full source metadata embedded
Run python export-internal-to-external.py to create a CSV with platform / non-platform link data by country

Importing to Kibana (optional)

To export to our Media Cloud Kibana:

setup tunnel to Kibana ssh -6 -L 9200:$(ssh -6 bly.srv.mediacloud.org dokku elasticsearch:info kibana-elasticsearch --internal-ip):9200 bly.srv.mediacloud.org
run import: elasticsearch_loader --es-host http://[::1]:9200/ --index linkstudy-1 --index-settings-file export/story-link-mapping.json json export/story-links-sample1.ndjson --json-lines

Queries

The original query that created our corpus of stories (matches around 800k stories):

FETCH_TEXT = False
FETCH_RAW_HTML = True
COLLECTIONS = [
    34412476,  # uk national
    34412282,  # Australia national
    34412126,  # Kenya national
    34412238,  # S Africa national
    34412313,  # Philippines national
    34412118,  # India national
    34412234,  # US national
]
QUERY = '* language:en'
FQ = " OR ".join([
    mediacloud.api.MediaCloud.dates_as_query_clause(dt.date(2020, 2, 2), dt.date(2020, 2, 8)),  # inclusive
    mediacloud.api.MediaCloud.dates_as_query_clause(dt.date(2020, 5, 10), dt.date(2020, 5, 16)),  # inclusive
    mediacloud.api.MediaCloud.dates_as_query_clause(dt.date(2020, 8, 16), dt.date(2020, 8, 22)),  # inclusive
    mediacloud.api.MediaCloud.dates_as_query_clause(dt.date(2020, 10, 25), dt.date(2020, 10, 31))  # inclusive
])
LIMIT = None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Linking in the News Study Data Processor

Install

Process

Importing to Kibana (optional)

Queries

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
analyzer		analyzer
export		export
.env.template		.env.template
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
combine-link-csvs.sh		combine-link-csvs.sh
combine-story-csvs.sh		combine-story-csvs.sh
export-domains.py		export-domains.py
export-internal-to-external.py		export-internal-to-external.py
export-network-graphs.py		export-network-graphs.py
export-top-targets.py		export-top-targets.py
fetch-domain-info.py		fetch-domain-info.py
remove-duplicates.py		remove-duplicates.py
requirements.txt		requirements.txt
run-pipeline.py		run-pipeline.py
run-pipeline.sh		run-pipeline.sh

dataculturegroup/news-linking-study-data

Folders and files

Latest commit

History

Repository files navigation

Linking in the News Study Data Processor

Install

Process

Importing to Kibana (optional)

Queries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages