Skip to content

Tool for extracting and saving news article metadata (and optionally content) at regular intervals.

License

Notifications You must be signed in to change notification settings

Leibniz-HBI/newsfeedback

Repository files navigation

newsfeedback

Tool for extracting and saving news article metadata at regular intervals. It utilizes Beautiful Soup 4, trafilatura and Selenium to extract and, if desired, filter article metadata across three different pipelines depending on a site's structure.

Note: 🏗 This tool and its README are currently under construction 🏗


💻 Installation and Usage

If you use pipx, you can install with pipx install newsfeedback. Alternatively, you can install via pip: pip install newsfeedback. There you go! You can now run newsfeedback --help and the commands outlined below. To run tests type pytest.

📦 Getting Started - Default

"Out the box", newsfeedback retrieves a list of homepages to be extracted from the default homepage config file. It is recommended to proceed with extracting metadata with these unchanged settings once to get acquainted with the functionalities of the tool.

  1. After installing newsfeedback, run newsfeedback pipeline-picker -u '[LINK OF YOUR CHOICE]'. This URL must be in the config file.
  2. Check the output folder (default: newsfeedback/output) for the csv - this will be the structure of your exported CSVs.
  3. If satisfied, proceed to run newsfeedback get-data, adding -t [INTEGER] to specify every -t newsfeedback is to grab data.

Note: This defaults to every 6 hours and extracts data from all default homepage URLs in the config. If you wish to only extract data from one URL, add it to a custom config with `newsfeedback add-homepage-url` and then re-run Step 3.

💻 Running newsfeedback on a server

If you want to run newsfeedback on a dedicated server, please make sure that you have Chrome installed on said server. Otherwise, you may be met with a Chrome binary error when using the Pur Abo pipeline. If you are met with regularly occurring timeouts while using the Pur Abo pipeline, your server may not have enough memory. It seems that at least 2GB are needed.

🗂 Commands

newsfeedback --help

Get an overview of the command line commands.

newsfeedback add-homepage-url

Add a homepage URL to your config file (and thus to your metadata extraction workflow) via prompt.

newsfeedback generate-config

Generate a new user config. Prompts the user to select either metadata or homepage as type of config and then clones the default settings into the new user config file.

If a user-generated homepage config already exists, missing default homepage URLs will be copied into the user-generated config file.

If a user-generated metadata config already exists, the users will be met with an error and prompted to adjust settings manually in the config.

newsfeedback pipeline-picker -u → URL of the website you wish to extract metadata from. -o → output folder (default: newsfeedback/output)

Extracts article links and metadata from a homepage that is stored in the config file. A typical usecase is a trial run for the extraction of data from a newly added homepage.

newsfeedback get-data newsfeedback extracts the metadata and links once every -t (default: 6) hours.

Using the homepages listed in the user config file (or the default config file, should the former not exist), metadata is extracted.

🎨 Customizing your parameters

Extraction and filtering pipelines

beautifulsoup : using Beautiful Soup 4, URLs are extracted from the homepage HTML. As this initial URL collection is very broad, subsequent filtering is recommended. This is the most comprehensive pipeline for URL collection, but also has a higher quota of irrelevant URLs, especially if filtering is not turned on.

  • high URL success retrieval rates
  • high rates of irrelevant URLs
  • filtering recommended
trafilatura : using trafilatura, articles are extracted from the given homepage URL. Success rates depend on the homepage HTML structure, but the quota of irrelevant URLs is very low.
  • URL retrieval success depends on homepage HTML structure
  • low rates of irrelevant URLs
  • filtering is not needed
purabo : if a news portal requires access via a Pur Abo/data tracking consent (i.e. ZEIT online or heise), the consent button for the latter must be clicked via Selenium so that the article URLs can be collected. The Pur-Abo-pipeline continues with the same functionalities as the beautifulsoup-pipeline once the consent button has been clicked. Note: oftentimes, article URLs can still be retrieved, as the page is loaded behind the overlay, so only use this pipeline if others fail.
  • only needed for very few homepages
  • dependant on Selenium and Beautiful Soup pipeline
  • high rates of irrelevant URLs
  • filtering recommended

Filters apply to URLs only . newsfeedback's filters are based on a simple whitelist with the eventual goal of allowing user additions to the whitelist rules. Due to this tool still being in its infancy, these filters are far from sophisticated ☺

Once article URLs have been extracted and, if need be, filtered, metadata is extracted with trafilatura.bare_extraction.

Adding data to the config file

If you wish to generate a custom config file, run newsfeedback add-homepage-url and follow the instructions. You will be asked for the URL, the desired pipeline (either beautifulsoup, trafilatura or purabo). This will spawn an empty user config, adding in the desired URL. If you wish to extract metadata from the default homepages as well, please run newsfeedback generate-config and select homepage, as this copies the missing URLs into the user-generated config. newsfeedback will automatically refer to the user-generated config, if present, as the standard config for data collection.

Changing the types of metadata collected

By default, newsfeedback collects an article's title, url, description, date. If you wish to collect other categories of metadata, simply generate a user config file with newsfeedback generate-config and then manually adjust the settings within this file. Possible categories of metadata are: title, author, url, hostname, description, sitename, date, categories, tags, fingerprint, id, license, body, comments, commentsbody, raw_text, text, language. Note that not all website may provide all categories.


[Rahel Winter and Felix Victor Münch](mailto:[email protected], [email protected]) under MIT.

About

Tool for extracting and saving news article metadata (and optionally content) at regular intervals.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages