Tool for extracting and saving news article metadata at regular intervals. It utilizes Beautiful Soup 4, trafilatura and Selenium to extract and, if desired, filter article metadata across three different pipelines depending on a site's structure.
Note: 🏗 This tool and its README are currently under construction 🏗
If you use pipx
, you can install with pipx install newsfeedback
. Alternatively, you can install via pip
: pip install newsfeedback
.
There you go! You can now run newsfeedback --help
and the commands outlined below. To run tests type pytest
.
"Out the box", newsfeedback retrieves a list of homepages to be extracted from the default homepage config file. It is recommended to proceed with extracting metadata with these unchanged settings once to get acquainted with the functionalities of the tool.
- After installing
newsfeedback
, runnewsfeedback pipeline-picker -u '[LINK OF YOUR CHOICE]'
. This URL must be in the config file. - Check the output folder (default: newsfeedback/output) for the csv - this will be the structure of your exported CSVs.
- If satisfied, proceed to run
newsfeedback get-data
, adding-t [INTEGER]
to specify every-t
newsfeedback is to grab data.
Note: This defaults to every 6 hours and extracts data from all default homepage URLs in the config. If you wish to only extract data from one URL, add it to a custom config with `newsfeedback add-homepage-url` and then re-run Step 3.
If you want to run newsfeedback on a dedicated server, please make sure that you have Chrome installed on said server. Otherwise, you may be met with a Chrome binary error when using the Pur Abo pipeline. If you are met with regularly occurring timeouts while using the Pur Abo pipeline, your server may not have enough memory. It seems that at least 2GB are needed.
newsfeedback --help
Get an overview of the command line commands.
newsfeedback add-homepage-url
Add a homepage URL to your config file (and thus to your metadata extraction workflow) via prompt.
newsfeedback generate-config
Generate a new user config. Prompts the user to select either metadata or homepage as type of config and then clones the default settings into the new user config file.
If a user-generated homepage config already exists, missing default homepage URLs will be copied into the user-generated config file.
If a user-generated metadata config already exists, the users will be met with an error and prompted to adjust settings manually in the config.
newsfeedback pipeline-picker
-u
→ URL of the website you wish to extract metadata from. -o
→ output folder (default: newsfeedback/output)
Extracts article links and metadata from a homepage that is stored in the config file. A typical usecase is a trial run for the extraction of data from a newly added homepage.
newsfeedback get-data
newsfeedback extracts the metadata and links once every -t
(default: 6) hours.
Using the homepages listed in the user config file (or the default config file, should the former not exist), metadata is extracted.
beautifulsoup : using Beautiful Soup 4, URLs are extracted from the homepage HTML. As this initial URL collection is very broad, subsequent filtering is recommended. This is the most comprehensive pipeline for URL collection, but also has a higher quota of irrelevant URLs, especially if filtering is not turned on.
- high URL success retrieval rates
- high rates of irrelevant URLs
- filtering recommended
- URL retrieval success depends on homepage HTML structure
- low rates of irrelevant URLs
- filtering is not needed
- only needed for very few homepages
- dependant on Selenium and Beautiful Soup pipeline
- high rates of irrelevant URLs
- filtering recommended
Filters apply to URLs only . newsfeedback's filters are based on a simple whitelist with the eventual goal of allowing user additions to the whitelist rules. Due to this tool still being in its infancy, these filters are far from sophisticated ☺
Once article URLs have been extracted and, if need be, filtered, metadata is extracted with trafilatura.bare_extraction.
If you wish to generate a custom config file, run newsfeedback add-homepage-url
and follow the instructions. You will be asked for the URL, the desired pipeline (either beautifulsoup, trafilatura or purabo). This will spawn an empty user config, adding in the desired URL. If you wish to extract metadata from the default homepages as well, please run newsfeedback generate-config
and select homepage, as this copies the missing URLs into the user-generated config. newsfeedback will automatically refer to the user-generated config, if present, as the standard config for data collection.
By default, newsfeedback collects an article's title, url, description, date
. If you wish to collect other categories of metadata, simply generate a user config file with newsfeedback generate-config
and then manually adjust the settings within this file. Possible categories of metadata are: title, author, url, hostname, description, sitename, date, categories, tags, fingerprint, id, license, body, comments, commentsbody, raw_text, text, language. Note that not all website may provide all categories.
[Rahel Winter and Felix Victor Münch](mailto:[email protected], [email protected]) under MIT.