Merge branch 'master' of https://github.com/AndyTheFactory/newspaper4k …

…into docs-0.9.3
AndyTheFactory · Mar 27, 2024 · a4aba95 · a4aba95
2 parents c0834d2 + 9989040
commit a4aba95
Show file tree

Hide file tree

Showing 7 changed files with 173 additions and 104 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,10 @@
 # Change Log
 
+## 0.9.3.1 (2024-03-18)
+
+Some fixes with regards to python >= 3.11 dependencies. Numpy version was incompatible with colab. Now it is fixed.
+Also, there was a typo in the Nepali language code - it was "np" instead of "ne". This is now fixed.
+
 ## 0.9.3 (2024-03-18)
 Massive improvements in multi-language capabilities. Added over 40 new languages and completely reworked the language module. Much easier to add new languages now. Additionally, added support for Google News as a source. You can now search and parse news based on keywords, topic, location or website.
 Itegrated cloudscraper as an optional dependency. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection.

diff --git a/README.md b/README.md
@@ -118,6 +118,31 @@ print(len(articles))
 
 print(articles[0].title)
 ```
+## As of version 0.9.3, Newspaper4k supports Google News as a special Source object
+
+First, make sure you have the `google` extra installed, since we rely on the [Gnews package](https://github.com/ranahaani/GNews/) to get the articles from Google News. You can install it using pip like this:
+
+``` bash
+pip install newspaper4k[gnews]
+```
+
+Then you can use the `GoogleNews` class to get articles from Google News:
+``` python
+from newspaper.google_news import GoogleNewsSource
+
+source = GoogleNewsSource(
+ country="US",
+ period="7d",
+ max_results=10,
+)
+
+source.build(top_news=True)
+
+print(source.article_urls())
+# ['https://www.cnn.com/2024/03/18/politics/trump-464-million-dollar-bond/index.html', 'https://www.cnn.com/2024/03/18/politics/supreme-court-new-york-nra/index.html', ...
+source.download_articles()
+```
+
 
 ## Multilanguage features
 
@@ -152,13 +177,14 @@ detailed guides using newspaper.
 # Features
 
 - Multi-threaded article download framework
-- Newspaper category detection
+- Newspaper website category structure detection
 - News url identification
+- Google News integration
 - Text extraction from html
 - Top image extraction from html
 - All image extraction from html
 - Keyword building from the extracted text
-- Autoatic article text summarization
+- Automatic article text summarization
 - Author extraction from text
 - Easy to use Command Line Interface (`python -m newspaper....`)
 - Output in various formats (json, csv, text)
@@ -171,15 +197,32 @@ detailed guides using newspaper.
 
 Using the dataset from [ScrapingHub](https://github.com/scrapinghub/article-extraction-benchmark) I created an [evaluator script](tests/evaluation/evaluate.py) that compares the performance of newspaper against it's previous versions. This way we can see how newspaper updates improve or worsen the performance of the library.
 
+<h3 align="center">Scraperhub Article Extraction Benchmark</h3>
+
 | Version | Corpus BLEU Score | Corpus Precision Score | Corpus Recall Score | Corpus F1 Score |
 |--------------------|-------------------|------------------------|---------------------|-----------------|
 | Newspaper3k 0.2.8 | 0.8660 | 0.9128 | 0.9071 | 0.9100 |
 | Newspaper4k 0.9.0 | 0.9212 | 0.8992 | 0.9336 | 0.9161 |
 | Newspaper4k 0.9.1 | 0.9224 | 0.8895 | 0.9242 | 0.9065 |
 | Newspaper4k 0.9.2 | 0.9426 | 0.9070 | 0.9087 | 0.9078 |
+| Newspaper4k 0.9.3 | 0.9531 | 0.9585 | 0.9339 | 0.9460 |
+
 
 Precision, Recall and F1 are computed using overlap of shingles with n-grams of size 4. The corpus BLEU score is computed using the [nltk's bleu_score](https://www.nltk.org/api/nltk.translate.bleu).
 
+We also use our own, newly created dataset, the [Newspaper Article Extraction Benchmark](https://github.com/AndyTheFactory/article-extraction-dataset) (NAEB) which is a collection of over 400 articles from 200 different news sources to evaluate the performance of the library.
+
+<h3 align="center">Newspaper Article Extraction Benchmark</h3>
+
+| Version | Corpus BLEU Score | Corpus Precision Score | Corpus Recall Score | Corpus F1 Score |
+|--------------------|-------------------|------------------------|---------------------|-----------------|
+| Newspaper3k 0.2.8 | 0.8445 | 0.8760 | 0.8556 | 0.8657 |
+| Newspaper4k 0.9.0 | 0.8357 | 0.8547 | 0.8909 | 0.8724 |
+| Newspaper4k 0.9.1 | 0.8373 | 0.8505 | 0.8867 | 0.8682 |
+| Newspaper4k 0.9.2 | 0.8422 | 0.8888 | 0.9240 | 0.9061 |
+| Newspaper4k 0.9.3 | 0.8695 | 0.9140 | 0.8921 | 0.9029 |
+
+
 # Requirements and dependencies
 
 Following system packages are required:

diff --git a/newspaper/languages/ne.py b/newspaper/languages/ne.py
@@ -48,4 +48,4 @@ def tokenizer(text):
  """
  punct = re.escape(string.punctuation)
  text = re.sub(rf"[\s\t{punct}]+", " ", text)
- return trivial_tokenize(text, "np")
+ return trivial_tokenize(text, "ne")
diff --git a/newspaper/version.py b/newspaper/version.py
@@ -5,6 +5,6 @@
 """
 To change the version of entire package, just edit this one location.
 """
-# version 0.9.3
-version_info = (0, 9, 3)
+# version 0.9.3.1
+version_info = (0, 9, 3, 1)
 __version__ = ".".join(map(str, version_info))