Not working on New York Times #394

AndyTheFactory · 2023-10-24T17:36:48Z

Issue by JohnChu101
Thu Aug 22 03:34:16 2019
Originally opened as codelucas/newspaper#729

As mentioned in many issues: #645 #363 , newspaper doesn't work on New York times.
And I tested two versions of New York times, one is the English version, the second is the Chinese version (https://cn.nytimes.com).
The Chinese version doesn't have payment wall, so newspaper should be able to extract the full content of it. However in both cases, newspaper only extract like 3 or 4 paragraphs and they are not from the beginning.
Is there any way i can solve this?
Thanks.

My code:

from newspaper import Article, Config as NewspaperConfig
url="https://www.nytimes.com/2019/08/21/business/economy/jobs-growth-revision.html"
conf = NewspaperConfig()
article = Article(url, config=conf, keep_article_html=True)
article.download()
article.parse()
print(article.article_html)
print(article.text)

The urls i tested with:
https://www.nytimes.com/2019/08/21/business/economy/jobs-growth-revision.html
https://cn.nytimes.com/china/20190821/china-hong-kong-social-media-soft-power/
https://cn.nytimes.com/morning-brief/20190822/hong-kong-protests-british-consulate-us-sanctions-fentanyl/

The text was updated successfully, but these errors were encountered:

AndyTheFactory · 2023-10-24T17:36:49Z

Comment by jecarr
Mon May 10 06:20:58 2021

If it's any help, #885 works with your first URL. With the second URL, the last sentence is missing and with the third URL I think a few more sentences are missing. I can't read the Chinese version to fully determine what sentences are missing here and there but the linked PR captures more than the master branch - hope it helps!

AndyTheFactory · 2024-01-07T22:27:40Z

The Chinese article does not work because is_highlink_density does not use the proper tokenizer to split non-european languages.
Therefore many divs appear as high link density

Must re-think the multi-language handling

AndyTheFactory · 2024-03-12T22:59:34Z

fixed in 0.9.3

AndyTheFactory added the sites not working label Nov 12, 2023

AndyTheFactory added this to the Release 0.9.2 milestone Nov 12, 2023

AndyTheFactory self-assigned this Dec 19, 2023

AndyTheFactory added the bug Something isn't working label Jan 7, 2024

AndyTheFactory modified the milestones: Release 0.9.2, Release 0.9.3 Jan 11, 2024

AndyTheFactory closed this as completed Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not working on New York Times #394

Not working on New York Times #394

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Jan 7, 2024

AndyTheFactory commented Mar 12, 2024

Not working on New York Times #394

Not working on New York Times #394

Comments

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Jan 7, 2024

AndyTheFactory commented Mar 12, 2024