Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not working on New York Times #394

Closed
AndyTheFactory opened this issue Oct 24, 2023 · 3 comments
Closed

Not working on New York Times #394

AndyTheFactory opened this issue Oct 24, 2023 · 3 comments
Assignees
Labels
bug Something isn't working sites not working
Milestone

Comments

@AndyTheFactory
Copy link
Owner

Issue by JohnChu101
Thu Aug 22 03:34:16 2019
Originally opened as codelucas/newspaper#729


As mentioned in many issues: #645 #363 , newspaper doesn't work on New York times.
And I tested two versions of New York times, one is the English version, the second is the Chinese version (https://cn.nytimes.com).
The Chinese version doesn't have payment wall, so newspaper should be able to extract the full content of it. However in both cases, newspaper only extract like 3 or 4 paragraphs and they are not from the beginning.
Is there any way i can solve this?
Thanks.

My code:

from newspaper import Article, Config as NewspaperConfig
url="https://www.nytimes.com/2019/08/21/business/economy/jobs-growth-revision.html"
conf = NewspaperConfig()
article = Article(url, config=conf, keep_article_html=True)
article.download()
article.parse()
print(article.article_html)
print(article.text)

The urls i tested with:
https://www.nytimes.com/2019/08/21/business/economy/jobs-growth-revision.html
https://cn.nytimes.com/china/20190821/china-hong-kong-social-media-soft-power/
https://cn.nytimes.com/morning-brief/20190822/hong-kong-protests-british-consulate-us-sanctions-fentanyl/

@AndyTheFactory
Copy link
Owner Author

Comment by jecarr
Mon May 10 06:20:58 2021


If it's any help, #885 works with your first URL. With the second URL, the last sentence is missing and with the third URL I think a few more sentences are missing. I can't read the Chinese version to fully determine what sentences are missing here and there but the linked PR captures more than the master branch - hope it helps!

@AndyTheFactory
Copy link
Owner Author

The Chinese article does not work because is_highlink_density does not use the proper tokenizer to split non-european languages.
Therefore many divs appear as high link density

Must re-think the multi-language handling

@AndyTheFactory AndyTheFactory added the bug Something isn't working label Jan 7, 2024
@AndyTheFactory
Copy link
Owner Author

fixed in 0.9.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working sites not working
Projects
None yet
Development

No branches or pull requests

1 participant