-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not working on New York Times #394
Comments
Comment by jecarr If it's any help, #885 works with your first URL. With the second URL, the last sentence is missing and with the third URL I think a few more sentences are missing. I can't read the Chinese version to fully determine what sentences are missing here and there but the linked PR captures more than the master branch - hope it helps! |
The Chinese article does not work because is_highlink_density does not use the proper tokenizer to split non-european languages. Must re-think the multi-language handling |
fixed in 0.9.3 |
Issue by JohnChu101
Thu Aug 22 03:34:16 2019
Originally opened as codelucas/newspaper#729
As mentioned in many issues: #645 #363 , newspaper doesn't work on New York times.
And I tested two versions of New York times, one is the English version, the second is the Chinese version (https://cn.nytimes.com).
The Chinese version doesn't have payment wall, so newspaper should be able to extract the full content of it. However in both cases, newspaper only extract like 3 or 4 paragraphs and they are not from the beginning.
Is there any way i can solve this?
Thanks.
My code:
The urls i tested with:
https://www.nytimes.com/2019/08/21/business/economy/jobs-growth-revision.html
https://cn.nytimes.com/china/20190821/china-hong-kong-social-media-soft-power/
https://cn.nytimes.com/morning-brief/20190822/hong-kong-protests-british-consulate-us-sanctions-fentanyl/
The text was updated successfully, but these errors were encountered: