Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the future of snscrape #1037

Open
lukaszpeee opened this issue Nov 19, 2023 · 18 comments
Open

Question about the future of snscrape #1037

lukaszpeee opened this issue Nov 19, 2023 · 18 comments
Labels
question Further information is requested

Comments

@lukaszpeee
Copy link

Hi,

I think snscrape is an amazing library that offers numerous possibilities. I have a question about the current situation because it has been lasting for a few months. I would like to use snscrape for my projects, but now it's crashed. Do you think there is any chance that the situation will change?

Best regards

@lukaszpeee lukaszpeee added the question Further information is requested label Nov 19, 2023
@JustAnotherArchivist
Copy link
Owner

Yeah, I haven't had enough spare time recently to play whack-a-mole with Elon's minions. I do intend to resume development, but I can't currently say when that will happen.

@JustAnotherArchivist JustAnotherArchivist changed the title Question about the feature Question about the future of snscrape Nov 24, 2023
@Krishna-Singhal
Copy link

Yeah, I haven't had enough spare time recently to play whack-a-mole with Elon's minions. I do intend to resume development, but I can't currently say when that will happen.

Could you please revive the amazing lib for twitter (or x) ; )

@Evandro72
Copy link

Good luck my friend!

@Dev-Anky07
Copy link

@JustAnotherArchivist I can help you with the automated login using a headless browser instance, which can then be bypassed by passing the browser profile so that the passwords get saved

options = webdriver.ChromeOptions() 
options.add_argument("user-data-dir=C:\\Path") #Path to your chrome profile
w = webdriver.Chrome(executable_path="C:\\Users\\chromedriver.exe", chrome_options=options)

Although the bypass was for single user application and I don't know how you'll be able to implement it at this scale.

One way would be to ask the user for their login details which will then be used to authenticate the automated login.

I used selenium and the sendKeys() function.

lemme know if that's something you'd like, I just want to help as much as I can

@TheTechRobo
Copy link
Contributor

@Dev-Anky07 Authentication won't be supported in snscrape: #270

@lukaszpeee

This comment was marked as spam.

@leleobhz
Copy link

@Dev-Anky07 Authentication won't be supported in snscrape: #270

Hello!

I think #270 needs to be reviewed at least for Twitter, since API (and paying it) is the recognized and official way to scrap twitter. And new Twitter positioning about scraping will require at least follow their api guidelines.

@JustAnotherArchivist
Copy link
Owner

It does not. If you want to use the API, there are several API clients already. Also, regular use of an official API isn't scraping.

@leleobhz
Copy link

It does not. If you want to use the API, there are several API clients already. Also, regular use of an official API isn't scraping.

The way it is today SNScrape cannot scrape notting from Twitter - and I doubt it will can while Musk own this network. Also, idea here is allow user to have options since massive access to tweets can be reached with API. I'm not using snscrape as a client, but as a specific-terms scrap - and I think this will be very useful, counting the number of forks of sncrape just for support twitter auth API.

@JustAnotherArchivist
Copy link
Owner

You seem to misunderstand what snscrape is. It's a scraper, not an API client. And more specifically, it's for scraping publicly accessible content. Anything behind authentication walls has always been outside of snscrape's scope and design goal. If people want to maintain a fork going beyond that scope, they can do that (so long as they comply with GPLv3+). It might be useful to them. It's not something I will entertain though.

When I started writing snscrape, there was no usable software for scraping Twitter. There were and are usable API clients, and I'm not going to reinvent the wheel and write another one. Again, please use one of those many existing ones if you want to use the API.

snscrape can't scrape Twitter anymore, and the best thing it might do in the foreseeable future is retrieving individual tweets (useful for hydrating tweet ID lists, although it won't work for age-restricted or protected tweets) and a profile's most popular tweets. Those are the only things that are still publicly accessible as far as I know. I will likely remove all other Twitter scrapers.

@leleobhz
Copy link

You seem to misunderstand what snscrape is. It's a scraper, not an API client. And more specifically, it's for scraping publicly accessible content. Anything behind authentication walls has always been outside of snscrape's scope and design goal. If people want to maintain a fork going beyond that scope, they can do that (so long as they comply with GPLv3+). It might be useful to them. It's not something I will entertain though.

When I started writing snscrape, there was no usable software for scraping Twitter. There were and are usable API clients, and I'm not going to reinvent the wheel and write another one. Again, please use one of those many existing ones if you want to use the API.

snscrape can't scrape Twitter anymore, and the best thing it might do in the foreseeable future is retrieving individual tweets (useful for hydrating tweet ID lists, although it won't work for age-restricted or protected tweets) and a profile's most popular tweets. Those are the only things that are still publicly accessible as far as I know. I will likely remove all other Twitter scrapers.

I saw a nitter-based scrapper that works - within their VERY limitations and API questions, maybe something can be used from nitter too. I understand twitter data is public because the restriction is not about who access, but just have any "official" way to reach it. Its different than Facebook - as example - that allow user to decide what is public or not. On twitter, everything is available except for private accounts - as it always was. Musk want to vanish robots, crawlers, scrapers, etc from Twitter but this does not change the way information is handled by twitter.

In this way, I think twitter scrapers deserve attention since twitter still relevant for public information and debate. I understand you question about API (And I agree with you prerogatives) but also just deprecate is since there is no known neither easy way to access it can do the opposite scrapers always wanted to do.

@JustAnotherArchivist
Copy link
Owner

Nitter only works with accounts now, as far as I'm aware. It previously used guest tokens, but those can't be generated anymore since late January, and the last ones expired a few days ago.

@DevanshD3
Copy link

Hey @JustAnotherArchivist , I just wanted to scrape some tweets of a few accounts, my dad wanted some tweets and he was manually copy pasting, being a fresh CS Grad I had to intervene, but this thread made me sad. Can we use snscrape to do that, or that capability is also unavailable ?

@vzhb
Copy link

vzhb commented May 24, 2024

I also need this project very much. Is there anything I can help with

@DROBNJAK
Copy link

How about adding ability to use proxies to snscrape? if one keeps changing IP address then he can allay Elon's ire.

@TheTechRobo
Copy link
Contributor

snscrape already supports proxies. On the CLI, you can use the HTTPS_PROXY environment variable; in the module, there is a proxies argument to the scraper. The issue AFAIK is not IP addresses, it's that very little data is available if you are not authenticated or using the official API.

@DROBNJAK
Copy link

DROBNJAK commented Jul 14, 2024

There is another strategy that other similar types of software are using, and that is "humanising" the activity. When human goes to web page he/she doesn't just grab as much as he can as fast as he can, instead human goes back and forth, stops and does nothing, goes back etc.
Obviously output goes down, but one can parallelise with proxies and at least keep those accounts and keep doing it. If you want, slow it down to speed it up.

@DROBNJAK
Copy link

. . . another thing to consider are cookies. One can't have x5 proxies and collect only x1 cookie. Twitter will suss you out on the spot. There must be one cookie per proxy and cookies need to be carefully synchronised with proxies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

10 participants