-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dumpfile_name in regexps.py doesn't support dumps for services such as wikivoyage #1
Comments
Hello! Thanks for getting in touch. I'm not familiar with the wiki dumps from other services, could you provide a URL for where they can be obtained? I'm afraid extending wikidump to support them will not be as simple as expanding the dumpfile_name RE, as wikidump currently identifies only one dump per language. If you plan on using wikidump with both wikipedia and wikivoyage(?) data this functionality will need to be extended. I'm not able to work on this myself at the moment, but I welcome any code contributions to extend the scope of files that wikidump can process. |
Dnia 2013-04-16 01:54 saffsd napisał(a):
Hi, I downloaded the dump from http://dumps.wikimedia.org/backup-index.html - To be honest, I've spent some time with wikidump, but still have no idea Regards, |
Hi Tomasz. Indeed, I'm afraid wikidump does not do what you want. The main purpose I wrote wikidump for was to provide random access to the wikipedia dump files, by automatically indexing the title and category of each page and mapping these to offsets into the raw file. I also ended up implementing routines to extract only the "text" part of each page. |
It says all in the title :)
Sample dump name: enwikivoyage-20130330-pages-articles.xml
The text was updated successfully, but these errors were encountered: