Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dumpfile_name in regexps.py doesn't support dumps for services such as wikivoyage #1

Open
tptak opened this issue Apr 15, 2013 · 3 comments
Assignees

Comments

@tptak
Copy link

tptak commented Apr 15, 2013

It says all in the title :)
Sample dump name: enwikivoyage-20130330-pages-articles.xml

@ghost ghost assigned saffsd Apr 15, 2013
@saffsd
Copy link
Owner

saffsd commented Apr 15, 2013

Hello! Thanks for getting in touch. I'm not familiar with the wiki dumps from other services, could you provide a URL for where they can be obtained? I'm afraid extending wikidump to support them will not be as simple as expanding the dumpfile_name RE, as wikidump currently identifies only one dump per language. If you plan on using wikidump with both wikipedia and wikivoyage(?) data this functionality will need to be extended. I'm not able to work on this myself at the moment, but I welcome any code contributions to extend the scope of files that wikidump can process.

@tptak
Copy link
Author

tptak commented Apr 16, 2013

Dnia 2013-04-16 01:54 saffsd napisał(a):

Hello! Thanks for getting in touch. I'm not familiar with the wiki dumps
from other services, could you provide a URL for where they can be obtained?
I'm afraid extending wikidump to support them will not be as simple as
expanding the dumpfile_name RE, as wikidump currently identifies only one
dump per language. If you plan on using wikidump with both wikipedia and
wikivoyage(?) data this functionality will need to be extended. I'm not able
to work on this myself at the moment, but I welcome any code contributions
to extend the scope of files that wikidump can process.


Reply to this email directly or view it on GitHub:
#1 (comment)

Hi,

I downloaded the dump from http://dumps.wikimedia.org/backup-index.html -
wikivoyage is there.

To be honest, I've spent some time with wikidump, but still have no idea
what I could do with it :( I need a parser, which would convert the
Wikimedia markup into html. For what I understood until now, wikidump is not
what I'm looking for. It appears that mwlib does that, but they moved the
htmlwriter somewhere along the way and now I'm also looking for it.
But I still wanted to let you know, that not all wiki dumps contain
wiki- in the name :)

Regards,
Tomasz Ptak

@saffsd
Copy link
Owner

saffsd commented Apr 18, 2013

Hi Tomasz. Indeed, I'm afraid wikidump does not do what you want. The main purpose I wrote wikidump for was to provide random access to the wikipedia dump files, by automatically indexing the title and category of each page and mapping these to offsets into the raw file. I also ended up implementing routines to extract only the "text" part of each page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants