dumpfile_name in regexps.py doesn't support dumps for services such as wikivoyage #1

tptak · 2013-04-15T10:51:53Z

It says all in the title :)
Sample dump name: enwikivoyage-20130330-pages-articles.xml

saffsd · 2013-04-15T23:54:36Z

Hello! Thanks for getting in touch. I'm not familiar with the wiki dumps from other services, could you provide a URL for where they can be obtained? I'm afraid extending wikidump to support them will not be as simple as expanding the dumpfile_name RE, as wikidump currently identifies only one dump per language. If you plan on using wikidump with both wikipedia and wikivoyage(?) data this functionality will need to be extended. I'm not able to work on this myself at the moment, but I welcome any code contributions to extend the scope of files that wikidump can process.

tptak · 2013-04-16T20:29:34Z

Dnia 2013-04-16 01:54 saffsd napisał(a):

Hello! Thanks for getting in touch. I'm not familiar with the wiki dumps
from other services, could you provide a URL for where they can be obtained?
I'm afraid extending wikidump to support them will not be as simple as
expanding the dumpfile_name RE, as wikidump currently identifies only one
dump per language. If you plan on using wikidump with both wikipedia and
wikivoyage(?) data this functionality will need to be extended. I'm not able
to work on this myself at the moment, but I welcome any code contributions
to extend the scope of files that wikidump can process.

Reply to this email directly or view it on GitHub:
#1 (comment)

Hi,

I downloaded the dump from http://dumps.wikimedia.org/backup-index.html -
wikivoyage is there.

To be honest, I've spent some time with wikidump, but still have no idea
what I could do with it :( I need a parser, which would convert the
Wikimedia markup into html. For what I understood until now, wikidump is not
what I'm looking for. It appears that mwlib does that, but they moved the
htmlwriter somewhere along the way and now I'm also looking for it.
But I still wanted to let you know, that not all wiki dumps contain
wiki- in the name :)

Regards,
Tomasz Ptak

saffsd · 2013-04-18T01:17:20Z

Hi Tomasz. Indeed, I'm afraid wikidump does not do what you want. The main purpose I wrote wikidump for was to provide random access to the wikipedia dump files, by automatically indexing the title and category of each page and mapping these to offsets into the raw file. I also ended up implementing routines to extract only the "text" part of each page.

ghost assigned saffsd Apr 15, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dumpfile_name in regexps.py doesn't support dumps for services such as wikivoyage #1

dumpfile_name in regexps.py doesn't support dumps for services such as wikivoyage #1

tptak commented Apr 15, 2013

saffsd commented Apr 15, 2013

tptak commented Apr 16, 2013

saffsd commented Apr 18, 2013

dumpfile_name in regexps.py doesn't support dumps for services such as wikivoyage #1

dumpfile_name in regexps.py doesn't support dumps for services such as wikivoyage #1

Comments

tptak commented Apr 15, 2013

saffsd commented Apr 15, 2013

tptak commented Apr 16, 2013

saffsd commented Apr 18, 2013