Document this module and make it easier for others to re-run #3

ragesoss · 2020-10-07T17:31:56Z

This module hasn't been used to recreate the original analysis Kevin did in several years. Try to make it work, document any problems found, and add documentation and/or fixes to make it practical to do similar analyses on a regular basis.

tab1tha · 2020-10-11T04:51:29Z

I'm on it.
How do I submit the documentation though, in what format will you prefer? and please can you link me to Kevin's original analysis? I have not come across it yet.

ragesoss · 2020-10-12T17:42:59Z

I believe this is Kevin's analysis based on this module: https://meta.wikimedia.org/wiki/Research:Wiki_Ed_student_editor_contributions_to_sciences_on_Wikipedia

ragesoss · 2020-10-12T17:50:50Z

As for the format... I guess the best option would probably be to add a new markdown file with details on how to use it, along with inline comments for anything within the code that you think should be clarified.

tab1tha · 2020-10-13T04:43:56Z

okay. Thank you

tab1tha · 2020-10-16T14:37:30Z

[Help needed] The main issue I have been having for days now is that the mwdumps --wiki=enwiki --verbose /home/tab1tha/Documents takes hours but does not run to completion. I have had to keep interrupting it using Ctrl+C and rerunning it so that a few more files are downloaded.
Do I need to use all the files? What arguments can I use to select only the relevant files.

I am planning to replicate Kevin's research using topic contribution data for the year 2020.

This is the level at which the command is at now: https://pastebin.com/SFVSXKwp

ragesoss · 2020-10-16T20:41:49Z

Thanks for the update! Hmm... I suspect that Kevin may have done this from Toolforge, and even if he didn't, that's probably the best way around the problem you're facing. https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_Introduction

I suggest going through the process to get Toolforge access and try to do it from there, since using a server within the same clould environment should make the dump downloads much faster and more reliable.

tab1tha · 2020-10-16T20:59:20Z

Thank you. I'll go through the guide, set it up and give it another try.

tab1tha · 2020-10-18T03:56:29Z

I requested for toolforge access and it says here (https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Quickstart) that I have to wait a week for it to be granted.

Are there any other related tasks that I can be working on until then?

ragesoss · 2020-10-19T17:23:32Z

Hopefully it will be less than a week, but here's another related analysis module you could look at: https://github.com/WikiEducationFoundation/academic_classification

Similarly to this one, it's from the work Kevin was doing several years ago and we'd love to be able to easily re-run similar analyses on more recent data, so documenting where the bottlenecks and problems are will be helpful.

tab1tha · 2020-10-19T17:30:32Z

Okay. I'm checking it out now

tab1tha · 2020-10-27T10:21:22Z

I have received toolforge access and it's taking me surprisingly long to understand how to use it. I apologize for my speed so far, I am doing my best to make a substantial contribution before the 30th.

ragesoss · 2020-10-27T17:53:21Z

Thanks @tab1tha! Sorry I couldn't provide a more clear-cut way to dive in.

tab1tha · 2020-10-27T18:18:21Z

I have a few questions, Do I need to create a toolforge tool?.
Where do I run commands like this mwdumps --wiki=enwiki --verbose /home/tab1tha/Documents, is it on the toolforge shell ?
I can't find the toolforge shell.
I have however succeeded to access the dumps from PAWS using the command ls /public/dumps/public

ragesoss · 2020-10-27T18:20:56Z

yes, creating a tool might be the best way to go.

The 'toolforge' shell probably just means the terminal once you've logged on to toolforge via SSH. If you can get to the PAWS dumps, I think that means you're in the toolforge shell already.

tab1tha · 2020-10-27T18:30:41Z

Ohh. This is helpful. Thank you

tab1tha · 2020-10-28T16:34:32Z

[Update: help needed]
This commit shows the work I have done so far. I am at the point where I cannot write the output files to a folder which I created. This might be because of some restrictions on the toolforge platform. The error log is pasted here .

I receive Error [13] which says that I do not have file permissions but when I check, it shows that I do have all the file permissions for that folder. Trying to use sudo with the command fails too with this error .

ragesoss · 2020-10-28T16:49:47Z

@tab1tha it looks like out=/demo_results specifies an absolute path, rather than a path relative to your home directory or your tool's directory. Maybe that's why you're getting a permissions error?

tab1tha · 2020-10-28T17:01:33Z

using the relative path /home/tambetabitha/demo_results yields this instead https://pastebin.com/DhMSriMh
It says now that enwiki-20201001-stub-meta-history9.xml.gz is not a directory

ragesoss · 2020-10-28T17:40:55Z

That seems like progress, perhaps. I don't know why it would be trying to treat that gzip file as a directory, though.

tab1tha · 2020-10-28T17:59:52Z

I have been trying to figure that out too. I'm looking at the code now.

tab1tha · 2020-10-29T04:29:24Z

I think it fails because the regex in topics.cmdline._get_files_to_work_on specifies that the filename must end with .xml. However, mine is still gzipped and ends in .gz.

`def _get_files_to_work_on(input_dir):

raw_files = [join(input_dir, f) for f in listdir(input_dir)
if isfile(join(input_dir, f))]
dump_files = [f for f in raw_files
if re.match('.*stub-meta-history(\d+).xml', f)]
return dump_files`

It is therefore necessary to unzip the file before passing it as a command line argument. Alternatively, we could adjust the regex code in the topics.cmdline module such that receives both zipped and unzipped files and in the case where the file in zipped, it unzips it using gzip.

tab1tha · 2020-10-29T09:41:14Z

In the meantime, considering that the commit of Demonstration.md is part of Pull request 5, I have changed the pull request name to a more appropriate one.

Also, am I on track with respect to the content and format of Demonstration.md so far? Is there something else that you expected or would want me to add?

tab1tha · 2020-10-29T16:57:37Z

To enable handling of .gz files, I have considered adding a try-except clause to the topics.cmdline._get_files_to_work_on function as such:
def _get_files_to_work_on(input_dir):
raw_files = [join(input_dir, f) for f in listdir(input_dir)
if isfile(join(input_dir, f))]
try:
dump_files = [f for f in raw_files
if re.match('.*stub-meta-history(\d+).xml', f)]
except expression as identifier:
dump_files = [gzip.open(f) for f in raw_files
if re.match('.stub-meta-history(\d+).xml.', f)]
return dump_files

Is this okay? What would you prefer?

ragesoss · 2020-10-29T17:00:24Z

Anything that works is fine with me! I don't have much of a sense for what is the most Pythonic way to do things, so use your best judgment.

tab1tha · 2020-10-29T17:01:04Z

Okay. I'm on it !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document this module and make it easier for others to re-run #3

Document this module and make it easier for others to re-run #3

ragesoss commented Oct 7, 2020

tab1tha commented Oct 11, 2020

ragesoss commented Oct 12, 2020

ragesoss commented Oct 12, 2020

tab1tha commented Oct 13, 2020

tab1tha commented Oct 16, 2020

ragesoss commented Oct 16, 2020

tab1tha commented Oct 16, 2020

tab1tha commented Oct 18, 2020

ragesoss commented Oct 19, 2020

tab1tha commented Oct 19, 2020

tab1tha commented Oct 27, 2020

ragesoss commented Oct 27, 2020

tab1tha commented Oct 27, 2020

ragesoss commented Oct 27, 2020

tab1tha commented Oct 27, 2020 •

edited

Loading

tab1tha commented Oct 28, 2020 •

edited

Loading

ragesoss commented Oct 28, 2020

tab1tha commented Oct 28, 2020

ragesoss commented Oct 28, 2020

tab1tha commented Oct 28, 2020

tab1tha commented Oct 29, 2020 •

edited

Loading

tab1tha commented Oct 29, 2020

tab1tha commented Oct 29, 2020 •

edited

Loading

ragesoss commented Oct 29, 2020

tab1tha commented Oct 29, 2020

Document this module and make it easier for others to re-run #3

Document this module and make it easier for others to re-run #3

Comments

ragesoss commented Oct 7, 2020

tab1tha commented Oct 11, 2020

ragesoss commented Oct 12, 2020

ragesoss commented Oct 12, 2020

tab1tha commented Oct 13, 2020

tab1tha commented Oct 16, 2020

ragesoss commented Oct 16, 2020

tab1tha commented Oct 16, 2020

tab1tha commented Oct 18, 2020

ragesoss commented Oct 19, 2020

tab1tha commented Oct 19, 2020

tab1tha commented Oct 27, 2020

ragesoss commented Oct 27, 2020

tab1tha commented Oct 27, 2020

ragesoss commented Oct 27, 2020

tab1tha commented Oct 27, 2020 • edited Loading

tab1tha commented Oct 28, 2020 • edited Loading

ragesoss commented Oct 28, 2020

tab1tha commented Oct 28, 2020

ragesoss commented Oct 28, 2020

tab1tha commented Oct 28, 2020

tab1tha commented Oct 29, 2020 • edited Loading

tab1tha commented Oct 29, 2020

tab1tha commented Oct 29, 2020 • edited Loading

ragesoss commented Oct 29, 2020

tab1tha commented Oct 29, 2020

tab1tha commented Oct 27, 2020 •

edited

Loading

tab1tha commented Oct 28, 2020 •

edited

Loading

tab1tha commented Oct 29, 2020 •

edited

Loading

tab1tha commented Oct 29, 2020 •

edited

Loading