Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document this module and make it easier for others to re-run #3

Open
ragesoss opened this issue Oct 7, 2020 · 25 comments
Open

Document this module and make it easier for others to re-run #3

ragesoss opened this issue Oct 7, 2020 · 25 comments

Comments

@ragesoss
Copy link
Member

ragesoss commented Oct 7, 2020

This module hasn't been used to recreate the original analysis Kevin did in several years. Try to make it work, document any problems found, and add documentation and/or fixes to make it practical to do similar analyses on a regular basis.

@tab1tha
Copy link
Contributor

tab1tha commented Oct 11, 2020

I'm on it.
How do I submit the documentation though, in what format will you prefer? and please can you link me to Kevin's original analysis? I have not come across it yet.

@ragesoss
Copy link
Member Author

I believe this is Kevin's analysis based on this module: https://meta.wikimedia.org/wiki/Research:Wiki_Ed_student_editor_contributions_to_sciences_on_Wikipedia

@ragesoss
Copy link
Member Author

As for the format... I guess the best option would probably be to add a new markdown file with details on how to use it, along with inline comments for anything within the code that you think should be clarified.

@tab1tha
Copy link
Contributor

tab1tha commented Oct 13, 2020

okay. Thank you

@tab1tha
Copy link
Contributor

tab1tha commented Oct 16, 2020

[Help needed] The main issue I have been having for days now is that the mwdumps --wiki=enwiki --verbose /home/tab1tha/Documents takes hours but does not run to completion. I have had to keep interrupting it using Ctrl+C and rerunning it so that a few more files are downloaded.
Do I need to use all the files? What arguments can I use to select only the relevant files.

I am planning to replicate Kevin's research using topic contribution data for the year 2020.

This is the level at which the command is at now: https://pastebin.com/SFVSXKwp

@ragesoss
Copy link
Member Author

Thanks for the update! Hmm... I suspect that Kevin may have done this from Toolforge, and even if he didn't, that's probably the best way around the problem you're facing. https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_Introduction

I suggest going through the process to get Toolforge access and try to do it from there, since using a server within the same clould environment should make the dump downloads much faster and more reliable.

@tab1tha
Copy link
Contributor

tab1tha commented Oct 16, 2020

Thank you. I'll go through the guide, set it up and give it another try.

@tab1tha
Copy link
Contributor

tab1tha commented Oct 18, 2020

I requested for toolforge access and it says here (https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Quickstart) that I have to wait a week for it to be granted.

Are there any other related tasks that I can be working on until then?

@ragesoss
Copy link
Member Author

Hopefully it will be less than a week, but here's another related analysis module you could look at: https://github.com/WikiEducationFoundation/academic_classification

Similarly to this one, it's from the work Kevin was doing several years ago and we'd love to be able to easily re-run similar analyses on more recent data, so documenting where the bottlenecks and problems are will be helpful.

@tab1tha
Copy link
Contributor

tab1tha commented Oct 19, 2020

Okay. I'm checking it out now

@tab1tha
Copy link
Contributor

tab1tha commented Oct 27, 2020

I have received toolforge access and it's taking me surprisingly long to understand how to use it. I apologize for my speed so far, I am doing my best to make a substantial contribution before the 30th.

@ragesoss
Copy link
Member Author

Thanks @tab1tha! Sorry I couldn't provide a more clear-cut way to dive in.

@tab1tha
Copy link
Contributor

tab1tha commented Oct 27, 2020

I have a few questions, Do I need to create a toolforge tool?.
Where do I run commands like this mwdumps --wiki=enwiki --verbose /home/tab1tha/Documents, is it on the toolforge shell ?
I can't find the toolforge shell.
I have however succeeded to access the dumps from PAWS using the command ls /public/dumps/public

@ragesoss
Copy link
Member Author

yes, creating a tool might be the best way to go.

The 'toolforge' shell probably just means the terminal once you've logged on to toolforge via SSH. If you can get to the PAWS dumps, I think that means you're in the toolforge shell already.

@tab1tha
Copy link
Contributor

tab1tha commented Oct 27, 2020

Ohh. This is helpful. Thank you

@tab1tha
Copy link
Contributor

tab1tha commented Oct 28, 2020

[Update: help needed]
This commit shows the work I have done so far. I am at the point where I cannot write the output files to a folder which I created. This might be because of some restrictions on the toolforge platform. The error log is pasted here .

I receive Error [13] which says that I do not have file permissions but when I check, it shows that I do have all the file permissions for that folder. Trying to use sudo with the command fails too with this error .

@ragesoss
Copy link
Member Author

@tab1tha it looks like out=/demo_results specifies an absolute path, rather than a path relative to your home directory or your tool's directory. Maybe that's why you're getting a permissions error?

@tab1tha
Copy link
Contributor

tab1tha commented Oct 28, 2020

using the relative path /home/tambetabitha/demo_results yields this instead https://pastebin.com/DhMSriMh
It says now that enwiki-20201001-stub-meta-history9.xml.gz is not a directory

@ragesoss
Copy link
Member Author

That seems like progress, perhaps. I don't know why it would be trying to treat that gzip file as a directory, though.

@tab1tha
Copy link
Contributor

tab1tha commented Oct 28, 2020

I have been trying to figure that out too. I'm looking at the code now.

@tab1tha
Copy link
Contributor

tab1tha commented Oct 29, 2020

I think it fails because the regex in topics.cmdline._get_files_to_work_on specifies that the filename must end with .xml. However, mine is still gzipped and ends in .gz.

`def _get_files_to_work_on(input_dir):

raw_files = [join(input_dir, f) for f in listdir(input_dir)
if isfile(join(input_dir, f))]
dump_files = [f for f in raw_files
if re.match('.*stub-meta-history(\d+).xml', f)]
return dump_files`

It is therefore necessary to unzip the file before passing it as a command line argument. Alternatively, we could adjust the regex code in the topics.cmdline module such that receives both zipped and unzipped files and in the case where the file in zipped, it unzips it using gzip.

@tab1tha
Copy link
Contributor

tab1tha commented Oct 29, 2020

In the meantime, considering that the commit of Demonstration.md is part of Pull request 5, I have changed the pull request name to a more appropriate one.

Also, am I on track with respect to the content and format of Demonstration.md so far? Is there something else that you expected or would want me to add?

@tab1tha
Copy link
Contributor

tab1tha commented Oct 29, 2020

To enable handling of .gz files, I have considered adding a try-except clause to the topics.cmdline._get_files_to_work_on function as such:
def _get_files_to_work_on(input_dir):
raw_files = [join(input_dir, f) for f in listdir(input_dir)
if isfile(join(input_dir, f))]
try:
dump_files = [f for f in raw_files
if re.match('.*stub-meta-history(\d+).xml', f)]
except expression as identifier:
dump_files = [gzip.open(f) for f in raw_files
if re.match('.stub-meta-history(\d+).xml.', f)]
return dump_files

Is this okay? What would you prefer?

@ragesoss
Copy link
Member Author

Anything that works is fine with me! I don't have much of a sense for what is the most Pythonic way to do things, so use your best judgment.

@tab1tha
Copy link
Contributor

tab1tha commented Oct 29, 2020

Okay. I'm on it !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants