Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Dockerfile for automated build #24

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

add Dockerfile for automated build #24

wants to merge 1 commit into from

Conversation

lpmi-13
Copy link

@lpmi-13 lpmi-13 commented Mar 14, 2019

I'm not exactly sure why, but when I tried to use line continuation to put all the ENV variable declarations on one line, everything blew up...so leaving it as multiple layers for now. I've also not had much luck building things in the smaller alpine containers, so that's why I went with the bigger (but more robust) ubuntu 16.04 base image.

Ideally, we would have a Dockerfile that features a clean build with everything in source control, but if that's not feasible, then I suggest we might use this to put something in a publicly accessible docker hub image, to be used with docker-compose, and then it wouldn't have to constantly pull in dependencies from sourceforge.

I also only did cursory testing on this (ie, with the tccg command in /grammars/tiny), so there might be something I missed during the compilation process, but happy to update this PR in that event.

Lastly, if it seems sensible to add in a short section in the README about installing docker, running it as non-root (at least in linux), and using it to build and run the image, I'm happy to add that as part of this PR as well.

@lpmi-13
Copy link
Author

lpmi-13 commented Mar 18, 2019

Have updated the README with some basic install and usage instructions for docker, as well as grabbing most of the jar files from the maven repository rather than sourceforge. I'm still not entirely sure which functionality is missing (didn't bother building KenLM or hunting down NERApp.jar, but can take another look if necessary). Tested with the instructions from the README and everything seemed to have built alright though.

Happy to either add on to this PR or open a new one if there is additional functionality that needs more build steps.

@mwhite14850
Copy link
Member

mwhite14850 commented Mar 18, 2019 via email

@lpmi-13
Copy link
Author

lpmi-13 commented Mar 18, 2019

Ah right, that and the taggers README are just what I'd been missing!

Will give them a read and update this PR when they're integrated into the docker build.

@mwhite14850
Copy link
Member

mwhite14850 commented Mar 19, 2019 via email

@lpmi-13
Copy link
Author

lpmi-13 commented Apr 5, 2019

I'm a bit stuck at the following section from the ccgbank-README:

Since the pre-built English models and CCGbank data for training
represent much larger downloads than the OpenCCG core files, they are
available as separate downloads (where YYYY-MM-DD represents the date
of creation):

english-models.YYYY-MM-DD.tgz
ccgbank-data.YYYY-MM-DD.tgz


I wasn't able to find these in either the openccg project or anywhere on the ccgBank site. Would you be able to provide a bit of guidance?

@dmhowcroft
Copy link
Contributor

Ah, there's no Git LFS or similar solution set up for the data files, so they're still hosted on Source Forge: https://sourceforge.net/projects/openccg/files/data/

I don't think this is mentioned explicitly in the README; there's just the pointer to get the libs from SF.

@lpmi-13
Copy link
Author

lpmi-13 commented Apr 7, 2019

got it...if those two archives aren't tending to move that much (looks like 2013 was the last update?), then any objections to just storing the uncompressed version in the github source?

@dmhowcroft
Copy link
Contributor

I think we should at least keep them out of the main branch on GitHub because not everyone needs them. 90 MB might not be much by today's standards (with terribly wasteful Electron apps all over the place), but I would rather handle it separately in the Dockerfile and let folks who just need the git repo avoid downloading it altogether.

@lpmi-13
Copy link
Author

lpmi-13 commented Apr 8, 2019

ah, fair point. I guess we might want to decide on whether the Dockerfile would be intended to target just the bare bones minimum functionality then?

I'm sure it's my lack of domain knowledge here, but I wasn't really able to figure out from the README what the basic functions of the project are (eg, as a subset of the complete set of functions). Would it be possible to specify the default functions (and ideally with examples of the commands and expected outputs) we definitely want in a docker container, and then I can aim to target that?

What the README suggests to me is that we have three basic functions that we would expect in any minimally functional installation (specified in the "Trying it out", "Visualizing semantic graphs", and "Creating disjunctive logical forms" sections), though do please correct me if that's not the case.

Alternately, if we would prefer to have a bit more of the functionality, including the parsing and tagging (as specified in docs/taggers-README), I'm happy to attempt to add that in as well, since solving the installation/configuration issues once in the docker container would make it scalable for future use by other researchers.

@mwhite14850
Copy link
Member

mwhite14850 commented Apr 10, 2019 via email

@lpmi-13
Copy link
Author

lpmi-13 commented Apr 12, 2019

Hi Mike,

I appreciate your patience with this whole process. I'm very aware of my lack of context on this project, and that's probably leading to a lot of questions that wouldn't otherwise arise. I'm definitely focused on creating something that's useful for you and the project users, so I'm very happy for you and the other maintainers to drive this.

In terms of options for Docker implementations, we could technically add a second Dockerfile for a non-default container, though this isn't usually done and is considered a bit non-standard (ergo, perhaps not immediately obvious to users that the option exists).

What might be easier is to set up the build so that if the user wants to use those extra models, they would be able to download those locally and have them automatically mounted into the default container at runtime. One of the advantages of this is it would just involve passing additional parameters in the command rather than any difference in the Dockerfile per se.

And then in terms of end-usage, I'm assuming that the results of most of the commands involve writing things to files (rather than, say, just outputting results of exploratory data analysis to the console)? This would primarily have implications for whether users would need to enter the running container vs. just firing commands at it (the former being slightly more complex), but in any case it would be possible to set up the final container to take input and return output...just trying to think through what the final usable solution looks like.

@lpmi-13
Copy link
Author

lpmi-13 commented May 7, 2019

So I've added in a conditional script in the Dockerfile to deal with the english-models.YYYY-MM-DD.tgz file if it exists, and skip it if it does not.

In addition, I've now gotten the following commands to complete successfully:

  • tccg

  • ccg-draw-graph -i tb.xml -v graphs/g

  • ccg-build -f build-ps.xml test-novel &> logs/log.ps.test.novel &

  • ccg-build -f build-rz.xml test-novel &> logs/log.rz.test.novel &

...though this is where I'm stuck currently

  • Building English models from the CCGBank
    You'll also need to create a symbolic link to
    your original CCGbank directory from $OPENCCG_HOME/ccgbank/.
    (what would the original CCGbank directory be? I'm unable to find anything in the system that looks like ccgbank1.1)

@mwhite14850
Copy link
Member

mwhite14850 commented May 7, 2019 via email

@lpmi-13
Copy link
Author

lpmi-13 commented May 8, 2019

Hi Mike,

Alright, I think we're almost there. I've added a comment as per your suggestion into the README documentation and skipped building the English models for now.

Just to confirm, are the POS and supertaggers intended for normal use? If so, I can go ahead and get those working in the docker container as well, but wanted to check with you first just in case those are a similar feature like the CCGBank English model building and shouldn't be included in the default container.

Thanks,
Adam

@mwhite14850
Copy link
Member

mwhite14850 commented May 8, 2019 via email

@lpmi-13
Copy link
Author

lpmi-13 commented May 9, 2019

I've been able to compile both the maxent toolkit (from source on github) as well as the srilm package (available from one of the google code archives...version 1.6.0, but seems like it might work).

The current sticking point is when I attempt to run the command:
ccg-build -f build-original.xml &> logs/log.original &

the training fails with the following log ouput:

Buildfile: /openccg/ccgbank/build-original.xml

init:

make-corpus-splits:
     [echo] Making corpus splits in ./original/corpus

BUILD FAILED
/openccg/ccgbank/build-original.xml:46: /openccg/ccgbank/ccgbank1.1/data/AUTO does not exist.

Total time: 0 seconds

...which leads me to believe that it might be related to the same issue as previously, where since I don't have the CCGBANK data, it fails. Any thoughts?

@mwhite14850
Copy link
Member

mwhite14850 commented May 9, 2019 via email

@lpmi-13
Copy link
Author

lpmi-13 commented May 9, 2019

ah, got it. I don't happen to have any access to the CCGbank, but perhaps you might know a user who would be interested in testing out the docker container build?

In any event, does your note about the maxent and SRILM mean that we wouldn't necessarily want them in the default docker build? I'm assuming the KenLM download, as massive as it is, is also not something we'd want in the default docker build.

I'm happy to remove the steps for adding/compiling the maxent/SRILM stuff if that seems appropriate, and do please let me know if anything else would be necessary to finish this pull request for the docker automated build.

Thanks,
Adam

@mwhite14850
Copy link
Member

mwhite14850 commented May 9, 2019 via email

@lpmi-13
Copy link
Author

lpmi-13 commented May 10, 2019

I tried parsing and realizing using gigaword4.5g.kenlm.bin, and it seems to have completed successfully, though I'm not sure if that was because it correctly picked it up, or if it fell back to the default trigram model. Would there be something in particular in the log files that might indicate that?

I was also curious about the line in the README to set the LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$OPENCCG_HOME/lib

It looked like $LD_LIBRARY_PATH wasn't resolving to anything, since it hadn't been set yet, so I attempted to just set it to $OPENCCG_HOME/lib like so:
export LD_LIBRARY_PATH=$OPENCCG_HOME/lib

Was there a previous step where $LD_LIBRARY_PATH had been set...or should that be a different variable?

@dmhowcroft
Copy link
Contributor

LD_LIBRARY_PATH is a standard linux environment variable that is usually empty but can contain a list of places to look for libraries before searching in the standard places. (More here)

In Linux, the environment variable LD_LIBRARY_PATH is a colon-separated set of directories where libraries should be searched for first, before the standard set of directories; this is useful when debugging a new library or using a nonstandard library for special purposes.

So I don't think it's unusual for it to be unset right now.

Per @mwhite14850's suggestion, I think I can test the docker image, but I'm very busy for the coming weeks so I can't guarantee a particular time for testing. If I find the time, I'll update this thread that I'm working on it, and otherwise I'll try to get back to it sometime in June.

@mwhite14850
Copy link
Member

mwhite14850 commented May 10, 2019 via email

@lpmi-13
Copy link
Author

lpmi-13 commented May 15, 2019

ah, thanks for the pointer @dmhowcroft ...I'm a bit embarrassed to admit I've never encountered the LD_LIBRARY_PATH in my adventures through linux land. Good to know!

I've been examining the log files, though now I can't seem to get the parser to work, which is strange, since it was working earlier with the same Dockerfile. At any rate, no rush to test this, since I need to do some investigation into this anyway.

@lpmi-13
Copy link
Author

lpmi-13 commented May 19, 2019

alright, so I've confirmed that the gigaword4.5g.kenlm.bin file works if present. The lines you mentioned were only in the logs when I removed the file. Additionally, I've pushed the version of the Dockerfile that works for all the basic functionality, along with lines to get the maxent toolkit and SRILM working, which are commented out for now (plus a line in the README detailing the same).

Again, this could do with a full test to see if I've set it up correctly. Version 1.6 of SRILM was the only one I could find freely available (and possible to pull in via Dockerfile).

No rush on this. Feel free to test when convenient and we can move forward from there.

@dmhowcroft
Copy link
Contributor

I just tried to install on my laptop running Fedora 30 and encountered the following error at the end of the Docker build process:

Error: Could not find or load main class org.apache.tools.ant.launch.Launcher
The command '/bin/sh -c ./models.sh &&     mvn dependency:copy-dependencies -DoutputDirectory='./lib' &&     mv lib/stanford-corenlp-3.9.2.jar ccgbank/stanford-nlp/stanford-core-nlp.jar &&     jar xf lib/stanford-corenlp-3.9.2-models.jar &&     cp edu/stanford/nlp/models/ner/* ccgbank/stanford-nlp/classifiers/ &&     rm -rf edu &&     ccg-build' returned a non-zero code: 1

Steps to reproduce:
0. Install Docker and start it

    sudo dnf install docker
    sudo systemctl start docker
  1. Install the repo and checkout the right branch
  git clone https://github.com/lpmi-13/openccg.git
  cd openccg
  git checkout dockerize
  1. Run the Docker build process
    sudo docker build .

@mwhite14850
Copy link
Member

mwhite14850 commented Jul 16, 2019 via email

@lpmi-13
Copy link
Author

lpmi-13 commented Jul 16, 2019

The maven pull request can go in whenever you're ready, and I'll take a look at reproducing this error in the docker build today or tomorrow

@mwhite14850
Copy link
Member

mwhite14850 commented Jul 16, 2019 via email

@lpmi-13
Copy link
Author

lpmi-13 commented Jul 18, 2019

The failure is due to javacc, which may somehow have gotten out of sync. I notice it's also included in the master branch now via the merge two days ago, so I'll attempt to update using that instead and we can see where we are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants