-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Dockerfile for automated build #24
base: master
Are you sure you want to change the base?
Conversation
Have updated the README with some basic install and usage instructions for docker, as well as grabbing most of the jar files from the maven repository rather than sourceforge. I'm still not entirely sure which functionality is missing (didn't bother building KenLM or hunting down NERApp.jar, but can take another look if necessary). Tested with the instructions from the README and everything seemed to have built alright though. Happy to either add on to this PR or open a new one if there is additional functionality that needs more build steps. |
I gather you haven't tried any of the steps from docs/ccgbank-README?
(Looks like that's not linked from the main README, which it should be.)
…On Mon, Mar 18, 2019 at 7:27 PM Adam Leskis ***@***.***> wrote:
Have updated the README with some basic install and usage instructions for
docker, as well as grabbing most of the jar files from the maven repository
rather than sourceforge. I'm still not entirely sure which functionality is
missing (didn't bother building KenLM or hunting down NERApp.jar, but can
take another look if necessary). Tested with the instructions from the
README and everything seemed to have built alright though.
Happy to either add on to this PR or open a new one if there is additional
functionality that needs more build steps.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#24 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADxvzXuef2ueP6undcEl1GAzkVd5fH-dks5vYCDigaJpZM4b1aK3>
.
|
Ah right, that and the taggers README are just what I'd been missing! Will give them a read and update this PR when they're integrated into the docker build. |
Cool beans!
…On Mon, Mar 18, 2019 at 7:43 PM Adam Leskis ***@***.***> wrote:
Ah right, that and the taggers README are just what I'd been missing!
Will give them a read and update this PR when they're integrated into the
docker build.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#24 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADxvzQQRsXfFf3gar85OIwOJdUwsCXKoks5vYCSNgaJpZM4b1aK3>
.
|
I'm a bit stuck at the following section from the ccgbank-README: Since the pre-built English models and CCGbank data for training english-models.YYYY-MM-DD.tgz I wasn't able to find these in either the openccg project or anywhere on the ccgBank site. Would you be able to provide a bit of guidance? |
Ah, there's no Git LFS or similar solution set up for the data files, so they're still hosted on Source Forge: https://sourceforge.net/projects/openccg/files/data/ I don't think this is mentioned explicitly in the README; there's just the pointer to get the libs from SF. |
got it...if those two archives aren't tending to move that much (looks like 2013 was the last update?), then any objections to just storing the uncompressed version in the github source? |
I think we should at least keep them out of the main branch on GitHub because not everyone needs them. 90 MB might not be much by today's standards (with terribly wasteful Electron apps all over the place), but I would rather handle it separately in the Dockerfile and let folks who just need the git repo avoid downloading it altogether. |
ah, fair point. I guess we might want to decide on whether the Dockerfile would be intended to target just the bare bones minimum functionality then? I'm sure it's my lack of domain knowledge here, but I wasn't really able to figure out from the README what the basic functions of the project are (eg, as a subset of the complete set of functions). Would it be possible to specify the default functions (and ideally with examples of the commands and expected outputs) we definitely want in a docker container, and then I can aim to target that? What the README suggests to me is that we have three basic functions that we would expect in any minimally functional installation (specified in the "Trying it out", "Visualizing semantic graphs", and "Creating disjunctive logical forms" sections), though do please correct me if that's not the case. Alternately, if we would prefer to have a bit more of the functionality, including the parsing and tagging (as specified in |
Hi Adam
I would say there are two different kinds of users, namely (1) ones
interested in using or creating precise, domain-specific grammars and (2)
ones interested in using the broad coverage English grammar for parsing and
(especially) realization.
I would agree that the first group of users would appreciate not having to
download large model files that they don't need. The second group of users
would generally also like to have the basic functionality in "Trying it
out" and "Visualizing semantic graphs". (I'm not sure how much "Creating
disjunctive logical forms" is getting used.)
Not sure what this means re the Dockerfile though, is it possible to have
two, or one with options?
Mike
…On Mon, Apr 8, 2019 at 1:42 PM Adam Leskis ***@***.***> wrote:
ah, fair point. I guess we might want to decide on whether the Dockerfile
would be intended to target just the bare bones minimum functionality then?
I'm sure it's my lack of domain knowledge here, but I wasn't really able
to figure out from the README what the basic functions of the project are
(eg, as a subset of the complete set of functions). Would it be possible to
specify the default functions (and ideally with examples of the commands
and expected outputs) we definitely want in a docker container, and then I
can aim to target that?
What the README suggests to me is that we have three basic functions that
we would expect in any minimally functional installation (specified in the
"Trying it out", "Visualizing semantic graphs", and "Creating disjunctive
logical forms" sections), though do please correct me if that's not the
case.
Alternately, if we would prefer to have a bit more of the functionality,
including the parsing and tagging (as specified in docs/taggers-README),
I'm happy to attempt to add that in as well, since solving the
installation/configuration issues once in the docker container would make
it scalable for future use by other researchers.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#24 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADxvzUmHJnAOrER7kZfJcj-3WUTaIdGVks5ve39pgaJpZM4b1aK3>
.
|
Hi Mike, I appreciate your patience with this whole process. I'm very aware of my lack of context on this project, and that's probably leading to a lot of questions that wouldn't otherwise arise. I'm definitely focused on creating something that's useful for you and the project users, so I'm very happy for you and the other maintainers to drive this. In terms of options for Docker implementations, we could technically add a second Dockerfile for a non-default container, though this isn't usually done and is considered a bit non-standard (ergo, perhaps not immediately obvious to users that the option exists). What might be easier is to set up the build so that if the user wants to use those extra models, they would be able to download those locally and have them automatically mounted into the default container at runtime. One of the advantages of this is it would just involve passing additional parameters in the command rather than any difference in the Dockerfile per se. And then in terms of end-usage, I'm assuming that the results of most of the commands involve writing things to files (rather than, say, just outputting results of exploratory data analysis to the console)? This would primarily have implications for whether users would need to enter the running container vs. just firing commands at it (the former being slightly more complex), but in any case it would be possible to set up the final container to take input and return output...just trying to think through what the final usable solution looks like. |
So I've added in a conditional script in the Dockerfile to deal with the In addition, I've now gotten the following commands to complete successfully:
...though this is where I'm stuck currently
|
Hi Adam
The CCGbank is licensed by the LDC and can only be obtained from them
directly, that’s why this part is set up the way it is.
Perhaps it would make sense to just skip this and document the reason why?
I would say that only the most expert users would be likely to want to do
this step anyway.
Thanks
Mike
…On Tue, May 7, 2019 at 1:13 AM Adam Leskis ***@***.***> wrote:
So I've added in a conditional script in the Dockerfile to deal with the
english-models.YYYY-MM-DD.tgz file if it exists, and skip it if it does
not.
In addition, I've now gotten the following commands to complete
successfully:
tccg
ccg-draw-graph -i tb.xml -v graphs/g
-
ccg-build -f build-ps.xml test-novel &> logs/log.ps.test.novel &
-
ccg-build -f build-rz.xml test-novel &> logs/log.rz.test.novel &
...though this is where I'm stuck currently
- Building English models from the CCGBank
You'll also need to create a symbolic link to
your original CCGbank directory from $OPENCCG_HOME/ccgbank/.
(what would the original CCGbank directory be? I'm unable to find
anything in the system that looks like ccgbank1.1)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#24 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA6G7TNKU32WKTCVOSUB733PUEFXNANCNFSM4G6VUK3Q>
.
|
Hi Mike, Alright, I think we're almost there. I've added a comment as per your suggestion into the README documentation and skipped building the English models for now. Just to confirm, are the POS and supertaggers intended for normal use? If so, I can go ahead and get those working in the docker container as well, but wanted to check with you first just in case those are a similar feature like the CCGBank English model building and shouldn't be included in the default container. Thanks, |
Thanks for the update!
Yes, all the taggers are for normal use.
Mike
…On Wed, May 8, 2019 at 2:38 AM Adam Leskis ***@***.***> wrote:
Hi Mike,
Alright, I think we're almost there. I've added a comment as per your
suggestion into the README documentation and skipped building the English
models for now.
Just to confirm, are the POS and supertaggers intended for normal use? If
so, I can go ahead and get those working in the docker container as well,
but wanted to check with you first just in case those are a similar feature
like the CCGBank English model building and shouldn't be included in the
default container.
Thanks,
Adam
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#24 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA6G7TPEAAUNLFNBU4SMGGLPUJYPXANCNFSM4G6VUK3Q>
.
|
I've been able to compile both the maxent toolkit (from source on github) as well as the srilm package (available from one of the google code archives...version 1.6.0, but seems like it might work). The current sticking point is when I attempt to run the command: the training fails with the following log ouput:
...which leads me to believe that it might be related to the same issue as previously, where since I don't have the CCGBANK data, it fails. Any thoughts? |
Yes, I'm sure that's the same issue.
If you have an LDC license and have or can get the CCGbank, you could test
this out. Otherwise the tests for parsing and generating novel text with
the existing models is as far as I'd expect you to be able to get.
Note that the maxent toolkit and SRILM are primarily for training models
from scratch -- in principle the JNI code for using SRILM as a runtime
language model could also be used, but hasn't been anytime recently, as
it's mostly superseded by KenLM.
…On Thu, May 9, 2019 at 5:26 PM Adam Leskis ***@***.***> wrote:
I've been able to compile both the maxent toolkit (from source on github)
as well as the srilm package (available from one of the google code
archives...version 1.6.0, but seems like it might work).
The current sticking point is when I attempt to run the command:
ccg-build -f build-original.xml &> logs/log.original &
the training fails with the following log ouput:
Buildfile: /openccg/ccgbank/build-original.xml
init:
make-corpus-splits:
[echo] Making corpus splits in ./original/corpus
BUILD FAILED
/openccg/ccgbank/build-original.xml:46: /openccg/ccgbank/ccgbank1.1/data/AUTO does not exist.
Total time: 0 seconds
...which leads me to believe that it might be related to the same issue as
previously, where since I don't have the CCGBANK data, it fails. Any
thoughts?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#24 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA6G7TPVZWHXMCJJPVA2DX3PUSJIVANCNFSM4G6VUK3Q>
.
|
ah, got it. I don't happen to have any access to the CCGbank, but perhaps you might know a user who would be interested in testing out the docker container build? In any event, does your note about the maxent and SRILM mean that we wouldn't necessarily want them in the default docker build? I'm assuming the KenLM download, as massive as it is, is also not something we'd want in the default docker build. I'm happy to remove the steps for adding/compiling the maxent/SRILM stuff if that seems appropriate, and do please let me know if anything else would be necessary to finish this pull request for the docker automated build. Thanks, |
Yes, maybe comment out the bits for maxent and SRILM.
Have you been able to test that using gigaword4.5g.kenlm.bin works if it's
there? The log files from running (parsing and) realization on novel text
should be slightly different if it's in the expected location.
Perhaps Dave Howcroft could try out the docker container build ...
…On Thu, May 9, 2019 at 6:00 PM Adam Leskis ***@***.***> wrote:
ah, got it. I don't happen to have any access to the CCGbank, but perhaps
you might know a user who would be interested in testing out the docker
container build?
In any event, does your note about the maxent and SRILM mean that we
wouldn't necessarily want them in the default docker build? I'm assuming
the KenLM download, as massive as it is, is also not something we'd want in
the default docker build.
I'm happy to remove the steps for adding/compiling the maxent/SRILM stuff
if that seems appropriate, and do please let me know if anything else would
be necessary to finish this pull request for the docker automated build.
Thanks,
Adam
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#24 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA6G7TMGGQQFP6JSQULGOVDPUSNHHANCNFSM4G6VUK3Q>
.
|
I tried parsing and realizing using gigaword4.5g.kenlm.bin, and it seems to have completed successfully, though I'm not sure if that was because it correctly picked it up, or if it fell back to the default trigram model. Would there be something in particular in the log files that might indicate that? I was also curious about the line in the README to set the LD_LIBRARY_PATH: It looked like Was there a previous step where |
So I don't think it's unusual for it to be unset right now. Per @mwhite14850's suggestion, I think I can test the docker image, but I'm very busy for the coming weeks so I can't guarantee a particular time for testing. If I find the time, I'll update this thread that I'm working on it, and otherwise I'll try to get back to it sometime in June. |
If it can't find the big LM, the log file should contain this message:
"Reusing trigram model as a stand-in for the big LM"
If that message isn't there, that should mean that it found the big LM
successfully; to test this, just temporarily move or rename the
gigaword4.5g.kenlm.bin file and see if this message appears when running
again.
This message is in the file ccgbank/plugins/MyNgramCombo.java, one of a set
of plugin classes used to do flexible configuration.
…On Fri, May 10, 2019 at 4:55 AM Adam Leskis ***@***.***> wrote:
I tried parsing and realizing using gigaword4.5g.kenlm.bin, and it seems
to have completed successfully, though I'm not sure if that was because it
correctly picked it up, or if it fell back to the default trigram model.
Would there be something in particular in the log files that might indicate
that?
I was also curious about the line in the README to set the LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$OPENCCG_HOME/lib
It looked like $LD_LIBRARY_PATH wasn't resolving to anything, since it
hadn't been set yet, so I attempted to just set it to $OPENCCG_HOME/lib
like so:
export LD_LIBRARY_PATH=$OPENCCG_HOME/lib
Was there a previous step where $LD_LIBRARY_PATH had been set...or should
that be a different variable?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#24 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA6G7TJOZFXAZ4KG6VGTHA3PUU2ATANCNFSM4G6VUK3Q>
.
|
ah, thanks for the pointer @dmhowcroft ...I'm a bit embarrassed to admit I've never encountered the LD_LIBRARY_PATH in my adventures through linux land. Good to know! I've been examining the log files, though now I can't seem to get the parser to work, which is strange, since it was working earlier with the same Dockerfile. At any rate, no rush to test this, since I need to do some investigation into this anyway. |
alright, so I've confirmed that the gigaword4.5g.kenlm.bin file works if present. The lines you mentioned were only in the logs when I removed the file. Additionally, I've pushed the version of the Dockerfile that works for all the basic functionality, along with lines to get the maxent toolkit and SRILM working, which are commented out for now (plus a line in the README detailing the same). Again, this could do with a full test to see if I've set it up correctly. Version 1.6 of SRILM was the only one I could find freely available (and possible to pull in via Dockerfile). No rush on this. Feel free to test when convenient and we can move forward from there. |
I just tried to install on my laptop running Fedora 30 and encountered the following error at the end of the Docker build process:
Steps to reproduce: sudo dnf install docker
sudo systemctl start docker
git clone https://github.com/lpmi-13/openccg.git
cd openccg
git checkout dockerize
sudo docker build . |
I was going to go ahead and accept the maven pull request, does this issue
affect that?
Thanks!
…On Tue, Jul 16, 2019 at 10:02 AM Dave Howcroft ***@***.***> wrote:
I just tried to install on my laptop running Fedora 30 and encountered the
following error at the end of the Docker build process:
Error: Could not find or load main class org.apache.tools.ant.launch.Launcher
The command '/bin/sh -c ./models.sh && mvn dependency:copy-dependencies -DoutputDirectory='./lib' && mv lib/stanford-corenlp-3.9.2.jar ccgbank/stanford-nlp/stanford-core-nlp.jar && jar xf lib/stanford-corenlp-3.9.2-models.jar && cp edu/stanford/nlp/models/ner/* ccgbank/stanford-nlp/classifiers/ && rm -rf edu && ccg-build' returned a non-zero code: 1
Steps to reproduce:
0. Install Docker and start it
sudo dnf install docker
sudo systemctl start docker
1. Install the repo and checkout the right branch
git clone https://github.com/lpmi-13/openccg.git
cd openccg
git checkout dockerize
1. Run the Docker build process
sudo docker build .
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24?email_source=notifications&email_token=AA6G7TLTG6JPHRCCJF2HS5LP7XIH5A5CNFSM4G6VUK32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2A6MIA#issuecomment-511829536>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA6G7TNQWGRKCZVXTM5KJL3P7XIH5ANCNFSM4G6VUK3Q>
.
|
The maven pull request can go in whenever you're ready, and I'll take a look at reproducing this error in the docker build today or tomorrow |
Ok, great, just did the merge (as noted on the pull request thread).
Thanks!
…On Tue, Jul 16, 2019 at 11:07 AM Adam Leskis ***@***.***> wrote:
The maven pull request can go in whenever you're ready, and I'll take a
look at reproducing this error in the docker build today or tomorrow
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24?email_source=notifications&email_token=AA6G7TJ5T3EULQOSVEBO6U3P7XP2ZA5CNFSM4G6VUK32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2BFFNY#issuecomment-511857335>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA6G7TIZILZSZYE5RJ3VGDLP7XP2ZANCNFSM4G6VUK3Q>
.
|
The failure is due to javacc, which may somehow have gotten out of sync. I notice it's also included in the master branch now via the merge two days ago, so I'll attempt to update using that instead and we can see where we are. |
I'm not exactly sure why, but when I tried to use line continuation to put all the ENV variable declarations on one line, everything blew up...so leaving it as multiple layers for now. I've also not had much luck building things in the smaller alpine containers, so that's why I went with the bigger (but more robust) ubuntu 16.04 base image.
Ideally, we would have a Dockerfile that features a clean build with everything in source control, but if that's not feasible, then I suggest we might use this to put something in a publicly accessible docker hub image, to be used with docker-compose, and then it wouldn't have to constantly pull in dependencies from sourceforge.
I also only did cursory testing on this (ie, with the tccg command in /grammars/tiny), so there might be something I missed during the compilation process, but happy to update this PR in that event.
Lastly, if it seems sensible to add in a short section in the README about installing docker, running it as non-root (at least in linux), and using it to build and run the image, I'm happy to add that as part of this PR as well.