-
Notifications
You must be signed in to change notification settings - Fork 299
Allow to specify hadoop minor version (2.4 and 2.6 at the moment) #56
base: branch-2.0
Are you sure you want to change the base?
Conversation
Thanks @jamborta - I will take a look at this soon. Meanwhile if anybody else can try this out and give any comments that will be great ! |
added hadoop 2.7 as it has much better s3 support |
@shivaram would be good to add hadoop2.6 and hadoop2.7 to your spark-related-packages s3 bucket as pulling it from www.apache.org gets slow from time to time. |
@@ -11,10 +11,15 @@ SCALA_VERSION="2.10.3" | |||
|
|||
if [[ "0.7.3 0.8.0 0.8.1" =~ $SPARK_VERSION ]]; then | |||
SCALA_VERSION="2.9.3" | |||
wget http://s3.amazonaws.com/spark-related-packages/scala-$SCALA_VERSION.tgz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need a scala installation on the cluster anymore as Spark should just work with a JRE. But it seems fine to have this if people find it useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never tried spark without scala. even spark-shell does not need scala?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - recent Spark distribution includes the scala libraries that provide the shell and other support. But since this is a useful thing irrespective lets keep this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jamborta -- Sorry for the delay in reviewing. Overall the change looks pretty good. Left some comments inline
wget http://s3.amazonaws.com/spark-related-packages/spark-1.3.0-bin-hadoop2.4.tgz | ||
fi | ||
;; | ||
2.0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this be handled in the default case ? Also is hadoop1 supported with 2.0.0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left default case to cover 1.4 - 1.6.2 could turn it around and set the default case to spark 2.0.0 and 2.0.1, but that looked like more code. 😄
I could not find hadoop1 jars in spark-related-packages, so I did not include it
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz | ||
if [[ "$HADOOP_MINOR_VERSION" == "2.4" ]]; then | ||
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz | ||
elif [[ "$HADOOP_MINOR_VERSION" == "2.6" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about hadoop-2.7 here ? Was it left out as only > 2.0.0 supports hadoop 2.7 ? In that case I think we can have two big case statements -- one for spark major versions < 2.0 and one for major version >= 2.0 ? FWIW the main goal is to avoid making this file too long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was to cover 1.4 - 1.6.2 so no hadoop 2.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be happy to rewrite, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking that we could write two big case statements one to handle 1.x and the other to handle 2.x (we can add sub-case statements within them for specific 1.x quirks etc.)
format(v=spark_version, hv=hadoop_version), file=stderr) | ||
sys.exit(1) | ||
if hadoop_version == "yarn" and hadoop_minor_version != "2.4" and hadoop_minor_version != "2.6" and hadoop_minor_version != "2.7": | ||
print("Spark version: {v}, does not support Hadoop minor version: {hm}". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be useful to list the supported minor versions. Also can we make this a list at the top of the file ? Might be easier to add more hadoop versions later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. added.
@@ -13,6 +13,9 @@ then | |||
# Not yet supported | |||
echo "Tachyon git hashes are not yet supported. Please specify a Tachyon release version." | |||
# Pre-package tachyon version | |||
elif [[ "$HADOOP_MINOR_VERSION" == "2.6" ]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to handle 2.7 ? I actually think we can turn off tachyon if Spark version >= 2.0 / or hadoop version = yarn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz | ||
if [[ "$HADOOP_MINOR_VERSION" == "2.4" ]]; then | ||
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz | ||
elif [[ "$HADOOP_MINOR_VERSION" == "2.6" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking that we could write two big case statements one to handle 1.x and the other to handle 2.x (we can add sub-case statements within them for specific 1.x quirks etc.)
@@ -80,6 +80,12 @@ | |||
"2.0.0", | |||
]) | |||
|
|||
VALID_HADOOP_MINOR_VERSIONS = set([ | |||
"2.4", | |||
"2.5", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be 2.6 ?
@@ -11,10 +11,15 @@ SCALA_VERSION="2.10.3" | |||
|
|||
if [[ "0.7.3 0.8.0 0.8.1" =~ $SPARK_VERSION ]]; then | |||
SCALA_VERSION="2.9.3" | |||
wget http://s3.amazonaws.com/spark-related-packages/scala-$SCALA_VERSION.tgz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - recent Spark distribution includes the scala libraries that provide the shell and other support. But since this is a useful thing irrespective lets keep this.
@shivaram updated the code to handle the 4 distinct range of spark versions. |
@shivaram would also be good to include hadoop-2.7.0 and hadoop-2.6.0 in your s3 bucket. it takes about 10 mins to pull that from apache.org (and needs to be done twice). |
@jamborta - I agree it would be more convenient to have Hadoop hosted on there as well, but the last time I brought this issue up I was told that going forward only Spark packages would be hosted on S3. The way I got around this on my own project, Flintrock, is by downloading from the (often slow) Apache mirrors by default, but allowing users to specify an alternate download location if desired. Dunno if we want to do the same for spark-ec2. If Databricks or the AMPLab (I'm not sure who owns the |
Yeah its awkward that Apache / Amazon wouldn't have a fast way to download Hadoop on EC2. I'm talking to @pwendell and @JoshRosen to see if we can add Hadoop stuff to spark-related-packages. |
Thanks to @JoshRosen I now have permission to upload to the spark-related-packages bucket. Right now Hadoop 2.6 and Hadoop 2.7 tar.gz files should be up there. @jamborta Could you test this and let me know if they work fine ? |
Not to distract from the PR, but is it now "official" that new releases of Hadoop will be uploaded to S3 going forward? Or is this just a one-off? |
I would like to keep this one off and not make it "official" as I dont think we have enough resources to track all Hadoop releases and keep this up to date (like say we do for Spark). However I think we can undertake hosting every Hadoop version that Spark release binaries are built for (right now this list is 2.3, 2.4, 2.6 and 2.7 for the 2.0.1 release). Does that sound good ? |
Yes, that sounds good to me. The releases of Hadoop that Spark targets are the only ones that users of spark-ec2, Flintrock, and similar tools are going to care about anyway. |
@shivaram updated the s3 path in the code. all works fine. |
It looks like there are some conflicts - Could you resolve them ? |
just done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jamborta for the updates. I did one more pass.
wget http://s3.amazonaws.com/spark-related-packages/scala-$SCALA_VERSION.tgz | ||
elif [[ "2.0.0" =~ $SPARK_VERSION ]]; then | ||
SCALA_VERSION="2.11.8" | ||
wget http://downloads.lightbend.com/scala/2.11.8/scala-$SCALA_VERSION.tgz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also uploaded this to the s3 bucket. Lets switch to that to avoid depending on the lightbend source ?
if "." in spark_version: | ||
parts = spark_version.split(".") | ||
if parts[0].isdigit(): | ||
if parts[0].isdigit() and parts[0].isdigit(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant if check ? I guess this should be parts[1].isdigit()
?
sys.exit(1) | ||
if hadoop_minor_version == "2.7" and spark_major_minor_version < 2.0: | ||
print("Spark version: {v}, does not support Hadoop minor version: {hm}". | ||
format(v=spark_version, hm=hadoop_minor_version, sv=",".join(VALID_HADOOP_MINOR_VERSIONS)), file=stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the variable sv
is not used in this error message or the next one ?
0.7.3) | ||
case "$SPARK_VERSION" in | ||
# 0.7.3 - 1.0.2 | ||
0\.[7-9]\.[0-3]|1\.0\.[0-2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this will work as the 0.8.0 and 0.9.0 have incubating in their package names. My take would be be to keep the existing long form for these early versions
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then | ||
wget http://s3.amazonaws.com/spark-related-packages/spark-1.2.1-bin-hadoop1.tgz | ||
wget http://s3.amazonaws.com/spark-related-packages/spark-2.0.0-bin-hadoop1.tgz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version numbers here should not be hard coded to 2.0.0 ? Also it might be good to keep the future proof solution we had of `spark-$SPARK_VERSION-bin-hadoop$HADOOP_MINOR_VERSION.tgz' ?
So thinking more about this I think the idea should be that we do the checking of available spark / hadoop version combinations in the Python file (its easier to read / review / maintain than bash). Then the bash script just does the downloading / setup to handle corner cases like incubating
etc. What do you think ?
@shivaram updated the code based on your comments. |
@shivaram - Sorry to keep butting in on this PR. If we are going to start hosting Hadoop on S3, shouldn't we use the latest maintenance releases? In other words, 2.7.3 instead of 2.7.0, 2.6.5 instead of 2.6.0, etc... |
@nchammas Thats a reasonable question - I uploaded 2.7.0 as @jamborta had used that in the PR. On the one hand I dont mind starting off with 2.7.3 right now as its better to have the maintenance release. On the other hand we might have to pick / live with say the latest maintenance version as of now as I dont think we can keep updating this when say 2.7.4 comes out. @jamborta Any thoughts on this ? |
I agree that we could put the latest maintenance release for 2.4, 2.6 and 2.7 as it is now, and stick with those. |
@shivaram if you upload the latest releases I can update this PR. |
@jamborta 2.7.3 and 2.6.5 are now in the same S3 bucket |
@shivaram code updated |
No description provided.