Skip to content
This repository has been archived by the owner on Nov 23, 2017. It is now read-only.

Allow to specify hadoop minor version (2.4 and 2.6 at the moment) #56

Open
wants to merge 31 commits into
base: branch-2.0
Choose a base branch
from

Conversation

jamborta
Copy link

@jamborta jamborta commented Oct 2, 2016

No description provided.

@shivaram
Copy link
Contributor

shivaram commented Oct 3, 2016

Thanks @jamborta - I will take a look at this soon. Meanwhile if anybody else can try this out and give any comments that will be great !

@jamborta
Copy link
Author

jamborta commented Oct 8, 2016

added hadoop 2.7 as it has much better s3 support

@jamborta
Copy link
Author

jamborta commented Oct 10, 2016

@shivaram would be good to add hadoop2.6 and hadoop2.7 to your spark-related-packages s3 bucket as pulling it from www.apache.org gets slow from time to time.

@@ -11,10 +11,15 @@ SCALA_VERSION="2.10.3"

if [[ "0.7.3 0.8.0 0.8.1" =~ $SPARK_VERSION ]]; then
SCALA_VERSION="2.9.3"
wget http://s3.amazonaws.com/spark-related-packages/scala-$SCALA_VERSION.tgz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need a scala installation on the cluster anymore as Spark should just work with a JRE. But it seems fine to have this if people find it useful

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never tried spark without scala. even spark-shell does not need scala?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - recent Spark distribution includes the scala libraries that provide the shell and other support. But since this is a useful thing irrespective lets keep this.

Copy link
Contributor

@shivaram shivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jamborta -- Sorry for the delay in reviewing. Overall the change looks pretty good. Left some comments inline

wget http://s3.amazonaws.com/spark-related-packages/spark-1.3.0-bin-hadoop2.4.tgz
fi
;;
2.0.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be handled in the default case ? Also is hadoop1 supported with 2.0.0 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left default case to cover 1.4 - 1.6.2 could turn it around and set the default case to spark 2.0.0 and 2.0.1, but that looked like more code. 😄
I could not find hadoop1 jars in spark-related-packages, so I did not include it

wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
if [[ "$HADOOP_MINOR_VERSION" == "2.4" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
elif [[ "$HADOOP_MINOR_VERSION" == "2.6" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about hadoop-2.7 here ? Was it left out as only > 2.0.0 supports hadoop 2.7 ? In that case I think we can have two big case statements -- one for spark major versions < 2.0 and one for major version >= 2.0 ? FWIW the main goal is to avoid making this file too long

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was to cover 1.4 - 1.6.2 so no hadoop 2.7

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be happy to rewrite, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking that we could write two big case statements one to handle 1.x and the other to handle 2.x (we can add sub-case statements within them for specific 1.x quirks etc.)

format(v=spark_version, hv=hadoop_version), file=stderr)
sys.exit(1)
if hadoop_version == "yarn" and hadoop_minor_version != "2.4" and hadoop_minor_version != "2.6" and hadoop_minor_version != "2.7":
print("Spark version: {v}, does not support Hadoop minor version: {hm}".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to list the supported minor versions. Also can we make this a list at the top of the file ? Might be easier to add more hadoop versions later on

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. added.

@@ -13,6 +13,9 @@ then
# Not yet supported
echo "Tachyon git hashes are not yet supported. Please specify a Tachyon release version."
# Pre-package tachyon version
elif [[ "$HADOOP_MINOR_VERSION" == "2.6" ]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to handle 2.7 ? I actually think we can turn off tachyon if Spark version >= 2.0 / or hadoop version = yarn.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
if [[ "$HADOOP_MINOR_VERSION" == "2.4" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
elif [[ "$HADOOP_MINOR_VERSION" == "2.6" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking that we could write two big case statements one to handle 1.x and the other to handle 2.x (we can add sub-case statements within them for specific 1.x quirks etc.)

@@ -80,6 +80,12 @@
"2.0.0",
])

VALID_HADOOP_MINOR_VERSIONS = set([
"2.4",
"2.5",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 2.6 ?

@@ -11,10 +11,15 @@ SCALA_VERSION="2.10.3"

if [[ "0.7.3 0.8.0 0.8.1" =~ $SPARK_VERSION ]]; then
SCALA_VERSION="2.9.3"
wget http://s3.amazonaws.com/spark-related-packages/scala-$SCALA_VERSION.tgz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - recent Spark distribution includes the scala libraries that provide the shell and other support. But since this is a useful thing irrespective lets keep this.

@jamborta
Copy link
Author

@shivaram updated the code to handle the 4 distinct range of spark versions.

@jamborta
Copy link
Author

@shivaram would also be good to include hadoop-2.7.0 and hadoop-2.6.0 in your s3 bucket. it takes about 10 mins to pull that from apache.org (and needs to be done twice).

@nchammas
Copy link
Contributor

@jamborta - I agree it would be more convenient to have Hadoop hosted on there as well, but the last time I brought this issue up I was told that going forward only Spark packages would be hosted on S3.

The way I got around this on my own project, Flintrock, is by downloading from the (often slow) Apache mirrors by default, but allowing users to specify an alternate download location if desired. Dunno if we want to do the same for spark-ec2.

If Databricks or the AMPLab (I'm not sure who owns the spark-related-packages S3 bucket) hosts the various versions of Hadoop on there, it would be more convenient for everybody, but I'm not sure they want to take on that responsibility going forward. And I can understand why, since it's a cost (in time and money) and there are alternative solutions out there already.

@shivaram
Copy link
Contributor

Yeah its awkward that Apache / Amazon wouldn't have a fast way to download Hadoop on EC2. I'm talking to @pwendell and @JoshRosen to see if we can add Hadoop stuff to spark-related-packages.

@shivaram
Copy link
Contributor

Thanks to @JoshRosen I now have permission to upload to the spark-related-packages bucket. Right now Hadoop 2.6 and Hadoop 2.7 tar.gz files should be up there. @jamborta Could you test this and let me know if they work fine ?

@nchammas
Copy link
Contributor

Not to distract from the PR, but is it now "official" that new releases of Hadoop will be uploaded to S3 going forward? Or is this just a one-off?

@shivaram
Copy link
Contributor

I would like to keep this one off and not make it "official" as I dont think we have enough resources to track all Hadoop releases and keep this up to date (like say we do for Spark). However I think we can undertake hosting every Hadoop version that Spark release binaries are built for (right now this list is 2.3, 2.4, 2.6 and 2.7 for the 2.0.1 release). Does that sound good ?

@nchammas
Copy link
Contributor

Yes, that sounds good to me. The releases of Hadoop that Spark targets are the only ones that users of spark-ec2, Flintrock, and similar tools are going to care about anyway.

@jamborta
Copy link
Author

@shivaram updated the s3 path in the code. all works fine.

@shivaram
Copy link
Contributor

It looks like there are some conflicts - Could you resolve them ?

@jamborta
Copy link
Author

just done

Copy link
Contributor

@shivaram shivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jamborta for the updates. I did one more pass.

wget http://s3.amazonaws.com/spark-related-packages/scala-$SCALA_VERSION.tgz
elif [[ "2.0.0" =~ $SPARK_VERSION ]]; then
SCALA_VERSION="2.11.8"
wget http://downloads.lightbend.com/scala/2.11.8/scala-$SCALA_VERSION.tgz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also uploaded this to the s3 bucket. Lets switch to that to avoid depending on the lightbend source ?

if "." in spark_version:
parts = spark_version.split(".")
if parts[0].isdigit():
if parts[0].isdigit() and parts[0].isdigit():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant if check ? I guess this should be parts[1].isdigit() ?

sys.exit(1)
if hadoop_minor_version == "2.7" and spark_major_minor_version < 2.0:
print("Spark version: {v}, does not support Hadoop minor version: {hm}".
format(v=spark_version, hm=hadoop_minor_version, sv=",".join(VALID_HADOOP_MINOR_VERSIONS)), file=stderr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the variable sv is not used in this error message or the next one ?

0.7.3)
case "$SPARK_VERSION" in
# 0.7.3 - 1.0.2
0\.[7-9]\.[0-3]|1\.0\.[0-2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will work as the 0.8.0 and 0.9.0 have incubating in their package names. My take would be be to keep the existing long form for these early versions

if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-1.2.1-bin-hadoop1.tgz
wget http://s3.amazonaws.com/spark-related-packages/spark-2.0.0-bin-hadoop1.tgz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version numbers here should not be hard coded to 2.0.0 ? Also it might be good to keep the future proof solution we had of `spark-$SPARK_VERSION-bin-hadoop$HADOOP_MINOR_VERSION.tgz' ?

So thinking more about this I think the idea should be that we do the checking of available spark / hadoop version combinations in the Python file (its easier to read / review / maintain than bash). Then the bash script just does the downloading / setup to handle corner cases like incubating etc. What do you think ?

@jamborta
Copy link
Author

jamborta commented Oct 24, 2016

@shivaram updated the code based on your comments. validate_spark_hadoop_version in python should capture all the checks that are needed - but still kept the same logic spark/init.sh. i think some of the logic can be removed from init.sh, if we don't wanna duplicate the checks.

@nchammas
Copy link
Contributor

@shivaram - Sorry to keep butting in on this PR.

If we are going to start hosting Hadoop on S3, shouldn't we use the latest maintenance releases?

In other words, 2.7.3 instead of 2.7.0, 2.6.5 instead of 2.6.0, etc...

@shivaram
Copy link
Contributor

@nchammas Thats a reasonable question - I uploaded 2.7.0 as @jamborta had used that in the PR. On the one hand I dont mind starting off with 2.7.3 right now as its better to have the maintenance release. On the other hand we might have to pick / live with say the latest maintenance version as of now as I dont think we can keep updating this when say 2.7.4 comes out.

@jamborta Any thoughts on this ?

@jamborta
Copy link
Author

jamborta commented Oct 25, 2016

I agree that we could put the latest maintenance release for 2.4, 2.6 and 2.7 as it is now, and stick with those.

@jamborta
Copy link
Author

@shivaram if you upload the latest releases I can update this PR.

@shivaram
Copy link
Contributor

@jamborta 2.7.3 and 2.6.5 are now in the same S3 bucket

@jamborta
Copy link
Author

@shivaram code updated

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants