Allow to specify hadoop minor version (2.4 and 2.6 at the moment) #56

jamborta · 2016-10-02T22:17:50Z

No description provided.

shivaram · 2016-10-03T22:03:31Z

Thanks @jamborta - I will take a look at this soon. Meanwhile if anybody else can try this out and give any comments that will be great !

jamborta · 2016-10-08T21:35:40Z

added hadoop 2.7 as it has much better s3 support

jamborta · 2016-10-10T10:24:13Z

@shivaram would be good to add hadoop2.6 and hadoop2.7 to your spark-related-packages s3 bucket as pulling it from www.apache.org gets slow from time to time.

shivaram · 2016-10-12T18:28:34Z

scala/init.sh

@@ -11,10 +11,15 @@ SCALA_VERSION="2.10.3"

 if [[ "0.7.3 0.8.0 0.8.1" =~ $SPARK_VERSION ]]; then
  SCALA_VERSION="2.9.3"
+  wget http://s3.amazonaws.com/spark-related-packages/scala-$SCALA_VERSION.tgz


I'm not sure we need a scala installation on the cluster anymore as Spark should just work with a JRE. But it seems fine to have this if people find it useful

never tried spark without scala. even spark-shell does not need scala?

Yes - recent Spark distribution includes the scala libraries that provide the shell and other support. But since this is a useful thing irrespective lets keep this.

shivaram

Thanks @jamborta -- Sorry for the delay in reviewing. Overall the change looks pretty good. Left some comments inline

shivaram · 2016-10-12T18:29:45Z

spark/init.sh

+        wget http://s3.amazonaws.com/spark-related-packages/spark-1.3.0-bin-hadoop2.4.tgz
+      fi
+      ;;
+    2.0.0)


Can't this be handled in the default case ? Also is hadoop1 supported with 2.0.0 ?

left default case to cover 1.4 - 1.6.2 could turn it around and set the default case to spark 2.0.0 and 2.0.1, but that looked like more code. 😄
I could not find hadoop1 jars in spark-related-packages, so I did not include it

shivaram · 2016-10-12T18:31:59Z

spark/init.sh

-        wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
+        if [[ "$HADOOP_MINOR_VERSION" == "2.4" ]]; then
+          wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
+        elif [[ "$HADOOP_MINOR_VERSION" == "2.6" ]]; then


What about hadoop-2.7 here ? Was it left out as only > 2.0.0 supports hadoop 2.7 ? In that case I think we can have two big case statements -- one for spark major versions < 2.0 and one for major version >= 2.0 ? FWIW the main goal is to avoid making this file too long

it was to cover 1.4 - 1.6.2 so no hadoop 2.7

would be happy to rewrite, though.

Yeah I was thinking that we could write two big case statements one to handle 1.x and the other to handle 2.x (we can add sub-case statements within them for specific 1.x quirks etc.)

shivaram · 2016-10-12T18:32:52Z

spark_ec2.py

                    format(v=spark_version, hv=hadoop_version), file=stderr)
              sys.exit(1)
+            if hadoop_version == "yarn" and hadoop_minor_version != "2.4" and hadoop_minor_version != "2.6" and hadoop_minor_version != "2.7":
+              print("Spark version: {v}, does not support Hadoop minor version: {hm}".


Might be useful to list the supported minor versions. Also can we make this a list at the top of the file ? Might be easier to add more hadoop versions later on

shivaram · 2016-10-12T18:34:59Z

tachyon/init.sh

@@ -13,6 +13,9 @@ then
  # Not yet supported
  echo "Tachyon git hashes are not yet supported. Please specify a Tachyon release version."
 # Pre-package tachyon version
+elif [[ "$HADOOP_MINOR_VERSION" == "2.6" ]]


Also need to handle 2.7 ? I actually think we can turn off tachyon if Spark version >= 2.0 / or hadoop version = yarn.

shivaram · 2016-10-13T22:13:47Z

spark/init.sh

-        wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
+        if [[ "$HADOOP_MINOR_VERSION" == "2.4" ]]; then
+          wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
+        elif [[ "$HADOOP_MINOR_VERSION" == "2.6" ]]; then


Yeah I was thinking that we could write two big case statements one to handle 1.x and the other to handle 2.x (we can add sub-case statements within them for specific 1.x quirks etc.)

shivaram · 2016-10-13T22:14:02Z

spark_ec2.py

@@ -80,6 +80,12 @@
    "2.0.0",
 ])

+VALID_HADOOP_MINOR_VERSIONS = set([
+    "2.4",
+    "2.5",


This should be 2.6 ?

shivaram · 2016-10-13T22:15:54Z

scala/init.sh

@@ -11,10 +11,15 @@ SCALA_VERSION="2.10.3"

 if [[ "0.7.3 0.8.0 0.8.1" =~ $SPARK_VERSION ]]; then
  SCALA_VERSION="2.9.3"
+  wget http://s3.amazonaws.com/spark-related-packages/scala-$SCALA_VERSION.tgz


Yes - recent Spark distribution includes the scala libraries that provide the shell and other support. But since this is a useful thing irrespective lets keep this.

jamborta · 2016-10-19T19:59:51Z

@shivaram updated the code to handle the 4 distinct range of spark versions.

jamborta · 2016-10-19T20:03:20Z

@shivaram would also be good to include hadoop-2.7.0 and hadoop-2.6.0 in your s3 bucket. it takes about 10 mins to pull that from apache.org (and needs to be done twice).

nchammas · 2016-10-19T20:28:57Z

@jamborta - I agree it would be more convenient to have Hadoop hosted on there as well, but the last time I brought this issue up I was told that going forward only Spark packages would be hosted on S3.

The way I got around this on my own project, Flintrock, is by downloading from the (often slow) Apache mirrors by default, but allowing users to specify an alternate download location if desired. Dunno if we want to do the same for spark-ec2.

If Databricks or the AMPLab (I'm not sure who owns the spark-related-packages S3 bucket) hosts the various versions of Hadoop on there, it would be more convenient for everybody, but I'm not sure they want to take on that responsibility going forward. And I can understand why, since it's a cost (in time and money) and there are alternative solutions out there already.

shivaram · 2016-10-19T20:35:42Z

Yeah its awkward that Apache / Amazon wouldn't have a fast way to download Hadoop on EC2. I'm talking to @pwendell and @JoshRosen to see if we can add Hadoop stuff to spark-related-packages.

shivaram · 2016-10-20T22:06:51Z

Thanks to @JoshRosen I now have permission to upload to the spark-related-packages bucket. Right now Hadoop 2.6 and Hadoop 2.7 tar.gz files should be up there. @jamborta Could you test this and let me know if they work fine ?

nchammas · 2016-10-20T22:38:08Z

Not to distract from the PR, but is it now "official" that new releases of Hadoop will be uploaded to S3 going forward? Or is this just a one-off?

shivaram · 2016-10-20T22:51:37Z

I would like to keep this one off and not make it "official" as I dont think we have enough resources to track all Hadoop releases and keep this up to date (like say we do for Spark). However I think we can undertake hosting every Hadoop version that Spark release binaries are built for (right now this list is 2.3, 2.4, 2.6 and 2.7 for the 2.0.1 release). Does that sound good ?

nchammas · 2016-10-21T02:06:26Z

Yes, that sounds good to me. The releases of Hadoop that Spark targets are the only ones that users of spark-ec2, Flintrock, and similar tools are going to care about anyway.

jamborta · 2016-10-24T16:04:05Z

@shivaram updated the s3 path in the code. all works fine.

shivaram · 2016-10-24T16:10:08Z

It looks like there are some conflicts - Could you resolve them ?

jamborta · 2016-10-24T16:11:17Z

just done

shivaram

Thanks @jamborta for the updates. I did one more pass.

shivaram · 2016-10-24T16:20:44Z

scala/init.sh

+  wget http://s3.amazonaws.com/spark-related-packages/scala-$SCALA_VERSION.tgz
+elif [[ "2.0.0" =~ $SPARK_VERSION ]]; then
+  SCALA_VERSION="2.11.8"
+  wget http://downloads.lightbend.com/scala/2.11.8/scala-$SCALA_VERSION.tgz


I've also uploaded this to the s3 bucket. Lets switch to that to avoid depending on the lightbend source ?

shivaram · 2016-10-24T16:21:25Z

spark_ec2.py

    if "." in spark_version:
        parts = spark_version.split(".")
-        if parts[0].isdigit():
+        if parts[0].isdigit() and parts[0].isdigit():


redundant if check ? I guess this should be parts[1].isdigit() ?

shivaram · 2016-10-24T16:27:42Z

spark_ec2.py

+              sys.exit(1)
+            if hadoop_minor_version == "2.7" and spark_major_minor_version < 2.0:
+              print("Spark version: {v}, does not support Hadoop minor version: {hm}".
+                    format(v=spark_version, hm=hadoop_minor_version, sv=",".join(VALID_HADOOP_MINOR_VERSIONS)), file=stderr)


the variable sv is not used in this error message or the next one ?

shivaram · 2016-10-24T16:29:13Z

spark/init.sh

-    0.7.3)
+ case "$SPARK_VERSION" in
+    # 0.7.3 - 1.0.2
+    0\.[7-9]\.[0-3]|1\.0\.[0-2])


I'm not sure this will work as the 0.8.0 and 0.9.0 have incubating in their package names. My take would be be to keep the existing long form for these early versions

shivaram · 2016-10-24T16:34:12Z

spark/init.sh

      if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
-        wget http://s3.amazonaws.com/spark-related-packages/spark-1.2.1-bin-hadoop1.tgz
+        wget http://s3.amazonaws.com/spark-related-packages/spark-2.0.0-bin-hadoop1.tgz


The version numbers here should not be hard coded to 2.0.0 ? Also it might be good to keep the future proof solution we had of `spark-$SPARK_VERSION-bin-hadoop$HADOOP_MINOR_VERSION.tgz' ?

So thinking more about this I think the idea should be that we do the checking of available spark / hadoop version combinations in the Python file (its easier to read / review / maintain than bash). Then the bash script just does the downloading / setup to handle corner cases like incubating etc. What do you think ?

jamborta · 2016-10-24T20:19:27Z

@shivaram updated the code based on your comments. validate_spark_hadoop_version in python should capture all the checks that are needed - but still kept the same logic spark/init.sh. i think some of the logic can be removed from init.sh, if we don't wanna duplicate the checks.

nchammas · 2016-10-24T20:38:59Z

@shivaram - Sorry to keep butting in on this PR.

If we are going to start hosting Hadoop on S3, shouldn't we use the latest maintenance releases?

In other words, 2.7.3 instead of 2.7.0, 2.6.5 instead of 2.6.0, etc...

shivaram · 2016-10-25T17:34:34Z

@nchammas Thats a reasonable question - I uploaded 2.7.0 as @jamborta had used that in the PR. On the one hand I dont mind starting off with 2.7.3 right now as its better to have the maintenance release. On the other hand we might have to pick / live with say the latest maintenance version as of now as I dont think we can keep updating this when say 2.7.4 comes out.

@jamborta Any thoughts on this ?

jamborta · 2016-10-25T21:55:29Z

I agree that we could put the latest maintenance release for 2.4, 2.6 and 2.7 as it is now, and stick with those.

jamborta · 2016-10-26T10:44:17Z

@shivaram if you upload the latest releases I can update this PR.

shivaram · 2016-10-29T21:28:47Z

@jamborta 2.7.3 and 2.6.5 are now in the same S3 bucket

jamborta · 2016-10-31T11:05:03Z

@shivaram code updated

jamborta added 16 commits September 30, 2016 23:30

adding hadoop 2.6 support

fde24d2

a few lines of description

87153f2

adding parameter to template file

a242d0b

validate minor version

064abd9

validate minor version

c282e06

validate minor version

a68ca9b

validate minor version

14f0d75

bugfix

a308f74

correct path for ephemeral hdfs

97cbb6b

scala 2.11 for spark 2

e46020c

document s3 changes

c34d93e

document s3 changes

d8d4803

document s3 changes

c13a437

Update README.md

9e6920c

Update README.md

833f2de

Update README.md

750ede8

adding hadoop 2.7

1c34483

shivaram reviewed Oct 12, 2016

View reviewed changes

jamborta added 2 commits October 12, 2016 21:34

adding static variable VALID_HADOOP_MINOR_VERSIONS

46b6394

disable tachyon for spark 2 and yarn

ad525d7

shivaram reviewed Oct 13, 2016

View reviewed changes

jamborta added 2 commits October 13, 2016 23:23

typo in version

03e70b7

separate case for each range of spark versions

653f338

jamborta added 7 commits October 24, 2016 13:25

update hadoop dependency download path

21e03d0

exhaustive checking of hadoop versions

4a4f4a5

typo in options

d7e73bf

return -1 for unknown hadoop version

332b90b

return 1 for unknown hadoop version

71c7047

safeguard for hadoop minor version 2.7

a924690

safeguard for hadoop minor version 2.6

e3ee4e2

resolve conflicts

db15dcc

shivaram reviewed Oct 24, 2016

View reviewed changes

update based on comments

246b888

shivaram mentioned this pull request Oct 25, 2016

How can launch Spark 2.0.1 and Hadoop 2.7? #64

Open

use latest hadoop maintenance version

9b0f1d1

lagerspetz mentioned this pull request Jul 13, 2017

This pull request makes newer Spark versions with yarn use Hadoop 2.7. #105

Closed

Allow to specify hadoop minor version (2.4 and 2.6 at the moment) #56

Are you sure you want to change the base?

Allow to specify hadoop minor version (2.4 and 2.6 at the moment) #56

Conversation

jamborta commented Oct 2, 2016

shivaram commented Oct 3, 2016

jamborta commented Oct 8, 2016

jamborta commented Oct 10, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivaram left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamborta commented Oct 19, 2016

jamborta commented Oct 19, 2016

nchammas commented Oct 19, 2016

shivaram commented Oct 19, 2016

shivaram commented Oct 20, 2016

nchammas commented Oct 20, 2016

shivaram commented Oct 20, 2016

nchammas commented Oct 21, 2016

jamborta commented Oct 24, 2016

shivaram commented Oct 24, 2016

jamborta commented Oct 24, 2016

shivaram left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamborta commented Oct 24, 2016 • edited Loading

nchammas commented Oct 24, 2016

shivaram commented Oct 25, 2016

jamborta commented Oct 25, 2016 • edited Loading

jamborta commented Oct 26, 2016

shivaram commented Oct 29, 2016

jamborta commented Oct 31, 2016

jamborta commented Oct 10, 2016 •

edited

Loading

jamborta commented Oct 24, 2016 •

edited

Loading

jamborta commented Oct 25, 2016 •

edited

Loading