Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prashan_pul #1

Open
wants to merge 3,444 commits into
base: master
Choose a base branch
from
Open

prashan_pul #1

wants to merge 3,444 commits into from

Conversation

prashanC
Copy link

@prashanC prashanC commented Oct 3, 2016

No description provided.

ankurdave and others added 30 commits January 13, 2014 14:54
Improving documentation and identifying potential bug in CC calculation.
`sbt/sbt doc` used to fail. This fixed it.
Updated JavaStreamingContext to make scaladoc compile.

`sbt/sbt doc` used to fail. This fixed it.
The bug was due to a misunderstanding of the activeSetOpt parameter to
Graph.mapReduceTriplets. Passing EdgeDirection.Both causes
mapReduceTriplets to run only on edges with *both* vertices in the
active set. This commit adds EdgeDirection.Either, which causes
mapReduceTriplets to run on edges with *either* vertex in the active
set. This is what connected components needed.
…ing serialization support for GraphImpl to address issues with failed closure capture.
…aphx

Conflicts:
	graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
Improved logic of finding new files in FileInputDStream

Earlier, if HDFS has a hiccup and reports a existence of a new file (mod time T sec) at time T + 1 sec, then fileStream could have missed that file. With this change, it should be able to find files that are delayed by up to <batch size> seconds. That is, even if file is reported at T + <batch time> sec, file stream should be able to catch it.

The new logic, at a high level, is as follows. It keeps track of the new files it found in the previous interval and mod time of the oldest of those files (lets call it X). Then in the current interval, it will ignore those files that were seen in the previous interval and those which have mod time older than X. So if a new file gets reported by HDFS that in the current interval, but has mod time in the previous interval, it will be considered. However, if the mod time earlier than the previous interval (that is, earlier than X), they will be ignored. This is the current limitation, and future version would improve this behavior further.

Also reduced line lengths in DStream to <=100 chars.
JoshRosen and others added 30 commits January 28, 2014 20:20
This fixes SPARK-1043, a bug introduced in 0.9.0
where PySpark couldn't serialize strings > 64kB.

This fix was written by @tyro89 and @bouk in #512.
This commit squashes and rebases their pull request
in order to fix some merge conflicts.
Switch from MUTF8 to UTF8 in PySpark serializers.

This fixes SPARK-1043, a bug introduced in 0.9.0 where PySpark couldn't serialize strings > 64kB.

This fix was written by @tyro89 and @bouk in #512. This commit squashes and rebases their pull request in order to fix some merge conflicts.
Updated Spark Streaming Programming Guide

Here is the updated version of the Spark Streaming Programming Guide. This is still a work in progress, but the major changes are in place. So feedback is most welcome.

In general, I have tried to make the guide to easier to understand even if the reader does not know much about Spark. The updated website is hosted here -

http://www.eecs.berkeley.edu/~tdas/spark_docs/streaming-programming-guide.html

The major changes are:
- Overview illustrates the usecases of Spark Streaming - various input sources and various output sources
- An example right after overview to quickly give an idea of what Spark Streaming program looks like
- Made Java API and examples a first class citizen like Scala by using tabs to show both Scala and Java examples (similar to AMPCamp tutorial's code tabs)
- Highlighted the DStream operations updateStateByKey and transform because of their powerful nature
- Updated driver node failure recovery text to highlight automatic recovery in Spark standalone mode
- Added information about linking and using the external input sources like Kafka and Flume
- In general, reorganized the sections to better show the Basic section and the more advanced sections like Tuning and Recovery.

Todos:
- Links to the docs of external Kafka, Flume, etc
- Illustrate window operation with figure as well as example.

Author: Tathagata Das <[email protected]>

== Merge branch commits ==

commit 18ff105
Author: Tathagata Das <[email protected]>
Date:   Tue Jan 28 21:49:30 2014 -0800

    Fixed a lot of broken links.

commit 34a5a60
Author: Tathagata Das <[email protected]>
Date:   Tue Jan 28 18:02:28 2014 -0800

    Updated github url to use SPARK_GITHUB_URL variable.

commit f338a60
Author: Tathagata Das <[email protected]>
Date:   Mon Jan 27 22:42:42 2014 -0800

    More updates based on Patrick and Harvey's comments.

commit 89a81ff
Author: Tathagata Das <[email protected]>
Date:   Mon Jan 27 13:08:34 2014 -0800

    Updated docs based on Patricks PR comments.

commit d5b6196
Author: Tathagata Das <[email protected]>
Date:   Sun Jan 26 20:15:58 2014 -0800

    Added spark.streaming.unpersist config and info on StreamingListener interface.

commit e3dcb46
Author: Tathagata Das <[email protected]>
Date:   Sun Jan 26 18:41:12 2014 -0800

    Fixed docs on StreamingContext.getOrCreate.

commit 6c29524
Author: Tathagata Das <[email protected]>
Date:   Thu Jan 23 18:49:39 2014 -0800

    Added example and figure for window operations, and links to Kafka and Flume API docs.

commit f06b964
Author: Tathagata Das <[email protected]>
Date:   Wed Jan 22 22:49:12 2014 -0800

    Fixed missing endhighlight tag in the MLlib guide.

commit 036a7d4
Merge: eab351d a1cd185
Author: Tathagata Das <[email protected]>
Date:   Wed Jan 22 22:17:42 2014 -0800

    Merge remote-tracking branch 'apache/master' into docs-update

commit eab351d
Author: Tathagata Das <[email protected]>
Date:   Wed Jan 22 22:17:15 2014 -0800

    Update Spark Streaming Programming Guide.
Issue with failed worker registrations

I've been going through the spark source after having some odd issues with workers dying and not coming back. After some digging (I'm very new to scala and spark) I believe I've found a worker registration issue. It looks to me like a failed registration follows the same code path as a successful registration which end up with workers believing they are connected (since they received a `RegisteredWorker` event) even tho they are not registered on the Master.

This is a quick fix that I hope addresses this issue (assuming I didn't completely miss-read the code and I'm about to look like a silly person :P)

I'm opening this pr now to start a chat with you guys while I do some more testing on my side :)

Author: Erik Selin <[email protected]>

== Merge branch commits ==

commit 973012f
Author: Erik Selin <[email protected]>
Date:   Tue Jan 28 23:36:12 2014 -0500

    break logwarning into two lines to respect line character limit.

commit e3754dc
Author: Erik Selin <[email protected]>
Date:   Tue Jan 28 21:16:21 2014 -0500

    add log warning when worker registration fails due to attempt to re-register on same address.

commit 14baca2
Author: Erik Selin <[email protected]>
Date:   Wed Jan 22 21:23:26 2014 -0500

    address code style comment

commit 71c0d7e
Author: Erik Selin <[email protected]>
Date:   Wed Jan 22 16:01:42 2014 -0500

    Make a failed registration not persist, not send a `RegisteredWordker` event and not run `schedule` but rather send a `RegisterWorkerFailed` message to the worker attempting to register.
Added spark.shuffle.file.buffer.kb to configuration doc.

Author: Reynold Xin <[email protected]>

== Merge branch commits ==

commit 0eea1d7
Author: Reynold Xin <[email protected]>
Date:   Wed Jan 29 14:40:48 2014 -0800

    Added spark.shuffle.file.buffer.kb to configuration doc.
Add GraphX to assembly/pom.xml

Author: Ankur Dave <[email protected]>

== Merge branch commits ==

commit bb0b33e
Author: Ankur Dave <[email protected]>
Date:   Fri Jan 31 15:24:52 2014 -0800

    Add GraphX to assembly/pom.xml
Change the ⇒ character (maybe from scalariform) to => in Scala code for style consistency

Looks like there are some ⇒ Unicode character (maybe from scalariform) in Scala code.
This PR is to change it to => to get some consistency on the Scala code.

If we want to use ⇒ as default we could use sbt plugin scalariform to make sure all Scala code has ⇒ instead of =>

And remove unused imports found in TwitterInputDStream.scala while I was there =)

Author: Henry Saputra <[email protected]>

== Merge branch commits ==

commit 29c1771
Author: Henry Saputra <[email protected]>
Date:   Sat Feb 1 22:05:16 2014 -0800

    Change the ⇒ character (maybe from scalariform) to => in Scala code for style consistency.
Remove explicit conversion to PairRDDFunctions in cogroup()

As SparkContext._ is already imported, using the implicit conversion appears to make the code much cleaner. Perhaps there was some sinister reason for doing the conversion explicitly, however.

Author: Aaron Davidson <[email protected]>

== Merge branch commits ==

commit aa4a63f
Author: Aaron Davidson <[email protected]>
Date:   Sun Feb 2 23:48:04 2014 -0800

    Remove explicit conversion to PairRDDFunctions in cogroup()

    As SparkContext._ is already imported, using the implicit conversion
    appears to make the code much cleaner. Perhaps there was some sinister
    reason for doing the converion explicitly, however.
 Refactor RDD sampling and add randomSplit to RDD (update)

Replace SampledRDD by PartitionwiseSampledRDD, which accepts a RandomSampler instance as input. The current sample with/without replacement can be easily integrated via BernoulliSampler and PoissonSampler. The benefits are:

1) RDD.randomSplit is implemented in the same way, related to https://github.com/apache/incubator-spark/pull/513
2) Stratified sampling and importance sampling can be implemented in the same manner as well.

Unit tests are included for samplers and RDD.randomSplit.

This should performance better than my previous request where the BernoulliSampler creates many Iterator instances:
https://github.com/apache/incubator-spark/pull/513

Author: Xiangrui Meng <[email protected]>

== Merge branch commits ==

commit e8ce957
Author: Xiangrui Meng <[email protected]>
Date:   Mon Feb 3 12:21:08 2014 -0800

    more docs to PartitionwiseSampledRDD

commit fbb4586
Author: Xiangrui Meng <[email protected]>
Date:   Mon Feb 3 00:44:23 2014 -0800

    move XORShiftRandom to util.random and use it in BernoulliSampler

commit 987456b
Author: Xiangrui Meng <[email protected]>
Date:   Sat Feb 1 11:06:59 2014 -0800

    relax assertions in SortingSuite because the RangePartitioner has large variance in this case

commit 3690aae
Author: Xiangrui Meng <[email protected]>
Date:   Sat Feb 1 09:56:28 2014 -0800

    test split ratio of RDD.randomSplit

commit 8a410bc
Author: Xiangrui Meng <[email protected]>
Date:   Sat Feb 1 09:25:22 2014 -0800

    add a test to ensure seed distribution and minor style update

commit ce7e866
Author: Xiangrui Meng <[email protected]>
Date:   Fri Jan 31 18:06:22 2014 -0800

    minor style change

commit 750912b
Author: Xiangrui Meng <[email protected]>
Date:   Fri Jan 31 18:04:54 2014 -0800

    fix some long lines

commit c446a25
Author: Xiangrui Meng <[email protected]>
Date:   Fri Jan 31 17:59:59 2014 -0800

    add complement to BernoulliSampler and minor style changes

commit dbe2bc2
Author: Xiangrui Meng <[email protected]>
Date:   Fri Jan 31 17:45:08 2014 -0800

    switch to partition-wise sampling for better performance

commit a1fca52
Merge: ac712e4 cf6128f
Author: Xiangrui Meng <[email protected]>
Date:   Fri Jan 31 16:33:09 2014 -0800

    Merge branch 'sample' of github.com:mengxr/incubator-spark into sample

commit cf6128f
Author: Xiangrui Meng <[email protected]>
Date:   Sun Jan 26 14:40:07 2014 -0800

    set SampledRDD deprecated in 1.0

commit f430f84
Author: Xiangrui Meng <[email protected]>
Date:   Sun Jan 26 14:38:59 2014 -0800

    update code style

commit a8b5e20
Author: Xiangrui Meng <[email protected]>
Date:   Sun Jan 26 12:56:27 2014 -0800

    move package random to util.random

commit ab0fa2c
Author: Xiangrui Meng <[email protected]>
Date:   Sun Jan 26 12:50:35 2014 -0800

    add Apache headers and update code style

commit 985609f
Author: Xiangrui Meng <[email protected]>
Date:   Sun Jan 26 11:49:25 2014 -0800

    add new lines

commit b21bddf
Author: Xiangrui Meng <[email protected]>
Date:   Sun Jan 26 11:46:35 2014 -0800

    move samplers to random.IndependentRandomSampler and add tests

commit c02dacb
Author: Xiangrui Meng <[email protected]>
Date:   Sat Jan 25 15:20:24 2014 -0800

    add RandomSampler

commit 8ff7ba3
Author: Xiangrui Meng <[email protected]>
Date:   Fri Jan 24 13:23:22 2014 -0800

    init impl of IndependentlySampledRDD
Fixed typo in scaladoc

Author: Stevo Slavić <[email protected]>

== Merge branch commits ==

commit 0a77f78
Author: Stevo Slavić <[email protected]>
Date:   Tue Feb 4 15:30:27 2014 +0100

    Fixed typo in scaladoc
Fixed wrong path to compute-classpath.cmd

compute-classpath.cmd is in bin, not in sbin directory

Author: Stevo Slavić <[email protected]>

== Merge branch commits ==

commit 23deca3
Author: Stevo Slavić <[email protected]>
Date:   Tue Feb 4 15:01:47 2014 +0100

    Fixed wrong path to compute-classpath.cmd

    compute-classpath.cmd is in bin, not in sbin directory
Fix line end character stripping for Windows

LogQuery Spark example would produce unwanted result when run on Windows platform because of different, platform specific trailing line end characters (not only \n but \r too).

This fix makes use of Scala's standard library string functions to properly strip all trailing line end characters, letting Scala handle the platform specific stuff.

Author: Stevo Slavić <[email protected]>

== Merge branch commits ==

commit 1e43ba0
Author: Stevo Slavić <[email protected]>
Date:   Wed Feb 5 14:48:29 2014 +0100

    Fix line end character stripping for Windows

    LogQuery Spark example would produce unwanted result when run on Windows platform because of different, platform specific trailing line end characters (not only \n but \r too).

    This fix makes use of Scala's standard library string functions to properly strip all trailing line end characters, letting Scala handle the platform specific stuff.
…#544.

Fixed warnings in test compilation.

This commit fixes two problems: a redundant import, and a
deprecated function.

Author: Kay Ousterhout <[email protected]>

== Merge branch commits ==

commit da9d2e1
Author: Kay Ousterhout <[email protected]>
Date:   Wed Feb 5 11:41:51 2014 -0800

    Fixed warnings in test compilation.

    This commit fixes two problems: a redundant import, and a
    deprecated function.
remove actorToWorker in master.scala, which is actually not used

actorToWorker is actually not used in the code....just remove it

Author: CodingCat <[email protected]>

== Merge branch commits ==

commit 52656c2
Author: CodingCat <[email protected]>
Date:   Thu Feb 6 00:28:26 2014 -0500

    remove actorToWorker in master.scala, which is actually not used
…s #526.

spark on yarn - yarn-client mode doesn't always exit immediately

https://spark-project.atlassian.net/browse/SPARK-1049

If you run in the yarn-client mode but you don't get all the workers you requested right away and then you exit your application, the application master stays around until it gets the number of workers you initially requested. This is a waste of resources.  The AM should exit immediately upon the client going away.

This fix simply checks to see if the driver closed while its waiting for the initial # of workers.

Author: Thomas Graves <[email protected]>

== Merge branch commits ==

commit 03f40a6
Author: Thomas Graves <[email protected]>
Date:   Fri Jan 31 11:23:10 2014 -0600

    spark on yarn - yarn-client mode doesn't always exit immediately
Fix off-by-one error with task progress info log.

Author: Kay Ousterhout <[email protected]>

== Merge branch commits ==

commit 29798fc
Author: Kay Ousterhout <[email protected]>
Date:   Wed Feb 5 13:40:01 2014 -0800

    Fix off-by-one error with task progress info log.
Python api additions

Author: Prashant Sharma <[email protected]>

== Merge branch commits ==

commit 8b51591
Author: Prashant Sharma <[email protected]>
Date:   Fri Jan 24 11:50:29 2014 +0530

    Josh's and Patricks review comments.

commit d37f967
Author: Prashant Sharma <[email protected]>
Date:   Thu Jan 23 17:27:17 2014 +0530

    fixed doc tests

commit 27cb54b
Author: Prashant Sharma <[email protected]>
Date:   Thu Jan 23 16:48:43 2014 +0530

    Added keys and values methods for PairFunctions in python

commit 4ce76b3
Author: Prashant Sharma <[email protected]>
Date:   Thu Jan 23 13:51:26 2014 +0530

    Added foreachPartition

commit 05f0534
Author: Prashant Sharma <[email protected]>
Date:   Thu Jan 23 13:02:59 2014 +0530

    Added coalesce fucntion to python API

commit 6568d2c
Author: Prashant Sharma <[email protected]>
Date:   Thu Jan 23 12:52:44 2014 +0530

    added repartition function to python API.
SPARK-1056. Fix header comment in Executor to not imply that it's only u...

...sed for Mesos and Standalone.

Author: Sandy Ryza <[email protected]>

== Merge branch commits ==

commit 1f2443d
Author: Sandy Ryza <[email protected]>
Date:   Thu Feb 6 15:03:50 2014 -0800

    SPARK-1056. Fix header comment in Executor to not imply that it's only used for Mesos and Standalone
Inform DAG scheduler about all started/finished tasks.

Previously, the DAG scheduler was not always informed
when tasks started and finished. The simplest example here
is for speculated tasks: the DAGScheduler was only told about
the first attempt of a task, meaning that SparkListeners were
also not told about multiple task attempts, so users can't see
what's going on with speculation in the UI.  The DAGScheduler
also wasn't always told about finished tasks, so in the UI, some
tasks will never be shown as finished (this occurs, for example,
if a task set gets killed).

The other problem is that the fairness accounting was wrong
-- the number of running tasks in a pool was decreased when a
task set was considered done, even if all of its tasks hadn't
yet finished.

Author: Kay Ousterhout <[email protected]>

== Merge branch commits ==

commit c8d547d
Author: Kay Ousterhout <[email protected]>
Date:   Wed Jan 15 16:47:33 2014 -0800

    Addressed Reynold's review comments.

    Always use a TaskEndReason (remove the option), and explicitly
    signal when we don't know the reason. Also, always tell
    DAGScheduler (and associated listeners) about started tasks, even
    when they're speculated.

commit 3fee1e2
Author: Kay Ousterhout <[email protected]>
Date:   Wed Jan 8 22:58:13 2014 -0800

    Fixed broken test and improved logging

commit ff12fca
Author: Kay Ousterhout <[email protected]>
Date:   Sun Dec 29 21:08:20 2013 -0800

    Inform DAG scheduler about all finished tasks.

    Previously, the DAG scheduler was not always informed
    when tasks finished. For example, when a task set was
    aborted, the DAG scheduler was never told when the tasks
    in that task set finished. The DAG scheduler was also
    never told about the completion of speculated tasks.
    This led to confusion with SparkListeners because information
    about the completion of those tasks was never passed on to
    the listeners (so in the UI, for example, some tasks will never
    be shown as finished).

    The other problem is that the fairness accounting was wrong
    -- the number of running tasks in a pool was decreased when a
    task set was considered done, even if all of its tasks hadn't
    yet finished.
Only run ResubmitFailedStages event after a fetch fails

Previously, the ResubmitFailedStages event was called every
200 milliseconds, leading to a lot of unnecessary event processing
and clogged DAGScheduler logs.

Author: Kay Ousterhout <[email protected]>

== Merge branch commits ==

commit e603784
Author: Kay Ousterhout <[email protected]>
Date:   Wed Feb 5 11:34:41 2014 -0800

    Re-add check for empty set of failed stages

commit d258f0e
Author: Kay Ousterhout <[email protected]>
Date:   Wed Jan 15 23:35:41 2014 -0800

    Only run ResubmitFailedStages event after a fetch fails

    Previously, the ResubmitFailedStages event was called every
    200 milliseconds, leading to a lot of unnecessary event processing
    and clogged DAGScheduler logs.
External spilling - generalize batching logic

The existing implementation consists of a hack for Kryo specifically and only works for LZF compression. Introducing an intermediate batch-level stream takes care of pre-fetching and other arbitrary behavior of higher level streams in a more general way.

Author: Andrew Or <[email protected]>

== Merge branch commits ==

commit 3ddeb7e
Author: Andrew Or <[email protected]>
Date:   Wed Feb 5 12:09:32 2014 -0800

    Also privatize fields

commit 090544a
Author: Andrew Or <[email protected]>
Date:   Wed Feb 5 10:58:23 2014 -0800

    Privatize methods

commit 13920c9
Author: Andrew Or <[email protected]>
Date:   Tue Feb 4 16:34:15 2014 -0800

    Update docs

commit bd5a1d7
Author: Andrew Or <[email protected]>
Date:   Tue Feb 4 13:44:24 2014 -0800

    Typo: phyiscal -> physical

commit 287ef44
Author: Andrew Or <[email protected]>
Date:   Tue Feb 4 13:38:32 2014 -0800

    Avoid reading the entire batch into memory; also simplify streaming logic

    Additionally, address formatting comments.

commit 3df7005
Merge: a531d2e 164489d
Author: Andrew Or <[email protected]>
Date:   Mon Feb 3 18:27:49 2014 -0800

    Merge branch 'master' of github.com:andrewor14/incubator-spark

commit a531d2e
Author: Andrew Or <[email protected]>
Date:   Mon Feb 3 18:18:04 2014 -0800

    Relax assumptions on compressors and serializers when batching

    This commit introduces an intermediate layer of an input stream on the batch level.
    This guards against interference from higher level streams (i.e. compression and
    deserialization streams), especially pre-fetching, without specifically targeting
    particular libraries (Kryo) and forcing shuffle spill compression to use LZF.

commit 164489d
Author: Andrew Or <[email protected]>
Date:   Mon Feb 3 18:18:04 2014 -0800

    Relax assumptions on compressors and serializers when batching

    This commit introduces an intermediate layer of an input stream on the batch level.
    This guards against interference from higher level streams (i.e. compression and
    deserialization streams), especially pre-fetching, without specifically targeting
    particular libraries (Kryo) and forcing shuffle spill compression to use LZF.
SPARK-1062 Add rdd.intersection(otherRdd) method

Author: Andrew Ash <[email protected]>

== Merge branch commits ==

commit 5d9982b
Author: Andrew Ash <[email protected]>
Date:   Thu Feb 6 18:11:45 2014 -0800

    Minor fixes

    - style: (v,null) => (v, null)
    - mention the shuffle in Javadoc

commit b86d02f
Author: Andrew Ash <[email protected]>
Date:   Sun Feb 2 13:17:40 2014 -0800

    Overload .intersection() for numPartitions and custom Partitioner

commit bcaa349
Author: Andrew Ash <[email protected]>
Date:   Sun Feb 2 13:05:40 2014 -0800

    Better naming of parameters in intersection's filter

commit b10a6af
Author: Andrew Ash <[email protected]>
Date:   Sat Jan 25 23:06:26 2014 -0800

    Follow spark code format conventions of tab => 2 spaces

commit 965256e
Author: Andrew Ash <[email protected]>
Date:   Fri Jan 24 00:28:01 2014 -0800

    Add rdd.intersection(otherRdd) method
tex formulas in the documentation

using mathjax.
and spliting the MLlib documentation by techniques

see jira
https://spark-project.atlassian.net/browse/MLLIB-19
and
https://github.com/shivaram/spark/compare/mathjax

Author: Martin Jaggi <[email protected]>

== Merge branch commits ==

commit 0364bfa
Author: Martin Jaggi <[email protected]>
Date:   Fri Feb 7 03:19:38 2014 +0100

    minor polishing, as suggested by @pwendell

commit dcd2142
Author: Martin Jaggi <[email protected]>
Date:   Thu Feb 6 18:04:26 2014 +0100

    enabling inline latex formulas with $.$

    same mathjax configuration as used in math.stackexchange.com

    sample usage in the linear algebra (SVD) documentation

commit bbafafd
Author: Martin Jaggi <[email protected]>
Date:   Thu Feb 6 17:31:29 2014 +0100

    split MLlib documentation by techniques

    and linked from the main mllib-guide.md site

commit d1c5212
Author: Martin Jaggi <[email protected]>
Date:   Thu Feb 6 16:59:43 2014 +0100

    enable mathjax formula in the .md documentation files

    code by @shivaram

commit d73948d
Author: Martin Jaggi <[email protected]>
Date:   Thu Feb 6 16:57:23 2014 +0100

    minor update on how to compile the documentation
Make sbt download an atomic operation

Modifies the `sbt/sbt` script to gracefully recover when a previous invocation died in the middle of downloading the SBT jar.

Author: Jey Kottalam <[email protected]>

== Merge branch commits ==

commit 6c600eb
Author: Jey Kottalam <[email protected]>
Date:   Fri Jan 17 10:43:54 2014 -0800

    Make sbt download an atomic operation
Kill drivers in postStop() for Worker.

 JIRA SPARK-1068:https://spark-project.atlassian.net/browse/SPARK-1068

Author: Qiuzhuang Lian <[email protected]>

== Merge branch commits ==

commit 9c19ce6
Author: Qiuzhuang Lian <[email protected]>
Date:   Sat Feb 8 16:07:39 2014 +0800

    Kill drivers in postStop() for Worker.
     JIRA SPARK-1068:https://spark-project.atlassian.net/browse/SPARK-1068
Version number to 1.0.0-SNAPSHOT

Since 0.9.0-incubating is done and out the door, we shouldn't be building 0.9.0-incubating-SNAPSHOT anymore.

@pwendell

Author: Mark Hamstra <[email protected]>

== Merge branch commits ==

commit 1b00a8a
Author: Mark Hamstra <[email protected]>
Date:   Wed Feb 5 09:30:32 2014 -0800

    Version number to 1.0.0-SNAPSHOT
SPARK-1066: Add developer scripts to repository.

These are some developer scripts I've been maintaining in a separate public repo. This patch adds them to the Spark repository so they can evolve here and are clearly accessible to all committers.

I may do some small additional clean-up in this PR, but wanted to put them here in case others want to review. There are a few types of scripts here:

1. A tool to merge pull requests.
2. A script for packaging releases.
3. A script for auditing release candidates.

Author: Patrick Wendell <[email protected]>

== Merge branch commits ==

commit 5d5d331
Author: Patrick Wendell <[email protected]>
Date:   Sat Feb 8 22:11:47 2014 -0800

    SPARK-1066: Add developer scripts to repository.
[WIP] SPARK-1067: Default log4j initialization causes errors for those not using log4j

To fix this - we add a check when initializing log4j.

Author: Patrick Wendell <[email protected]>

== Merge branch commits ==

commit ffdce51
Author: Patrick Wendell <[email protected]>
Date:   Fri Feb 7 15:22:29 2014 -0800

    Logging fix
Added example Python code for sort

I added an example Python code for sort. Right now, PySpark has limited examples for new people willing to use the project. This example code sorts integers stored in a file. I was able to sort 5 million, 10 million and 25 million integers with this code.

Author: jyotiska <[email protected]>

== Merge branch commits ==

commit 8ad8faf
Author: jyotiska <[email protected]>
Date:   Sun Feb 9 11:00:41 2014 +0530

    Added comments in code on collect() method

commit 6f98f1e
Author: jyotiska <[email protected]>
Date:   Sat Feb 8 13:12:37 2014 +0530

    Updated python example code sort.py

commit 945e39a
Author: jyotiska <[email protected]>
Date:   Sat Feb 8 12:59:09 2014 +0530

    Added example python code for sort
[SPARK-1060] startJettyServer should explicitly use IP information

https://spark-project.atlassian.net/browse/SPARK-1060

In the current implementation, the webserver in Master/Worker is started with

val (srv, bPort) = JettyUtils.startJettyServer("0.0.0.0", port, handlers)

inside startJettyServer:

val server = new Server(currentPort) //here, the Server will take "0.0.0.0" as the hostname, i.e. will always bind to the IP address of the first NIC

this can cause wrong IP binding, e.g. if the host has two NICs, N1 and N2, the user specify the SPARK_LOCAL_IP as the N2's IP address, however, when starting the web server, for the reason stated above, it will always bind to the N1's address

Author: CodingCat <[email protected]>

== Merge branch commits ==

commit 6c6d9a8
Author: CodingCat <[email protected]>
Date:   Thu Feb 6 14:53:34 2014 -0500

    startJettyServer should explicitly use IP information
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.