Replies: 1 comment
-
The df.join(df2, JoinType("inner"), colInt("myCol")) notation is too verbose for me. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We have been lately thinking about joins in Doric & its equivalences in Spark. In Doric we have changed the signature of join methods so join type would be more explicit:
def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
def join(df2: Dataset[_], joinType: String, col: DoricColumn[_], cols: DoricColumn[_]*): DataFrame
Join type
The first improvement I would like to discuss is the use of JoinType object. This way the developer could get an error in runtime (just as spark do), but getting all the errors together (joinType error + doric column errors).
This new method would live among the others:
def join(df2: Dataset[_], joinType: String, col: DoricColumn[_], cols: DoricColumn[_]*): DataFrame
df.join(df2, "inner", colInt("myCol"))
def join(df2: Dataset[_], joinType: JoinType, col: DoricColumn[_], cols: DoricColumn[_]*): DataFrame
-->df.join(df2, JoinType("inner"), colInt("myCol"))
Join type enforcement
The idea of using JoinType object is to avoid confusing using doric/spark and get early errors, but maybe some other solution could simplify this. Today I thought about a join object which will expose the most used join methods, so it would be easier to see the type of join and could never get a join type error. It would look like this:
df.join.inner(df2, colLong(id))
df.join.leftAnti(df2, colLong(id))
EDIT
Let's vote the different options using examples:
df.join(df2, "inner", colInt("myCol"))
df.join(df2, JoinType("inner"), colInt("myCol"))
df.join.inner(df2, colLong(id))
df.innerJoin(df2, colLong(id))
Beta Was this translation helpful? Give feedback.
All reactions