-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support read unstructured excel file #901
Support read unstructured excel file #901
Conversation
301f53c
to
2b3361f
Compare
Oh, it's a bit different. When withDefaultHeader is set to true, it doesn't use the row specified by skipRow as the header. If skipRow is set, it retrieves data starting from the row specified by skipRow, using automatically generated headers. However, as @Jolanrensen pointed out, I hadn't considered that particular case. and i agree change it to firstRowIsHeader = true as the argument too |
"There are no defined cells on header row number ${skipRows + 1} (1-based index). Pass `columns` argument to specify what columns to read or make sure the index is correct" | ||
|
||
else -> { | ||
val notEmptyRow = sheet.rowIterator().asSequence().maxByOrNull { it.lastCellNum } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might name this largestRow
or something like that. Good solution though :) didn't know about lastCellNum
@@ -103,11 +107,21 @@ public fun DataFrame.Companion.readExcel( | |||
stringColumns: StringColumns? = null, | |||
rowsCount: Int? = null, | |||
nameRepairStrategy: NameRepairStrategy = NameRepairStrategy.CHECK_UNIQUE, | |||
firstRowIsHeader: Boolean = true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently learned about bytecode compatibility which will break when adding this extra argument (https://kotlinlang.org/docs/api-guidelines-backward-compatibility.html#avoid-adding-arguments-to-existing-api-functions). This means that when you use a library that depends on DataFrame 0.14 while your own project depends on DataFrame 0.15 and that library calls readExcel()
, you'll get a NoSuchMethodError
. We're adding a detector plugin for these cases.
We can solve this, however, by adding some extra DeprecationLevel.HIDDEN
overloads of the old version of these functions. I could do this for you if you like :) The rest looks good, so afterwards we could merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the review and for accepting my answer!
I truly appreciate it.
e53634e
to
b6fe681
Compare
It has been confirmed that the current Dataframe.readExcel function only supports structured Excel formats.
Of course, it would be ideal if everyone created Excel files in a structured format, but as shown in the image, when dealing with unstructured Excel formats, the current DataFrame approach of always designating the first row as the header causes difficulties in usage.
The Python pandas library supports this, making it very efficient to use.
As a result, for unstructured Excel formats, i've implemented support by using a value called withDefaultHeader.
When set to true, it automatically generates headers using NameRepairStrategy,
thus enabling support for unstructured Excel formats.
when withDefaultHeader is set to true, it operates as [NameRepairStrategy.MAKE_UNIQUE]
Is there a better approach?