Support read unstructured excel file #901

khm0651 · 2024-10-01T12:17:19Z

It has been confirmed that the current Dataframe.readExcel function only supports structured Excel formats.

Of course, it would be ideal if everyone created Excel files in a structured format, but as shown in the image, when dealing with unstructured Excel formats, the current DataFrame approach of always designating the first row as the header causes difficulties in usage.

The Python pandas library supports this, making it very efficient to use.

As a result, for unstructured Excel formats, i've implemented support by using a value called withDefaultHeader.

When set to true, it automatically generates headers using NameRepairStrategy,
thus enabling support for unstructured Excel formats.

when withDefaultHeader is set to true, it operates as [NameRepairStrategy.MAKE_UNIQUE]

Is there a better approach?

Jolanrensen · 2024-10-01T13:40:20Z

Thanks for your contribution!

Some thoughts:

So if I'm correct, withDefaultHeader = true will make the columns be named according to excel columns, like "A", "B", "C" etc.? It's not really clear to me this will happen from your description of the argument.

I might change it to firstRowIsHeader = true as the argument, then describing that when that's set to false explicitly, DF will fall-back to excel letter column names, else it will take the first row (after skipRows) as the header.

Also, am I correct that in the current implementation, column J will not be included in the result?

khm0651 · 2024-10-01T14:35:53Z

Oh, it's a bit different. When withDefaultHeader is set to true, it doesn't use the row specified by skipRow as the header. If skipRow is set, it retrieves data starting from the row specified by skipRow, using automatically generated headers.
Excel columns are automatically generated as "A", "B", "C", etc.. and so on.

However, as @Jolanrensen pointed out, I hadn't considered that particular case.
It seems we need to address this case as well
to accommodate this case, should calculate and incorporate a maximum value.

and i agree change it to firstRowIsHeader = true as the argument too

Jolanrensen · 2024-10-30T10:59:15Z

dataframe-excel/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/xlsx.kt

-            "There are no defined cells on header row number ${skipRows + 1} (1-based index). Pass `columns` argument to specify what columns to read or make sure the index is correct"
+
+        else -> {
+            val notEmptyRow = sheet.rowIterator().asSequence().maxByOrNull { it.lastCellNum }


I might name this largestRow or something like that. Good solution though :) didn't know about lastCellNum

Jolanrensen · 2024-10-30T11:08:37Z

dataframe-excel/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/xlsx.kt

@@ -103,11 +107,21 @@ public fun DataFrame.Companion.readExcel(
    stringColumns: StringColumns? = null,
    rowsCount: Int? = null,
    nameRepairStrategy: NameRepairStrategy = NameRepairStrategy.CHECK_UNIQUE,
+    firstRowIsHeader: Boolean = true,


I recently learned about bytecode compatibility which will break when adding this extra argument (https://kotlinlang.org/docs/api-guidelines-backward-compatibility.html#avoid-adding-arguments-to-existing-api-functions). This means that when you use a library that depends on DataFrame 0.14 while your own project depends on DataFrame 0.15 and that library calls readExcel(), you'll get a NoSuchMethodError. We're adding a detector plugin for these cases.
We can solve this, however, by adding some extra DeprecationLevel.HIDDEN overloads of the old version of these functions. I could do this for you if you like :) The rest looks good, so afterwards we could merge it.

Thank you so much for the review and for accepting my answer!
I truly appreciate it.

…ed_excel_file

Support read unstructured excel file

2b3361f

khm0651 force-pushed the support_unstructured_excel_file branch from 301f53c to 2b3361f Compare October 1, 2024 12:25

lint fix

0d08506

Jolanrensen self-requested a review October 1, 2024 12:49

hare added 2 commits October 2, 2024 00:45

Clearly modify kdoc and argument name

94641c9

add edge case

b289573

Jolanrensen requested changes Oct 30, 2024

View reviewed changes

Jolanrensen added the enhancement New feature or request label Oct 30, 2024

Jolanrensen added this to the 0.15.0 milestone Oct 30, 2024

Jolanrensen added 2 commits October 31, 2024 15:05

Merge branch 'refs/heads/master' into fork/khm0651/support_unstructur…

eb85027

…ed_excel_file

solved binary compatibility issues and renamed notEmptyRow to largestRow

b6fe681

Jolanrensen force-pushed the support_unstructured_excel_file branch from e53634e to b6fe681 Compare October 31, 2024 14:22

Jolanrensen merged commit 77b2f56 into Kotlin:master Oct 31, 2024
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support read unstructured excel file #901

Support read unstructured excel file #901

khm0651 commented Oct 1, 2024

Jolanrensen commented Oct 1, 2024

khm0651 commented Oct 1, 2024 •

edited

Loading

Jolanrensen Oct 30, 2024

Jolanrensen Oct 30, 2024

khm0651 Nov 1, 2024

Support read unstructured excel file #901

Support read unstructured excel file #901

Conversation

khm0651 commented Oct 1, 2024

Jolanrensen commented Oct 1, 2024

khm0651 commented Oct 1, 2024 • edited Loading

Jolanrensen Oct 30, 2024

Choose a reason for hiding this comment

Jolanrensen Oct 30, 2024

Choose a reason for hiding this comment

khm0651 Nov 1, 2024

Choose a reason for hiding this comment

khm0651 commented Oct 1, 2024 •

edited

Loading