assertDataFrameApproximateEquals params show when failing #404

eruizalo · 2024-01-14T22:37:59Z

Hello @holdenk , I've noticed that since commit 6f812cc, the way of displaying DataFrames that are not similar has been changed using the assertDataFrameApproximateEquals method. The new approach involves a show, using the default parameters (truncating the text and limiting the number of rows to 20).

I would like to be able to parameterize this "show" in some way so that at least the truncate and numRows parameters may be provided to the assertDataFrameApproximateEquals method. Another option could be to use a lambda DataFrame => Unit param (default lambda would keep current functionality) so that we can customize it per project. This way, we could even convert the DataFrame into JSON if we have complex structures and properly see which fields are not equal.

I've been working on this and other contributions (but I'm not sure if do you want an issue for each feature):

Timestamp tolerance (using java.timeDuration) so a different tolerance can be used for decimals and timestamps in the same dataframe
Struct approx equality (recursive calling): I think if you use tolerance in a struct it currently fails within a valid tolerance
BigDecimal tolerance: Compare Java and Scala BigDecimal with tolerance #213 didn't merge and now the PR might be conflictive, so I took the liberty of adding it

What do you think?

The text was updated successfully, but these errors were encountered:

holdenk · 2024-01-15T00:02:41Z

Using show sounds good, and custom lambda for comparison sounds reasonable provided we keep the current default comparison.

I'm not so sure about JSON based comparison that sounds slow/brittle to me but could be wrong.

eruizalo · 2024-01-17T17:21:00Z

Hi @holdenk,

I'm not sure if I explained myself properly and I had the features almost finished, so I think it is better to show them to you.

I tried to split this issue into 4 commits, each commit should contain only a possible feature, so if you don't like any of them it can be removed or updated individually. I may also create different issues/PRs if you prefer that:

Timestamp tolerance - cb0b0ac: Here I try to differ timestamp tolerance from decimal tolerance, this way we can test both tolerances in the same row (decimal tolerance 0 > tol > 1, but timestamp tolerance usually > 1000)
Big Decimal tolerance - 88b173d copy/paste from Compare Java and Scala BigDecimal with tolerance #213
Enable struct approx equality - 3ebf545: I think it fails when a decimal or a timestamp is within a structure but it should pass the validation due to tolerance
Custom show when failing - 5e77125: Actually if a DataFrame differ from another it simply calls a show() without any param, so big columns are truncated. With a custom show function we may do: df.show(50, false) or df.toJSON.show(false) (this is what I meant when talking about json/big structures). Now is a dev decision how to show the differences and how many (default behaviour remains)

Please, tell me what you think about these features and whether I should separate them into new issues/PRs or if they require any changes.

holdenk · 2024-01-18T01:19:57Z

Ah that makes more sense sorry for misunderstanding part of the first issue. This sounds reasonable do you have a PR up I can review?

eruizalo · 2024-01-18T16:46:25Z

Hi @holdenk, I just created #405 let me know your insights

holdenk · 2024-01-18T17:14:58Z

Fantastic I'll try and take a look either tomorrow or early next week :)

…custom show (#405) * feat: [~] #404 Timestamp tolerance * fix: [~] #404 PR#213 BigDecimal tolerance * feat: [~] #404 enable struct approxEquals * feat: [+] #404 custom show when failing approxEquals * feat: [~] retro-compatibility & release notes * feat: [~] create deprecated functions - backward compatibility

eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 17, 2024

feat: [~] holdenk#404 Timestamp tolerance

cb0b0ac

eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 17, 2024

fix: [~] holdenk#404 PR#213 BigDecimal tolerance

88b173d

eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 17, 2024

feat: [~] holdenk#404 enable struct approxEquals

3ebf545

eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 17, 2024

feat: [+] holdenk#404 custom show when failing approxEquals

5e77125

eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 18, 2024

feat: [~] holdenk#404 Timestamp tolerance

949b066

eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 18, 2024

fix: [~] holdenk#404 PR#213 BigDecimal tolerance

0541231

eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 18, 2024

feat: [~] holdenk#404 enable struct approxEquals

1803d39

eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 18, 2024

feat: [+] holdenk#404 custom show when failing approxEquals

5f52426

eruizalo mentioned this issue Jan 18, 2024

Feature #404 approx equals struct, Timestamp & BigDecimal tolerance, custom show #405

Merged

eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 19, 2024

feat: [+] holdenk#404 custom show when failing approxEquals

31f2f7b

holdenk closed this as completed in #405 Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assertDataFrameApproximateEquals params show when failing #404

assertDataFrameApproximateEquals params show when failing #404

eruizalo commented Jan 14, 2024

holdenk commented Jan 15, 2024

eruizalo commented Jan 17, 2024 •

edited

Loading

holdenk commented Jan 18, 2024

eruizalo commented Jan 18, 2024

holdenk commented Jan 18, 2024

assertDataFrameApproximateEquals params show when failing #404

assertDataFrameApproximateEquals params show when failing #404

Comments

eruizalo commented Jan 14, 2024

holdenk commented Jan 15, 2024

eruizalo commented Jan 17, 2024 • edited Loading

holdenk commented Jan 18, 2024

eruizalo commented Jan 18, 2024

holdenk commented Jan 18, 2024

eruizalo commented Jan 17, 2024 •

edited

Loading