Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assertDataFrameApproximateEquals params show when failing #404

Closed
eruizalo opened this issue Jan 14, 2024 · 5 comments · Fixed by #405
Closed

assertDataFrameApproximateEquals params show when failing #404

eruizalo opened this issue Jan 14, 2024 · 5 comments · Fixed by #405

Comments

@eruizalo
Copy link
Contributor

Hello @holdenk , I've noticed that since commit 6f812cc, the way of displaying DataFrames that are not similar has been changed using the assertDataFrameApproximateEquals method. The new approach involves a show, using the default parameters (truncating the text and limiting the number of rows to 20).

I would like to be able to parameterize this "show" in some way so that at least the truncate and numRows parameters may be provided to the assertDataFrameApproximateEquals method. Another option could be to use a lambda DataFrame => Unit param (default lambda would keep current functionality) so that we can customize it per project. This way, we could even convert the DataFrame into JSON if we have complex structures and properly see which fields are not equal.

I've been working on this and other contributions (but I'm not sure if do you want an issue for each feature):

  • Timestamp tolerance (using java.timeDuration) so a different tolerance can be used for decimals and timestamps in the same dataframe
  • Struct approx equality (recursive calling): I think if you use tolerance in a struct it currently fails within a valid tolerance
  • BigDecimal tolerance: Compare Java and Scala BigDecimal with tolerance #213 didn't merge and now the PR might be conflictive, so I took the liberty of adding it

What do you think?

@holdenk
Copy link
Owner

holdenk commented Jan 15, 2024

Using show sounds good, and custom lambda for comparison sounds reasonable provided we keep the current default comparison.

I'm not so sure about JSON based comparison that sounds slow/brittle to me but could be wrong.

eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 17, 2024
eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 17, 2024
eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 17, 2024
eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 17, 2024
@eruizalo
Copy link
Contributor Author

eruizalo commented Jan 17, 2024

Hi @holdenk,

I'm not sure if I explained myself properly and I had the features almost finished, so I think it is better to show them to you.

I tried to split this issue into 4 commits, each commit should contain only a possible feature, so if you don't like any of them it can be removed or updated individually. I may also create different issues/PRs if you prefer that:

  • Timestamp tolerance - cb0b0ac: Here I try to differ timestamp tolerance from decimal tolerance, this way we can test both tolerances in the same row (decimal tolerance 0 > tol > 1, but timestamp tolerance usually > 1000)
  • Big Decimal tolerance - 88b173d copy/paste from Compare Java and Scala BigDecimal with tolerance #213
  • Enable struct approx equality - 3ebf545: I think it fails when a decimal or a timestamp is within a structure but it should pass the validation due to tolerance
  • Custom show when failing - 5e77125: Actually if a DataFrame differ from another it simply calls a show() without any param, so big columns are truncated. With a custom show function we may do: df.show(50, false) or df.toJSON.show(false) (this is what I meant when talking about json/big structures). Now is a dev decision how to show the differences and how many (default behaviour remains)

Please, tell me what you think about these features and whether I should separate them into new issues/PRs or if they require any changes.

@holdenk
Copy link
Owner

holdenk commented Jan 18, 2024

Ah that makes more sense sorry for misunderstanding part of the first issue. This sounds reasonable do you have a PR up I can review?

eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 18, 2024
eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 18, 2024
eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 18, 2024
eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 18, 2024
@eruizalo
Copy link
Contributor Author

Hi @holdenk, I just created #405 let me know your insights

@holdenk
Copy link
Owner

holdenk commented Jan 18, 2024

Fantastic I'll try and take a look either tomorrow or early next week :)

eruizalo added a commit to eruizalo/spark-testing-base that referenced this issue Jan 19, 2024
holdenk pushed a commit that referenced this issue Feb 13, 2024
…custom show (#405)

* feat: [~] #404 Timestamp tolerance

* fix: [~] #404 PR#213 BigDecimal tolerance

* feat: [~] #404 enable struct approxEquals

* feat: [+] #404 custom show when failing approxEquals

* feat: [~] retro-compatibility & release notes

* feat: [~] create deprecated functions - backward compatibility
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants