Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

union_relations guesses data types incorrectly so database fails #957

Open
1 of 5 tasks
EugenioGastelumGBR opened this issue Oct 9, 2024 · 0 comments
Open
1 of 5 tasks
Labels
bug Something isn't working triage

Comments

@EugenioGastelumGBR
Copy link

EugenioGastelumGBR commented Oct 9, 2024

Describe the bug

I am using dbt_utils.union_relations to union two tables that have the same column names, but not same data types. So the macro compiles the code by doing a SELECT cast(col1 as {data_type}) as col1.... for each table it has in the UNION on the compiled code. However, I am trying to union two models A and B, both have already materialized as tables. And Both have a column called keyword_id as numeric but of different size. Table A has it defined as numeric(1,0) while table B as numeric(18,0) and it's fine because at table A the values are single digits while at B they are very large.

The issue is, when I do dbt_utils.union_relations(relations=[ref('A'),ref('B')] it fails, while if I just invert the order and do dbt_utils.union_relations(relations=[ref('B'),ref('A')] it works

I checked the compiled code to see why. And its because when A comes before B, it guesses the data type of col keyword_id as number(1,0) (in other words, it took its decision based on table A apparently). While in the scenario where we have table B and then A it guesses the data type as number(18,0).

In the compiled code of scenario 1 (A and then B) it does

SELECT ...., cast(keyword_id as number(1,0)) as keyword_id,...<more_columns> FROM A
UNION 
SELECT ...., cast(keyword_id as number(1,0)) as keyword_id,...<more_columns> FROM B

While in scenario 2 (B then A):

SELECT ...., cast(keyword_id as number(18,0)) as keyword_id,...<more_columns> FROM B
UNION 
SELECT ...., cast(keyword_id as number(18,0)) as keyword_id,...<more_columns> FROM A

So the order in which we add the relations in the macro has a an impact on how it guesses the data types. And in a certain combination of order of tables and their data types it might choose data types that will error out.

Steps to reproduce

Have two models which have the same column names, they can be identical, just in one model add a column keyword_id as always a single digit number; and in the other model make it a lot of digits (18 for example). Both models need to materialize as tables (in my case it was in snowflake) and have to be built and populated. Then create a third model and use the union macro to see the compile code in the scenario where you put the order of relations as model1 and then model2, vs the reverse scenario. Under the compile code check the way it casts the keyword_id column and how it changes from one scenario to another

Expected results

Ideally, the order on how we pass the references to the macro should not impact in the final result on how it guesses the data type. Or in another possible solution, the union macro should accept some data types parameter or some parameter to define a strategy on how to guess data types

System information

The contents of your packages.yml file:
packages:

  • package: dbt-labs/dbt_utils
    version: 1.3.0
  • package: calogica/dbt_expectations
    version: [">=0.10.0", "<0.11.0"]

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.3.0
  - package: calogica/dbt_expectations
    version: [">=0.10.0", "<0.11.0"]

Are you interested in contributing the fix?

I would like to contribute if I had some guidance, but I have never navigated the repo before and I am not an expert on the source code

@EugenioGastelumGBR EugenioGastelumGBR added bug Something isn't working triage labels Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

1 participant