Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][connector-hive] hive sink connector support overwrite mode #7843

Open
3 tasks done
Adamyuanyuan opened this issue Oct 15, 2024 · 0 comments · May be fixed by #7891
Open
3 tasks done

[Feature][connector-hive] hive sink connector support overwrite mode #7843

Adamyuanyuan opened this issue Oct 15, 2024 · 0 comments · May be fixed by #7891

Comments

@Adamyuanyuan
Copy link

Search before asking

  • I had searched in the feature and found no similar feature requirement.

Description

Flag to decide whether to use overwrite mode when inserting data into Hive. If set to true, for non-partitioned tables, the existing data in the table will be deleted before inserting new data. For partitioned tables, the data in the relevant partition will be deleted before inserting new data;

When performing Hive insert operations, the current mode is append, but in reality, there may be requirements for overwriting the data, similar to insert overwrite, or deleting before insertion. There are several implementation approaches, such as:

  1. Using Scheduling Workflows: A temporary solution for data processing involves configuring a workflow, dragging a workflow to first delete the corresponding table, and then performing the insertion.
  2. Upper-Layer Data Integration Products: Through pipelines or similar methods, data is deleted before insertion.
  3. Native Support for "Overwrite" Mode in Seatunnel: Currently, implementing this feature directly in the Seatunnel core is the most convenient. By leveraging Seatunnel's two-phase commit logic, data is first written to a temporary directory, then deleted (using deleteFile(directory)), and finally renamed. This approach ensures better data consistency, with the time between deleting and renaming the directory being in milliseconds. It leverages existing utility classes, resulting in minimal code changes and significantly better performance compared to upper-layer methods.

During the implementation, the logic of Flink's overwrite operator was referenced.

Simply adding an overwrite parameter on the Hive side (defaulting to false) would suffice.

sink {
  Hive {
    table_name = "default.test_overwrite_1"
    metastore_uri = "thrift://hadoop-master1.orb.local:9083"
    overwrite = true
    source_table_name = "source_table"
  }
}

Expected Logic

Operation Type Logic Description
Hive Table Overwrite Delete old table data, write new data
Hive Partitioned Table Overwrite Delete only the relevant partition data to be overwritten, write corresponding partition data

目前进行hive插入的时候,模式是append的方式,但是实际上,可能有的需求是需要覆盖写入的,类似于insert overwrite,或者说插入之前先删除。这有很多种实现思路,比如:

  1. 借助调度工作流:数据处理配置工作流的临时方案,拖一个工作流,先删除对应的表,然后再插入;
  2. 上层数据集成产品侧通过流水线或类似方式,在插入前先删除数据,再写入数据;
  3. 通过Seatunnel原生支持“覆盖写入”模式;
    目前对比下来,直接在Seatunnel底层实现这个功能最方便,借助Seatunnel的二阶段提交逻辑,先写到临时目录,再删(deleteFile(目录),再rename,数据一致性更好,删目录和rename目录之间的时间为毫秒级,借助现成的工具类,代码改动比较少,效果远好于通过上层的方式;
    实现的过程中,参考了 Flink的overwrite算子的逻辑。

只需要在hive侧新增一个overwrite参数即可(默认为false):

sink {
  Hive {
    table_name = "default.test_overwrite_1"
    metastore_uri = "thrift://hadoop-master1.orb.local:9083"
    overwrite = true
    source_table_name = "source_table"
  }
}

期望逻辑

操作类型 逻辑描述
Hive表覆盖写入 删除旧表数据,写入新数据
Hive分区表覆盖写入 只删除要覆写相关的分区数据,写入对应的分区数据

Usage Scenario

When performing Hive insert operations, the current mode is append, but in reality, there may be requirements for overwriting the data, similar to insert overwrite, or deleting before insertion.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@Adamyuanyuan Adamyuanyuan changed the title [Feature][connector-hive] hive connector support overwrite mode [Feature][connector-hive] hive sink connector support overwrite mode Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant