spark-tensorflow-connector: gzip codec ignored in latest master version (scala 2.12) #172

tekumara · 2020-10-11T06:28:20Z

#131 introduced the codec option, eg:

    df.write.format("tfrecords").option("recordType", "Example")
      .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
      .mode(SaveMode.Overwrite)
      .save(path)

This works using org.tensorflow:spark-tensorflow-connector_2.11:1.15.0 from maven central.

However this setting is ignored when building from master ie: commit 77abfa1 (Scala 2.12 and Spark 3.0.0)

The text was updated successfully, but these errors were encountered:

acastelli1 · 2021-01-20T10:44:11Z

Do we have any news about this?

TylerBrabham · 2021-02-09T23:06:00Z

+1 This is blocking me.

dx-xp-team · 2021-02-25T13:50:59Z

+1 Same for me. This is blocking.

acastelli1 · 2021-06-17T01:48:12Z

Any news about the matter, this is a bit annoying. It's happening for me writing in Google Cloud Storage

mNemlaghi · 2021-06-23T13:14:10Z

Just in case anyone is still blocked by this, I managed to overcome this issue and tfrecords with gzip compression, with Spark 3.0.0 and Scala 2.12.10. I simply replaced spark-tensorflow connector jar with spark-tfrecord_2.12-0.3.0.jar jar. The latter seems to be based upon the former. It comes with a slight change within the code, though : replacing tfrecords with tfrecord. This might eventually work:

df.write.format("tfrecord").option("recordType", "Example") .option("codec", "org.apache.hadoop.io.compress.GzipCodec") .mode(SaveMode.Overwrite) .save(path)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-tensorflow-connector: gzip codec ignored in latest master version (scala 2.12) #172

spark-tensorflow-connector: gzip codec ignored in latest master version (scala 2.12) #172

tekumara commented Oct 11, 2020

acastelli1 commented Jan 20, 2021

TylerBrabham commented Feb 9, 2021

dx-xp-team commented Feb 25, 2021

acastelli1 commented Jun 17, 2021

mNemlaghi commented Jun 23, 2021

spark-tensorflow-connector: gzip codec ignored in latest master version (scala 2.12) #172

spark-tensorflow-connector: gzip codec ignored in latest master version (scala 2.12) #172

Comments

tekumara commented Oct 11, 2020

acastelli1 commented Jan 20, 2021

TylerBrabham commented Feb 9, 2021

dx-xp-team commented Feb 25, 2021

acastelli1 commented Jun 17, 2021

mNemlaghi commented Jun 23, 2021