Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark-tensorflow-connector: gzip codec ignored in latest master version (scala 2.12) #172

Open
tekumara opened this issue Oct 11, 2020 · 5 comments

Comments

@tekumara
Copy link

#131 introduced the codec option, eg:

    df.write.format("tfrecords").option("recordType", "Example")
      .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
      .mode(SaveMode.Overwrite)
      .save(path)

This works using org.tensorflow:spark-tensorflow-connector_2.11:1.15.0 from maven central.

However this setting is ignored when building from master ie: commit 77abfa1 (Scala 2.12 and Spark 3.0.0)

@acastelli1
Copy link

Do we have any news about this?

@TylerBrabham
Copy link

+1 This is blocking me.

@dx-xp-team
Copy link

+1 Same for me. This is blocking.

@acastelli1
Copy link

Any news about the matter, this is a bit annoying. It's happening for me writing in Google Cloud Storage

@mNemlaghi
Copy link

Just in case anyone is still blocked by this, I managed to overcome this issue and tfrecords with gzip compression, with Spark 3.0.0 and Scala 2.12.10. I simply replaced spark-tensorflow connector jar with spark-tfrecord_2.12-0.3.0.jar jar. The latter seems to be based upon the former. It comes with a slight change within the code, though : replacing tfrecords with tfrecord. This might eventually work:

df.write.format("tfrecord").option("recordType", "Example") .option("codec", "org.apache.hadoop.io.compress.GzipCodec") .mode(SaveMode.Overwrite) .save(path)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants