Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactoring code to parquet to zip2parquet #525

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open

Conversation

blublinsky
Copy link
Collaborator

Why are these changes needed?

modified code2 parquet to support both code and language files and moved it to universal

Related issue number (if any).

#520

@shahrokhDaijavad
Copy link
Member

@blublinsky Is this new version completely backward compatible with the old version of code2parquet?

@blublinsky
Copy link
Collaborator Author

Yes. By default it will work exactly the same as current (all the current tests pass). If you specify code_data == False, it will not try to decide which programming language it is - just store the file content as is

@shahrokhDaijavad
Copy link
Member

Thanks, @blublinsky. Perfect! From a technical point of view, David needs to approve, but a note to myself that if this is approved, the example Jupyter notebook for code that Shivdeep has created should be modified with the new name and location of the transform.

@daw3rd
Copy link
Member

daw3rd commented Aug 27, 2024

i'm not sure I think this has to be backwards compatibile with code2parquet. As such, I would suggest the following:

  1. code is NOT the default
  2. There are a lot of configuration keys that are code-specific. Maybe we should have a single key and have its value be a dictionary like we do for DataAccessFactory. For example {"programming_language_column" : "some name", }

Separately, we have discussed the ability to add configuration to specify the list of extensions to import - that is, filtering of sorts.

I would like to do more here than just generalize for code and .txt. Some more design seems needed.

@blublinsky
Copy link
Collaborator Author

'm not sure I think this has to be backwards compatibile with code2parquet. This was initial requirement from @shahrokhDaijavad

As such, I would suggest the following:

  1. code is NOT the default
    Does not matter to me
  2. There are a lot of configuration keys that are code-specific. Maybe we should have a single key and have its value be a dictionary like we do for DataAccessFactory. For example {"programming_language_column" : "some name", }
    We need them for code support

Separately, we have discussed the ability to add configuration to specify the list of extensions to import - that is, filtering of sorts.
The last conversation with @nirmdesai was that we want all files

@Bytes-Explorer
Copy link
Collaborator

Why is this PR required in the first place? I would suggest to keep the code2parquet module as it is and add new modules as required. The code2parquet module is being used in many places and will break code flows. I would not support this PR.

@Bytes-Explorer Bytes-Explorer self-requested a review August 28, 2024 10:53
@blublinsky
Copy link
Collaborator Author

Why is this PR required in the first place? I would suggest to keep the code2parquet module as it is and add new modules as required. The code2parquet module is being used in many places and will break code flows. I would not support this PR.

@Bytes-Explorer please take it up with @nirmdesai . It was his request

@Bytes-Explorer
Copy link
Collaborator

Ok, will clarify. Lets not merge this PR till then.

@nirmdesai
Copy link
Collaborator

@Bytes-Explorer , @blublinsky, @touma-I : Team, since we have various notebooks and other artifacts that already depend on Code2Parquet, we cannot make breaking changes to this transform. It is on me that I did not explicitly clarify this earlier!

For now, it would be best to add Any2Parquet as a separate transform that can read any file content as binary and produce a parquet.

In future, if all notebooks / users were using "pip install" to use a specific stable version of DPK, we would be free to make breaking changes in developing the next release without affecting all the users. I know you all are moving in this direction already.

@shahrokhDaijavad
Copy link
Member

shahrokhDaijavad commented Aug 28, 2024

Thanks for the clarifications, @nirmdesai. Since we now want to ingest more than just text files, doing Any2Parquet as a separate module is the best path forward.

In defense of what was done, the changes done to the Code2Parquet were completely backward-compatible with the old version and would not have broken any notebook or artifact, if the name of the module and the path to its directory had not changed. At the same time, the name Code2Parquet was not appropriate, if it was handling both code and non-code text.

Having said that, let's move towards the new Any2Parquet.

@touma-I touma-I self-requested a review October 14, 2024 12:36
Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@touma-I to check if this is still relevant or should be closed without merge.

@shahrokhDaijavad
Copy link
Member

@touma-I The code that Boris developed under this PR is valuable because it is a "generalization" of what we have with the current code2parquet. However, in order not to run into issues of backward compatibility with the notebooks/examples that use the current code2parquet, the most useful thing to do is to make changes to this PR, so that it keeps the current code2parquet as is (under the Code directory) and the new modified version by Boris, which should be renamed any2parquet, will be a separate transform under the universal directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants