Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support for Additional File Formats in Data Preview #458

Open
pingdoom opened this issue Aug 10, 2024 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@pingdoom
Copy link

Hello CSGHub Team,

I hope this message finds you well. I've been exploring the CSGHub platform and I'm impressed with its capabilities, especially in managing large model assets and datasets. It's evident that a lot of thought and effort has gone into making CSGHub a comprehensive asset management platform.

One area where I believe CSGHub could be enhanced is in the support for additional file formats in the dataset preview functionality. Currently, CSGHub provides excellent support for previewing datasets in common formats. However, as datasets become increasingly complex and diverse, the need to support additional formats becomes apparent.

Enhancement Request:
I would like to request the addition of support for the following file formats in the dataset preview functionality:

  • HDF5 (.h5)
  • Apache Parquet (.parquet)
  • Avro (.avro)

These formats are widely used in the data science and machine learning communities for storing large, complex datasets. Supporting these formats would significantly enhance the usability of CSGHub for a broader audience and facilitate more efficient data exploration and management.

Justification:

  • HDF5: Widely used in academia and industry for storing large datasets, especially in the fields of physics, astronomy, and bioinformatics.
  • Apache Parquet: Offers efficient data compression and encoding schemes. It's heavily adopted in data engineering pipelines and supports schema evolution.
  • Avro: A row-based storage format that's ideal for data serialization. It's commonly used in data streaming architectures.

Potential Implementation:
While I understand that adding support for these formats might require considerable effort, perhaps starting with HDF5, given its widespread use, could be a beneficial first step. Utilizing existing open-source libraries for reading these formats could also streamline the implementation process.

I believe that extending dataset preview capabilities to include these formats would make CSGHub even more versatile and valuable to the data science and machine learning communities.

Thank you for considering this enhancement request. I'm looking forward to seeing how CSGHub continues to evolve and meet the needs of its users.

@SeanHH86
Copy link
Contributor

SeanHH86 commented Aug 11, 2024

@pingdoom Thanks for raising this and give more information and justification for data view on those data format. Dataset preview is key feature for us and lots of requirements are coming, make those datasets can be preview on CSGhub are in roadmap and we are working on dataset view to deal with datasets, and there are more things need to be consider include security, performance, usability etc. Looking forward to receiving more feedback from you.

Have a nice day!

@Rader Rader added the enhancement New feature or request label Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants