Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggestion: Upload dataset with jsonl format #38

Open
iainmwallace opened this issue Jul 15, 2017 · 4 comments
Open

suggestion: Upload dataset with jsonl format #38

iainmwallace opened this issue Jul 15, 2017 · 4 comments

Comments

@iainmwallace
Copy link

iainmwallace commented Jul 15, 2017

Enabling dataset upload via jsonl might improve performance and reduce parsing mistakes compared to CSV

This is what is implemented in the bigrquery package to upload datasets using the insert_upload_job function. Specifically, the bigrquery:::export_json() creates the appropriate file format

This works:
x<-bigrquery:::export_json(mtcars)
x<-gsub("\n$","",x) # solvebio parser fails if there is an extra line at the end of the file
write_lines(x,path="mtcars.json.gz") # zipping file

Note exporting json from Google's bigquery directly incorrectly formats integers
https://issuetracker.google.com/issues/35906037

@davecap
Copy link
Member

davecap commented Jul 15, 2017

You can upload in JSONL format already. We'll look into the newline at the end of the file bug though. When you write a json.gz file, you should be able to upload and import that file, does it not work?

@iainmwallace
Copy link
Author

Yeah, I can upload this file.

I was suggesting a helper function that would convert from csv file to jsonl in the backend before uploading as this might help reduce parsing mistakes. It would also enable the solvebio package to upload any file via the rio package (https://cran.r-project.org/web/packages/rio/index.html)

@davecap
Copy link
Member

davecap commented Jul 16, 2017

Good idea! That would probably make it a lot friendlier for R users to get data in safely.

@iainmwallace
Copy link
Author

And this is a slightly longer code snippet for the entire process

library(rio)
library(readr)

my_original_name=input$file2$name # from Shiny upload file option input$file2

my_data<-getData() # getData() returns a data frame from the import function in rio
d <- 1:nrow(my_data)
chunks <- split(d, ceiling(seq_along(d)/100000))
x<-list()
y<-list()
my_json_name<-list()
file_pattern<-paste0(my_original_name,"_")

for(i in seq_along(chunks)){
  cat(i)
  x[i]<-bigrquery:::export_json(my_data[chunks[[i]],])
  y[i]<-gsub("\n$","",x[[i]])
  my_json_name[i]<-tempfile(pattern=file_pattern,fileext = ".json.gz")
  write_lines(y[[i]],path=my_json_name[[i]])
}

files<-list.files(tempdir(),pattern=file_pattern)


## set up dataset
datasetName<-"Dataset name"
repository_name<-"Set repository name"
my_dataset<-paste0(repository_name,datasetName)
my_project="my_project"
my_description="dataset description"
dataset <- Dataset.get_or_create_by_full_name(my_dataset,
                                              description=my_description,
                                              metadata=list(details="detail metadata",
                                                            project=my_project,
                                                            original_file_name=my_original_name
                                              ),
                                              tags=c("tag1","tag2"))

###loop over files

for(i in seq_along(files)){
  my_filename<-paste0(tempdir(),"/",files[i])
  upload_file<- Upload.create(path=my_filename)
  
  imp = DatasetImport.create(
    dataset_id=dataset$id,
    upload_id=upload_file$id,
    title=paste0(my_original_name,' import'),
    auto_approve=TRUE)
}

@davecap davecap removed this from the v2.0.1 Release milestone Aug 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants