suggestion: Upload dataset with jsonl format #38

iainmwallace · 2017-07-15T02:32:25Z

Enabling dataset upload via jsonl might improve performance and reduce parsing mistakes compared to CSV

This is what is implemented in the bigrquery package to upload datasets using the insert_upload_job function. Specifically, the bigrquery:::export_json() creates the appropriate file format

This works:
x<-bigrquery:::export_json(mtcars)
x<-gsub("\n$","",x) # solvebio parser fails if there is an extra line at the end of the file
write_lines(x,path="mtcars.json.gz") # zipping file

Note exporting json from Google's bigquery directly incorrectly formats integers
https://issuetracker.google.com/issues/35906037

davecap · 2017-07-15T14:20:15Z

You can upload in JSONL format already. We'll look into the newline at the end of the file bug though. When you write a json.gz file, you should be able to upload and import that file, does it not work?

iainmwallace · 2017-07-15T16:13:05Z

Yeah, I can upload this file.

I was suggesting a helper function that would convert from csv file to jsonl in the backend before uploading as this might help reduce parsing mistakes. It would also enable the solvebio package to upload any file via the rio package (https://cran.r-project.org/web/packages/rio/index.html)

davecap · 2017-07-16T01:03:22Z

Good idea! That would probably make it a lot friendlier for R users to get data in safely.

iainmwallace · 2017-07-19T21:46:44Z

And this is a slightly longer code snippet for the entire process

library(rio)
library(readr)

my_original_name=input$file2$name # from Shiny upload file option input$file2

my_data<-getData() # getData() returns a data frame from the import function in rio
d <- 1:nrow(my_data)
chunks <- split(d, ceiling(seq_along(d)/100000))
x<-list()
y<-list()
my_json_name<-list()
file_pattern<-paste0(my_original_name,"_")

for(i in seq_along(chunks)){
  cat(i)
  x[i]<-bigrquery:::export_json(my_data[chunks[[i]],])
  y[i]<-gsub("\n$","",x[[i]])
  my_json_name[i]<-tempfile(pattern=file_pattern,fileext = ".json.gz")
  write_lines(y[[i]],path=my_json_name[[i]])
}

files<-list.files(tempdir(),pattern=file_pattern)


## set up dataset
datasetName<-"Dataset name"
repository_name<-"Set repository name"
my_dataset<-paste0(repository_name,datasetName)
my_project="my_project"
my_description="dataset description"
dataset <- Dataset.get_or_create_by_full_name(my_dataset,
                                              description=my_description,
                                              metadata=list(details="detail metadata",
                                                            project=my_project,
                                                            original_file_name=my_original_name
                                              ),
                                              tags=c("tag1","tag2"))

###loop over files

for(i in seq_along(files)){
  my_filename<-paste0(tempdir(),"/",files[i])
  upload_file<- Upload.create(path=my_filename)
  
  imp = DatasetImport.create(
    dataset_id=dataset$id,
    upload_id=upload_file$id,
    title=paste0(my_original_name,' import'),
    auto_approve=TRUE)
}

davecap modified the milestones: v2.0.0 Release, v2.0.1 Release Jul 22, 2017

davecap removed this from the v2.0.1 Release milestone Aug 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suggestion: Upload dataset with jsonl format #38

suggestion: Upload dataset with jsonl format #38

iainmwallace commented Jul 15, 2017 •

edited

Loading

davecap commented Jul 15, 2017

iainmwallace commented Jul 15, 2017

davecap commented Jul 16, 2017

iainmwallace commented Jul 19, 2017

suggestion: Upload dataset with jsonl format #38

suggestion: Upload dataset with jsonl format #38

Comments

iainmwallace commented Jul 15, 2017 • edited Loading

davecap commented Jul 15, 2017

iainmwallace commented Jul 15, 2017

davecap commented Jul 16, 2017

iainmwallace commented Jul 19, 2017

iainmwallace commented Jul 15, 2017 •

edited

Loading