Skip to content

Materials for RCC workshop, "Large-scale data analysis in R."

License

Notifications You must be signed in to change notification settings

rcc-uchicago/R-large-scale

Repository files navigation

Large-scale data analysis in R

The R computing environment has become an important tool for quantitative research, from computational biology to financial modeling. In this hands-on workshop, we will explore commonly used strategies to efficiently analyze large-scale data sets in R. Participants will learn to automate their R analyses on a compute cluster, profile memory usage, call fast C++ routines in R, and implement simple parallelization strategies, including multithreaded and distributed computing. The aim is to learn these techniques through hands-on "live coding"; we will analyze several medium to large-scale data sets. Objectives: Attendees will (1) learn how to automate R analyses on a compute cluster; (2) use simple techniques to profile memory usage in R; (3) learn how to make more effective use of memory in R; (4) use multithreading to speed up R computations; (5) learn how to call C++ code from R using Rcpp; (6) write scripts to distribute "embarrassingly parallel" R computations using the Slurm job scheduler on the RCC Midway compute cluster; (7) learn through "live coding."

Prerequistes

All participants are expected to bring a laptop with a Mac, Linux or Windows operating system. Further, participants should be comfortable interacting with the UNIX shell and programming in a non-graphical R environment (not RStudio). An RCC user account is recommended, but not required.

What's included

This git repository (the "workshop packet") includes:

  • README.md: This file.

  • conduct.md: Code of Conduct.

  • LICENSE.md: License information for the materials in this repository.

  • slides.pdf: The slides for the workshop.

  • slides.Rmd: R Markdown source used to generate these slides.

  • Makefile: GNU Makefile containing commands to generate the slides from the R Markdown source.

Other information

Credits

These materials were developed by Peter Carbonetto at the University of Chicago. Thank you to Matthew Stephens for his support and guidance. Also thank you to Gao Wang for sharing the Python script for profiling memory usage.

About

Materials for RCC workshop, "Large-scale data analysis in R."

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published