CUDA Scalar Mul (#17)

* First draft affine batch ops & wnaf * changes to mutability and lifetimes * delete superfluous files * crazy direction: Passing a FnMut to generate an iterator locally * unsuccessful further attempts * compile sucess using index approach * fixes for mutable borrows * Successfully passed scalar mul test * benchmarks + prefetching * stash * generic impl of batch arith for all affinecurves * batched affine formulas for TE - too expensive * improved TE affine * cleanup batch inversion * fmt... * fix minor error * remove debugging scaffolding * fmt... * delete batch arith bench as not suitable for criterion or bench * fix bench removal errors * fmt... * added missing coeff_a * refactor BatchGroupArithmetic to be separate trait * Batch verification with radix sort * Cache-locality & parallelisation * Successfully impl batch verify * added tests and bench for batch_ver, parallel_random_gen, ^ thread util * fmt * enabled missing test * remove voracious_radix_sort * commented unneeded Instant::now() * Fixed batch_ver tests for curves of small or unit cofactor * split recursive and non-recursive, tidy up shared functionality * reduce max_logn * adjust max_logn further * Batch MSM, speedup only for bw6 due to poor cache performance * fmt... * GLV iBiginteger * stash * stash * GLV with Parameter-based specialisation * GLV lattice basis script success * Successfully passed tests and benched * Improvments to MSM with and bucketed adds using lightweight index sort * changed rng to be external parameter for non-parallel batch veri * remove bench print scaffolding * remove old batch_bucketed_add using vectors instead of fixed offsets * retain parallel batch_add_split * Comments for batch arith * remove need for hashmap for no std for batch_bucketed_add * minor changes * cleanup * cleanup * fmt + use no_std Vec * removed std:: * add scratch space * Add GLV for non-batched SW mul * fix for glv_scalar_decomposition when k == MODULUS (subgroup check) * Fixed performance BUG: unnecessary table generation * GLV -> has_glv(), bigint slice bd check, refactor batch loops, u32 index * clean remove of batch_verify * fix mistake with elems indexing, unused arg for future recursion PR * trivial errors * more minor fixes * fix issues with batch_ver (.is_zero(), TE affine->proj mul) * fix issue with batch_bucketed_add_split * misname * Success in test and bench \(*v*)/ * tmp commit to cache experimental batch_add_write_shift_.. * remove batch_add_write_shift.. * optional dep, fmt... * undo accidental deletion of dlsd sort * fmt... * cleanup batch bucket add, unify impl * no std... * fixed tests * fixed unimplemented for TE, swapped wnaf table row/col for batchaddwrite * wnaf table generation uses fewer copies, remove timing instrumentation * Minor Cleanup * Add feature-activated timing instrumentation, reduce code bloat (wnaf) * unused var, no_std * Make timing macros defined globally, instrument more code * instrument w/ tid, better num_rounds est. f64, timing black/whitelisting * Minor changes * refactor tests, generic MSM test * 2D test matrix :) * batchaffine * tests * additive features * big_n feature for test-benching * prefetch unroll * minor adjustments * extension(s -> "")_fields * remove artifacts, fix asm * uncomment subgroup checks, glv param sources * gpu scalar mul * fix dependency issues * Extend GPU scalar mul to all curves * refactor * CPU + GPU coprocessing * With suboptimal BW6 assembly * add static partitioning * profiling-based static partitioining * statically partition between multiple gpus * comments * BBaseField -> BaseFieldForBatch * Outline of basic traits * Remove sw_proj, add gpu support for all sw projective curves * impl gpu kernels for all curves * feature-gate with "cuda" * rename curves/gpu directory to curves/cuda * Fix merge errors * Use github rather than local jon-chuang/accel * again * again * update README * feature = "cuda" * gpu_standalone (good for non-generic), feature gate under cuda too * fix merging errors * make helpers a same-file module * remove cancerous --all-features from github yml * Use dummy accel_dummy crate for when not compiling as CUDA * feature gate accel import * fix no_std * fix gpu-standalone does not depend algebra-core/cuda * lazy static optional * kernel-specific static profile data * cuda test, cached profile data (in OS cache dir) for all curves * rectify omission of NAMESPACE, minor errors * fix no_std, group size in bits too large for 2 groups (mnt6, cp6 - Fq3) * toml fixes * update README * remove extraneous file * bake in check for oversized group elems * typo * remove boilerplate/compactify * remove standalone * fmt * fix println and comments * fix: typo * Update README.md Co-authored-by: Kobi Gurkan <[email protected]> * Make GPUScalarMulInternal APIs, only expose two APIs exposing more APIs is future work * add ci to test cuda compilation/link and cuda scalar mul when no gpu * change kernel accel compile branch to master * fix ci * use unreachable instead of empty implementation * install required toolchain * Empty commit to get CI working * try to fix ci * fmt * fix ci * safer error handling in gpu code * fix ci * handle dirs crate not available without cuda * don't check early intermediate results * fix no_std and nightly * fix remaining errors * No for_tests * Feature gate clear profile data * install cuda library to successfully link * change the order of CI jobs * change the order of CI again * cd .. * Get rid of cacheing * Never all features * Put back cacheing * Remove cuda .deb to save disk space * Increase max-parallel * check examples with all features Co-authored-by: Kobi Gurkan <[email protected]>
celo-org · Nov 10, 2020 · 7c518fd · 7c518fd
1 parent c894564
commit 7c518fd
Show file tree

Hide file tree

Showing 55 changed files with 1,565 additions and 682 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -22,7 +22,7 @@ jobs:
  toolchain: stable
  override: true
  components: rustfmt
-
+ default: true
  - name: cargo fmt --check
  uses: actions-rs/cargo@v1
  with:
@@ -35,6 +35,7 @@ jobs:
  env:
  RUSTFLAGS: -Dwarnings
  strategy:
+ max-parallel: 6
  matrix:
  rust:
  - stable
@@ -50,14 +51,38 @@ jobs:
  toolchain: ${{ matrix.rust }}
  override: true
 
- - uses: actions/cache@v2
- with:
- path: |
- ~/.cargo/registry
- ~/.cargo/git
- target
+ - name: Install CUDA toolchains
+ run: |
+ wget -q https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
+ sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
+ wget -q https://developer.download.nvidia.com/compute/cuda/11.1.1/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.1-455.32.00-1_amd64.deb
+ sudo dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.1-455.32.00-1_amd64.deb
+ sudo apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
+ sudo apt-get update
+ sudo apt-get -y install cuda
+ rm cuda-repo-ubuntu*
+ curl -sSL https://github.com/jon-chuang/accel/raw/master/setup_nvptx_toolchain.sh | bash
+
+ - uses: actions/cache@v2 
+ with: 
+ path: | 
+ ~/.cargo/registry 
+ ~/.cargo/git 
+ target 
  key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
 
+ - name: Test algebra with CUDA
+ run: |
+ cd algebra
+ cargo test --features "all_curves cuda cuda_test"
+ cd ..
+ 
+ - name: Test algebra
+ run: |
+ cd algebra
+ cargo test --features full
+ cd ..
+
  - name: Check examples
  uses: actions-rs/cargo@v1
  with:
@@ -68,7 +93,7 @@ jobs:
  uses: actions-rs/cargo@v1
  with:
  command: check
- args: --examples --all-features --all
+ args: --all-features --examples --all
  if: matrix.rust == 'stable'
 
  - name: Check benchmarks on nightly
@@ -88,12 +113,6 @@ jobs:
  --exclude ff-fft-benches \
  -- --skip dpc --skip integration_test"
 
- - name: Test algebra
- run: |
- cd algebra
- cargo test --features full
- cd ..
-
  - name: Test algebra with assembly
  run: |
  cd algebra

diff --git a/Cargo.toml b/Cargo.toml
@@ -15,7 +15,7 @@ members = [
  "r1cs-core",
  "r1cs-std",
  "algebra-core/algebra-core-derive",
- "scripts/glv_lattice_basis"
+ "scripts/glv_lattice_basis",
 ]
 
 [profile.release]

diff --git a/README.md b/README.md
@@ -87,6 +87,13 @@ To bench `algebra-benches` with greater accuracy, especially for functions with
 cargo +nightly bench --features "n_fold bls12_381"
 ```
 
+CUDA support is available for a limited set of functions. To allow compilation for CUDA on Linux, first run the script
+```
+curl -sSL https://github.com/jon-chuang/accel/raw/master/setup_nvptx_toolchain.sh | bash
+```
+or run the equivalent commands for your OS. Then, pass the `cuda` feature to rustc or cargo when compiling, and import the relevant traits (e.g. GPUScalarMulSlice) wherever the functions are called.
+
+When the `cuda` feature is not activated, Zexe will still compile. However, when either the `cuda` feature is not activated during compilation or CUDA is not detected on your system at runtime, Zexe will default to a CPU-only implementation of the same functionality.
 
 ## License
 

diff --git a/algebra-benches/Cargo.toml b/algebra-benches/Cargo.toml
@@ -31,9 +31,10 @@ rand_xorshift = { version = "0.2" }
 paste = "1.0"
 
 [features]
+bw6_asm = [ "algebra/bw6_asm"]
 asm = [ "algebra/asm"]
 prefetch = [ "algebra/prefetch"]
-bw6_asm = [ "algebra/bw6_asm"]
+cuda = [ "algebra/cuda" ]
 n_fold = []
 mnt4_298 = [ "algebra/mnt4_298"]
 mnt6_298 = [ "algebra/mnt6_298"]

diff --git a/algebra-core/Cargo.toml b/algebra-core/Cargo.toml
@@ -27,32 +27,40 @@ algebra-core-derive = { path = "algebra-core-derive", optional = true }
 derivative = { version = "2", features = ["use_core"] }
 num-traits = { version = "0.2", default-features = false }
 rand = { version = "0.7", default-features = false }
-rayon = { version = "1", optional = true }
+rayon = { version = "1.3.0", optional = true }
 unroll = { version = "=0.1.4" }
 itertools = { version = "0.9.0", default-features = false }
 either = { version = "1.6.0", default-features = false }
 thread-id = { version = "3.3.0", optional = true }
 backtrace = { version = "0.3", optional = true }
+accel = { git = "https://github.com/jon-chuang/accel", package = "accel", optional = true }
+peekmore = "0.5.6"
+closure = { version = "0.3.0", optional = true }
+lazy_static = { version = "1.4.0", optional = true }
+serde_json = { version = "1.0.58", optional = true }
+dirs = { version = "1.0.5", optional = true }
+log = { version = "0.4.11", optional = true }
 paste = "0.1"
 
 [build-dependencies]
 field-assembly = { path = "./field-assembly", optional = true }
-cc = "1.0"
 rustc_version = "0.2"
+cc = "1.0"
 
 [dev-dependencies]
 rand_xorshift = "0.2"
 
 [features]
+bw6_asm = []
 default = [ "std", "rand/default" ]
 std = []
 parallel = [ "std", "rayon", "rand/default" ]
 derive = [ "algebra-core-derive" ]
 prefetch = [ "std" ]
+cuda = [ "std", "parallel", "accel", "lazy_static", "serde_json", "dirs", "closure", "log" ]
 
 timing = [ "std", "backtrace" ]
 timing_detailed = [ "std", "backtrace" ]
 timing_thread_id = [ "thread-id" ]
 
 llvm_asm = [ "field-assembly" ]
-bw6_asm = []
diff --git a/algebra-core/algebra-core-derive/Cargo.toml b/algebra-core/algebra-core-derive/Cargo.toml
@@ -27,4 +27,4 @@ proc-macro = true
 [dependencies]
 proc-macro2 = "1.0"
 syn = "1.0"
-quote = "1.0"
+quote = "1.0.7"
diff --git a/algebra-core/mince/Cargo.toml b/algebra-core/mince/Cargo.toml
@@ -7,7 +7,7 @@ edition = "2018"
 # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
 
 [dependencies]
-quote = "1.0"
+quote = "1.0.7"
 syn = {version = "1.0.17", features = ["full"]}
 
 [lib]

diff --git a/algebra-core/src/bytes.rs b/algebra-core/src/bytes.rs
@@ -316,7 +316,7 @@ mod test {
  fn test_macro_empty() {
  let array: Vec<u8> = vec![];
  let bytes: Vec<u8> = to_bytes![array].unwrap();
- assert_eq!(&bytes, &[]);
+ assert_eq!(bytes, Vec::<u8>::new());
  assert_eq!(bytes.len(), 0);
  }
 

diff --git a/algebra-core/src/curves/batch_arith.rs b/algebra-core/src/curves/batch_arith.rs
@@ -25,7 +25,7 @@ pub trait BatchGroupArithmetic
 where
  Self: Sized + Clone + Copy + Zero + Neg<Output = Self>,
 {
- type BBaseField: Field;
+ type BaseFieldForBatch: Field;
 
  // We use the w-NAF method, achieving point density of approximately 1/(w + 1)
  // and requiring storage of only 2^(w - 1).
@@ -136,7 +136,7 @@ where
  fn batch_double_in_place(
  bases: &mut [Self],
  index: &[u32],
- scratch_space: Option<&mut Vec<Self::BBaseField>>,
+ scratch_space: Option<&mut Vec<Self::BaseFieldForBatch>>,
  );
 
  /// Mutates bases in place and stores result in the first operand.

diff --git a/algebra-core/src/curves/cuda/accel_dummy.rs b/algebra-core/src/curves/cuda/accel_dummy.rs
@@ -0,0 +1,9 @@
+#[cfg(not(feature = "std"))]
+use alloc::vec::Vec;
+pub mod error {
+ pub type Result<T> = T;
+}
+
+pub struct Context {}
+
+pub type DeviceMemory<T> = Vec<T>;
diff --git a/algebra-core/src/curves/cuda/mod.rs b/algebra-core/src/curves/cuda/mod.rs
@@ -0,0 +1,6 @@
+#[macro_use]
+pub mod scalar_mul;
+pub use scalar_mul::*;
+
+#[cfg(not(feature = "cuda"))]
+pub mod accel_dummy;