Skip to content
This repository has been archived by the owner on Jan 26, 2022. It is now read-only.

Commit

Permalink
CUDA Scalar Mul (#17)
Browse files Browse the repository at this point in the history
* First draft affine batch ops & wnaf

* changes to mutability and lifetimes

* delete superfluous files

* crazy direction: Passing a FnMut to generate an iterator locally

* unsuccessful further attempts

* compile sucess using index approach

* fixes for mutable borrows

* Successfully passed scalar mul test

* benchmarks + prefetching

* stash

* generic impl of batch arith for all affinecurves

* batched affine formulas for TE - too expensive

* improved TE affine

* cleanup batch inversion

* fmt...

* fix minor error

* remove debugging scaffolding

* fmt...

* delete batch arith bench as not suitable for criterion or bench

* fix bench removal errors

* fmt...

* added missing coeff_a

* refactor BatchGroupArithmetic to be separate trait

* Batch verification with radix sort

* Cache-locality & parallelisation

* Successfully impl batch verify

* added tests and bench for batch_ver, parallel_random_gen, ^ thread util

* fmt

* enabled missing test

* remove voracious_radix_sort

* commented unneeded Instant::now()

* Fixed batch_ver tests for curves of small or unit cofactor

* split recursive and non-recursive, tidy up shared functionality

* reduce max_logn

* adjust max_logn further

* Batch MSM, speedup only for bw6 due to poor cache performance

* fmt...

* GLV iBiginteger

* stash

* stash

* GLV with Parameter-based specialisation

* GLV lattice basis script success

* Successfully passed tests and benched

* Improvments to MSM with and bucketed adds using lightweight index sort

* changed rng to be external parameter for non-parallel batch veri

* remove bench print scaffolding

* remove old batch_bucketed_add using vectors instead of fixed offsets

* retain parallel batch_add_split

* Comments for batch arith

* remove need for hashmap for no std for batch_bucketed_add

* minor changes

* cleanup

* cleanup

* fmt + use no_std Vec

* removed std::

* add scratch space

* Add GLV for non-batched SW mul

* fix for glv_scalar_decomposition when k == MODULUS (subgroup check)

* Fixed performance BUG: unnecessary table generation

* GLV -> has_glv(), bigint slice bd check, refactor batch loops, u32 index

* clean remove of batch_verify

* fix mistake with elems indexing, unused arg for future recursion PR

* trivial errors

* more minor fixes

* fix issues with batch_ver (.is_zero(), TE affine->proj mul)

* fix issue with batch_bucketed_add_split

* misname

* Success in test and bench \(*v*)/

* tmp commit to cache experimental batch_add_write_shift_..

* remove batch_add_write_shift..

* optional dep, fmt...

* undo accidental deletion of dlsd sort

* fmt...

* cleanup batch bucket add, unify impl

* no std...

* fixed tests

* fixed unimplemented for TE, swapped wnaf table row/col for batchaddwrite

* wnaf table generation uses fewer copies, remove timing instrumentation

* Minor Cleanup

* Add feature-activated timing instrumentation, reduce code bloat (wnaf)

* unused var, no_std

* Make timing macros defined globally, instrument more code

* instrument w/ tid, better num_rounds est. f64, timing black/whitelisting

* Minor changes

* refactor tests, generic MSM test

* 2D test matrix :)

* batchaffine

* tests

* additive features

* big_n feature for test-benching

* prefetch unroll

* minor adjustments

* extension(s -> "")_fields

* remove artifacts, fix asm

* uncomment subgroup checks, glv param sources

* gpu scalar mul

* fix dependency issues

* Extend GPU scalar mul to all curves

* refactor

* CPU + GPU coprocessing

* With suboptimal BW6 assembly

* add static partitioning

* profiling-based static partitioining

* statically partition between multiple gpus

* comments

* BBaseField -> BaseFieldForBatch

* Outline of basic traits

* Remove sw_proj, add gpu support for all sw projective curves

* impl gpu kernels for all curves

* feature-gate with "cuda"

* rename curves/gpu directory to curves/cuda

* Fix merge errors

* Use github rather than local jon-chuang/accel

* again

* again

* update README

* feature = "cuda"

* gpu_standalone (good for non-generic), feature gate under cuda too

* fix merging errors

* make helpers a same-file module

* remove cancerous --all-features from github yml

* Use dummy accel_dummy crate for when not compiling as CUDA

* feature gate accel import

* fix no_std

* fix gpu-standalone does not depend algebra-core/cuda

* lazy static optional

* kernel-specific static profile data

* cuda test, cached profile data (in OS cache dir) for all curves

* rectify omission of NAMESPACE, minor errors

* fix no_std, group size in bits too large for 2 groups (mnt6, cp6 - Fq3)

* toml fixes

* update README

* remove extraneous file

* bake in check for oversized group elems

* typo

* remove boilerplate/compactify

* remove standalone

* fmt

* fix println and comments

* fix: typo

* Update README.md

Co-authored-by: Kobi Gurkan <[email protected]>

* Make GPUScalarMulInternal APIs, only expose two APIs
exposing more APIs is future work

* add ci to test cuda compilation/link and cuda scalar mul when no gpu

* change kernel accel compile branch to master

* fix ci

* use unreachable instead of empty implementation

* install required toolchain

* Empty commit to get CI working

* try to fix ci

* fmt

* fix ci

* safer error handling in gpu code

* fix ci

* handle dirs crate not available without cuda

* don't check early intermediate results

* fix no_std and nightly

* fix remaining errors

* No for_tests

* Feature gate clear profile data

* install cuda library to successfully link

* change the order of CI jobs

* change the order of CI again

* cd ..

* Get rid of cacheing

* Never all features

* Put back cacheing

* Remove cuda .deb to save disk space

* Increase max-parallel

* check examples with all features

Co-authored-by: Kobi Gurkan <[email protected]>
  • Loading branch information
jon-chuang and kobigurk authored Nov 10, 2020
1 parent c894564 commit 7c518fd
Show file tree
Hide file tree
Showing 55 changed files with 1,565 additions and 682 deletions.
47 changes: 33 additions & 14 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
toolchain: stable
override: true
components: rustfmt

default: true
- name: cargo fmt --check
uses: actions-rs/cargo@v1
with:
Expand All @@ -35,6 +35,7 @@ jobs:
env:
RUSTFLAGS: -Dwarnings
strategy:
max-parallel: 6
matrix:
rust:
- stable
Expand All @@ -50,14 +51,38 @@ jobs:
toolchain: ${{ matrix.rust }}
override: true

- uses: actions/cache@v2
with:
path: |
~/.cargo/registry
~/.cargo/git
target
- name: Install CUDA toolchains
run: |
wget -q https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget -q https://developer.download.nvidia.com/compute/cuda/11.1.1/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.1-455.32.00-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.1-455.32.00-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
rm cuda-repo-ubuntu*
curl -sSL https://github.com/jon-chuang/accel/raw/master/setup_nvptx_toolchain.sh | bash
- uses: actions/cache@v2
with:
path: |
~/.cargo/registry
~/.cargo/git
target
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}

- name: Test algebra with CUDA
run: |
cd algebra
cargo test --features "all_curves cuda cuda_test"
cd ..
- name: Test algebra
run: |
cd algebra
cargo test --features full
cd ..
- name: Check examples
uses: actions-rs/cargo@v1
with:
Expand All @@ -68,7 +93,7 @@ jobs:
uses: actions-rs/cargo@v1
with:
command: check
args: --examples --all-features --all
args: --all-features --examples --all
if: matrix.rust == 'stable'

- name: Check benchmarks on nightly
Expand All @@ -88,12 +113,6 @@ jobs:
--exclude ff-fft-benches \
-- --skip dpc --skip integration_test"

- name: Test algebra
run: |
cd algebra
cargo test --features full
cd ..
- name: Test algebra with assembly
run: |
cd algebra
Expand Down
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ members = [
"r1cs-core",
"r1cs-std",
"algebra-core/algebra-core-derive",
"scripts/glv_lattice_basis"
"scripts/glv_lattice_basis",
]

[profile.release]
Expand Down
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,13 @@ To bench `algebra-benches` with greater accuracy, especially for functions with
cargo +nightly bench --features "n_fold bls12_381"
```

CUDA support is available for a limited set of functions. To allow compilation for CUDA on Linux, first run the script
```
curl -sSL https://github.com/jon-chuang/accel/raw/master/setup_nvptx_toolchain.sh | bash
```
or run the equivalent commands for your OS. Then, pass the `cuda` feature to rustc or cargo when compiling, and import the relevant traits (e.g. GPUScalarMulSlice) wherever the functions are called.

When the `cuda` feature is not activated, Zexe will still compile. However, when either the `cuda` feature is not activated during compilation or CUDA is not detected on your system at runtime, Zexe will default to a CPU-only implementation of the same functionality.

## License

Expand Down
3 changes: 2 additions & 1 deletion algebra-benches/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,10 @@ rand_xorshift = { version = "0.2" }
paste = "1.0"

[features]
bw6_asm = [ "algebra/bw6_asm"]
asm = [ "algebra/asm"]
prefetch = [ "algebra/prefetch"]
bw6_asm = [ "algebra/bw6_asm"]
cuda = [ "algebra/cuda" ]
n_fold = []
mnt4_298 = [ "algebra/mnt4_298"]
mnt6_298 = [ "algebra/mnt6_298"]
Expand Down
14 changes: 11 additions & 3 deletions algebra-core/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,32 +27,40 @@ algebra-core-derive = { path = "algebra-core-derive", optional = true }
derivative = { version = "2", features = ["use_core"] }
num-traits = { version = "0.2", default-features = false }
rand = { version = "0.7", default-features = false }
rayon = { version = "1", optional = true }
rayon = { version = "1.3.0", optional = true }
unroll = { version = "=0.1.4" }
itertools = { version = "0.9.0", default-features = false }
either = { version = "1.6.0", default-features = false }
thread-id = { version = "3.3.0", optional = true }
backtrace = { version = "0.3", optional = true }
accel = { git = "https://github.com/jon-chuang/accel", package = "accel", optional = true }
peekmore = "0.5.6"
closure = { version = "0.3.0", optional = true }
lazy_static = { version = "1.4.0", optional = true }
serde_json = { version = "1.0.58", optional = true }
dirs = { version = "1.0.5", optional = true }
log = { version = "0.4.11", optional = true }
paste = "0.1"

[build-dependencies]
field-assembly = { path = "./field-assembly", optional = true }
cc = "1.0"
rustc_version = "0.2"
cc = "1.0"

[dev-dependencies]
rand_xorshift = "0.2"

[features]
bw6_asm = []
default = [ "std", "rand/default" ]
std = []
parallel = [ "std", "rayon", "rand/default" ]
derive = [ "algebra-core-derive" ]
prefetch = [ "std" ]
cuda = [ "std", "parallel", "accel", "lazy_static", "serde_json", "dirs", "closure", "log" ]

timing = [ "std", "backtrace" ]
timing_detailed = [ "std", "backtrace" ]
timing_thread_id = [ "thread-id" ]

llvm_asm = [ "field-assembly" ]
bw6_asm = []
2 changes: 1 addition & 1 deletion algebra-core/algebra-core-derive/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,4 @@ proc-macro = true
[dependencies]
proc-macro2 = "1.0"
syn = "1.0"
quote = "1.0"
quote = "1.0.7"
2 changes: 1 addition & 1 deletion algebra-core/mince/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ edition = "2018"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
quote = "1.0"
quote = "1.0.7"
syn = {version = "1.0.17", features = ["full"]}

[lib]
Expand Down
2 changes: 1 addition & 1 deletion algebra-core/src/bytes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -316,7 +316,7 @@ mod test {
fn test_macro_empty() {
let array: Vec<u8> = vec![];
let bytes: Vec<u8> = to_bytes![array].unwrap();
assert_eq!(&bytes, &[]);
assert_eq!(bytes, Vec::<u8>::new());
assert_eq!(bytes.len(), 0);
}

Expand Down
4 changes: 2 additions & 2 deletions algebra-core/src/curves/batch_arith.rs
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ pub trait BatchGroupArithmetic
where
Self: Sized + Clone + Copy + Zero + Neg<Output = Self>,
{
type BBaseField: Field;
type BaseFieldForBatch: Field;

// We use the w-NAF method, achieving point density of approximately 1/(w + 1)
// and requiring storage of only 2^(w - 1).
Expand Down Expand Up @@ -136,7 +136,7 @@ where
fn batch_double_in_place(
bases: &mut [Self],
index: &[u32],
scratch_space: Option<&mut Vec<Self::BBaseField>>,
scratch_space: Option<&mut Vec<Self::BaseFieldForBatch>>,
);

/// Mutates bases in place and stores result in the first operand.
Expand Down
9 changes: 9 additions & 0 deletions algebra-core/src/curves/cuda/accel_dummy.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#[cfg(not(feature = "std"))]
use alloc::vec::Vec;
pub mod error {
pub type Result<T> = T;
}

pub struct Context {}

pub type DeviceMemory<T> = Vec<T>;
6 changes: 6 additions & 0 deletions algebra-core/src/curves/cuda/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#[macro_use]
pub mod scalar_mul;
pub use scalar_mul::*;

#[cfg(not(feature = "cuda"))]
pub mod accel_dummy;
Loading

0 comments on commit 7c518fd

Please sign in to comment.