Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreNEURON error when running on NVIDIA Grace Hopper GPU and NVHPC 24.9 #3144

Open
iraikov opened this issue Oct 26, 2024 · 11 comments
Open
Labels

Comments

@iraikov
Copy link
Contributor

iraikov commented Oct 26, 2024

Context

Overview of the issue

Hello, I am trying to get CoreNEURON to run on Grace Hopper nodes on TACC Vista. I have compiled NEURON from the master branch with NVHPC 24.9. Unfortunately, the following error occurs (detailed log below):

FATAL ERROR: data in update device clause was not found on device 1: name=_ZN10coreneuron10pas_globalE
 file:/home1/03320/iraikov/src/aarch64/corenrn/mod2c/passive.cpp _ZN10coreneuron12nrn_init_pasEPNS_9NrnThreadEPNS_9Memb_listEi line:268

Expected result/behavior

Successful invocation of psolve.

NEURON setup

  • Version: master branch
  • Installation method: cmake build
  • OS + Version: CentOS 9.3
  • Compiler + Version: nvhpc 24.9

Minimal working example - MWE

MWE that can be used for reproducing the issue and testing. A couple of examples:

  • python script:
import sys, os, itertools, argparse, time
import numpy as np
from neuron import h, gui, coreneuron

h.nrnmpi_init()

cells = []
nclist = []
vrecs = []
stims = []

class MyCell:
    _ids = itertools.count(0)
    def __repr__(self):
	return 'MyCell[%d]' % self.id

    def __init__(self):
        self.id = next(self._ids)
        # create the morphology and connect it                                                             
	self.soma = h.Section(name='soma', cell=self)
        self.dend = h.Section(name='dend', cell=self)
	self.dend.connect(self.soma(0.5))
	self.soma.insert('pas')
        self.dend.insert('pas')
	self.dend(0.5).pas.e = -65
        self.soma(0.5).pas.e = -65
        self.synlist = []
        self.all = h.SectionList([self.soma, self.dend])

def mkcells(pc, ngids):
    nranks = int(pc.nhost())
    myrank = int(pc.id())

    for gid in range(ngids):

        if gid % nranks == myrank:

	    cell=MyCell()
            nc = h.NetCon(cell.soma(0.5)._ref_v, None, sec=cell.soma)
	    pc.set_gid2node(gid, myrank)
	    pc.cell(gid, nc, 1)
            cells.append(cell)

            # Current injection into section                                                               
            stim = h.IClamp(cell.soma(0.5))
            if gid % 2 == 0:
                stim.delay = 10
           else:
                stim.delay = 20
            stim.dur = 20
            stim.amp = 10
            stims.append(stim)

	    # Record membrane potential                                                                    
            v = h.Vector()
	    v.record(cell.dend(0.5)._ref_v)
	    vrecs.append(v)

            if myrank == 0:
                print("Rank %i: created gid %i; stim delay = %.02f" % (myrank, gid, stim.delay))

## Creates connections:                                                                                    
def connectcells(pc, ngids):
    nranks = int(pc.nhost())
    myrank = int(pc.id())

    for gid in range(0, ngids, 2):

        # source gid: all even gids                                                                        
        src = gid
        # destination gid: all odd gids                                                                    
	dst = gid+1

        if pc.gid_exists(dst) > 0:
            cell = pc.gid2cell(src)
            sec  = cell.dend
            syn = h.Exp2Syn(sec(0.5))
            nc = pc.gid_connect(src, syn)
            nc.delay = 0.5
            nclist.append(nc)
            cell.synlist.append(syn)

def main():
    parser = argparse.ArgumentParser(description='Parallel CoreNEURON test.')
    parser.add_argument('--result-prefix', default='.',
                        help='place output files in given directory (must exist before launch)')
    parser.add_argument('--ngids', default=2, type=int,
                        help='number of gids to create (must be even)')


    coreneuron.enable = True
    coreneuron.gpu = True
    coreneuron.verbose = 1

    args, unknown = parser.parse_known_args()

    pc = h.ParallelContext()
    myrank = int(pc.id())

    coreneuron.enable = True
    coreneuron.gpu = True
    coreneuron.verbose = 1

    args, unknown = parser.parse_known_args()

    pc = h.ParallelContext()
    myrank = int(pc.id())

    if myrank == 0:
        print("numprocs = %d" % int(pc.nhost()))

    mkcells(pc, args.ngids)
    pc.barrier()
    if myrank == 0:
	print("created cells")

    connectcells(pc, args.ngids)

    pc.barrier()
    if myrank == 0:
	print("created connections")

    h.cvode.use_fast_imem(1)
    h.cvode.cache_efficient(1)

    rec_t = h.Vector()
    rec_t.record(h._ref_t)

    wt = time.time()

    h.dt = 0.25
    pc.set_maxstep(10)

    h.finitialize(-65)
    pc.psolve(500)

    total_wt = time.time() - wt

    print('rank %d: total compute time: %.02f' % (myrank, total_wt))
    output = itertools.chain([ np.asarray(rec_t.to_python()) ],
                             [ np.asarray(vrec.to_python()) for vrec in vrecs ])
    np.savetxt("%s/ParCoreNeuron_%04i.dat" % (args.result_prefix, myrank),
	       np.column_stack(tuple(output)))

    pc.runworker()
    pc.done()

    h.quit()

main()
  • Environment
module list
Currently Loaded Modules:
  1) ucc/1.3.0    3) cmake/3.29.5   5) TACC          7) cuda/12.6            (g)
  2) ucx/1.17.0   4) xalt/3.1       6) nvidia/24.9   8) openmpi/5.0.5_nvc249

  Where:
   g:  built for GPU

  • CMake build commands:
git clone [email protected]:neuronsimulator/nrn.git
mkdir build && cd build
make ../nrn -DNRN_ENABLE_INTERVIEWS=OFF  -DNRN_ENABLE_MPI=ON -DNRN_ENABLE_RX3D=OFF -DNRN_ENABLE_CORENEURON=ON -DNRN_ENABLE_PYTHON=ON -DPYTHON_EXECUTABLE:FILEPATH=`which python3` -DCMAKE_INSTALL_PREFIX=$SCRATCH/bin/nrnpython -DReadline_ROOT_DIR:PATH=$HOME/bin/readline -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_CUDA_UNIFIED_MEMORY=ON -DCMAKE_C_COMPILER=nvc   -DCMAKE_CUDA_COMPILER=nvcc   -DCMAKE_CXX_COMPILER=nvc++
cmake --build . --parallel 8

Logs

nvidia-smi 
Fri Oct 25 19:23:36 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 120GB             On  |   00000009:01:00.0 Off |                    0 |
| N/A   26C    P0             89W /  900W |       4MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

$ ibrun python3 test_par_coreneuron.py
TACC:  Starting up job 57280 
TACC:  Setting up parallel environment for OpenMPI mpirun. 
TACC:  Starting parallel tasks... 
numprocs=1
numprocs = 1
Rank 0: created gid 0; stim delay = 10.00
Rank 0: created gid 1; stim delay = 20.00
created cells
created connections
 num_mpi=1

 Info : 1 GPUs shared by 1 ranks per node
 
 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 9.0.0 510b9bc94 (2024-10-21 20:50:56 -0400)
 
 Additional mechanisms from files
 exp2syn.mod expsyn.mod ggap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

 Memory (MBs) :             After mk_mech : Max 372.6875, Min 372.6875, Avg 372.6875 
 GPU Memory (MiBs) : Used = 559.062500, Free = 96720.937500, Total = 97280.000000
 Memory (MBs) :            After MPI_Init : Max 372.6875, Min 372.6875, Avg 372.6875 
 GPU Memory (MiBs) : Used = 559.062500, Free = 96720.937500, Total = 97280.000000
 Memory (MBs) :          Before nrn_setup : Max 372.6875, Min 372.6875, Avg 372.6875 
 GPU Memory (MiBs) : Used = 559.062500, Free = 96720.937500, Total = 97280.000000
 Setup Done   : 0.00 seconds 
 Model size   : 4.56 kB
 Memory (MBs) :          After nrn_setup  : Max 372.5000, Min 372.5000, Avg 372.5000 
 GPU Memory (MiBs) : Used = 561.062500, Free = 96718.937500, Total = 97280.000000
GENERAL PARAMETERS
--mpi=true
--mpi-lib=
--gpu=true
--dt=0.25
--tstop=500

GPU
--nwarp=65536
--cell-permute=1
--cuda-interface=false

INPUT PARAMETERS
--voltage=1000
--seed=-1
--datpath=.
--filesdat=files.dat
--pattern=
--report-conf=
--restore=     

PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=true

SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false

CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=10
--report-buffer-size=4

OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=

 Start time (t) = 0

 Memory (MBs) :  After mk_spikevec_buffer : Max 372.5000, Min 372.5000, Avg 372.5000 
 GPU Memory (MiBs) : Used = 561.062500, Free = 96718.937500, Total = 97280.000000
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 9.0, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
host:0x400012638e40 device:0x400279efa400 size:4 presentcount:0+1 line:-1 name:(null)
host:0x400012638e58 device:0x400279efa000 size:8 presentcount:0+1 line:-1 name:(null)
host:0x400012638e60 device:0x400279efa200 size:8 presentcount:0+1 line:-1 name:(null)
host:0xaaaadb0bba20 device:0x400279efa600 size:136 presentcount:0+1 line:-1 name:(null)
host:0xaaaadb42bd80 device:0x400279efa800 size:24 presentcount:0+1 line:-1 name:(null)
host:0xaaaadb42bf50 device:0x400279efaa00 size:24 presentcount:0+1 line:-1 name:(null)
host:0xaaaadb43b5c0 device:0x400279efac00 size:128 presentcount:0+1 line:-1 name:(null)
host:0xaaaadb43e4e0 device:0x400279efae00 size:48 presentcount:0+1 line:-1 name:(null)
allocated block device:0x400279efa000 size:512 thread:1
allocated block device:0x400279efa200 size:512 thread:1
allocated block device:0x400279efa400 size:512 thread:1
allocated block device:0x400279efa600 size:512 thread:1
allocated block device:0x400279efa800 size:512 thread:1
allocated block device:0x400279efaa00 size:512 thread:1
allocated block device:0x400279efac00 size:512 thread:1
allocated block device:0x400279efae00 size:512 thread:1
FATAL ERROR: data in update device clause was not found on device 1: name=_ZN10coreneuron10pas_globalE
 file:/home1/03320/iraikov/src/aarch64/corenrn/mod2c/passive.cpp _ZN10coreneuron12nrn_init_pasEPNS_9NrnThreadEPNS_9Memb_listEi line:268

@iraikov iraikov added the bug label Oct 26, 2024
@pramodk
Copy link
Member

pramodk commented Oct 28, 2024

@iraikov : could you turn off unified memory and try? i.e. -DCORENRN_ENABLE_CUDA_UNIFIED_MEMORY=OFF

@iraikov
Copy link
Contributor Author

iraikov commented Oct 28, 2024

@iraikov : could you turn off unified memory and try? i.e. -DCORENRN_ENABLE_CUDA_UNIFIED_MEMORY=OFF

@pramodk It seems that I get the same error if I omit the unified memory option at build time. Does CoreNEURON indicate whether it is using the unified memory interface at run time?

@pramodk
Copy link
Member

pramodk commented Oct 28, 2024

Did you clean old build directory? Just in case...!

I don't have access to Grace-Hopper but on my local Ubuntu box with NVHPC v24.9, my CMake full log is below. In the CoreNEURON section, we see -- Unified Memory | OFF:

$ cmake .. -DNRN_ENABLE_INTERVIEWS=OFF -DNRN_ENABLE_MPI=ON -DNRN_ENABLE_RX3D=OFF -DNRN_ENABLE_CORENEURON=ON -DCMAKE_INSTALL_PREFIX=`pwd`/install -DNRN_ENABLE_TESTS=OFF -DCORENRN_ENABLE_GPU=ON
-- The C compiler identification is NVHPC 24.9.0
-- The CXX compiler identification is NVHPC 24.9.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Setting build type to 'RelWithDebInfo' as none was specified.
-- The compiler /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc++ has no support for OpenMP SIMD construct
-- 3rd party project: using Random123 from "external/Random123"
-- 3rd party project: using eigen from "external/eigen"
-- Sub-project : using fmt from from /home/kumbhar/workarena/repos/bbp/nrn/external/fmt
-- {fmt} version: 11.0.2
-- Build type: RelWithDebInfo
-- No python executable specified. Looking for `python3` in the PATH...
-- Checking if /usr/bin/python3 is a working python
-- Found BISON: /usr/bin/bison (found version "3.8.2")
-- Found FLEX: /usr/bin/flex (found suitable version "2.6.4", minimum required is "2.6")
-- Found Readline: /usr/include
-- Found MPI_C: /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib/libmpi.so (found version "3.1")
-- Found MPI_CXX: /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib/libmpi.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Detected OpenMPI 4.1.7
-- Sub-project : using nanobind from from /home/kumbhar/workarena/repos/bbp/nrn/external/nanobind
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- 3rd party project: using CLI11 from "external/CLI11"
-- Building CoreNEURON
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Setting default CUDA architectures to 70;80
-- The CUDA compiler identification is NVIDIA 12.6.20
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/cuda/12.6/include (found suitable version "12.6.20", minimum required is "9.0")
-- Could NOT find nmodl (missing: nmodl_BINARY)
-- Sub-project : using nmodl from from /home/kumbhar/workarena/repos/bbp/nrn/external/nmodl
-- CHECKING FOR FLEX/BISON
-- Found BISON: /usr/bin/bison (found suitable version "3.8.2", minimum required is "3.0")
-- NMODL_TEST_FORMATTING: OFF
-- NMODL_GIT_HOOKS: OFF
-- NMODL_GIT_COMMIT_HOOKS:
-- NMODL_GIT_PUSH_HOOKS: courtesy-msg
-- NMODL_STATIC_ANALYSIS: OFF
-- NMODL_TEST_STATIC_ANALYSIS: OFF
-- 3rd party project: using json from "ext/json"
-- Using the multi-header code from /home/kumbhar/workarena/repos/bbp/nrn/external/nmodl/ext/json/include/
-- 3rd party project: using pybind11 from "ext/pybind11"
-- pybind11 v2.12.0
-- Found PythonInterp: /usr/bin/python3 (found suitable version "3.10.12", minimum required is "3.6")
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.10.so
-- 3rd party project: using spdlog from "ext/spdlog"
-- Build spdlog: 1.13.0
-- Build type: RelWithDebInfo
-- CHECKING FOR PYTHON
-- Found Python: /usr/bin/python3.10 (found suitable version "3.10.12", minimum required is "3.8") found components: Interpreter
--
-- Configured NMODL 0.6 (e6250014d 2024-09-10 09:00:35 -0400)
--
-- You can now build NMODL using:
--   cmake --build . --parallel 8 [--target TARGET]
-- You might want to adjust the number of parallel build jobs for your system.
-- Some non-default targets you might want to build:
-- --------------------+--------------------------------------------------------
--  Target             |   Description
-- --------------------+--------------------------------------------------------
-- test                | Run unit tests
-- install             | Will install NMODL to: /home/kumbhar/workarena/repos/bbp/nrn/build_gpu/install
-- --------------------+--------------------------------------------------------
--  Build option       | Status
-- --------------------+--------------------------------------------------------
-- CXX COMPILER        | /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc++
-- COMPILE FLAGS       |  -mp  -g  -O2   -Wc,--pending_instantiations=0
-- Build Type          | RelWithDebInfo
-- Python Bindings     | OFF
-- Flex                | /usr/bin/flex
-- Bison               | /usr/bin/bison
-- Python              | /usr/bin/python3
--   Linked against    | ON
-- --------------------+--------------------------------------------------------
--  See documentation : https://github.com/BlueBrain/nmodl/
-- --------------------+--------------------------------------------------------
--
--
-- CoreNEURON is enabled with following build configuration:
-- --------------------+--------------------------------------------------------
--  Build option       | Status
-- --------------------+--------------------------------------------------------
-- CXX COMPILER        | /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc++
-- COMPILE FLAGS       |  -mp  -g  -O2   --c++17 -cuda -gpu=cuda12.6,lineinfo,cc70,cc80 -mp=gpu -Mautoinline -DCORENEURON_CUDA_PROFILING -DCORENEURON_ENABLE_GPU -DCORENEURON_PREFER_OPENMP_OFFLOAD -DCORENEURON_BUILD -DHAVE_MALLOC_H -DCORENRN_BUILD=1 -DEIGEN_DONT_PARALLELIZE -DEIGEN_DONT_VECTORIZE=1 -DNRNMPI=1 -DLAYOUT=0 -DDISABLE_HOC_EXP -DENABLE_SPLAYTREE_QUEUING
-- Build Type          | SHARED
-- MPI                 | ON
--   DYNAMIC           | OFF
--   INC               | /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include;/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include/openmpi;/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include/openmpi/opal/mca/hwloc/hwloc201/hwloc/include;/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include/openmpi/opal/mca/event/libevent2022/libevent;/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include/openmpi/opal/mca/event/libevent2022/libevent/include
-- OpenMP              | ON
-- NMODL PATH          | /home/kumbhar/workarena/repos/bbp/nrn/build_gpu/bin/nmodl
-- NMODL FLAGS         |
-- GPU Support         | ON
--   CUDA              | /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/cuda/12.6/lib64
--   Offload           | OpenMP
--   Unified Memory    | OFF
-- Auto Timeout        | ON
-- Wrap exp()          | OFF
-- SplayTree Queue     | ON
-- NetReceive Buffer   | ON
-- Caliper             | OFF
-- Likwid              | OFF
-- Unit Tests          | OFF
-- Reporting           | OFF
-- --------------------+--------------------------------------------------------
--
Extracting link flags from target 'nrngnu', beware that this can be fragile. Got:
Extracting link flags from target 'sparse13', beware that this can be fragile. Got:
Extracting link flags from target 'fmt::fmt', beware that this can be fragile. Got:
For 'nrnpython' going to see TARGET 'fmt::fmt' recursively.
Extracting link flags from target 'fmt::fmt', beware that this can be fragile. Got:  /usr/lib/x86_64-linux-gnu/libpython3.10.so;fmt::fmt;nanobind
For 'nrnpython' going to see TARGET 'nanobind' recursively.
Extracting link flags from target 'nanobind', beware that this can be fragile. Got:  /usr/lib/x86_64-linux-gnu/libpython3.10.so;fmt::fmt;nanobind
Extracting link flags from target 'nrnpython', beware that this can be fragile. Got:  /usr/lib/x86_64-linux-gnu/libpython3.10.so;fmt::fmt;nanobind
Extracting link flags from target 'Threads::Threads', beware that this can be fragile. Got:  /usr/lib/x86_64-linux-gnu/libpython3.10.so;fmt::fmt;nanobind
Generating link flags from path /usr/lib/x86_64-linux-gnu/libreadline.so Got: /usr/lib/x86_64-linux-gnu/libreadline.so -Wl,-rpath,/usr/lib/x86_64-linux-gnu
Generating link flags from path /usr/lib/x86_64-linux-gnu/libcurses.so Got: /usr/lib/x86_64-linux-gnu/libcurses.so -Wl,-rpath,/usr/lib/x86_64-linux-gnu
Generating link flags from path /usr/lib/x86_64-linux-gnu/libform.so Got: /usr/lib/x86_64-linux-gnu/libform.so -Wl,-rpath,/usr/lib/x86_64-linux-gnu
Generating link flags from path /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib/libmpi.so Got: /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib/libmpi.so -Wl,-rpath,/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib
Generating link flags from name 'dl', beware that this can be fragile. Got: -ldl
--
-- Configured NEURON 9.0.0
--
-- You can now build NEURON using:
--   cmake --build . --parallel 8 [--target TARGET]
-- You might want to adjust the number of parallel build jobs for your system.
-- Some non-default targets you might want to build:
-- --------------+--------------------------------------------------------------
--  Target       |   Description
-- --------------+--------------------------------------------------------------
-- install       | Will install NEURON to: /home/kumbhar/workarena/repos/bbp/nrn/build_gpu/install
--               | Change the install location of NEURON using:
--               |   cmake <src_path> -DCMAKE_INSTALL_PREFIX=<install_path>
-- docs          | Build full docs. Calls targets: doxygen, notebooks, sphinx, notebooks-clean
-- uninstall     | Removes files installed by make install (todo)
-- --------------+--------------------------------------------------------------
--  Build option | Status
-- --------------+--------------------------------------------------------------
-- C COMPILER    | /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc
-- CXX COMPILER  | /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc++
-- BUILD_TYPE    | RelWithDebInfo (allowed: Custom;Debug;Release;RelWithDebInfo;Fast;FastDebug)
-- COMPILE FLAGS | -g  -O2   --diag_suppress=1,47,111,128,170,174,177,186,541,550,816,2465 -noswitcherror
-- Shared        | ON
-- MPI           | ON
--   DYNAMIC     | OFF
--   INC         | /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include;/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include/openmpi;/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include/openmpi/opal/mca/hwloc/hwloc201/hwloc/include;/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include/openmpi/opal/mca/event/libevent2022/libevent;/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include/openmpi/opal/mca/event/libevent2022/libevent/include
--   LIB         | /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib/libmpi.so
-- Python        | ON
--   DYNAMIC     | OFF
--   MODULE      | ON
--  python3.10 (default)
--   EXE         | /usr/bin/python3
--   INC         | /usr/include/python3.10
--   LIB         | /usr/lib/x86_64-linux-gnu/libpython3.10.so
-- Readline      | /usr/lib/x86_64-linux-gnu/libreadline.so
-- Curses        | /usr/lib/x86_64-linux-gnu/libcurses.so;/usr/lib/x86_64-linux-gnu/libform.so
-- RX3D          | OFF
-- Interviews    | OFF
-- CoreNEURON    | ON
--   PATH        | /home/kumbhar/workarena/repos/bbp/nrn/src/coreneuron
--   LINK FLAGS  | -cuda -gpu=cuda12.6,lineinfo,cc70,cc80 -mp=gpu -lcorenrnmech  -lcoreneuron-cuda -Wl,-rpath,/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib/libmpi.so -ldl
-- Tests         | OFF
-- --------------+--------------------------------------------------------------
--  See documentation : https://www.neuron.yale.edu/neuron/
-- --------------+--------------------------------------------------------------
--
-- Configuring done
-- Generating done
-- Build files have been written to: /home/kumbhar/workarena/repos/bbp/nrn/build_gpu

and then make -j8 && make install. And then I am ble to run your test on GPU:

# nsys profile just for a sanity check

$ nsys nvprof /home/kumbhar/workarena/repos/bbp/nrn/build_gpu/install/bin/nrniv -python test.py
WARNING: nrniv and any of its children processes will be profiled.

Collecting data...
NEURON -- VERSION 9.0.dev-1246-g797e9b0a8+ HEAD (797e9b0a8+) 2023-01-04
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2022
See http://neuron.yale.edu/neuron/credits

 Info : 1 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 9.0.0 2434192bc (2024-10-17 16:11:44 +0200)

 Additional mechanisms from files
 exp2syn.mod expsyn.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

 Memory (MBs) :             After mk_mech : Max 691.2656, Min 691.2656, Avg 691.2656
 GPU Memory (MiBs) : Used = 245.000000, Free = 5678.812500, Total = 5923.812500
 Memory (MBs) :            After MPI_Init : Max 691.2656, Min 691.2656, Avg 691.2656
 GPU Memory (MiBs) : Used = 245.000000, Free = 5678.812500, Total = 5923.812500
 Memory (MBs) :          Before nrn_setup : Max 691.2656, Min 691.2656, Avg 691.2656
 GPU Memory (MiBs) : Used = 245.000000, Free = 5678.812500, Total = 5923.812500
 Setup Done   : 0.00 seconds
 Model size   : 4.56 kB
 Memory (MBs) :          After nrn_setup  : Max 691.7500, Min 691.7500, Avg 691.7500
 GPU Memory (MiBs) : Used = 245.000000, Free = 5678.812500, Total = 5923.812500
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.25
--tstop=500

GPU
--nwarp=65536
--cell-permute=1
--cuda-interface=false

INPUT PARAMETERS
--voltage=1000
--seed=-1
--datpath=.
--filesdat=files.dat
--pattern=
--report-conf=
--restore=

PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=true

SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false

CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=10
--report-buffer-size=4

OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=

 Start time (t) = 0

 Memory (MBs) :  After mk_spikevec_buffer : Max 691.7500, Min 691.7500, Avg 691.7500
 GPU Memory (MiBs) : Used = 245.000000, Free = 5678.812500, Total = 5923.812500
 Memory (MBs) :     After nrn_finitialize : Max 691.9883, Min 691.9883, Avg 691.9883
 GPU Memory (MiBs) : Used = 245.000000, Free = 5678.812500, Total = 5923.812500

psolve |=========================================================| t: 500.00 ETA: 0h00m01s

Solver Time : 0.390506


 Simulation Statistics
 Number of cells: 2
 Number of compartments: 10
 Number of presyns: 2
 Number of input presyns: 0
 Number of synapses: 1
 Number of point processes: 3
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 0
 Number of spikes with non negative gid-s: 0
numprocs = 1
Rank 0: created gid 0; stim delay = 10.00
Rank 0: created gid 1; stim delay = 20.00
created cells
created connections
rank 0: total compute time: 0.63
Generating '/tmp/nsys-report-f933.qdstrm'
[1/7] [========================100%] report2.nsys-rep
[2/7] [========================100%] report2.sqlite
[3/7] Executing 'nvtx_sum' stats report
SKIPPED: /home/kumbhar/workarena/repos/bbp/nrn/ivan/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)  StdDev (ns)             Name
 --------  ---------------  ---------  ----------  ----------  --------  --------  -----------  --------------------------
     63.0        238542651      46161      5167.6      4620.0       177     33643       2306.9  cuStreamSynchronize
     13.4         50750148      42000      1208.3      1179.0      1041    156736        829.8  cuLaunchKernel
      8.4         31962472          2  15981236.0  15981236.0   1369231  30593241   20664495.6  cudaProfilerStop
      7.7         29251053       2044     14310.7     14484.0      5204     35330       1533.2  cuMemcpyDtoHAsync_v2
      5.4         20595070          1  20595070.0  20595070.0  20595070  20595070          0.0  cuMemAllocManaged
      1.1          4288147       4134      1037.3      1014.0       886      6575        169.1  cuMemcpyHtoDAsync_v2
      0.4          1644335       2006       819.7       785.0       677     15841        457.1  cuMemsetD32Async
      0.2           938734          1    938734.0    938734.0    938734    938734          0.0  cudaGetFuncBySymbol_v11000
      0.1           437050          1    437050.0    437050.0    437050    437050          0.0  cuMemAllocHost_v2
      0.0           129572         62      2089.9       864.0       710     59470       7438.3  cuMemAlloc_v2
      0.0            43339          6      7223.2      5951.5      5033     13387       3212.2  cudaMemGetInfo
      0.0            42526        412       103.2        77.0        42      2366        126.2  cuGetProcAddress_v2
      0.0             6084          2      3042.0      3042.0      1500      4584       2180.7  cudaProfilerStart
      0.0             4742          1      4742.0      4742.0      4742      4742          0.0  cudaFree
      0.0             1571          2       785.5       785.5       527      1044        365.6  cuInit
      0.0              746          4       186.5       127.5        46       445        188.4  cuCtxSetCurrent
      0.0              310          1       310.0       310.0       310       310          0.0  cuFuncGetModule
      0.0               82          1        82.0        82.0        82        82          0.0  cuModuleGetLoadingMode

[5/7] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     16.9         19083822       2000    9541.9    9536.0      7808     13632        795.0  nvkernel__ZN10coreneuron18solve_interleaved1Ei_F580L793_4
     10.5         11918583       2000    5959.3    6048.0      4928      6432        294.4  nvkernel__ZN10coreneuron17nrn_state_Exp2SynEPNS_9NrnThreadEPNS_9Memb_listEi_F1L477_23
      9.0         10127902       2000    5064.0    5120.0      4160      5856        257.0  nvkernel__ZN10coreneuron14nrn_cur_IClampEPNS_9NrnThreadEPNS_9Memb_listEi_F1L321_9
      8.3          9413280       2000    4706.6    4768.0      3872      5344        233.1  nvkernel__ZN10coreneuron15nrn_cur_Exp2SynEPNS_9NrnThreadEPNS_9Memb_listEi_F1L436_16
      5.5          6240543       2000    3120.3    3168.0      2560      3744        156.5  nvkernel__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_F1L305_9
      4.6          5204375       4000    1301.1    1312.0      1024      1793         70.8  nvkernel__ZN10coreneuron23net_buf_receive_Exp2SynEPNS_9NrnThreadE_F1L340_2
      4.3          4819276       2000    2409.6    2432.0      1952      2688        122.3  nvkernel__ZN95_INTERNAL_73__home_kumbhar_workarena_repos_bbp_nrn_src_coreneuron_sim_treeset_core_cp…
      4.1          4650301       2000    2325.2    2368.0      1888      2401        118.4  nvkernel__ZN95_INTERNAL_73__home_kumbhar_workarena_repos_bbp_nrn_src_coreneuron_sim_treeset_core_cp…
      3.9          4393077       2000    2196.5    2209.0      1791      2688        112.2  nvkernel__ZN10coreneuron23nrncore2nrn_send_valuesEPNS_9NrnThreadE_F1L295_18
....

Could you copy the output of cmake configure step ?

@iraikov
Copy link
Contributor Author

iraikov commented Oct 28, 2024

Hello @pramodk, I did remove the old build directory, but thanks for checking! Below is my cmake configure log. It appears that in your CoreNEURON "GPU" section, the Offload setting is OpenMP, whereas mine is OpenACC. Could this be the culprit?

build_config.log

@pramodk
Copy link
Member

pramodk commented Oct 28, 2024

Add -DCORENRN_ENABLE_OPENMP=OFF and then that enabled OpenACC:

-- CoreNEURON is enabled with following build configuration:
-- --------------------+--------------------------------------------------------
--  Build option       | Status
-- --------------------+--------------------------------------------------------
-- CXX COMPILER        | /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc++
-- COMPILE FLAGS       | -g  -O2   --c++17 -cuda -gpu=cuda12.6,lineinfo,cc70,cc80 -acc -Mautoinline -DCORENEURON_CUDA_PROFILING -DCORENEURON_ENABLE_GPU -DCORENEURON_BUILD -DHAVE_MALLOC_H -DCORENRN_BUILD=1 -DEIGEN_DONT_PARALLELIZE -DEIGEN_DONT_VECTORIZE=1 -DNRNMPI=1 -DLAYOUT=0 -DDISABLE_HOC_EXP -DENABLE_SPLAYTREE_QUEUING
-- Build Type          | SHARED
-- MPI                 | ON
--   DYNAMIC           | OFF
--   INC               | /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include;/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include/openmpi;/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include/openmpi/opal/mca/hwloc/hwloc201/hwloc/include;/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include/openmpi/opal/mca/event/libevent2022/libevent;/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include/openmpi/opal/mca/event/libevent2022/libevent/include
-- OpenMP              | OFF
-- NMODL PATH          | /home/kumbhar/workarena/repos/bbp/nrn/build_gpu_acc/bin/nmodl
-- NMODL FLAGS         |
-- GPU Support         | ON
--   CUDA              | /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/cuda/12.6/lib64
--   Offload           | OpenACC
--   Unified Memory    | OFF
-- Auto Timeout        | ON
-- Wrap exp()          | OFF
-- SplayTree Queue     | ON
-- NetReceive Buffer   | ON
-- Caliper             | OFF
-- Likwid              | OFF
-- Unit Tests          | OFF

and still finished without errors:

$ nsys nvprof /home/kumbhar/workarena/repos/bbp/nrn/build_gpu_acc/install/bin/nrniv -python test.py
WARNING: nrniv and any of its children processes will be profiled.

Collecting data...
NEURON -- VERSION 9.0.dev-1246-g797e9b0a8+ HEAD (797e9b0a8+) 2023-01-04
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2022
See http://neuron.yale.edu/neuron/credits

 Info : 1 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 9.0.0 2434192bc (2024-10-17 16:11:44 +0200)

 Additional mechanisms from files
 exp2syn.mod expsyn.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

 Memory (MBs) :             After mk_mech : Max 690.1641, Min 690.1641, Avg 690.1641
 GPU Memory (MiBs) : Used = 245.000000, Free = 5678.812500, Total = 5923.812500
 Memory (MBs) :            After MPI_Init : Max 690.1641, Min 690.1641, Avg 690.1641
 GPU Memory (MiBs) : Used = 245.000000, Free = 5678.812500, Total = 5923.812500
 Memory (MBs) :          Before nrn_setup : Max 690.1641, Min 690.1641, Avg 690.1641
 GPU Memory (MiBs) : Used = 245.000000, Free = 5678.812500, Total = 5923.812500
 Setup Done   : 0.00 seconds
 Model size   : 4.56 kB
 Memory (MBs) :          After nrn_setup  : Max 690.1641, Min 690.1641, Avg 690.1641
 GPU Memory (MiBs) : Used = 245.000000, Free = 5678.812500, Total = 5923.812500
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.25
--tstop=500

GPU
--nwarp=65536
--cell-permute=1
--cuda-interface=false

INPUT PARAMETERS
--voltage=1000
--seed=-1
--datpath=.
--filesdat=files.dat
--pattern=
--report-conf=
--restore=

PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=true

SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false

CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=10
--report-buffer-size=4

OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=

 Start time (t) = 0

 Memory (MBs) :  After mk_spikevec_buffer : Max 690.1641, Min 690.1641, Avg 690.1641
 GPU Memory (MiBs) : Used = 245.000000, Free = 5678.812500, Total = 5923.812500
 Memory (MBs) :     After nrn_finitialize : Max 690.7969, Min 690.7969, Avg 690.7969
 GPU Memory (MiBs) : Used = 245.000000, Free = 5678.812500, Total = 5923.812500

psolve |=========================================================| t: 500.00 ETA: 0h00m00s

Solver Time : 0.263252


 Simulation Statistics
 Number of cells: 2
 Number of compartments: 10
 Number of presyns: 2
 Number of input presyns: 0
 Number of synapses: 1
 Number of point processes: 3
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 0
 Number of spikes with non negative gid-s: 0
numprocs = 1
Rank 0: created gid 0; stim delay = 10.00
Rank 0: created gid 1; stim delay = 20.00
created cells
created connections
rank 0: total compute time: 0.46
Generating '/tmp/nsys-report-3cb0.qdstrm'
[1/7] [========================100%] report3.nsys-rep
[2/7] [========================100%] report3.sqlite
[3/7] Executing 'nvtx_sum' stats report
SKIPPED: /home/kumbhar/workarena/repos/bbp/nrn/ivan/report3.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)  StdDev (ns)             Name
 --------  ---------------  ---------  ----------  ----------  --------  --------  -----------  --------------------------
     55.7        108157789      36162      2990.9       202.0       150     37026       4569.9  cuStreamSynchronize
     26.8         52061729      42000      1239.6      1208.0      1028    114231        664.3  cuLaunchKernel
      5.8         11304352          1  11304352.0  11304352.0  11304352  11304352          0.0  cuMemHostAlloc
      5.4         10557036          2   5278518.0   5278518.0   1372485   9184551    5523964.8  cudaProfilerStop
      2.4          4696424       4139      1134.7      1125.0       874     16344        313.9  cuMemcpyHtoDAsync_v2
      1.3          2468031       2044      1207.5      1084.0       983     23143        803.0  cuMemcpyDtoHAsync_v2
      0.8          1582809       2001       791.0       772.0       679      2974        136.6  cuMemsetD32Async
      0.6          1217946       2000       609.0       593.0       543      4273        118.9  cuEventRecord
      0.5          1006975         30     33565.8      5287.5       532    783273     141747.0  cudaGetFuncBySymbol_v11000
      0.2           420253          2    210126.5    210126.5      2417    417836     293745.6  cuMemAllocHost_v2
      0.2           345595       2000       172.8       169.0       156      1514         38.2  cuEventSynchronize
      0.1           175334         62      2828.0       825.0       655     64889       9902.4  cuMemAlloc_v2
      0.0            45463          6      7577.2      6085.0      3472     13111       4281.6  cudaMemGetInfo
      0.0            42937        412       104.2        78.0        41      2569        136.1  cuGetProcAddress_v2
      0.0             4679          2      2339.5      2339.5      1257      3422       1530.9  cudaProfilerStart
      0.0             3828          4       957.0       464.0       233      2667       1156.9  cuEventCreate
      0.0             3050          1      3050.0      3050.0      3050      3050          0.0  cuStreamCreate
      0.0             1181          5       236.2        95.0        47       867        353.9  cuCtxSetCurrent
      0.0              476          1       476.0       476.0       476       476          0.0  cuInit
      0.0              464          1       464.0       464.0       464       464          0.0  cuFuncGetModule
      0.0               70          1        70.0        70.0        70        70          0.0  cuModuleGetLoadingMode

[5/7] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                             Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  -------------------------------------------------------------------------------------------
     16.9         15945974       2000    7973.0    8672.0      5887     12544       1482.4  coreneuron::solve_interleaved1_793(int)
     11.1         10420669       2000    5210.3    5760.0      3904      6240        844.6  coreneuron::nrn_state_Exp2Syn_477(coreneuron::NrnThread *, coreneuron::Memb_list *, int)
      8.8          8310735       2000    4155.4    4576.0      3103      5280        682.6  coreneuron::nrn_cur_IClamp_321(coreneuron::NrnThread *, coreneuron::Memb_list *, int)
      8.3          7871696       2000    3935.8    4352.0      2943      5025        640.4  coreneuron::nrn_cur_Exp2Syn_436(coreneuron::NrnThread *, coreneuron::Memb_list *, int)
      5.5          5197502       2000    2598.8    2849.0      1920      3456        425.3  coreneuron::nrn_cur_pas_305(coreneuron::NrnThread *, coreneuron::Memb_list *, int)
      5.1          4836565       4000    1209.1    1312.0       863      1920        203.9  coreneuron::net_buf_receive_Exp2Syn_340(coreneuron::NrnThread *)
      4.5          4252522       2000    2126.3    2336.0      1567      2656        351.1  coreneuron::nrncore2nrn_send_values_295(coreneuron::NrnThread *)
      4.2          3973279       2000    1986.6    2177.0      1471      2848        329.6  coreneuron::NetCvode::check_thresh_536(coreneuron::NrnThread *)
      3.7          3472390       2000    1736.2    1920.0      1279      2176        285.9  coreneuron::nrn_rhs_83(coreneuron::NrnThread *)
      3.5          3284434       2000    1642.2    1792.0      1215      1888        270.8  coreneuron::nrn_lhs_160(coreneuron::NrnThread *)
      3.5          3261122       2000    1630.6    1792.0      1184      2016        269.8  coreneuron::nrn_jacob_capacitance_74(coren
...

Quickly skimming through log, I don't see anything obvious 😕. I will be off some time, we can revisit this later.

@iraikov
Copy link
Contributor Author

iraikov commented Oct 28, 2024

@pramodk Thank you for checking! I confirmed that I still get the same error with the build settings above. Just for my reference, what Nvidia platform are you running on?

@pramodk
Copy link
Member

pramodk commented Oct 28, 2024

This is my local development machine with Intel i9-12900K and

$ nvidia-smi
Tue Oct 29 00:16:54 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A2000               Off | 00000000:01:00.0 Off |                  Off |
| 30%   28C    P8               5W /  70W |    130MiB /  6138MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

I didn't test on our HPC cluster with Volta 100 because we don't have NVHPC 24.9 there yet.

We will figure this out, need to look a bit into the details.

@iraikov
Copy link
Contributor Author

iraikov commented Oct 28, 2024

Ok, thank you. I wonder if the ARM architecture of Grace somehow could be a cause.

@iraikov
Copy link
Contributor Author

iraikov commented Oct 30, 2024

@pramodk Would it be helpful if I added you to the allocation on TACC Vista so you can take a look?

@pramodk
Copy link
Member

pramodk commented Nov 3, 2024

@pramodk: It would be helpful to have access. My account name on TACC portal is kumbhar.
I have changed my email id from epfl to a personal gmail account, but I guess you can add me via the username?

( I am off this week and hence will be a bit late with the response)

@iraikov
Copy link
Contributor Author

iraikov commented Nov 4, 2024

@pramodk Thank you so much for responding during your off time. We have added you to our TACC allocation. ssh access to Vista from outside is not allowed yet, so you first have to connect to frontera.tacc.utexas.edu and from there to vista.tacc.utexas.edu. Thanks a lot for your help!

Just FYI, my module environment on Vista is:

module list
Currently Loaded Modules:
  1) ucc/1.3.0    3) cmake/3.29.5   5) TACC          7) cuda/12.6            (g)
  2) ucx/1.17.0   4) xalt/3.1       6) nvidia/24.9   8) openmpi/5.0.5_nvc249

  Where:
   g:  built for GPU

The Grace-Hopper queue is called gh, and the Grace-Grace queue is called gg.
To start an interactive job on a GH node with 1 GPU and 1 CPU, you would need, e.g.:

srun --pty -N 1 -n 1 -t 0:30:00 -p gh /bin/bash -l

The Vista user guide is here: https://docs.tacc.utexas.edu/hpc/vista/

pramodk: It would be helpful to have access. My account name on TACC portal is kumbhar. I have changed my email id from epfl to a personal gmail account, but I guess you can add me via the username?

( I am off this week and hence will be a bit late with the response)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants