Skip to content

CARV-ICS-FORTH/GHNAVS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GHNAVS gem5 simulator

GHNAVS stands for Gem5 - HBM - NoC - ARM - V1 - Simulator

This is a patch for gem5, to enhance support for ARM64 CPUs and ARM-CMN Interconnects modeling.

I. Features

  1. Implementation for multi-layered NoC, where each traffic class (VNET) uses a separate NoC. This is a more accurate modeling of ARM CMN Interconnects[7], in relation to the existing gem5 NoC models (garnet2.0 [1]).

    • The number of NoC layers be adjusted with a single parameter (--noc_layers) from the command-line
  2. Changes in the Ruby CPU Sequencer [2, 3], in order to avoid Sequencer port blocking caused by aliased requests. In the default implementation of Ruby CPU Sequencer subsequent requests (read or write), to the same Cache line are called aliased requests, and cause the CPU master port to block, even for requests targeting a different cacheline. This unrealistic modeling results in very poor performance (due to low number of outstanding requests), for workloads with sequential access to memory (e.g: STREAM [4]).

    • This new feature (called sequencer_port_block_bypass) can be turned on/off using a single configuration parameter.
  3. Processor models:

  • ARM Cortex-A76 CPU and
  • R-CPU, which models an ARM-Neoverse-V1 CPU core with SVE support
  1. NoC Topology: 16-cores, 16 SLC slices, 8 HBM2 controllers 4x4 Mesh Topology

  2. HBM2 memory model

  3. Very detailed NoC statistics

  4. MOESI_CMP_directory CC protocol

II. Instructions

A. How to build

Software Requirements

  • g++ version 7
  • scons version 3.01
  • python version 2.7

Additionally you are advised to install pydot and graphviz related libraries, in order to visualize the generated Systems, and cross-check that they are as expected. (When those libraries are installed gem5 will produce config.dot.pdf and config.system.ruby.dot.pdf files, which contain a high level, visual representation of the components of the simulated system).

One can build an LXC debian buster container, which can have all these requirements.

The provided patch is based on public-gem5 source code [6], commit: 904784fb1e15f0c090fb1f1e5c5057e74b0b4ea8. In order to apply the patch use the following:

    git clone https://gem5.googlesource.com/public/gem5
    cd gem5
    git checkout 904784fb1e15f0c090fb1f1e5c5057e74b0b4ea8
    git apply *.patch 

Example build command:

   CXX=g++-7 scons NUMBER_BITS_PER_SET=128 PROTOCOL=MOESI_CMP_directory build/ARM/gem5.opt -j7 --force-lto

B. Important configuration parameters

The following parameters, can be adjusted to more accurate values, depending on the system you want to setup.

Param File Value Description
max_outstanding_requests src/mem/ruby/system/Sequencer.py 128 The number of outstanding requests for Ruby CPU Sequencer
sequencer_port_block_bypass src/mem/ruby/system/Sequencer.py True Use True to enable FORTH Sequencer (False for the default Ruby Sequencer)
--sys-clock command-line 2GHz System Clock
--ruby-clock command-line 2GHz Ruby Subsystem Clock
--cpu-clock command-line 2GHz CPU clock
--topology command-line Mesh_EPI_quadrant_p1 Mesh_XY with multiple NoC layers support
--mesh-rows command-line 4 4x4 Mesh, (set to 4 for 4x4 Mesh)
--link-width-bits command-line 576 This covers 8 bytes for control and 64 bytes for data (64 bit + 512 bits)
--noc_layers command-line 3 The number of layers is dependent on the number of VNETS of the CC protocol that is used. Use 3 for MOESI_CMP_directory protocol. Use 6 if you want to enable the multi-VNETs NoC feature
sve_vl_se src/arch/arm/ArmISA.py 1/2/4/8/16 SVE Length in Quadwords (quadword = 128 bits). This only works for SE mode
sve_vl arch/arm/ArmSystem.py 1/2/4/8/16 SVE Length in Quadwords (quadword = 128 bits). This only works for FS mode
--num-dirs command-line 4,8,... On Ruby Systems, the number of memory channels is directly related to the number of Cache directories. Note that mem-channels parameter is ignored when modeling Ruby Systems
--mem-channels command-line 1,2,4,8,... On Classic Memory Systems the number of memory channels is adjusted by mem-channels
--ports command-line 4 For Ruby Systems: CC transitions per cycle

C. How to run

1. Syscall emulation (SE) case (se.py)

An example command for SE mode is the following:

    ./build/ARM/gem5.opt \
    --outdir=m5out/outdir_1 \
    configs/example/se.py \
    --cpu-type=R_CPU \
    --arm-iset=aarch64 \
    --num-cpus=4 \
    --num-dirs=4 \
    --caches \
    --l2cache \
    --num-l2caches=4 \
    --l1i_size=64kB \
    --l1d_size=64kB \
    --l2_size=1MB \
    --l1i_assoc=4 \
    --l1d_assoc=4 \
    --l2_assoc=8 \
    --mem-type=DDR4_2400_8x8 \
    --mem-size=2GB \
    --sys-clock=2GHz \
    --cpu-clock=2GHz \
    --ruby-clock=2GHz \
    --ruby \
    --topology=Mesh_EPI \
    --mesh-rows=2 \
    --network=garnet2.0 \
    --link-width-bits=576 \
    --noc_layers=3 \
    --vcs-per-vnet=4 \
    -c sve_stream_copy_se.exe

Note: One can use -c $executable1;$executable2;$executable3;$executable4 to launch the same or different applications to each core, since mutli-thread support is limited in SE mode.

2. FullSystem (FS) mode (fs.py)

For Full System mode we use fs.py simulation script, as it supports the gem5 Ruby System [2], and therefore NoC modeling.

  • Getting a checkpoint: We always get a checkpoint using a classic system setup. The number of CPU cores, SVE lengths, and memory size are important as changing those requires a new checkpoint. Additionally mounting or adding files to the image file requires a new checkpoint to be taken.
./build/ARM/gem5.opt \
--outdir=m5out/checkpoint_outdir \
configs/example/fs.py \
--cpu-type=AtomicSimpleCPU \
--cpu-clock=2GHz \
--num-cpus=16 \
--kernel=vmlinux_bsc \
--disk=aarch64-ubuntu-armcl-sve.img \
--machine-type=VExpress_GEM5_V1 \
--mem-type=DDR4_2400_8x8 \
--script=configs/boot/hack_back_ckpt.rcS \
--mem-size=2GB
  • Restoring from a checkpoint: In order to restore from a checkpoint, the checkpoint directory (named cpt.*) should be copied (or sym-linked), inside the outdir of the new simulation (outdir_2 in the following examples).
  • If you don't want to restore from a checkpoint, remove -r 1 from the command line.

2-a) Simple system with a 2x2 Mesh Noc

8 core - Ruby NoC System - MOESI_CMP_directory

2-b) 4x4 Mesh: 16 cores - 16 SLCs - 8 HBM2 controllers (Topology: Mesh_EPI_quadrant_p1)

CXX=g++-7 scons NUMBER_BITS_PER_SET=128 PROTOCOL=MOESI_CMP_directory build/ARM/gem5.opt -j7 --force-lto

./build/ARM/gem5.opt \
--listener-mode=off  \
 --outdir=m5out/outdir_2 \
 configs/example/fs.py \
 -r 1 \
 --cpu-type=R_CPU \
 --restore-with-cpu=R_CPU \
 --arm-iset=aarch64 \
 --kernel=vmlinux_bsc \
 --disk=aarch64-ubuntu-armcl-sve.img \
 --machine-type=VExpress_GEM5_V1 \
 --script=configs/boot/bootscript1.rcS \
 --num-cpus=16 \
 --num-dirs=8 \
 --num-l2caches=16 \
 --caches \
 --l2cache \
 --l1i_size=64kB \
 --l1d_size=1MB \
 --l2_size=2MB \
 --l1i_assoc=4 \
 --l1d_assoc=4 \
 --l2_assoc=8 \
 --mem-type=HBM2_2000_4H_1x128 \
 --mem-size=2GB \
 --sys-clock=2GHz \
 --cpu-clock=2GHz \
 --ruby-clock=2GHz \
 --ruby \
 --topology=Mesh_EPI_quadrant_p1 \
 --mesh-rows=4 \
 --network=garnet2.0 \
 --link-width-bits=576 \
 --noc_layers=3 \
 --vcs-per-vnet=4

EPI Quadrant Topology

III. Advanced features

A. Setup with SLC approximation

MOESI_CMP_directory only supports L1 and L2. However we can approximate a 3-level cache hierarchy using the following: The idea is to use the L2 cache controllers to approximate the SLC cache slices. L1D size is increased to match the size of L1+L2 of the modeled platform, and L2 size is increased to match the size of the platform SLC. Also the cache latencies are adjusted accordingly.

The provided patch already adjusts L1/L2 cache latencies of the model. Then, you can adjust the L1D size, L2 size (which acts as SLC), as well as the number of L2 Controllers (--num-l2caches), approximating the number of SLC slices.

B. Mutli-VNETs feature (Double NoC bandwidth)

This release can support having multiple VNETs per Request/Response (This means that each VNET can have 2 physical links). More specifically one can initiate a gem5 system with a total of 6 NoC layers, instead of 3, while using MOESI_CMP_directory. Use --noc_layers=6, and make sure that use_offset_vnets = True in file: configs/network/Network.py.

C. Detailed NoC Latencies (and CC message types)

One can print detailed per VNET and per Source to Destination NoC Queue and Network latencies. To enable this feature, use --debug-flags=RubyNetConnections. When using this debug flag, the output shown, will contain two types of messages.

The first one regards the NoC latencies. For example:

    7207675622500: system.ruby.network: vnet:2, [L1Cache_Controller 3]->[L2Cache_Controller 5],
    NI[3->21], Rtr[46->37], hops[3],
    queue_AvgMinMaxMed[8.06/8/53/8.00],
    net_AvgMinMaxMed[9.86/9/91/9.00],
    flits:29393

The message is split into the following fields:

  1. gem5 simulation current tick
  2. VNET number
  3. Source Controller -> Destination Controller (Reminder: L1 represents L1+L2 and L2 represents SLC)
  4. Source NI -> Destination NI
  5. Source Router -> Destination Router
  6. Number of hops
  7. Queueing latencies (Avg, Min, Max, Median)
  8. Network latencies (Avg, Min, Max, Median)
  9. Number of packets (flits) sent from source to destination during the last ROI

The second type of messages, describes what type of CC messages, each Source Controller sends to each Destination Controller per VNET and ROI. An example printout is the following:

7207675622500: system.ruby.network: vnet:1, [L2Cache_Controller 2]->[Directory_Controller 0], REQ: [GETX : 19691][GETS : 39488][PUTX : 19723][WRITEBACK_DIRTY_DATA : 19723]
7207675622500: system.ruby.network: vnet:2, [L1Cache_Controller 3]->[L2Cache_Controller 0], RESP: [UNBLOCK : 19713][UNBLOCK_EXCLUSIVE : 9781]

Those types of messages contain the following fields:

  1. gem5 simulation current tick
  2. VNET number
  3. Source Controller -> Destination Controller (Reminder: L1 represents L1+L2 and L2 represents SLC)
  4. REQ / RESP (Request, Response Type)
  5. A sequence of: [CC Message type : Number of messages during last ROI]

IV. References

[1] http://old.gem5.org/Garnet2.0.html

[2] http://old.gem5.org/Ruby.html

[3] http://old.gem5.org/Coherence-Protocol-Independent_Memory_Components.html

[4] https://www.cs.virginia.edu/stream/

[5] http://old.gem5.org/Running_gem5.html

[6] https://gem5.googlesource.com/public/gem5

[7] Arm Neoverse CMN‑650 Coherent Mesh Network TRM

V. Acknowledgements

We thankfully acknowledge support for this research from the European High Performance Computing Joint Undertaking (EuroHPC JU) under Framework Partnership Agreement No 800928 (European Processor Initiative) and Specific Grant Agreement No 101036168 (EPI-SGA2). The EuroHPC JU receives support from the European Union’s Horizon 2020 research and innovation programme and from Croatia, France, Germany, Greece, Italy, Netherlands, Portugal, Spain, Sweden, and Switzerland. National contributions from the involved state members (including the Greek General Secretariat for Research and Innovation) match the EuroHPC funding.

About

No description, website, or topics provided.

Resources

License

BSD-3-Clause, Unknown licenses found

Licenses found

BSD-3-Clause
LICENSE
Unknown
COPYING

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published