Skip to content

Commit

Permalink
i#6949: Enable core-sharded by default for simulators (#7042)
Browse files Browse the repository at this point in the history
Adds a new interface trace_analysis_tool::preferred_shard_type() to the
drmemtrace framework to allow tools to request core-sharded operation.

The cache simulator, TLB simulator, and schedule_stats tools override
the new interface to request core-sharded mode.

In the launcher, if all tools prefer core-sharded, and the user did not
specify sharding (via -[no_]core_{sharded,serial} or -cpu_scheduling),
then core-sharded (or core-serial) mode is enabled, with a -verbose 1+
message.
```
  $ bin64/drrun -stderr_mask 0 -t drcachesim -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir/ -verbose 1 -tool schedule_stats:cache_simulator
  Enabling -core_serial as all tools prefer it
  <...>
  Schedule stats tool results:
  Total counts:
             4 cores
             8 threads: 1257600, 1257602, 1257599, 1257603, 1257598, 1257604, 1257596, 1257601
  <...>
  Core #0 schedule: AEA_A_
  <...>
  Cache simulation results:
  Core #0 (traced CPU(s): #0)
    L1I0 (size=32768, assoc=8, block=64, LRU) stats:
      Hits:                          123,659
  <...>
```

If sharding is not specified and tools do not agree on a preferred type
(with the default being preferring thread-sharded, if a tool does not
override the new method), an error is raised to avoid confusion:
```
  $ bin64/drrun -t drcachesim -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir -tool cache_simulator:basic_counts
  ERROR: failed to initialize analyzer: Selected tools differ in preferred sharding: please re-run with -[no_]core_sharded or -[no_]core_serial
```

Unfortunately, it is not easy to detect core-sharded-on-disk traces in
the launcher, so the user must now pass `-no_core_sharded` when using
such traces with core-sharded-preferring tools to avoid the trace being
re-scheduled yet again. Documentation for this is added and it is turned
into a fatal error since this re-scheduling there is almost certainly
user error.

Reduces the scheduler queue diagnostics by 5x as they seem too frequent
in short runs of the simulator with the new defaults, which new users
are going to see.

Updates the documentation proper and the options documentation to
describe the new defaults.

Updates numerous drcachesim test output templates.

Keeps a couple of simulator tests using thread-sharded by passing
-no_core_serial.

Fixes #6949
  • Loading branch information
derekbruening authored Oct 18, 2024
1 parent 1afe5eb commit e9a983a
Show file tree
Hide file tree
Showing 24 changed files with 216 additions and 84 deletions.
6 changes: 6 additions & 0 deletions api/docs/release.dox
Original file line number Diff line number Diff line change
Expand Up @@ -269,6 +269,12 @@ Further non-compatibility-affecting changes include:
the value of TRACE_MARKER_TYPE_ markers. This filter takes a list of
<TRACE_MARKER_TYPE_,new_value> and changes every listed marker in the trace to its
corresponding new_value.
- Added trace_analysis_tool::preferred_shard_type() to the drmemtrace framework to
allow switching to core-sharded by default if all tools prefer that mode.
- For the drmemtrace framework, if only core-sharded-preferring tools are enabled
(these include cache and TLB simulators and the schedule_stats tool), -core_sharded or
-core_serial is automatically turned on for offline analysis to enable more
representative simulated software thread scheduling onto virtual cores.

**************************************************
<hr>
Expand Down
13 changes: 13 additions & 0 deletions clients/drcachesim/analysis_tool.h
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,19 @@ template <typename RecordType> class analysis_tool_tmpl_t {
{
return "";
}
/**
* Identifies the preferred shard type for this analysis. This only applies when
* the user does not specify a shard type for a run. In that case, if every tool
* being run prefers #SHARD_BY_CORE, the framework uses that mode. If tools
* disagree then an error is raised. This is ignored if the user specifies a
* shard type via one of -core_sharded, -core_serial, -no_core_sharded,
* -no_core_serial, or -cpu_scheduling.
*/
virtual shard_type_t
preferred_shard_type()
{
return SHARD_BY_THREAD;
}
/** Returns whether the tool was created successfully. */
virtual bool
operator!()
Expand Down
13 changes: 13 additions & 0 deletions clients/drcachesim/analyzer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -339,6 +339,19 @@ analyzer_tmpl_t<RecordType, ReaderType>::init_scheduler_common(
uint64_t filetype = scheduler_.get_stream(i)->get_filetype();
VPRINT(this, 2, "Worker %d filetype %" PRIx64 "\n", i, filetype);
if (TESTANY(OFFLINE_FILE_TYPE_CORE_SHARDED, filetype)) {
if (i == 0 && shard_type_ == SHARD_BY_CORE) {
// This is almost certainly user error.
// Better to exit than risk user confusion.
// XXX i#7045: Ideally this could be reported as an error by the
// scheduler, and also detected early in analyzer_multi to auto-fix
// (when no mode is specified: if the user specifies core-sharding
// there could be config differences and this should be an error),
// but neither is simple so today the user has to re-run.
error_string_ =
"Re-scheduling a core-sharded-on-disk trace is generally a "
"mistake; re-run with -no_core_sharded.\n";
return false;
}
shard_type_ = SHARD_BY_CORE;
}
}
Expand Down
56 changes: 54 additions & 2 deletions clients/drcachesim/analyzer_multi.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -462,6 +462,7 @@ analyzer_multi_tmpl_t<RecordType, ReaderType>::analyzer_multi_tmpl_t()
if (!error.empty()) {
this->success_ = false;
this->error_string_ = "raw2trace failed: " + error;
return;
}
}
}
Expand All @@ -473,8 +474,54 @@ analyzer_multi_tmpl_t<RecordType, ReaderType>::analyzer_multi_tmpl_t()
return;
}

bool sharding_specified = op_core_sharded.specified() || op_core_serial.specified() ||
// -cpu_scheduling implies thread-sharded.
op_cpu_scheduling.get_value();
// TODO i#7040: Add core-sharded support for online tools.
bool offline = !op_indir.get_value().empty() || !op_infile.get_value().empty();
if (offline && !sharding_specified) {
bool all_prefer_thread_sharded = true;
bool all_prefer_core_sharded = true;
for (int i = 0; i < this->num_tools_; ++i) {
if (this->tools_[i]->preferred_shard_type() == SHARD_BY_THREAD) {
all_prefer_core_sharded = false;
} else if (this->tools_[i]->preferred_shard_type() == SHARD_BY_CORE) {
all_prefer_thread_sharded = false;
}
if (this->parallel_ && !this->tools_[i]->parallel_shard_supported()) {
this->parallel_ = false;
}
}
if (all_prefer_core_sharded) {
// XXX i#6949: Ideally we could detect a core-sharded-on-disk input
// here and avoid this but that's not simple so currently we have a
// fatal error from the analyzer and the user must re-run with
// -no_core_sharded for such inputs.
if (this->parallel_) {
if (op_verbose.get_value() > 0)
fprintf(stderr, "Enabling -core_sharded as all tools prefer it\n");
op_core_sharded.set_value(true);
} else {
if (op_verbose.get_value() > 0)
fprintf(stderr, "Enabling -core_serial as all tools prefer it\n");
op_core_serial.set_value(true);
}
} else if (!all_prefer_thread_sharded) {
this->success_ = false;
this->error_string_ = "Selected tools differ in preferred sharding: please "
"re-run with -[no_]core_sharded or -[no_]core_serial";
return;
}
}

typename sched_type_t::scheduler_options_t sched_ops;
if (op_core_sharded.get_value() || op_core_serial.get_value()) {
if (!offline) {
// TODO i#7040: Add core-sharded support for online tools.
this->success_ = false;
this->error_string_ = "Core-sharded is not yet supported for online analysis";
return;
}
if (op_core_serial.get_value()) {
this->parallel_ = false;
}
Expand Down Expand Up @@ -502,8 +549,10 @@ analyzer_multi_tmpl_t<RecordType, ReaderType>::analyzer_multi_tmpl_t()
return;
}
if (!this->init_scheduler(tracedir, only_threads, only_shards,
op_verbose.get_value(), std::move(sched_ops)))
op_verbose.get_value(), std::move(sched_ops))) {
this->success_ = false;
return;
}
} else if (op_infile.get_value().empty()) {
// XXX i#3323: Add parallel analysis support for online tools.
this->parallel_ = false;
Expand All @@ -520,12 +569,15 @@ analyzer_multi_tmpl_t<RecordType, ReaderType>::analyzer_multi_tmpl_t()
if (!this->init_scheduler(std::move(reader), std::move(end),
op_verbose.get_value(), std::move(sched_ops))) {
this->success_ = false;
return;
}
} else {
// Legacy file.
if (!this->init_scheduler(op_infile.get_value(), {}, {}, op_verbose.get_value(),
std::move(sched_ops)))
std::move(sched_ops))) {
this->success_ = false;
return;
}
}
if (!init_analysis_tools()) {
this->success_ = false;
Expand Down
26 changes: 20 additions & 6 deletions clients/drcachesim/common/options.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -299,13 +299,19 @@ droption_t<std::string> op_v2p_file(
droption_t<bool> op_cpu_scheduling(
DROPTION_SCOPE_CLIENT, "cpu_scheduling", false,
"Map threads to cores matching recorded cpu execution",
"By default, the simulator schedules threads to simulated cores in a static "
"By default for online analysis, the simulator schedules threads to simulated cores "
"in a static "
"round-robin fashion. This option causes the scheduler to instead use the recorded "
"cpu that each thread executed on (at a granularity of the trace buffer size) "
"for scheduling, mapping traced cpu's to cores and running each segment of each "
"thread on the core that owns the recorded cpu for that segment. "
"This option is not supported with -core_serial; use "
"-cpu_schedule_file with -core_serial instead.");
"-cpu_schedule_file with -core_serial instead. For offline analysis, the "
"recommendation is to not recreate the as-traced schedule (as it is not accurate due "
"to overhead) and instead use a dynamic schedule via -core_serial. If only "
"core-sharded-preferring tools are enabled (e.g., " CPU_CACHE ", " TLB
", " SCHEDULE_STATS
"), -core_serial is automatically turned on for offline analysis.");

droption_t<bytesize_t> op_max_trace_size(
DROPTION_SCOPE_CLIENT, "max_trace_size", 0,
Expand Down Expand Up @@ -890,19 +896,27 @@ droption_t<int> op_kernel_trace_buffer_size_shift(
// Core-oriented analysis.
droption_t<bool> op_core_sharded(
DROPTION_SCOPE_ALL, "core_sharded", false, "Analyze per-core in parallel.",
"By default, the input trace is analyzed in parallel across shards equal to "
"software threads. This option instead schedules those threads onto virtual cores "
"By default, the sharding mode is determined by the preferred shard type of the"
"tools selected (unless overridden, the default preferred type is thread-sharded). "
"This option enables core-sharded, overriding tool defaults. Core-sharded "
"anlysis schedules the input software threads onto virtual cores "
"and analyzes each core in parallel. Thus, each shard consists of pieces from "
"many software threads. How the scheduling is performed is controlled by a set "
"of options with the prefix \"sched_\" along with -cores.");
"of options with the prefix \"sched_\" along with -cores. If only "
"core-sharded-preferring tools are enabled (e.g., " CPU_CACHE ", " TLB
", " SCHEDULE_STATS ") and they all support parallel operation, -core_sharded is "
"automatically turned on for offline analysis.");

droption_t<bool> op_core_serial(
DROPTION_SCOPE_ALL, "core_serial", false, "Analyze per-core in serial.",
"In this mode, scheduling is performed just like for -core_sharded. "
"However, the resulting schedule is acted upon by a single analysis thread"
"which walks the N cores in lockstep in round robin fashion. "
"How the scheduling is performed is controlled by a set "
"of options with the prefix \"sched_\" along with -cores.");
"of options with the prefix \"sched_\" along with -cores. If only "
"core-sharded-preferring tools are enabled (e.g., " CPU_CACHE ", " TLB
", " SCHEDULE_STATS ") and not all of them support parallel operation, "
"-core_serial is automatically turned on for offline analysis.");

droption_t<int64_t>
// We pick 10 million to match 2 instructions per nanosecond with a 5ms quantum.
Expand Down
19 changes: 13 additions & 6 deletions clients/drcachesim/docs/drcachesim.dox.in
Original file line number Diff line number Diff line change
Expand Up @@ -1292,22 +1292,24 @@ Neither simulator has a simple way to know which core any particular thread
executed on for each of its instructions. The tracer records which core a
thread is on each time it writes out a full trace buffer, giving an
approximation of the actual scheduling: but this is not representative
due to overhead (see \ref sec_drcachesim_as_traced). By default, these cache and TLB
simulators ignore that
due to overhead (see \ref sec_drcachesim_as_traced). For online analysis, by default,
these cache and TLB simulators ignore that
information and schedule threads to simulated cores in a static round-robin
fashion with load balancing to fill in gaps with new threads after threads
exit. The option "-cpu_scheduling" (see \ref sec_drcachesim_ops) can be
used to instead map each physical cpu to a simulated core and use the
recorded cpu that each segment of thread execution occurred on to schedule
execution following the "as traced" schedule, but as just noted this is not
representative. Instead, we recommend using offline traces and dynamic
re-scheduling as explained in \ref sec_drcachesim_sched_dynamic using the
`-core_serial` parameter. Here is an example:
re-scheduling in core-sharded mode as explained in \ref sec_drcachesim_sched_dynamic
using the
`-core_serial` parameter. In offline mode, `-core_serial` is the default for
these simulators.

\code
$ bin64/drrun -t drmemtrace -offline -- ~/test/pi_estimator 8 20
Estimation of pi is 3.141592653798125
$ bin64/drrun -t drcachesim -core_serial -cores 3 -indir drmemtrace.pi_estimator.*.dir
$ bin64/drrun -t drcachesim -cores 3 -indir drmemtrace.pi_estimator.*.dir
Cache simulation results:
Core #0 (traced CPU(s): #0)
L1I0 (size=32768, assoc=8, block=64, LRU) stats:
Expand Down Expand Up @@ -1473,6 +1475,9 @@ The #dynamorio::drmemtrace::TRACE_MARKER_TYPE_TIMESTAMP and
#dynamorio::drmemtrace::TRACE_MARKER_TYPE_CPU_ID markers are modified by the dynamic
scheduler to reflect the new schedule. The new timestamps maintain relative ordering
but should not be relied upon to indicate accurate durations between events.
When analyzing core-sharded-on-disk traces, `-no_core_sharded` must be passed when
using core-sharded-preferring tools to avoid an error from the framework attempting
to re-schedule the already-scheduled trace.

Traces also include markers indicating disruptions in user mode control
flow such as signal handler entry and exit.
Expand Down Expand Up @@ -1512,7 +1517,9 @@ the framework controls the iteration), to request the next trace
record for each output on its own. This scheduling is also available to any analysis tool
when the input traces are sharded by core (see the `-core_sharded` and `-core_serial`
and various `-sched_*` option documentation under \ref sec_drcachesim_ops as well as
core-sharded notes when \ref sec_drcachesim_newtool).
core-sharded notes when \ref sec_drcachesim_newtool), and in fact is the
default when all tools prefer core-sharded operation via
#dynamorio::drmemtrace::analysis_tool_t::preferred_shard_type().

********************
\section sec_drcachesim_as_traced As-Traced Schedule Limitations
Expand Down
9 changes: 7 additions & 2 deletions clients/drcachesim/scheduler/scheduler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3245,9 +3245,14 @@ scheduler_tmpl_t<RecordType, ReaderType>::pick_next_input(output_ordinal_t outpu
uint64_t blocked_time)
{
VDO(this, 1, {
static int global_heartbeat;
static int64_t global_heartbeat;
// 10K is too frequent for simple analyzer runs: it is too noisy with
// the new core-sharded-by-default for new users using defaults.
// 50K is a reasonable compromise.
// XXX: Add a runtime option to tweak this.
static constexpr int64_t GLOBAL_HEARTBEAT_CADENCE = 50000;
// We are ok with races as the cadence is approximate.
if (++global_heartbeat % 10000 == 0) {
if (++global_heartbeat % GLOBAL_HEARTBEAT_CADENCE == 0) {
print_queue_stats();
}
});
Expand Down
3 changes: 1 addition & 2 deletions clients/drcachesim/simulator/cache_simulator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -632,8 +632,7 @@ cache_simulator_t::print_results()
std::cerr << "Cache simulation results:\n";
// Print core and associated L1 cache stats first.
for (unsigned int i = 0; i < knobs_.num_cores; i++) {
print_core(i);
if (shard_type_ == SHARD_BY_CORE || thread_ever_counts_[i] > 0) {
if (print_core(i)) {
if (l1_icaches_[i] != l1_dcaches_[i]) {
std::cerr << " " << l1_icaches_[i]->get_name() << " ("
<< l1_icaches_[i]->get_description() << ") stats:" << std::endl;
Expand Down
7 changes: 5 additions & 2 deletions clients/drcachesim/simulator/simulator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -311,18 +311,19 @@ simulator_t::handle_thread_exit(memref_tid_t tid)
thread2core_.erase(tid);
}

void
bool
simulator_t::print_core(int core) const
{
if (!knob_cpu_scheduling_ && shard_type_ == SHARD_BY_THREAD) {
std::cerr << "Core #" << core << " (" << thread_ever_counts_[core]
<< " thread(s))" << std::endl;
return thread_ever_counts_[core] > 0;
} else {
std::cerr << "Core #" << core;
if (shard_type_ == SHARD_BY_THREAD && cpu_counts_[core] == 0) {
// We keep the "(s)" mainly to simplify test templates.
std::cerr << " (0 traced CPU(s))" << std::endl;
return;
return false;
}
std::cerr << " (";
if (shard_type_ == SHARD_BY_THREAD) // Always 1:1 for SHARD_BY_CORE.
Expand All @@ -338,6 +339,8 @@ simulator_t::print_core(int core) const
}
}
std::cerr << ")" << std::endl;
// If anything ran on this core, need_comma will be true.
return need_comma;
}
}

Expand Down
10 changes: 9 additions & 1 deletion clients/drcachesim/simulator/simulator.h
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,13 @@ class simulator_t : public analysis_tool_t {
std::string
initialize_shard_type(shard_type_t shard_type) override;

shard_type_t
preferred_shard_type() override
{
// We prefer a dynamic schedule with more realistic thread interleavings.
return SHARD_BY_CORE;
}

bool
process_memref(const memref_t &memref) override;

Expand All @@ -83,7 +90,8 @@ class simulator_t : public analysis_tool_t {
double warmup_fraction, uint64_t sim_refs, bool cpu_scheduling,
bool use_physical, unsigned int verbose);

void
// Returns whether the core was ever non-empty.
bool
print_core(int core) const;

int
Expand Down
3 changes: 1 addition & 2 deletions clients/drcachesim/simulator/tlb_simulator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -264,8 +264,7 @@ tlb_simulator_t::print_results()
{
std::cerr << "TLB simulation results:\n";
for (unsigned int i = 0; i < knobs_.num_cores; i++) {
print_core(i);
if (thread_ever_counts_[i] > 0) {
if (print_core(i)) {
std::cerr << " L1I stats:" << std::endl;
itlbs_[i]->get_stats()->print_stats(" ");
std::cerr << " L1D stats:" << std::endl;
Expand Down
8 changes: 4 additions & 4 deletions clients/drcachesim/tests/offline-burst_client.templatex
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ DynamoRIO statistics:
.*
all done
Cache simulation results:
Core #0 \(1 thread\(s\)\)
Core #0 \(traced CPU\(s\): #0\)
L1I0 .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand All @@ -36,9 +36,9 @@ Core #0 \(1 thread\(s\)\)
Compulsory misses: *[0-9,\.]*
Invalidations: *0
.* Miss rate: [0-3][,\.]..%
Core #1 \(0 thread\(s\)\)
Core #2 \(0 thread\(s\)\)
Core #3 \(0 thread\(s\)\)
Core #1 \(traced CPU\(s\): \)
Core #2 \(traced CPU\(s\): \)
Core #3 \(traced CPU\(s\): \)
LL .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand Down
8 changes: 4 additions & 4 deletions clients/drcachesim/tests/offline-burst_maps.templatex
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ pre-DR start
pre-DR detach
all done
Cache simulation results:
Core #0 \(1 thread\(s\)\)
Core #0 \(traced CPU\(s\): #0\)
L1I0 .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand All @@ -24,9 +24,9 @@ Core #0 \(1 thread\(s\)\)
Compulsory misses: *[0-9,\.]*
Invalidations: *0
.* Miss rate: [0-3][,\.]..%
Core #1 \(0 thread\(s\)\)
Core #2 \(0 thread\(s\)\)
Core #3 \(0 thread\(s\)\)
Core #1 \(traced CPU\(s\): \)
Core #2 \(traced CPU\(s\): \)
Core #3 \(traced CPU\(s\): \)
LL .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand Down
Loading

0 comments on commit e9a983a

Please sign in to comment.