i#6949: Enable core-sharded by default for simulators (#7042)

Adds a new interface trace_analysis_tool::preferred_shard_type() to the drmemtrace framework to allow tools to request core-sharded operation. The cache simulator, TLB simulator, and schedule_stats tools override the new interface to request core-sharded mode. In the launcher, if all tools prefer core-sharded, and the user did not specify sharding (via -[no_]core_{sharded,serial} or -cpu_scheduling), then core-sharded (or core-serial) mode is enabled, with a -verbose 1+ message. ``` $ bin64/drrun -stderr_mask 0 -t drcachesim -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir/ -verbose 1 -tool schedule_stats:cache_simulator Enabling -core_serial as all tools prefer it <...> Schedule stats tool results: Total counts: 4 cores 8 threads: 1257600, 1257602, 1257599, 1257603, 1257598, 1257604, 1257596, 1257601 <...> Core #0 schedule: AEA_A_ <...> Cache simulation results: Core #0 (traced CPU(s): #0) L1I0 (size=32768, assoc=8, block=64, LRU) stats: Hits: 123,659 <...> ``` If sharding is not specified and tools do not agree on a preferred type (with the default being preferring thread-sharded, if a tool does not override the new method), an error is raised to avoid confusion: ``` $ bin64/drrun -t drcachesim -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir -tool cache_simulator:basic_counts ERROR: failed to initialize analyzer: Selected tools differ in preferred sharding: please re-run with -[no_]core_sharded or -[no_]core_serial ``` Unfortunately, it is not easy to detect core-sharded-on-disk traces in the launcher, so the user must now pass `-no_core_sharded` when using such traces with core-sharded-preferring tools to avoid the trace being re-scheduled yet again. Documentation for this is added and it is turned into a fatal error since this re-scheduling there is almost certainly user error. Reduces the scheduler queue diagnostics by 5x as they seem too frequent in short runs of the simulator with the new defaults, which new users are going to see. Updates the documentation proper and the options documentation to describe the new defaults. Updates numerous drcachesim test output templates. Keeps a couple of simulator tests using thread-sharded by passing -no_core_serial. Fixes #6949
DynamoRIO · Oct 18, 2024 · e9a983a · e9a983a
1 parent 1afe5eb
commit e9a983a
Show file tree

Hide file tree

Showing 24 changed files with 216 additions and 84 deletions.
diff --git a/api/docs/release.dox b/api/docs/release.dox
@@ -269,6 +269,12 @@ Further non-compatibility-affecting changes include:
    the value of TRACE_MARKER_TYPE_ markers. This filter takes a list of
    <TRACE_MARKER_TYPE_,new_value> and changes every listed marker in the trace to its
    corresponding new_value.
+ - Added trace_analysis_tool::preferred_shard_type() to the drmemtrace framework to
+   allow switching to core-sharded by default if all tools prefer that mode.
+ - For the drmemtrace framework, if only core-sharded-preferring tools are enabled
+   (these include cache and TLB simulators and the schedule_stats tool), -core_sharded or
+   -core_serial is automatically turned on for offline analysis to enable more
+   representative simulated software thread scheduling onto virtual cores.
 
 **************************************************
 <hr>

diff --git a/clients/drcachesim/analysis_tool.h b/clients/drcachesim/analysis_tool.h
@@ -156,6 +156,19 @@ template <typename RecordType> class analysis_tool_tmpl_t {
     {
         return "";
     }
+    /**
+     * Identifies the preferred shard type for this analysis.  This only applies when
+     * the user does not specify a shard type for a run.  In that case, if every tool
+     * being run prefers #SHARD_BY_CORE, the framework uses that mode.  If tools
+     * disagree then an error is raised.  This is ignored if the user specifies a
+     * shard type via one of -core_sharded, -core_serial, -no_core_sharded,
+     * -no_core_serial, or -cpu_scheduling.
+     */
+    virtual shard_type_t
+    preferred_shard_type()
+    {
+        return SHARD_BY_THREAD;
+    }
     /** Returns whether the tool was created successfully. */
     virtual bool
     operator!()

diff --git a/clients/drcachesim/analyzer.cpp b/clients/drcachesim/analyzer.cpp
@@ -339,6 +339,19 @@ analyzer_tmpl_t<RecordType, ReaderType>::init_scheduler_common(
             uint64_t filetype = scheduler_.get_stream(i)->get_filetype();
             VPRINT(this, 2, "Worker %d filetype %" PRIx64 "\n", i, filetype);
             if (TESTANY(OFFLINE_FILE_TYPE_CORE_SHARDED, filetype)) {
+                if (i == 0 && shard_type_ == SHARD_BY_CORE) {
+                    // This is almost certainly user error.
+                    // Better to exit than risk user confusion.
+                    // XXX i#7045: Ideally this could be reported as an error by the
+                    // scheduler, and also detected early in analyzer_multi to auto-fix
+                    // (when no mode is specified: if the user specifies core-sharding
+                    // there could be config differences and this should be an error),
+                    // but neither is simple so today the user has to re-run.
+                    error_string_ =
+                        "Re-scheduling a core-sharded-on-disk trace is generally a "
+                        "mistake; re-run with -no_core_sharded.\n";
+                    return false;
+                }
                 shard_type_ = SHARD_BY_CORE;
             }
         }

diff --git a/clients/drcachesim/analyzer_multi.cpp b/clients/drcachesim/analyzer_multi.cpp
@@ -462,6 +462,7 @@ analyzer_multi_tmpl_t<RecordType, ReaderType>::analyzer_multi_tmpl_t()
             if (!error.empty()) {
                 this->success_ = false;
                 this->error_string_ = "raw2trace failed: " + error;
+                return;
             }
         }
     }
@@ -473,8 +474,54 @@ analyzer_multi_tmpl_t<RecordType, ReaderType>::analyzer_multi_tmpl_t()
         return;
     }
 
+    bool sharding_specified = op_core_sharded.specified() || op_core_serial.specified() ||
+        // -cpu_scheduling implies thread-sharded.
+        op_cpu_scheduling.get_value();
+    // TODO i#7040: Add core-sharded support for online tools.
+    bool offline = !op_indir.get_value().empty() || !op_infile.get_value().empty();
+    if (offline && !sharding_specified) {
+        bool all_prefer_thread_sharded = true;
+        bool all_prefer_core_sharded = true;
+        for (int i = 0; i < this->num_tools_; ++i) {
+            if (this->tools_[i]->preferred_shard_type() == SHARD_BY_THREAD) {
+                all_prefer_core_sharded = false;
+            } else if (this->tools_[i]->preferred_shard_type() == SHARD_BY_CORE) {
+                all_prefer_thread_sharded = false;
+            }
+            if (this->parallel_ && !this->tools_[i]->parallel_shard_supported()) {
+                this->parallel_ = false;
+            }
+        }
+        if (all_prefer_core_sharded) {
+            // XXX i#6949: Ideally we could detect a core-sharded-on-disk input
+            // here and avoid this but that's not simple so currently we have a
+            // fatal error from the analyzer and the user must re-run with
+            // -no_core_sharded for such inputs.
+            if (this->parallel_) {
+                if (op_verbose.get_value() > 0)
+                    fprintf(stderr, "Enabling -core_sharded as all tools prefer it\n");
+                op_core_sharded.set_value(true);
+            } else {
+                if (op_verbose.get_value() > 0)
+                    fprintf(stderr, "Enabling -core_serial as all tools prefer it\n");
+                op_core_serial.set_value(true);
+            }
+        } else if (!all_prefer_thread_sharded) {
+            this->success_ = false;
+            this->error_string_ = "Selected tools differ in preferred sharding: please "
+                                  "re-run with -[no_]core_sharded or -[no_]core_serial";
+            return;
+        }
+    }
+
     typename sched_type_t::scheduler_options_t sched_ops;
     if (op_core_sharded.get_value() || op_core_serial.get_value()) {
+        if (!offline) {
+            // TODO i#7040: Add core-sharded support for online tools.
+            this->success_ = false;
+            this->error_string_ = "Core-sharded is not yet supported for online analysis";
+            return;
+        }
         if (op_core_serial.get_value()) {
             this->parallel_ = false;
         }
@@ -502,8 +549,10 @@ analyzer_multi_tmpl_t<RecordType, ReaderType>::analyzer_multi_tmpl_t()
             return;
         }
         if (!this->init_scheduler(tracedir, only_threads, only_shards,
-                                  op_verbose.get_value(), std::move(sched_ops)))
+                                  op_verbose.get_value(), std::move(sched_ops))) {
             this->success_ = false;
+            return;
+        }
     } else if (op_infile.get_value().empty()) {
         // XXX i#3323: Add parallel analysis support for online tools.
         this->parallel_ = false;
@@ -520,12 +569,15 @@ analyzer_multi_tmpl_t<RecordType, ReaderType>::analyzer_multi_tmpl_t()
         if (!this->init_scheduler(std::move(reader), std::move(end),
                                   op_verbose.get_value(), std::move(sched_ops))) {
             this->success_ = false;
+            return;
         }
     } else {
         // Legacy file.
         if (!this->init_scheduler(op_infile.get_value(), {}, {}, op_verbose.get_value(),
-                                  std::move(sched_ops)))
+                                  std::move(sched_ops))) {
             this->success_ = false;
+            return;
+        }
     }
     if (!init_analysis_tools()) {
         this->success_ = false;

diff --git a/clients/drcachesim/common/options.cpp b/clients/drcachesim/common/options.cpp
@@ -299,13 +299,19 @@ droption_t<std::string> op_v2p_file(
 droption_t<bool> op_cpu_scheduling(
     DROPTION_SCOPE_CLIENT, "cpu_scheduling", false,
     "Map threads to cores matching recorded cpu execution",
-    "By default, the simulator schedules threads to simulated cores in a static "
+    "By default for online analysis, the simulator schedules threads to simulated cores "
+    "in a static "
     "round-robin fashion.  This option causes the scheduler to instead use the recorded "
     "cpu that each thread executed on (at a granularity of the trace buffer size) "
     "for scheduling, mapping traced cpu's to cores and running each segment of each "
     "thread on the core that owns the recorded cpu for that segment. "
     "This option is not supported with -core_serial; use "
-    "-cpu_schedule_file with -core_serial instead.");
+    "-cpu_schedule_file with -core_serial instead.  For offline analysis, the "
+    "recommendation is to not recreate the as-traced schedule (as it is not accurate due "
+    "to overhead) and instead use a dynamic schedule via -core_serial.  If only "
+    "core-sharded-preferring tools are enabled (e.g., " CPU_CACHE ", " TLB
+    ", " SCHEDULE_STATS
+    "), -core_serial is automatically turned on for offline analysis.");
 
 droption_t<bytesize_t> op_max_trace_size(
     DROPTION_SCOPE_CLIENT, "max_trace_size", 0,
@@ -890,19 +896,27 @@ droption_t<int> op_kernel_trace_buffer_size_shift(
 // Core-oriented analysis.
 droption_t<bool> op_core_sharded(
     DROPTION_SCOPE_ALL, "core_sharded", false, "Analyze per-core in parallel.",
-    "By default, the input trace is analyzed in parallel across shards equal to "
-    "software threads.  This option instead schedules those threads onto virtual cores "
+    "By default, the sharding mode is determined by the preferred shard type of the"
+    "tools selected (unless overridden, the default preferred type is thread-sharded). "
+    "This option enables core-sharded, overriding tool defaults.  Core-sharded "
+    "anlysis schedules the input software threads onto virtual cores "
     "and analyzes each core in parallel.  Thus, each shard consists of pieces from "
     "many software threads.  How the scheduling is performed is controlled by a set "
-    "of options with the prefix \"sched_\" along with -cores.");
+    "of options with the prefix \"sched_\" along with -cores.  If only "
+    "core-sharded-preferring tools are enabled (e.g., " CPU_CACHE ", " TLB
+    ", " SCHEDULE_STATS ") and they all support parallel operation, -core_sharded is "
+    "automatically turned on for offline analysis.");
 
 droption_t<bool> op_core_serial(
     DROPTION_SCOPE_ALL, "core_serial", false, "Analyze per-core in serial.",
     "In this mode, scheduling is performed just like for -core_sharded. "
     "However, the resulting schedule is acted upon by a single analysis thread"
     "which walks the N cores in lockstep in round robin fashion. "
     "How the scheduling is performed is controlled by a set "
-    "of options with the prefix \"sched_\" along with -cores.");
+    "of options with the prefix \"sched_\" along with -cores.  If only "
+    "core-sharded-preferring tools are enabled (e.g., " CPU_CACHE ", " TLB
+    ", " SCHEDULE_STATS ") and not all of them support parallel operation, "
+    "-core_serial is automatically turned on for offline analysis.");
 
 droption_t<int64_t>
     // We pick 10 million to match 2 instructions per nanosecond with a 5ms quantum.

diff --git a/clients/drcachesim/docs/drcachesim.dox.in b/clients/drcachesim/docs/drcachesim.dox.in
@@ -1292,22 +1292,24 @@ Neither simulator has a simple way to know which core any particular thread
 executed on for each of its instructions.  The tracer records which core a
 thread is on each time it writes out a full trace buffer, giving an
 approximation of the actual scheduling: but this is not representative
-due to overhead (see \ref sec_drcachesim_as_traced).  By default, these cache and TLB
-simulators ignore that
+due to overhead (see \ref sec_drcachesim_as_traced).  For online analysis, by default,
+these cache and TLB simulators ignore that
 information and schedule threads to simulated cores in a static round-robin
 fashion with load balancing to fill in gaps with new threads after threads
 exit.  The option "-cpu_scheduling" (see \ref sec_drcachesim_ops) can be
 used to instead map each physical cpu to a simulated core and use the
 recorded cpu that each segment of thread execution occurred on to schedule
 execution following the "as traced" schedule, but as just noted this is not
 representative.  Instead, we recommend using offline traces and dynamic
-re-scheduling as explained in \ref sec_drcachesim_sched_dynamic using the
-`-core_serial` parameter.  Here is an example:
+re-scheduling in core-sharded mode as explained in \ref sec_drcachesim_sched_dynamic
+using the
+`-core_serial` parameter.  In offline mode, `-core_serial` is the default for
+these simulators.
 
 \code
 $ bin64/drrun -t drmemtrace -offline -- ~/test/pi_estimator 8 20
 Estimation of pi is 3.141592653798125
-$ bin64/drrun -t drcachesim -core_serial -cores 3 -indir drmemtrace.pi_estimator.*.dir
+$ bin64/drrun -t drcachesim -cores 3 -indir drmemtrace.pi_estimator.*.dir
 Cache simulation results:
 Core #0 (traced CPU(s): #0)
   L1I0 (size=32768, assoc=8, block=64, LRU) stats:
@@ -1473,6 +1475,9 @@ The #dynamorio::drmemtrace::TRACE_MARKER_TYPE_TIMESTAMP and
 #dynamorio::drmemtrace::TRACE_MARKER_TYPE_CPU_ID markers are modified by the dynamic
 scheduler to reflect the new schedule.  The new timestamps maintain relative ordering
 but should not be relied upon to indicate accurate durations between events.
+When analyzing core-sharded-on-disk traces, `-no_core_sharded` must be passed when
+using core-sharded-preferring tools to avoid an error from the framework attempting
+to re-schedule the already-scheduled trace.
 
 Traces also include markers indicating disruptions in user mode control
 flow such as signal handler entry and exit.
@@ -1512,7 +1517,9 @@ the framework controls the iteration), to request the next trace
 record for each output on its own.  This scheduling is also available to any analysis tool
 when the input traces are sharded by core (see the `-core_sharded` and `-core_serial`
 and various `-sched_*` option documentation under \ref sec_drcachesim_ops as well as
-core-sharded notes when \ref sec_drcachesim_newtool).
+core-sharded notes when \ref sec_drcachesim_newtool), and in fact is the
+default when all tools prefer core-sharded operation via
+#dynamorio::drmemtrace::analysis_tool_t::preferred_shard_type().
 
 ********************
 \section sec_drcachesim_as_traced As-Traced Schedule Limitations

diff --git a/clients/drcachesim/scheduler/scheduler.cpp b/clients/drcachesim/scheduler/scheduler.cpp
@@ -3245,9 +3245,14 @@ scheduler_tmpl_t<RecordType, ReaderType>::pick_next_input(output_ordinal_t outpu
                                                           uint64_t blocked_time)
 {
     VDO(this, 1, {
-        static int global_heartbeat;
+        static int64_t global_heartbeat;
+        // 10K is too frequent for simple analyzer runs: it is too noisy with
+        // the new core-sharded-by-default for new users using defaults.
+        // 50K is a reasonable compromise.
+        // XXX: Add a runtime option to tweak this.
+        static constexpr int64_t GLOBAL_HEARTBEAT_CADENCE = 50000;
         // We are ok with races as the cadence is approximate.
-        if (++global_heartbeat % 10000 == 0) {
+        if (++global_heartbeat % GLOBAL_HEARTBEAT_CADENCE == 0) {
             print_queue_stats();
         }
     });

diff --git a/clients/drcachesim/simulator/cache_simulator.cpp b/clients/drcachesim/simulator/cache_simulator.cpp
@@ -632,8 +632,7 @@ cache_simulator_t::print_results()
     std::cerr << "Cache simulation results:\n";
     // Print core and associated L1 cache stats first.
     for (unsigned int i = 0; i < knobs_.num_cores; i++) {
-        print_core(i);
-        if (shard_type_ == SHARD_BY_CORE || thread_ever_counts_[i] > 0) {
+        if (print_core(i)) {
             if (l1_icaches_[i] != l1_dcaches_[i]) {
                 std::cerr << "  " << l1_icaches_[i]->get_name() << " ("
                           << l1_icaches_[i]->get_description() << ") stats:" << std::endl;

diff --git a/clients/drcachesim/simulator/simulator.cpp b/clients/drcachesim/simulator/simulator.cpp
@@ -311,18 +311,19 @@ simulator_t::handle_thread_exit(memref_tid_t tid)
     thread2core_.erase(tid);
 }
 
-void
+bool
 simulator_t::print_core(int core) const
 {
     if (!knob_cpu_scheduling_ && shard_type_ == SHARD_BY_THREAD) {
         std::cerr << "Core #" << core << " (" << thread_ever_counts_[core]
                   << " thread(s))" << std::endl;
+        return thread_ever_counts_[core] > 0;
     } else {
         std::cerr << "Core #" << core;
         if (shard_type_ == SHARD_BY_THREAD && cpu_counts_[core] == 0) {
             // We keep the "(s)" mainly to simplify test templates.
             std::cerr << " (0 traced CPU(s))" << std::endl;
-            return;
+            return false;
         }
         std::cerr << " (";
         if (shard_type_ == SHARD_BY_THREAD) // Always 1:1 for SHARD_BY_CORE.
@@ -338,6 +339,8 @@ simulator_t::print_core(int core) const
             }
         }
         std::cerr << ")" << std::endl;
+        // If anything ran on this core, need_comma will be true.
+        return need_comma;
     }
 }
 

diff --git a/clients/drcachesim/simulator/simulator.h b/clients/drcachesim/simulator/simulator.h
@@ -69,6 +69,13 @@ class simulator_t : public analysis_tool_t {
     std::string
     initialize_shard_type(shard_type_t shard_type) override;
 
+    shard_type_t
+    preferred_shard_type() override
+    {
+        // We prefer a dynamic schedule with more realistic thread interleavings.
+        return SHARD_BY_CORE;
+    }
+
     bool
     process_memref(const memref_t &memref) override;
 
@@ -83,7 +90,8 @@ class simulator_t : public analysis_tool_t {
                double warmup_fraction, uint64_t sim_refs, bool cpu_scheduling,
                bool use_physical, unsigned int verbose);
 
-    void
+    // Returns whether the core was ever non-empty.
+    bool
     print_core(int core) const;
 
     int

diff --git a/clients/drcachesim/simulator/tlb_simulator.cpp b/clients/drcachesim/simulator/tlb_simulator.cpp
@@ -264,8 +264,7 @@ tlb_simulator_t::print_results()
 {
     std::cerr << "TLB simulation results:\n";
     for (unsigned int i = 0; i < knobs_.num_cores; i++) {
-        print_core(i);
-        if (thread_ever_counts_[i] > 0) {
+        if (print_core(i)) {
             std::cerr << "  L1I stats:" << std::endl;
             itlbs_[i]->get_stats()->print_stats("    ");
             std::cerr << "  L1D stats:" << std::endl;

diff --git a/clients/drcachesim/tests/offline-burst_client.templatex b/clients/drcachesim/tests/offline-burst_client.templatex
@@ -23,7 +23,7 @@ DynamoRIO statistics:
 .*
 all done
 Cache simulation results:
-Core #0 \(1 thread\(s\)\)
+Core #0 \(traced CPU\(s\): #0\)
   L1I0 .* stats:
     Hits:                         *[0-9,\.]*
     Misses:                       *[0-9,\.]*
@@ -36,9 +36,9 @@ Core #0 \(1 thread\(s\)\)
     Compulsory misses:            *[0-9,\.]*
     Invalidations:                *0
 .*   Miss rate:                        [0-3][,\.]..%
-Core #1 \(0 thread\(s\)\)
-Core #2 \(0 thread\(s\)\)
-Core #3 \(0 thread\(s\)\)
+Core #1 \(traced CPU\(s\): \)
+Core #2 \(traced CPU\(s\): \)
+Core #3 \(traced CPU\(s\): \)
 LL .* stats:
     Hits:                         *[0-9,\.]*
     Misses:                       *[0-9,\.]*

diff --git a/clients/drcachesim/tests/offline-burst_maps.templatex b/clients/drcachesim/tests/offline-burst_maps.templatex
@@ -11,7 +11,7 @@ pre-DR start
 pre-DR detach
 all done
 Cache simulation results:
-Core #0 \(1 thread\(s\)\)
+Core #0 \(traced CPU\(s\): #0\)
   L1I0 .* stats:
     Hits:                         *[0-9,\.]*
     Misses:                       *[0-9,\.]*
@@ -24,9 +24,9 @@ Core #0 \(1 thread\(s\)\)
     Compulsory misses:            *[0-9,\.]*
     Invalidations:                *0
 .*   Miss rate:                        [0-3][,\.]..%
-Core #1 \(0 thread\(s\)\)
-Core #2 \(0 thread\(s\)\)
-Core #3 \(0 thread\(s\)\)
+Core #1 \(traced CPU\(s\): \)
+Core #2 \(traced CPU\(s\): \)
+Core #3 \(traced CPU\(s\): \)
 LL .* stats:
     Hits:                         *[0-9,\.]*
     Misses:                       *[0-9,\.]*