Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vk: rt: add very experimental shader clock/timing output #692

Draft
wants to merge 3 commits into
base: vulkan
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions ref/vk/NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -1011,3 +1011,92 @@ This would need the same as above, plus:
- A: probably should still do it on GPU lol

This would also allow passing arbitrary per-pixel data from shaders, which would make shader debugging much much easier.

# 2023-12-07 E343
## What do we really need for shader profiling
### Optimizing polygon light sampling
- Per-pixel numbers:
- Total shader time
- Sampling selection time (Σ)
- Selecting lights to sample (+count)
- Selecting light point to sample
- Vertices count
- Ray tracing time (Σ, +count)
- Aggregate numbers:
- TODO: what does VK_KHR_performance_query give us? Regs usage, etc.

# 2023-12-08 E344
## Experiment one
some save in test_brush2, default PROJECTED sampling
0-5%: s=0(0, 0.00%) r=133495(133495, 13.18%)
5-10%: s=0(0, 0.00%) r=204348(337843, 33.35%)
10-15%: s=0(0, 0.00%) r=92673(430516, 42.49%)
15-20%: s=0(0, 0.00%) r=25196(455712, 44.98%)
20-25%: s=67(67, 0.01%) r=3529(459241, 45.33%)
25-30%: s=253(320, 0.03%) r=12018(471259, 46.52%)
30-35%: s=319(639, 0.06%) r=39805(511064, 50.45%)
35-40%: s=2096(2735, 0.27%) r=178843(689907, 68.10%)
40-45%: s=8753(11488, 1.13%) r=270099(960006, 94.76%)
45-50%: s=53958(65446, 6.46%) r=44000(1004006, 99.10%)
50-55%: s=275150(340596, 33.62%) r=8256(1012262, 99.92%)
55-60%: s=183343(523939, 51.72%) r=838(1013100, 100.00%)
60-65%: s=38922(562861, 55.56%) r=0(1013100, 100.00%)
65-70%: s=6603(569464, 56.21%) r=0(1013100, 100.00%)
70-75%: s=14101(583565, 57.60%) r=0(1013100, 100.00%)
75-80%: s=85481(669046, 66.04%) r=0(1013100, 100.00%)
80-85%: s=150723(819769, 80.92%) r=0(1013100, 100.00%)
85-90%: s=107863(927632, 91.56%) r=0(1013100, 100.00%)
90-95%: s=85468(1013100, 100.00%) r=0(1013100, 100.00%)
95-100%: s=0(1013100, 100.00%) r=0(1013100, 100.00%)

(shader clock) percentiles:
99%: ray=393223 sampling=496965
95%: ray=321478 sampling=365558
90%: ray=276235 sampling=356515
75%: ray=250132 sampling=349746
50%: ray=184418 sampling=337416

PROJECTED + REALTIME clock:
percentiles:
99%: ray=45580 sampling=62860
95%: ray=30324 sampling=60136
90%: ray=19736 sampling=58868
75%: ray=13688 sampling=48068
50%: ray=10784 sampling=22500

## Two
SOLID sampling
0-5%: s=6(6, 0.00%) r=0(0, 0.00%)
5-10%: s=3361(3367, 0.33%) r=0(0, 0.00%)
10-15%: s=89901(93268, 9.21%) r=0(0, 0.00%)
15-20%: s=681106(774374, 76.44%) r=0(0, 0.00%)
20-25%: s=125185(899559, 88.79%) r=0(0, 0.00%)
25-30%: s=111237(1010796, 99.77%) r=222(222, 0.02%)
30-35%: s=2080(1012876, 99.98%) r=667(889, 0.09%)
35-40%: s=224(1013100, 100.00%) r=621(1510, 0.15%)
40-45%: s=0(1013100, 100.00%) r=1049(2559, 0.25%)
45-50%: s=0(1013100, 100.00%) r=2071(4630, 0.46%)
50-55%: s=0(1013100, 100.00%) r=1494(6124, 0.60%)
55-60%: s=0(1013100, 100.00%) r=7099(13223, 1.31%)
60-65%: s=0(1013100, 100.00%) r=80026(93249, 9.20%)
65-70%: s=0(1013100, 100.00%) r=152210(245459, 24.23%)
70-75%: s=0(1013100, 100.00%) r=466794(712253, 70.30%)
75-80%: s=0(1013100, 100.00%) r=276878(989131, 97.63%)
80-85%: s=0(1013100, 100.00%) r=22822(1011953, 99.89%)
85-90%: s=0(1013100, 100.00%) r=1147(1013100, 100.00%)
90-95%: s=0(1013100, 100.00%) r=0(1013100, 100.00%)
95-100%: s=0(1013100, 100.00%) r=0(1013100, 100.00%)

(shader clock) percentiles:
99%: ray=565588 sampling=105982
95%: ray=422982 sampling=102555
90%: ray=391055 sampling=100877
75%: ray=350036 sampling=97007
50%: ray=290830 sampling=76574

SOLID + REALTIME clock
99%: ray=97056 sampling=20328
95%: ray=78988 sampling=19532
90%: ray=74236 sampling=19076
75%: ray=51580 sampling=17512
50%: ray=29268 sampling=6648
15 changes: 15 additions & 0 deletions ref/vk/TODO.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
# Next
- [ ] performance query

# 2023-12-08 E344
- [x] measure percentage of direct light shader phases:
- [x] how long does sampling take -- ~~supposedly ALU-bound, need clockARB()?~~
- [x] how long does ray tracing take -- ~~supposedly mem-bound, clockRealtimeEXT()?~~
- [ ] try mapping shader realtime clock values to calibrated vk compute dispatch timestamps
Similar to what RGP draws, but not quite the same. Not sure if this is valuable, but it would be neat

# 2023-12-07 E343
- [x] extract raw shader clock
- [-] display times as scopes somewhere
→ tried chrome trace in ff profiler, it broke (200k scopes in 1k threads is too much)

# 2023-12-05 E342
- [x] tone down the specular indirect blur
- [-] try func_wall static light opt, #687
Expand Down
6 changes: 6 additions & 0 deletions ref/vk/shaders/denoiser.comp
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,12 @@ void main() {
readNormals(pix, geometry_normal, shading_normal);
imageStore(out_dest, pix, vec4(.5 + geometry_normal * .5, 0.));
return;
} else if (ubo.ubo.debug_display_only == DEBUG_DISPLAY_TIME_DIRECT_POLY) {
imageStore(out_dest, pix, vec4(imageLoad(light_poly_diffuse, pix).a)); return;
return;
} else if (ubo.ubo.debug_display_only == DEBUG_DISPLAY_TIME_DIRECT_POINT) {
imageStore(out_dest, pix, vec4(imageLoad(light_point_diffuse, pix).a)); return;
return;
}

#ifdef DEBUG_NOISE
Expand Down
17 changes: 16 additions & 1 deletion ref/vk/shaders/light_polygon.glsl
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@

#include "noise.glsl"
#include "utils.glsl"
#include "time.glsl"

uint prof_sampling = 0, prof_ray = 0;
uint prof_lights = 0, prof_samples = 0, prof_rays = 0;

#define DO_ALL_IN_CLUSTER 1

Expand Down Expand Up @@ -232,9 +236,14 @@ void sampleEmissiveSurfaces(vec3 P, vec3 N, vec3 throughput, vec3 view_dir, Mate

const float plane_dist = dot(poly.plane, vec4(P, 1.f));

++prof_lights;

if (plane_dist < 0.)
continue;

++prof_samples;

const time_t prof_sampling_begin = timeNow();
#ifdef PROJECTED
const vec4 light_sample_dir = getPolygonLightSampleProjected(view_dir, ctx, poly);
#elif defined(SOLID)
Expand All @@ -244,14 +253,20 @@ void sampleEmissiveSurfaces(vec3 P, vec3 N, vec3 throughput, vec3 view_dir, Mate
#else
const vec4 light_sample_dir = getPolygonLightSampleSimple(P, view_dir, poly);
#endif
prof_sampling += timeDelta(prof_sampling_begin, timeNow());

if (light_sample_dir.w <= 0.)
continue;

const float dist = - plane_dist / dot(light_sample_dir.xyz, poly.plane.xyz);
const vec3 emissive = poly.emissive;

if (!shadowed(P, light_sample_dir.xyz, dist)) {
const time_t prof_ray_begin = timeNow();
const bool shadow = shadowed(P, light_sample_dir.xyz, dist);
prof_ray += timeDelta(prof_ray_begin, timeNow());
++prof_rays;

if (!shadow) {
//const float estimate = total_contrib;
const float estimate = light_sample_dir.w;
vec3 poly_diffuse = vec3(0.), poly_specular = vec3(0.);
Expand Down
14 changes: 11 additions & 3 deletions ref/vk/shaders/ray_interop.h
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,15 @@
#define vec3 vec3_t
#define vec4 vec4_t
#define mat4 matrix4x4
typedef int ivec3[3];
typedef int ivec2[2];
typedef int32_t ivec3[3];
typedef int32_t ivec2[2];
typedef uint32_t uvec2[2];
typedef uint32_t uvec3[3];
typedef uint32_t uvec4[4];
#define TOKENPASTE(x, y) x ## y
#define TOKENPASTE2(x, y) TOKENPASTE(x, y)
#define PAD(x) float TOKENPASTE2(pad_, __LINE__)[x];
#define STRUCT struct

enum {
#define DECLARE_SPECIALIZATION_CONSTANT(index, type, name, default_value) \
SPEC_##name##_INDEX = index,
Expand Down Expand Up @@ -192,6 +194,8 @@ struct PushConstants {
#define DEBUG_DISPLAY_INDIRECT_SPEC 10
#define DEBUG_DISPLAY_INDIRECT_DIFF 11
#define DEBUG_DISPLAY_TRIHASH 12
#define DEBUG_DISPLAY_TIME_DIRECT_POLY 13
#define DEBUG_DISPLAY_TIME_DIRECT_POINT 14
// add more when needed

struct UniformBuffer {
Expand All @@ -205,6 +209,10 @@ struct UniformBuffer {
uint debug_display_only;
};

struct ProfilingStruct {
uvec4 data[4];
};

#undef PAD
#undef STRUCT

Expand Down
25 changes: 23 additions & 2 deletions ref/vk/shaders/ray_light_direct.glsl
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
#include "utils.glsl"
#include "noise.glsl"
#include "time.glsl"

#include "ray_kusochki.glsl"

vec4 profile = vec4(0.);

#include "light.glsl"

void readNormals(ivec2 uv, out vec3 geometry_normal, out vec3 shading_normal) {
Expand All @@ -12,6 +15,8 @@ void readNormals(ivec2 uv, out vec3 geometry_normal, out vec3 shading_normal) {
}

void main() {
const time_t time_begin = timeNow();

#ifdef RAY_TRACE
const vec2 uv = (gl_LaunchIDEXT.xy + .5) / gl_LaunchSizeEXT.xy * 2. - 1.;
const ivec2 pix = ivec2(gl_LaunchIDEXT.xy);
Expand Down Expand Up @@ -49,13 +54,29 @@ void main() {
vec3 diffuse = vec3(0.), specular = vec3(0.);
computeLighting(pos + geometry_normal * .001, shading_normal, throughput, -direction, material, diffuse, specular);

const time_t time_end = timeNow();
//const uint64_t time_diff = time_end - time_begin;
//const uint time_diff = time_begin.x - time_end.x;

const uint time_diff = timeDelta(time_begin, time_end);
const float time_diff_f = float(time_diff) / 1e6;

const uint prof_index = pix.x + pix.y * ubo.ubo.res.x;
#if LIGHT_POINT
imageStore(out_light_point_diffuse, pix, vec4(diffuse, 0.f));
imageStore(out_light_point_diffuse, pix, vec4(diffuse, time_diff_f));
imageStore(out_light_point_specular, pix, vec4(specular, 0.f));
//imageStore(out_light_point_profile, pix, profile);
//prof_direct_point[prof_index].data[0] = vec4(time_begin, time_end);
#endif

#if LIGHT_POLYGON
imageStore(out_light_poly_diffuse, pix, vec4(diffuse, 0.f));
imageStore(out_light_poly_diffuse, pix, vec4(diffuse, time_diff_f));
imageStore(out_light_poly_specular, pix, vec4(specular, 0.f));
//imageStore(out_light_poly_profile, pix, profile);
prof_direct_poly.a[prof_index].data[0] = uvec4(time_begin, time_end);
prof_direct_poly.a[prof_index].data[1] = uvec4(timeDelta(time_begin, time_end), prof_sampling, prof_ray, 0);
prof_direct_poly.a[prof_index].data[2] = uvec4(prof_lights, prof_samples, prof_rays, 0);
//prof_direct_poly.a[prof_index].data[2] = uvec4(prof_ray_begin, prof_ray_end);
//prof_direct_poly.a[prof_index].data[3] = uvec4(prof_light_count, 0, 0, 0);
#endif
}
4 changes: 4 additions & 0 deletions ref/vk/shaders/ray_light_direct_point.comp
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@ layout(set = 0, binding = 31, std430) readonly buffer Kusochki { Kusok a[]; } ku
layout(set = 0, binding = 32, std430) readonly buffer Indices { uint16_t a[]; } indices;
layout(set = 0, binding = 33, std430) readonly buffer Vertices { Vertex a[]; } vertices;

//layout(set = 0, binding = 34, rgba16f) uniform writeonly image2D out_light_point_profile;
//layout(set = 0, binding = 34, std430) writeonly buffer Profiling { ProfilingStruct prof_direct_point[]; };


#define RAY_QUERY
#define LIGHT_POINT 1
#include "ray_light_direct.glsl"
3 changes: 3 additions & 0 deletions ref/vk/shaders/ray_light_direct_poly.comp
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ layout(set = 0, binding = 31, std430) readonly buffer Kusochki { Kusok a[]; } ku
layout(set = 0, binding = 32, std430) readonly buffer Indices { uint16_t a[]; } indices;
layout(set = 0, binding = 33, std430) readonly buffer Vertices { Vertex a[]; } vertices;

//layout(set = 0, binding = 34, rgba16f) uniform writeonly image2D out_light_poly_profile;
layout(set = 0, binding = 34, std430) writeonly restrict buffer Profiling { ProfilingStruct a[]; } prof_direct_poly;

#define RAY_QUERY
#define LIGHT_POLYGON 1
#include "ray_light_direct.glsl"
53 changes: 53 additions & 0 deletions ref/vk/shaders/time.glsl
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#ifndef TIME_GLSL_INCLUDED
#define TIME_GLSL_INCLUDED

//#define PROF_USE_REALTIME
#ifdef PROF_USE_REALTIME
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_EXT_shader_realtime_clock: enable
#else
#extension GL_ARB_shader_clock: enable
#endif

#define time_t uvec2
#define T0 uvec2(0)
#define clockRealtime clockRealtime2x32EXT
#define clockShader clock2x32ARB

#define clockRealtime64 clockRealtimeEXT
#define clockShader64 clockARB

#ifdef PROF_USE_REALTIME
uint clockRealtimeDelta(time_t begin, time_t end) {
const uint64_t begin64 = begin.x | (uint64_t(begin.y) << 32);
const uint64_t end64 = end.x | (uint64_t(end.y) << 32);
const uint64_t time_diff = end64 - begin64;
return uint(time_diff);
}
#endif

uint clockShaderDelta(time_t begin, time_t end) {
// AMD RNDA2 SHADER_CYCLES reg is limited to 20 bits
return (end.x - begin.x) & 0xfffffu;
}

#ifdef PROF_USE_REALTIME
// On mesa+amdgpu there's a clear gradient: pixels on top of screen take 2-3x longer to compute than bottom ones. Also,
// it does flicker a lot.
// Deltas are about 30000-100000 parrots
#define timeNow clockRealtime
#define timeDelta clockRealtimeDelta
#else
// clockARB doesn't give directly usable time values on mesa+amdgpu
// even deltas between them are not meaningful enough.
// On AMD clockARB() values are limited to lower 20 bits (see RDNA 2 ISA SHADER_CYCLES reg), and they wrap around a lot.
// Absolute difference value are often 30-50% of the available range, so it's not that far off from wrapping around
// multiple times, rendering the value completely useless.
// Deltas are around 300000-500000 parrots.
// Other than that, the values seem uniform across the screen (as compared to realtime clock, which has a clearly
// visible gradient: top differences are larger than bottom ones.
#define timeNow clockShader
#define timeDelta clockShaderDelta
#endif

#endif //ifndef TIME_GLSL_INCLUDED
15 changes: 13 additions & 2 deletions ref/vk/vk_core.c
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,9 @@ static const char* device_extensions_rt[] = {
VK_KHR_RAY_TRACING_PIPELINE_EXTENSION_NAME,
VK_KHR_DEFERRED_HOST_OPERATIONS_EXTENSION_NAME,
VK_KHR_RAY_QUERY_EXTENSION_NAME,

// TODO optional
VK_KHR_SHADER_CLOCK_EXTENSION_NAME,
};

static const char* device_extensions_nv_checkpoint[] = {
Expand Down Expand Up @@ -526,9 +529,16 @@ static qboolean createDevice( void ) {
.pNext = head,
.rayQuery = VK_TRUE,
};
head = &ray_query_pipeline_feature;
VkPhysicalDeviceShaderClockFeaturesKHR shader_clock_feature = {
.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_CLOCK_FEATURES_KHR,
.pNext = head,
.shaderDeviceClock = VK_TRUE,
.shaderSubgroupClock = VK_TRUE,
};

if (vk_core.rtx) {
head = &ray_query_pipeline_feature;
head = &shader_clock_feature;
} else {
head = NULL;
}
Expand All @@ -537,7 +547,8 @@ static qboolean createDevice( void ) {
.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FEATURES_2,
.pNext = head,
.features.samplerAnisotropy = candidate_device->features.features.samplerAnisotropy,
.features.shaderInt16 = true,
.features.shaderInt16 = VK_TRUE,
.features.shaderInt64 = VK_TRUE,
};
head = &features;

Expand Down
Loading