Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync drm/panthor and drm/sched with 6.12-rc2 #264

Merged
merged 14 commits into from
Oct 18, 2024

Commits on Oct 14, 2024

  1. drm/panthor: Sync commit from drm-tip with Linux tree

    Minor change: in drm-tip one comment was in a different place than
    in the mainline branch. Fixing this to potentially simplify merge.
    ginkage committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    2b67c59 View commit details
    Browse the repository at this point in the history
  2. drm/sched: Re-queue run job worker when drm_sched_entity_pop_job() re…

    …turns NULL
    
    Rather then loop over entities until one with a ready job is found,
    re-queue the run job worker when drm_sched_entity_pop_job() returns NULL.
    
    Signed-off-by: Matthew Brost <[email protected]>
    Reviewed-by: Christian König <[email protected]>
    Fixes: 66dbd90 ("drm/sched: Drain all entities in DRM sched run job worker")
    Reviewed-by: Luben Tuikov <[email protected]>
    Signed-off-by: Dave Airlie <[email protected]>
    Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    mbrost05 authored and ginkage committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    13a04e9 View commit details
    Browse the repository at this point in the history
  3. drm/scheduler: Simplify the allocation of slab caches in drm_sched_fe…

    …nce_slab_init
    
    Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
    to simplify the creation of SLAB caches.
    
    Signed-off-by: Kunwu Chan <[email protected]>
    Signed-off-by: Daniel Vetter <[email protected]>
    Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    KunWuChan authored and ginkage committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    e305ff9 View commit details
    Browse the repository at this point in the history
  4. drm/sched: fix null-ptr-deref in init entity

    The bug can be triggered by sending an amdgpu_cs_wait_ioctl
    to the AMDGPU DRM driver on any ASICs with valid context.
    The bug was reported by Joonkyo Jung <[email protected]>.
    For example the following code:
    
        static void Syzkaller2(int fd)
        {
    	union drm_amdgpu_ctx arg1;
    	union drm_amdgpu_wait_cs arg2;
    
    	arg1.in.op = AMDGPU_CTX_OP_ALLOC_CTX;
    	ret = drmIoctl(fd, 0x140106442 /* amdgpu_ctx_ioctl */, &arg1);
    
    	arg2.in.handle = 0x0;
    	arg2.in.timeout = 0x2000000000000;
    	arg2.in.ip_type = AMD_IP_VPE /* 0x9 */;
    	arg2->in.ip_instance = 0x0;
    	arg2.in.ring = 0x0;
    	arg2.in.ctx_id = arg1.out.alloc.ctx_id;
    
    	drmIoctl(fd, 0xc0206449 /* AMDGPU_WAIT_CS * /, &arg2);
        }
    
    The ioctl AMDGPU_WAIT_CS without previously submitted job could be assumed that
    the error should be returned, but the following commit 1decbf6
    modified the logic and allowed to have sched_rq equal to NULL.
    
    As a result when there is no job the ioctl AMDGPU_WAIT_CS returns success.
    The change fixes null-ptr-deref in init entity and the stack below demonstrates
    the error condition:
    
    [  +0.000007] BUG: kernel NULL pointer dereference, address: 0000000000000028
    [  +0.007086] #PF: supervisor read access in kernel mode
    [  +0.005234] #PF: error_code(0x0000) - not-present page
    [  +0.005232] PGD 0 P4D 0
    [  +0.002501] Oops: 0000 [armbian#1] PREEMPT SMP KASAN NOPTI
    [  +0.005034] CPU: 10 PID: 9229 Comm: amd_basic Tainted: G    B   W    L     6.7.0+ armbian#4
    [  +0.007797] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020
    [  +0.009798] RIP: 0010:drm_sched_entity_init+0x2d3/0x420 [gpu_sched]
    [  +0.006426] Code: 80 00 00 00 00 00 00 00 e8 1a 81 82 e0 49 89 9c 24 c0 00 00 00 4c 89 ef e8 4a 80 82 e0 49 8b 5d 00 48 8d 7b 28 e8 3d 80 82 e0 <48> 83 7b 28 00 0f 84 28 01 00 00 4d 8d ac 24 98 00 00 00 49 8d 5c
    [  +0.019094] RSP: 0018:ffffc90014c1fa40 EFLAGS: 00010282
    [  +0.005237] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffffffff8113f3fa
    [  +0.007326] RDX: fffffbfff0a7889d RSI: 0000000000000008 RDI: ffffffff853c44e0
    [  +0.007264] RBP: ffffc90014c1fa80 R08: 0000000000000001 R09: fffffbfff0a7889c
    [  +0.007266] R10: ffffffff853c44e7 R11: 0000000000000001 R12: ffff8881a719b010
    [  +0.007263] R13: ffff88810d412748 R14: 0000000000000002 R15: 0000000000000000
    [  +0.007264] FS:  00007ffff7045540(0000) GS:ffff8883cc900000(0000) knlGS:0000000000000000
    [  +0.008236] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  +0.005851] CR2: 0000000000000028 CR3: 000000011912e000 CR4: 0000000000350ef0
    [  +0.007175] Call Trace:
    [  +0.002561]  <TASK>
    [  +0.002141]  ? show_regs+0x6a/0x80
    [  +0.003473]  ? __die+0x25/0x70
    [  +0.003124]  ? page_fault_oops+0x214/0x720
    [  +0.004179]  ? preempt_count_sub+0x18/0xc0
    [  +0.004093]  ? __pfx_page_fault_oops+0x10/0x10
    [  +0.004590]  ? srso_return_thunk+0x5/0x5f
    [  +0.004000]  ? vprintk_default+0x1d/0x30
    [  +0.004063]  ? srso_return_thunk+0x5/0x5f
    [  +0.004087]  ? vprintk+0x5c/0x90
    [  +0.003296]  ? drm_sched_entity_init+0x2d3/0x420 [gpu_sched]
    [  +0.005807]  ? srso_return_thunk+0x5/0x5f
    [  +0.004090]  ? _printk+0xb3/0xe0
    [  +0.003293]  ? __pfx__printk+0x10/0x10
    [  +0.003735]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
    [  +0.005482]  ? do_user_addr_fault+0x345/0x770
    [  +0.004361]  ? exc_page_fault+0x64/0xf0
    [  +0.003972]  ? asm_exc_page_fault+0x27/0x30
    [  +0.004271]  ? add_taint+0x2a/0xa0
    [  +0.003476]  ? drm_sched_entity_init+0x2d3/0x420 [gpu_sched]
    [  +0.005812]  amdgpu_ctx_get_entity+0x3f9/0x770 [amdgpu]
    [  +0.009530]  ? finish_task_switch.isra.0+0x129/0x470
    [  +0.005068]  ? __pfx_amdgpu_ctx_get_entity+0x10/0x10 [amdgpu]
    [  +0.010063]  ? __kasan_check_write+0x14/0x20
    [  +0.004356]  ? srso_return_thunk+0x5/0x5f
    [  +0.004001]  ? mutex_unlock+0x81/0xd0
    [  +0.003802]  ? srso_return_thunk+0x5/0x5f
    [  +0.004096]  amdgpu_cs_wait_ioctl+0xf6/0x270 [amdgpu]
    [  +0.009355]  ? __pfx_amdgpu_cs_wait_ioctl+0x10/0x10 [amdgpu]
    [  +0.009981]  ? srso_return_thunk+0x5/0x5f
    [  +0.004089]  ? srso_return_thunk+0x5/0x5f
    [  +0.004090]  ? __srcu_read_lock+0x20/0x50
    [  +0.004096]  drm_ioctl_kernel+0x140/0x1f0 [drm]
    [  +0.005080]  ? __pfx_amdgpu_cs_wait_ioctl+0x10/0x10 [amdgpu]
    [  +0.009974]  ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm]
    [  +0.005618]  ? srso_return_thunk+0x5/0x5f
    [  +0.004088]  ? __kasan_check_write+0x14/0x20
    [  +0.004357]  drm_ioctl+0x3da/0x730 [drm]
    [  +0.004461]  ? __pfx_amdgpu_cs_wait_ioctl+0x10/0x10 [amdgpu]
    [  +0.009979]  ? __pfx_drm_ioctl+0x10/0x10 [drm]
    [  +0.004993]  ? srso_return_thunk+0x5/0x5f
    [  +0.004090]  ? __kasan_check_write+0x14/0x20
    [  +0.004356]  ? srso_return_thunk+0x5/0x5f
    [  +0.004090]  ? _raw_spin_lock_irqsave+0x99/0x100
    [  +0.004712]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
    [  +0.005063]  ? __pfx_arch_do_signal_or_restart+0x10/0x10
    [  +0.005477]  ? srso_return_thunk+0x5/0x5f
    [  +0.004000]  ? preempt_count_sub+0x18/0xc0
    [  +0.004237]  ? srso_return_thunk+0x5/0x5f
    [  +0.004090]  ? _raw_spin_unlock_irqrestore+0x27/0x50
    [  +0.005069]  amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu]
    [  +0.008912]  __x64_sys_ioctl+0xcd/0x110
    [  +0.003918]  do_syscall_64+0x5f/0xe0
    [  +0.003649]  ? noist_exc_debug+0xe6/0x120
    [  +0.004095]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
    [  +0.005150] RIP: 0033:0x7ffff7b1a94f
    [  +0.003647] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
    [  +0.019097] RSP: 002b:00007fffffffe0a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    [  +0.007708] RAX: ffffffffffffffda RBX: 000055555558b360 RCX: 00007ffff7b1a94f
    [  +0.007176] RDX: 000055555558b360 RSI: 00000000c0206449 RDI: 0000000000000003
    [  +0.007326] RBP: 00000000c0206449 R08: 000055555556ded0 R09: 000000007fffffff
    [  +0.007176] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffffffe5d8
    [  +0.007238] R13: 0000000000000003 R14: 000055555555cba8 R15: 00007ffff7ffd040
    [  +0.007250]  </TASK>
    
    v2: Reworked check to guard against null ptr deref and added helpful comments
        (Christian)
    
    Cc: Christian Koenig <[email protected]>
    Cc: Alex Deucher <[email protected]>
    Cc: Luben Tuikov <[email protected]>
    Cc: Bas Nieuwenhuizen <[email protected]>
    Cc: Joonkyo Jung <[email protected]>
    Cc: Dokyung Song <[email protected]>
    Cc: <[email protected]>
    Cc: <[email protected]>
    Signed-off-by: Vitaly Prosyak <[email protected]>
    Reviewed-by: Christian König <[email protected]>
    Fixes: 56e4496 ("drm/sched: Convert the GPU scheduler to variable number of run-queues")
    Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    Signed-off-by: Christian König <[email protected]>
    vprosyak authored and ginkage committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    e33a0f8 View commit details
    Browse the repository at this point in the history
  5. drm/scheduler: remove full_recover from drm_sched_start

    This was basically just another one of amdgpus hacks. The parameter
    allowed to restart the scheduler without turning fence signaling on
    again.
    
    That this is absolutely not a good idea should be obvious by now since
    the fences will then just sit there and never signal.
    
    While at it cleanup the code a bit.
    
    Signed-off-by: Christian König <[email protected]>
    Reviewed-by: Matthew Brost <[email protected]>
    Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    ChristianKoenigAMD authored and ginkage committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    2abb6f4 View commit details
    Browse the repository at this point in the history
  6. drm/panthor: Fix race when converting group handle to group object

    XArray provides it's own internal lock which protects the internal array
    when entries are being simultaneously added and removed. However there
    is still a race between retrieving the pointer from the XArray and
    incrementing the reference count.
    
    To avoid this race simply hold the internal XArray lock when
    incrementing the reference count, this ensures there cannot be a racing
    call to xa_erase().
    
    Fixes: de85488 ("drm/panthor: Add the scheduler logical block")
    Signed-off-by: Steven Price <[email protected]>
    Reviewed-by: Boris Brezillon <[email protected]>
    Reviewed-by: Liviu Dudau <[email protected]>
    Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    Steven Price authored and ginkage committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    90e0e0b View commit details
    Browse the repository at this point in the history
  7. drm/sched: Fix dynamic job-flow control race

    Fixes a race condition reported here: AsahiLinux/linux#309 (comment)
    
    The whole premise of lockless access to a single-producer-single-
    consumer queue is that there is just a single producer and single
    consumer.  That means we can't call drm_sched_can_queue() (which is
    about queueing more work to the hw, not to the spsc queue) from
    anywhere other than the consumer (wq).
    
    This call in the producer is just an optimization to avoid scheduling
    the consuming worker if it cannot yet queue more work to the hw.  It
    is safe to drop this optimization to avoid the race condition.
    
    Suggested-by: Asahi Lina <[email protected]>
    Fixes: a78422e ("drm/sched: implement dynamic job-flow control")
    Closes: AsahiLinux/linux#309
    Cc: [email protected]
    Signed-off-by: Rob Clark <[email protected]>
    Reviewed-by: Danilo Krummrich <[email protected]>
    Tested-by: Janne Grunau <[email protected]>
    Signed-off-by: Danilo Krummrich <[email protected]>
    Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    robclark authored and ginkage committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    830795c View commit details
    Browse the repository at this point in the history
  8. drm/sched: Add locking to drm_sched_entity_modify_sched

    Without the locking amdgpu currently can race between
    amdgpu_ctx_set_entity_priority() (via drm_sched_entity_modify_sched()) and
    drm_sched_job_arm(), leading to the latter accesing potentially
    inconsitent entity->sched_list and entity->num_sched_list pair.
    
    v2:
     * Improve commit message. (Philipp)
    
    Signed-off-by: Tvrtko Ursulin <[email protected]>
    Fixes: b37aced ("drm/scheduler: implement a function to modify sched list")
    Cc: Christian König <[email protected]>
    Cc: Alex Deucher <[email protected]>
    Cc: Luben Tuikov <[email protected]>
    Cc: Matthew Brost <[email protected]>
    Cc: David Airlie <[email protected]>
    Cc: Daniel Vetter <[email protected]>
    Cc: [email protected]
    Cc: Philipp Stanner <[email protected]>
    Cc: <[email protected]> # v5.7+
    Reviewed-by: Christian König <[email protected]>
    Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    Signed-off-by: Christian König <[email protected]>
    Tvrtko Ursulin authored and ginkage committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    9820361 View commit details
    Browse the repository at this point in the history
  9. drm/sched: Always wake up correct scheduler in drm_sched_entity_push_job

    Since drm_sched_entity_modify_sched() can modify the entities run queue,
    lets make sure to only dereference the pointer once so both adding and
    waking up are guaranteed to be consistent.
    
    Alternative of moving the spin_unlock to after the wake up would for now
    be more problematic since the same lock is taken inside
    drm_sched_rq_update_fifo().
    
    v2:
     * Improve commit message. (Philipp)
     * Cache the scheduler pointer directly. (Christian)
    
    Signed-off-by: Tvrtko Ursulin <[email protected]>
    Fixes: b37aced ("drm/scheduler: implement a function to modify sched list")
    Cc: Christian König <[email protected]>
    Cc: Alex Deucher <[email protected]>
    Cc: Luben Tuikov <[email protected]>
    Cc: Matthew Brost <[email protected]>
    Cc: David Airlie <[email protected]>
    Cc: Daniel Vetter <[email protected]>
    Cc: Philipp Stanner <[email protected]>
    Cc: [email protected]
    Cc: <[email protected]> # v5.7+
    Reviewed-by: Christian König <[email protected]>
    Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    Signed-off-by: Christian König <[email protected]>
    Tvrtko Ursulin authored and ginkage committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    6af15ae View commit details
    Browse the repository at this point in the history
  10. drm/panthor: Lock the VM resv before calling drm_gpuvm_bo_obtain_prea…

    …lloc()
    
    drm_gpuvm_bo_obtain_prealloc() will call drm_gpuvm_bo_put() on our
    pre-allocated BO if the <BO,VM> association exists. Given we
    only have one ref on preallocated_vm_bo, drm_gpuvm_bo_destroy() will
    be called immediately, and we have to hold the VM resv lock when
    calling this function.
    
    Fixes: 647810e ("drm/panthor: Add the MMU/VM logical block")
    Signed-off-by: Boris Brezillon <[email protected]>
    Reviewed-by: Liviu Dudau <[email protected]>
    Reviewed-by: Steven Price <[email protected]>
    Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    bbrezillon authored and ginkage committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    2096c77 View commit details
    Browse the repository at this point in the history
  11. drm/panthor: Fix access to uninitialized variable in tick_ctx_cleanup()

    The group variable can't be used to retrieve ptdev in our second loop,
    because it points to the previously iterated list_head, not a valid
    group. Get the ptdev object from the scheduler instead.
    
    Cc: <[email protected]>
    Fixes: d72f049 ("drm/panthor: Allow driver compilation")
    Reported-by: kernel test robot <[email protected]>
    Reported-by: Julia Lawall <[email protected]>
    Closes: https://lore.kernel.org/r/[email protected]/
    Signed-off-by: Boris Brezillon <[email protected]>
    Reviewed-by: Liviu Dudau <[email protected]>
    Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    bbrezillon authored and ginkage committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    0eaab07 View commit details
    Browse the repository at this point in the history
  12. drm/panthor: Don't declare a queue blocked if deferred operations are…

    … pending
    
    If deferred operations are pending, we want to wait for those to
    land before declaring the queue blocked on a SYNC_WAIT. We need
    this to deal with the case where the sync object is signalled through
    a deferred SYNC_{ADD,SET} from the same queue. If we don't do that
    and the group gets scheduled out before the deferred SYNC_{SET,ADD}
    is executed, we'll end up with a timeout, because no external
    SYNC_{SET,ADD} will make the scheduler reconsider the group for
    execution.
    
    Fixes: de85488 ("drm/panthor: Add the scheduler logical block")
    Cc: <[email protected]>
    Signed-off-by: Boris Brezillon <[email protected]>
    Reviewed-by: Steven Price <[email protected]>
    Reviewed-by: Liviu Dudau <[email protected]>
    Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    bbrezillon authored and ginkage committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    62884fe View commit details
    Browse the repository at this point in the history
  13. drm/panthor: Don't add write fences to the shared BOs

    The only user (the mesa gallium driver) is already assuming explicit
    synchronization and doing the export/import dance on shared BOs. The
    only reason we were registering ourselves as writers on external BOs
    is because Xe, which was the reference back when we developed Panthor,
    was doing so. Turns out Xe was wrong, and we really want bookkeep on
    all registered fences, so userspace can explicitly upgrade those to
    read/write when needed.
    
    Fixes: 4bdca11 ("drm/panthor: Add the driver frontend block")
    Cc: Matthew Brost <[email protected]>
    Cc: Simona Vetter <[email protected]>
    Cc: <[email protected]>
    Signed-off-by: Boris Brezillon <[email protected]>
    Reviewed-by: Steven Price <[email protected]>
    Reviewed-by: Liviu Dudau <[email protected]>
    Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    bbrezillon authored and ginkage committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    4a47ff1 View commit details
    Browse the repository at this point in the history
  14. drm/sched: Use drm sched lockdep map for submit_wq

    Avoid leaking a lockdep map on each drm sched creation and destruction
    by using a single lockdep map for all drm sched allocated submit_wq.
    
    v2:
     - Use alloc_ordered_workqueue_lockdep_map (Tejun)
    
    Cc: Luben Tuikov <[email protected]>
    Cc: Christian König <[email protected]>
    Signed-off-by: Matthew Brost <[email protected]>
    Reviewed-by: Nirmoy Das <[email protected]>
    Acked-by: Danilo Krummrich <[email protected]>
    Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    Signed-off-by: Maarten Lankhorst <[email protected]>
    mbrost05 authored and ginkage committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    3fe3d46 View commit details
    Browse the repository at this point in the history