Reconcilers should rely on knative/pkg to Update TaskRuns and PipelineRuns #5146

imjasonh · 2022-07-16T07:37:20Z

Background

The TaskRun and PipelineRun controllers are invoked by Knative controller code that calls ReconcileKind(ctx, tr) with the result of a K8s watch on resources, then determines if those resources need to be updated, and if so, calls Update on them.

This is good, because Knative code handles the watching and updating for us, and only has to assume that any changes Tekton wants to make to resources is expressed in terms of updates to the tr object passed to ReconcileKind.

knative/pkg expects this:

knative/pkg calls Watch
  -> ReconcileKind(tr)
    -> if tr has changed, knative/pkg calls Update

However, our ReconcileKinds for TaskRuns and PipelineRuns (notably, not Runs) also make their own calls Update, as part of updateLabelsAndAnnotations:

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 1171 to 1186 in 8a7b0cf

    
           func (c *Reconciler) updateLabelsAndAnnotations(ctx context.Context, pr *v1beta1.PipelineRun) (*v1beta1.PipelineRun, error) { 
        
           	newPr, err := c.pipelineRunLister.PipelineRuns(pr.Namespace).Get(pr.Name) 
        
           	if err != nil { 
        
           		return nil, fmt.Errorf("error getting PipelineRun %s when updating labels/annotations: %w", pr.Name, err) 
        
           	} 
        
           	if !reflect.DeepEqual(pr.ObjectMeta.Labels, newPr.ObjectMeta.Labels) || !reflect.DeepEqual(pr.ObjectMeta.Annotations, newPr.ObjectMeta.Annotations) { 
        
           		// Note that this uses Update vs. Patch because the former is significantly easier to test. 
        
           		// If we want to switch this to Patch, then we will need to teach the utilities in test/controller.go 
        
           		// to deal with Patch (setting resourceVersion, and optimistic concurrency checks). 
        
           		newPr = newPr.DeepCopy() 
        
           		newPr.Labels = pr.Labels 
        
           		newPr.Annotations = pr.Annotations 
        
           		return c.PipelineClientSet.TektonV1beta1().PipelineRuns(pr.Namespace).Update(ctx, newPr, metav1.UpdateOptions{}) 
        
           	} 
        
           	return newPr, nil 
        
           }

pipeline/pkg/reconciler/taskrun/taskrun.go

Lines 548 to 573 in 8a7b0cf

    
           func (c *Reconciler) updateLabelsAndAnnotations(ctx context.Context, tr *v1beta1.TaskRun) (*v1beta1.TaskRun, error) { 
        
           	// Ensure the TaskRun is properly decorated with the version of the Tekton controller processing it. 
        
           	version, err := changeset.Get() 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	if tr.Annotations == nil { 
        
           		tr.Annotations = make(map[string]string, 1) 
        
           	} 
        
           	tr.Annotations[podconvert.ReleaseAnnotation] = version 
        
           	newTr, err := c.taskRunLister.TaskRuns(tr.Namespace).Get(tr.Name) 
        
           	if err != nil { 
        
           		return nil, fmt.Errorf("error getting TaskRun %s when updating labels/annotations: %w", tr.Name, err) 
        
           	} 
        
           	if !reflect.DeepEqual(tr.ObjectMeta.Labels, newTr.ObjectMeta.Labels) || !reflect.DeepEqual(tr.ObjectMeta.Annotations, newTr.ObjectMeta.Annotations) { 
        
           		// Note that this uses Update vs. Patch because the former is significantly easier to test. 
        
           		// If we want to switch this to Patch, then we will need to teach the utilities in test/controller.go 
        
           		// to deal with Patch (setting resourceVersion, and optimistic concurrency checks). 
        
           		newTr = newTr.DeepCopy() 
        
           		newTr.Labels = tr.Labels 
        
           		newTr.Annotations = tr.Annotations 
        
           		return c.PipelineClientSet.TektonV1beta1().TaskRuns(tr.Namespace).Update(ctx, newTr, metav1.UpdateOptions{}) 
        
           	} 
        
           	return newTr, nil 
        
           }

Effectively, what we're doing is:

knative/pkg calls Watch
  -> ReconcileKind(pr)
    -> if pr has changed, Tekton calls Update
      -> if pr has changed (unlikely), knative/pkg calls Update

This leads to an unnecessary and unexpected change in the control of the update lifecycle for both PipelineRuns and TaskRuns. A method unsuspiciously called updateLabelsAndAnnotations is actually responsible for persisting all the changes we make during a reconciliation.

This isn't a problem per se, it's just unexpected, and not how Knative reconciler code expects to be used, which may cause confusion and bugs later. It may also lead to duplicate calls to Update, if diffs sneak in after our call to updateLabelsAndAnnotations, which can lead to cluster-destabilizing write load on the K8s API server in heavy users of Tekton.

This may cause extra problems if Knative controller logic starts making load-bearing assumptions that it's solely responsible for the update lifecycle of objects it passes to ReconcileKind, which could cause headaches for us later. knative/pkg thankfully has downstream tests to warn about behavior changes that might break Tekton, but this still means they may be prevented from making otherwise helpful changes and optimizations because of how Tekton is (mis-)using its packages.

Expected Behavior

I'd expect ReconcileKind not to make calls to Update itself, and rely on Knative's own Updates.

Additional Info

Kubernetes version: N/A, just reading code
Tekton Pipeline version: 0.37.2

While we're at it, updateLabelsAndAnnotations is called by another unsuspicious-sounding method, finishReconcileUpdateEmitEvents, which should probably also be refactored so it's less load-bearing in the overall reconciliation cycle:

pipeline/pkg/reconciler/taskrun/taskrun.go

Line 281 in 8a7b0cf

    
           func (c *Reconciler) finishReconcileUpdateEmitEvents(ctx context.Context, tr *v1beta1.TaskRun, beforeCondition *apis.Condition, previousError error) error {

@vdemeester @afrittoli WDYT?

The text was updated successfully, but these errors were encountered:

imjasonh · 2022-07-18T13:39:13Z

Possibly related:

Move Cloud Events publishing into a separate controller binary #2944

I think at least part of the reason we update directly is so that when we publish the cloud event we can be sure the object matches the published state -- if updating fails, we shouldn't publish a different cloud event.

If cloud event publishing was a separate offline process that happened in response to the update, we'd have the same guarantee that the cloud event state matches the updated state, and we'd have the benefits of a separate configurable/monitorable/scaleable deployment just for publishing events, and we could have reconciliation updates managed by knative/pkg.

Seems like a win-win-win, we just need to write some code.

afrittoli · 2022-07-18T14:31:03Z

I'm pretty sure the call to Update (or Patch before), predates cloud events. I suspect it may even predate ReconcileKind and perhaps it's a legacy of old times when it would actually be needed?
We should totally try and remove it and run unit and e2e tests and see if anything breaks :)

About cloud events, moving publishing to a separate controller as pros and cons, but I think the pros definitely outweighs the cons. The main cons are:

Some events are not matched by a dedicated status update. If a PipelineRun is created, and external controller could send a new "queued" event (not a started one) because it doesn't know when the main controller will pick it up from the queue and start processing it. The external controller could then send the "started" event when it detects that the start time has been set and send a "running" event when it detects that a taskrun or run exists in the status
Avoiding duplicate events require a local cache of events. The cache is required for Run events anyways, so it's in the codebase already. It was initially developed in the external cloud events controller and then ported in. The cache is in memory only at the moment, which means that a controller restart would trigger duplicate events.

The external controller exists in experimental and with @waveywaves we drafted a roadmap to make it replace the current built-in support, but we then both ran out of bandwidth. I will try and pick it back up this summer.

tekton-robot · 2022-10-16T15:03:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

imjasonh · 2022-10-17T13:57:13Z

/lifecycle frozen
/remove-lifecycle stale

I think this is still something we should do. I think @vdemeester was looking into it.

vdemeester · 2022-10-19T09:47:20Z

I think this is still something we should do. I think @vdemeester was looking into it.

Yes I was looking into it recently, and it might "break" some assumption users do (on some labels being present on PipelineRun or TaskRun), so I think we might have to document and deprecate the current behavior so that we can remove it later on.

vdemeester · 2022-12-05T13:37:05Z

/assign

vdemeester · 2024-06-19T15:59:41Z

/unassign

imjasonh added kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jul 16, 2022

imjasonh mentioned this issue Jul 18, 2022

Rely on knative/pkg to update TaskRuns #5151

Closed

7 tasks

dibyom mentioned this issue Oct 5, 2022

[concurrency] Implement webhook and reconciler tektoncd/experimental#905

Merged

3 tasks

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 16, 2022

tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 17, 2022

tekton-robot assigned vdemeester Dec 5, 2022

vdemeester mentioned this issue Jan 4, 2023

Pipelines controller removing label from PipelineRun on reconcilation #5297

Closed

vdemeester mentioned this issue Feb 8, 2023

Remove Updating labels and annotations… #6127

Closed

7 tasks

lbernick mentioned this issue Jul 28, 2023

Add webhook validation for remote Tasks #6942

Merged

5 tasks

tekton-robot unassigned vdemeester Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconcilers should rely on knative/pkg to Update TaskRuns and PipelineRuns #5146

Reconcilers should rely on knative/pkg to Update TaskRuns and PipelineRuns #5146

imjasonh commented Jul 16, 2022

imjasonh commented Jul 18, 2022

afrittoli commented Jul 18, 2022

tekton-robot commented Oct 16, 2022

imjasonh commented Oct 17, 2022

vdemeester commented Oct 19, 2022

vdemeester commented Dec 5, 2022

vdemeester commented Jun 19, 2024

Reconcilers should rely on knative/pkg to Update TaskRuns and PipelineRuns #5146

Reconcilers should rely on knative/pkg to Update TaskRuns and PipelineRuns #5146

Comments

imjasonh commented Jul 16, 2022

Background

Expected Behavior

Additional Info

imjasonh commented Jul 18, 2022

afrittoli commented Jul 18, 2022

tekton-robot commented Oct 16, 2022

imjasonh commented Oct 17, 2022

vdemeester commented Oct 19, 2022

vdemeester commented Dec 5, 2022

vdemeester commented Jun 19, 2024