Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcilers should rely on knative/pkg to Update TaskRuns and PipelineRuns #5146

Open
imjasonh opened this issue Jul 16, 2022 · 7 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@imjasonh
Copy link
Member

Background

The TaskRun and PipelineRun controllers are invoked by Knative controller code that calls ReconcileKind(ctx, tr) with the result of a K8s watch on resources, then determines if those resources need to be updated, and if so, calls Update on them.

This is good, because Knative code handles the watching and updating for us, and only has to assume that any changes Tekton wants to make to resources is expressed in terms of updates to the tr object passed to ReconcileKind.

knative/pkg expects this:

knative/pkg calls Watch
  -> ReconcileKind(tr)
    -> if tr has changed, knative/pkg calls Update

However, our ReconcileKinds for TaskRuns and PipelineRuns (notably, not Runs) also make their own calls Update, as part of updateLabelsAndAnnotations:

func (c *Reconciler) updateLabelsAndAnnotations(ctx context.Context, pr *v1beta1.PipelineRun) (*v1beta1.PipelineRun, error) {
newPr, err := c.pipelineRunLister.PipelineRuns(pr.Namespace).Get(pr.Name)
if err != nil {
return nil, fmt.Errorf("error getting PipelineRun %s when updating labels/annotations: %w", pr.Name, err)
}
if !reflect.DeepEqual(pr.ObjectMeta.Labels, newPr.ObjectMeta.Labels) || !reflect.DeepEqual(pr.ObjectMeta.Annotations, newPr.ObjectMeta.Annotations) {
// Note that this uses Update vs. Patch because the former is significantly easier to test.
// If we want to switch this to Patch, then we will need to teach the utilities in test/controller.go
// to deal with Patch (setting resourceVersion, and optimistic concurrency checks).
newPr = newPr.DeepCopy()
newPr.Labels = pr.Labels
newPr.Annotations = pr.Annotations
return c.PipelineClientSet.TektonV1beta1().PipelineRuns(pr.Namespace).Update(ctx, newPr, metav1.UpdateOptions{})
}
return newPr, nil
}

func (c *Reconciler) updateLabelsAndAnnotations(ctx context.Context, tr *v1beta1.TaskRun) (*v1beta1.TaskRun, error) {
// Ensure the TaskRun is properly decorated with the version of the Tekton controller processing it.
version, err := changeset.Get()
if err != nil {
return nil, err
}
if tr.Annotations == nil {
tr.Annotations = make(map[string]string, 1)
}
tr.Annotations[podconvert.ReleaseAnnotation] = version
newTr, err := c.taskRunLister.TaskRuns(tr.Namespace).Get(tr.Name)
if err != nil {
return nil, fmt.Errorf("error getting TaskRun %s when updating labels/annotations: %w", tr.Name, err)
}
if !reflect.DeepEqual(tr.ObjectMeta.Labels, newTr.ObjectMeta.Labels) || !reflect.DeepEqual(tr.ObjectMeta.Annotations, newTr.ObjectMeta.Annotations) {
// Note that this uses Update vs. Patch because the former is significantly easier to test.
// If we want to switch this to Patch, then we will need to teach the utilities in test/controller.go
// to deal with Patch (setting resourceVersion, and optimistic concurrency checks).
newTr = newTr.DeepCopy()
newTr.Labels = tr.Labels
newTr.Annotations = tr.Annotations
return c.PipelineClientSet.TektonV1beta1().TaskRuns(tr.Namespace).Update(ctx, newTr, metav1.UpdateOptions{})
}
return newTr, nil
}

Effectively, what we're doing is:

knative/pkg calls Watch
  -> ReconcileKind(pr)
    -> if pr has changed, Tekton calls Update
      -> if pr has changed (unlikely), knative/pkg calls Update

This leads to an unnecessary and unexpected change in the control of the update lifecycle for both PipelineRuns and TaskRuns. A method unsuspiciously called updateLabelsAndAnnotations is actually responsible for persisting all the changes we make during a reconciliation.

This isn't a problem per se, it's just unexpected, and not how Knative reconciler code expects to be used, which may cause confusion and bugs later. It may also lead to duplicate calls to Update, if diffs sneak in after our call to updateLabelsAndAnnotations, which can lead to cluster-destabilizing write load on the K8s API server in heavy users of Tekton.

This may cause extra problems if Knative controller logic starts making load-bearing assumptions that it's solely responsible for the update lifecycle of objects it passes to ReconcileKind, which could cause headaches for us later. knative/pkg thankfully has downstream tests to warn about behavior changes that might break Tekton, but this still means they may be prevented from making otherwise helpful changes and optimizations because of how Tekton is (mis-)using its packages.

Expected Behavior

I'd expect ReconcileKind not to make calls to Update itself, and rely on Knative's own Updates.

Additional Info

  • Kubernetes version: N/A, just reading code
  • Tekton Pipeline version: 0.37.2

While we're at it, updateLabelsAndAnnotations is called by another unsuspicious-sounding method, finishReconcileUpdateEmitEvents, which should probably also be refactored so it's less load-bearing in the overall reconciliation cycle:

func (c *Reconciler) finishReconcileUpdateEmitEvents(ctx context.Context, tr *v1beta1.TaskRun, beforeCondition *apis.Condition, previousError error) error {

@vdemeester @afrittoli WDYT?

@imjasonh imjasonh added kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jul 16, 2022
@imjasonh
Copy link
Member Author

Possibly related:

I think at least part of the reason we update directly is so that when we publish the cloud event we can be sure the object matches the published state -- if updating fails, we shouldn't publish a different cloud event.

If cloud event publishing was a separate offline process that happened in response to the update, we'd have the same guarantee that the cloud event state matches the updated state, and we'd have the benefits of a separate configurable/monitorable/scaleable deployment just for publishing events, and we could have reconciliation updates managed by knative/pkg.

Seems like a win-win-win, we just need to write some code.

@afrittoli
Copy link
Member

I'm pretty sure the call to Update (or Patch before), predates cloud events. I suspect it may even predate ReconcileKind and perhaps it's a legacy of old times when it would actually be needed?
We should totally try and remove it and run unit and e2e tests and see if anything breaks :)

About cloud events, moving publishing to a separate controller as pros and cons, but I think the pros definitely outweighs the cons. The main cons are:

  • Some events are not matched by a dedicated status update. If a PipelineRun is created, and external controller could send a new "queued" event (not a started one) because it doesn't know when the main controller will pick it up from the queue and start processing it. The external controller could then send the "started" event when it detects that the start time has been set and send a "running" event when it detects that a taskrun or run exists in the status
  • Avoiding duplicate events require a local cache of events. The cache is required for Run events anyways, so it's in the codebase already. It was initially developed in the external cloud events controller and then ported in. The cache is in memory only at the moment, which means that a controller restart would trigger duplicate events.

The external controller exists in experimental and with @waveywaves we drafted a roadmap to make it replace the current built-in support, but we then both ran out of bandwidth. I will try and pick it back up this summer.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 16, 2022
@imjasonh
Copy link
Member Author

/lifecycle frozen
/remove-lifecycle stale

I think this is still something we should do. I think @vdemeester was looking into it.

@tekton-robot tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 17, 2022
@vdemeester
Copy link
Member

I think this is still something we should do. I think @vdemeester was looking into it.

Yes I was looking into it recently, and it might "break" some assumption users do (on some labels being present on PipelineRun or TaskRun), so I think we might have to document and deprecate the current behavior so that we can remove it later on.

@vdemeester
Copy link
Member

/assign

@vdemeester
Copy link
Member

/unassign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

4 participants