Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gloo fails after upgrade from 1.14.27 to 1.15.x #8968

Closed
ndtmdsma opened this issue Dec 12, 2023 · 13 comments
Closed

Gloo fails after upgrade from 1.14.27 to 1.15.x #8968

ndtmdsma opened this issue Dec 12, 2023 · 13 comments
Assignees
Labels
Area: Envoy activities related to envoy proxy Good First Issue Good issue for newbies release/1.15 Type: Bug Something isn't working

Comments

@ndtmdsma
Copy link

ndtmdsma commented Dec 12, 2023

Gloo Edge Product

Open Source

Gloo Edge Version

v1.15.x

Kubernetes Version

v1.24

Describe the bug

Upgrading from Gloo 1.14.x to 1.15.x causes Gloo to fail entirely. Everything starts, glooctl check says everything is OK, but Gloo doesn't route anything. I get warning as below for all upstreams:

[warning][config] [external/envoy/source/common/config/grpc_subscription_impl.cc:128] gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster rejected: Error adding/updating cluster(s) app1_gloo-system: Didn't find a registered config subscription factory implementation for name: 'envoy.config_subscription.rest', kube-svc:app2-0-80_ns2: Didn't find a registered config subscription factory implementation for name: 'envoy.config_subscription.rest', consul-svc:app3_gloo-system: Didn't find a registered config subscription factory implementation for name: 'envoy.config_subscription.rest'

It's a very long list.

Expected Behavior

Gloo working after upgrade

Steps to reproduce the bug

Upgrade Gloo from 1.14.27 to 1.15.x (I tried all patch versions).

Additional Environment Detail

Using consul upstreams discovery (but gloo fails for both consul and kube upstreams).

Running in EKS.

Helm values:

    crds:
      create: true
    discovery:
      deployment:
        floatingUserId: false
        fsGroup: 10101
        image:
          extended: false
          repository: discovery
          tag: 1.15.17
        replicas: 1
        runAsUser: 10101
        stats:
          enabled: true
          serviceMonitorEnabled: true
          podMonitorEnabled: true
      enabled: true
      fdsMode: WHITELIST
      serviceAccount:
        disableAutomount: false
      logLevel: "info"
      udsOptions:
        enabled: true
    gateway:
      enabled: true
      certGenJob:
        enabled: true
        image:
          repository: certgen
        restartPolicy: OnFailure
        setTtlAfterFinished: true
        ttlSecondsAfterFinished: 60
        runAsUser: 10101
        cron:
          enabled: false
          schedule: "* * * * *"
        runOnUpdate: false
      rolloutJob:
        enabled: true
        image:
          repository: kubectl
        restartPolicy: OnFailure
        runAsUser: 10101
        ttlSecondsAfterFinished: 60
      cleanupJob:
        enabled: true
        image:
          repository: kubectl
        restartPolicy: OnFailure
        runAsUser: 10101
        ttlSecondsAfterFinished: 60
      proxyServiceAccount:
        disableAutomount: false
      readGatewaysFromAllNamespaces: false
      isolateVirtualHostsBySslConfig: false
      serviceAccount:
        disableAutomount: false
      updateValues: false
      validation:
        enabled: true
        failurePolicy: "Ignore"
        secretName: gateway-validation-certs
        alwaysAcceptResources: true
        allowWarnings: true
        serverEnabled: true
        disableTransformationValidation: false
        warnRouteShortCircuiting: false
        validationServerGrpcMaxSizeBytes: 104857600
        webhook:
          enabled: true
          disableHelmHook: false
          extraAnnotations: {}
      persistProxySpec: true
    gatewayProxies:
      gatewayProxy:
        antiAffinity: false
        configMap:
          data: null
        envoyApiVersion: V3
        envoyBootstrapExtensions: null
        envoyStaticClusters: null
        failover:
          enabled: false
        gatewaySettings:
          disableGeneratedGateways: false
          options: {}
          useProxyProto: false
        globalDownstreamMaxConnections: 250000
        healthyPanicThreshold: 50
        kind:
          deployment:
            replicas: 1
        loopBackAddress: 127.0.0.1
        podTemplate:
          customReadinessProbe: {}
          disableNetBind: true
          floatingUserId: false
          fsGroup: 10101
          gracefulShutdown:
            enabled: false
            sleepTimeSeconds: 25
          httpPort: 8080
          httpsPort: 8443
          image:
            extended: false
            repository: gloo-envoy-wrapper
            tag: 1.15.17
          probes: false
          runAsUser: 10101
          runUnprivileged: true
          terminationGracePeriodSeconds: 0
          tolerations: null
        readConfig: false
        readConfigMulticluster: false
        service:
          httpPort: 80
          httpsFirst: false
          httpsPort: 443
          type: LoadBalancer
    global:
      glooRbac:
        create: true
        nameSuffix: ""
        namespaced: false
      glooStats:
        enabled: true
        routePrefixRewrite: /stats/prometheus
      image:
        extended: false
        pullPolicy: IfNotPresent
        registry: quay.io/solo-io
      istioSDS: {}
    gloo:
      deployment:
        disableUsageStatistics: false
        floatingUserId: false
        image:
          extended: false
          repository: gloo
          tag: 1.15.17
        replicas: 1
        restXdsPort: 9976
        runAsUser: 10101
        validationPort: 9988
        xdsPort: 9977
      serviceAccount:
        disableAutomount: false
    k8s:
      clusterName: cluster.local
    namespace:
      create: false
    settings:
      aws:
        enableCredentialsDiscovery: false
        enableServiceAccountCredentials: false
        stsCredentialsRegion: ""
      create: true
      disableKubernetesDestinations: false
      disableProxyGarbageCollection: false
      enableRestEds: true
      integrations:
        consul:
          insecureSkipVerify: true
          httpAddress: https://consul-server.consul.svc.cluster.local:8501
          serviceDiscovery: {}
        consulUpstreamDiscovery: null
        knative:
          enabled: false
      invalidConfigPolicy:
        invalidRouteResponseBody: Gateway has invalid configuration.
        invalidRouteResponseCode: 404
      regexMaxProgramSize: 150
      linkerd: false
      singleNamespace: false

Additional Context

No response

@ndtmdsma ndtmdsma added the Type: Bug Something isn't working label Dec 12, 2023
@sam-heilbron
Copy link
Contributor

sam-heilbron commented Dec 12, 2023

Related issue: istio/istio#46321

We introduced Envoy 1.26 as part of: #8454

Internal slack context: https://solo-io-corp.slack.com/archives/G01EERAK3KJ/p1702392220039389

@sam-heilbron
Copy link
Contributor

Thanks for writing up this issue.

This is indeed a bug due to the changes in the Envoy 1.26 release, however it will only occur when using enableRestEds. You can see more details about how EDS can be configured in Gloo Edge here: https://github.com/solo-io/gloo/blob/main/projects/gloo/pkg/xds/utils.go#L13.

Our recommended approach is to rely on gRPC EDS instead of REST.

Is that a feasible solution for now? If so, can you confirm that when using gRCP EDS, that this NACK no longer occurs?

@ndtmdsma
Copy link
Author

Our recommended approach is to rely on gRPC EDS instead of REST.

Is that a feasible solution for now? If so, can you confirm that when using gRCP EDS, that this NACK no longer occurs?

Hi, I can confirm setting enableRestEds: false does allow successful Gloo upgrade to 1.15.17 and is a viable workaround in my case. Thanks @sam-heilbron !

Shall I close this issue, or do you want to keep it open until the root cause is fixed?

@nfuden
Copy link
Contributor

nfuden commented Dec 12, 2023

Glad to hear that it solves your issue. We also have moved to generally recommending against using resteds so hopefully you will also get a nice performance gain with this change!

That being said please do leave this issue open as we would like to fix it and update our tests to catch this proactively in the future.

@nfuden nfuden added Area: Envoy activities related to envoy proxy Good First Issue Good issue for newbies labels Dec 12, 2023
@nfuden
Copy link
Contributor

nfuden commented Dec 13, 2023

Alright we will be fixing this issue, updating docs around when to use useresteds (we have some comments on history and why we dont use it on the function but not in docs), and then will scope out what testing effort we would need to have confidence in this feature going forwards.

@ben-taussig-solo
Copy link
Contributor

ben-taussig-solo commented Dec 14, 2023

Update:

Remaining work

  • Create a PR to update extensions_build_config.bzl in envoy-gloo-ee 1.26
    • This should be essentially identical to the PR against envoy-gloo linked above
  • Pull envoy-gloo update into Gloo OSS
  • Pull envoy-gloo-ee update into Solo-projects
  • Update documentation to indicate that we do not recommend the use of REST EDS
  • Refactor existing tests against REST EDS so that we have confidence in the functionality
    • more on this in the next comment

@ben-taussig-solo
Copy link
Contributor

Scoping test refactoring

  • The happypath tests currently attempt to:
    • construct a slice of test cases
      testCases = []struct {
      Title string
      RestEdsEnabled *wrappers.BoolValue
      TransportApiVersion envoy_config_core_v3.ApiVersion
      }{
      {
      Title: "Rest Eds Enabled",
      RestEdsEnabled: &wrappers.BoolValue{
      Value: true,
      },
      TransportApiVersion: envoy_config_core_v3.ApiVersion_V3,
      },
      {
      Title: "Rest Eds Disabled",
      RestEdsEnabled: &wrappers.BoolValue{
      Value: false,
      },
      TransportApiVersion: envoy_config_core_v3.ApiVersion_V3,
      },
      }
    • create a ginkgo Describe block for each test case in the slice
      for _, testCase := range testCases {
      Describe(fmt.Sprintf("%s: (%s)", testCase.Title, testCase.TransportApiVersion.String()), func() {
  • This does not work as intended; by debugging I am able to confirm that the tests run twice, but the second test case (Rest EDS Disabled) is used each time
  • I think It should be straightforward to refactor these tests to use a DescribeTable structure in order to achieve the goal of running the same suite of tests with and without REST EDS enabled
  • However, it is possible that some complications may arise given that we have not successfully run these tests with REST EDS enabled in quite some time.
    • That being said, I expect that the envoy-gloo bump above should resolve most of these issues
  • I expect that this refactor should take around 1 day of work (I.E. T-shirt size S).

@ben-taussig-solo
Copy link
Contributor

ben-taussig-solo commented Jan 8, 2024

Update

  • I have gone through and refactored the 1.15.x happypath e2e tests in This PR: [1.15.x] Resolve rest EDS upgrade failure #8982
  • However, even after updating these tests to successfully evaluate the REST EDS enabled case, we still do not have a test that fails when using the pre-bump version of envoy
    • I have a section below diving further into the techincal/implementation details of this.
  • Moving forward, I need to:
    • Understand why the refactored tests do not fail when using the pre-bump version of envoy with REST EDS
    • Do one of the following:
      1. Write another test that fails when using the pre-bump version of envoy with REST EDS
        • This may potentially be a kube2e test if it is not feasible to evaluate REST EDS in e2e tests
      2. Update e2e test setup so that we see the expected failure when running the refactored tests with the pre-bump version of envoy
        • I am not yet sure what this would entail or if it is feasible

Implementation Details

	// set Type = EDS if we have endpoints for the upstream
	if eps, ok := upstreamRefKeyToEndpoints[upstream.GetMetadata().Ref().Key()]; ok && len(eps) > 0 {
		xds.SetEdsOnCluster(out, t.settings)
	}
						ClusterNames:        []string{defaults.GlooRestXdsName},
  • When I try to call xds.SetEdsOnCluster despite not having an endpoint mapped to the upstream ref from our e2e tests, the tests fail due to an unexpected error
  • My current suspicion is that for some reason the XDS cluster is not created during e2e tests.
  • Currently, I am looking into whether this cluster is created during kube2e tests to see if those are better suited for evaluating this behavior

@ben-taussig-solo
Copy link
Contributor

Update

Because of the complications with automated testing around this functionality, we have decided to proceed with the following two PRs:

REST EDS is not recommended for use, as it is less performant and does not offer any benefits over the default GRPC EDS. As we do not expect further enhancements to the way Gloo Edge handles REST EDS, we are proceeding with manual validation for the 1.15.x change

@sam-heilbron
Copy link
Contributor

sam-heilbron commented Jan 12, 2024

We're time boxing the remaining work on this issue to be completed by EOD Jan 12. The goal is to complete as much of this checklist as possible:

cc @ben-taussig-solo @SantoDE @nfuden

@ben-taussig-solo
Copy link
Contributor

Please note that Gloo Edge 1.16 and 1.17 are not susceptible to this issue, as they depend on envoy-gloo/envoy-gloo-ee 1.27, each of which has included the REST config subscription extension in their extensions build config for some time:

@ben-taussig-solo
Copy link
Contributor

Closing this issue as we have merged the relevant work. The fix will be available in Gloo Edge OSS 1.15.20 and Gloo Edge Enterprise 1.15.11

@ndtmdsma
Copy link
Author

ndtmdsma commented Mar 6, 2024

Thank you for the awesome work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Envoy activities related to envoy proxy Good First Issue Good issue for newbies release/1.15 Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants