-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Option to never delete group (upsert only) #70
Comments
Yes please... this has happened to us 2-3 times in the last 45 days or so |
hi @guilhermeblanco @jdonohoo sorry to hear about that. I'll see if I can implement some workaround to this google API problem. The thing here is the
But anyway, let me see if I can do something to mitigate this one. |
@christiangda I understand the concept of the state file and workspaces being the source of truth, but APIs do go down. When the ICP sync lambda runs if there is a google outage / error instead of doing nothing... It is wiping out the group that error'd. So the source of truth is failing to be the source of truth, so it deletes Groups from AWS SSO. When Google stops failing it recreates the group, and all those AWS Account Assignments have the old group principalId that no longer exist so your users lose all permissions. Then the new group that was created has zero Account Assignments so you have to go reattach. I ended up writing a powershell script that is running as a github action cron job top clean up orphan'd account assignments and rebuild them for all the aws accounts. I removed obvious secrets, but you should get the general idea here:
|
Hi @christiangda, I completely understand your point of view. Google is source of truth, and in an ideal world it can be 100% trusted. But unfortunately, system designs cannot 100% uptime and accuracy, so that's why they are 99.99% reliable (Google Directory API is 4, not 5 9's). I want to illustrate what exactly happens during a Google outage, so you understand how painful it is to recover and why flag wouldn't hurt so much to prevent this chaos. Apologies for this being so long in the past, but it was the one we went further down investigating. Here is the internal incident report notes I want to share:
This continues for several minutes, until eventually the API comes back. This is what we experience:
Subsequent calls keep alternating the groups and users returned, further deleting groups (even newly created ones) and users. Now the challenge comes when we talk about AWS SSO Groups. In AWS SSO, we assign the Group, Account and PermissionSet together, saying that Admininistrator has AdministratorAccess over account XXXXXXXX. It also goes further, as: To resolve this issue, once updated our IaC logic to rely on group names instead of their IDs, fetching the group, and then relying on it to perform account assignment. This helped us to re-assign the permissions just by running a terraform apply again. This process is quick (and the same as @jdonohoo illustrated) and takes roughly 5 minutes. After that, we need to join every account and re-map all the IAM roles that got reprovisioned by AWS SSO into assigned accounts and run IaC again to update the mapped roles in EKS. Finally, we need to join the Reporting account, remap all IAM roles too reprovisioned and run IaC again to grant access to our data staff to visualize Data Catalog components. As you might have seen, if this happened once it wouldn't be so concerning, but since this is now almost consistently happening every week, it's a call to action issue from our end. It takes an individual from infrastructure team roughly 4h to address everything now, and it is getting more painful as we keep adding new job functions (read as groups), assignments (new accounts to developers or other teams) and users (we are still rolling out VPN across the organization), expanding rapidly the usage and reliability on this portion of the system to work effectively. This sprint we have a task ensure this is resolved. We would appreciate if you could help us resolving it sooner. Our only other alternative will be to fork and maintain it indefinitely, which I don't think it is the best for both of us. =) Sorry for the long post, I wanted to show you exactly how painful it is, so you understand the motivation behind of this ask to you. |
@guilhermeblanco we also use EKS, and are living through the pain real time. However, now they have no cluster access to EKS until some one in our ops group goes and redoes the mappings like you described. This is a total nightmare, we are at the same point of considering forking which seems like the wrong approach for everyone. |
hi @guilhermeblanco, @jdonohoo thank you for the detailed information and explanation. Now I understand very well your problem. I'm checking the code to figure out how to mitigate this, but I will need your help. Let me first explain a little bit about the main workflow - png / main workflow - html of the program
So, having said that:
Please check the main workflow - png and let me know if you understand me. IMPORTANT:
Maybe the version of the program is the problem!, because I don't have any of these issues in my implementation in production, and I manage more than |
Hi @christiangda! I completely understand you. My ask is to keep the ticket open for now until I either come back in 30 days to close it, or the issue happens again and we can continue conversation. In the meantime, rest assured I'll have all eyes on this. As for numbers, we have reached 91 groups and 125 users for the moment. I am holding off rolling this out organization wide until we ensure my reported ticket is no longer an issue. The expectation is have a closer number of groups, but over 500 users. Thanks for your help and dedication to this project! =) |
This feature would have been useful to have for another reason - I just tried to use this project on an SSO installation that had a bunch of manually created groups. It ended up deleting all of them, which I did not expect. In a situation where you are transitioning to using groups provisioned by SCIM it would be nice to be able to keep any manually-created groups until they could be fully deprecated. |
Hi @obscurerichard! You are absolutely right, this feature would fit nicely. |
Hi @christiangda! As I mentioned, I'd closely monitor these executions and report back on next Google failure. Today at 16:40 EST we had another episode. Personally, I think it is time to have this supported, and I'll gladly review again your flowchart and dedicate some time to have this implemented. A PR should come your way within the next few days, but please be patience as I am not a Go developer. All I'd ask is to help me ensure the solution is robust and pick the best flag name to add. So far the best I could think of is I'm happy to share the logs below of how it happened:
|
Adding to the issue, even though Google API seems to be back to normal, it is not creating the groups anymore. Once Google API went back to normal, it attempts to reconcile the groups and since state hash code is the same as google hash code, it never attempts to create the removed groups. My only resort to address the issue was to remove the state file, and it imported the groups and users, but it is now hitting the lambda execution time limit of 15min. If we break down the state file into groups, users and members, it would likely work. Here is where it gets stuck for a long time...
Thinking about this problem overnight, the state should be split into groups, users and one file per group_membership. I managed to get this back to operational state at a 14min mark, but I had to deactivate (comment out code, compile, upload, execute) for each one of the steps. The group membership alone took 14min. If we split the files like mentioned, it would be possible to paginate and perform the sync of group members on a per page basis, it would store the state file per group_member, and it could easily skip the ones already mapped, and consequently be able to resume changes in case one execution time limit ends. |
If the batch of changes grows too large, this will inevitably grow beyond the 15 minute Lambda execution time. It might also be worth exploring whether an async architecture where a batch of changes gets chunked up into messages and sent via SQS to a lambda function that processes these in batches would be in order. |
I'd be very excited to see this feature come through. I've had it happen a couple times in the last 2 weeks. |
Just a drive by comment from someone who is evaluating solutions for that lack of SCIM support between AWS and Google Workspace: Maybe instead of deleting immidietly, you can mark it as candidate for deletion with a due date (say in 3 days). Remove the mark when/if Google API gets back to its old self. And actually remove after mark expires. |
Hello, We experienced the same issue with Google on May 5th. We got one group deleted and recreated from scratch. However, since the terraform was not applied, no role was attached to the new group and thus employees were not able to enter. Fortunately, this did not affect people who were on oncall. Since we are delegating the role attribution to terraform, we would also prefer delegating group creation/deletion to terraform. In the "--sync-method", there is only one value "groups", could we add "users" too (so we can synchronize users only that belong to the groups specified in the configuration) ? Best, |
Is your feature request related to a problem? Please describe.
I'm always frustrated when Google APIs decide to barf and exclude several groups on AWS SSO side.
Google APIs have failed multiple times in the last 5 weeks. When it does, it consistently returns 502 for several minutes (idp-scim-sync handles it nicely!) and when it returns, for another minute it wildly provide a subset of existent groups, and then it returns back to normal.
When the recovery happens, all AWS SSO groups are re-created, but the permissionSets on AWS SSO reference to previously existent groups, forcing us to remap the group to account with permission all over again. This process takes ~20 minutes via Terraform, since the permissionSet also creates new IAM roles that need to be correlated to EKS roles.
Describe the solution you'd like
Introduce a new flag, preventing created groups to be removed, only to be added or modified.
Describe alternatives you've considered
An alternate solution would be in the case of a large diverging number of groups (when comparing state file groups to returned Google groups), attempt to re-fetch the groups up to 3 times, and then proceed with operation.
The text was updated successfully, but these errors were encountered: