Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: GID for group already in use by a different group #509

Open
adombeck opened this issue Sep 2, 2024 · 18 comments
Open

Issue: GID for group already in use by a different group #509

adombeck opened this issue Sep 2, 2024 · 18 comments
Labels
bug Something isn't working high High importance issue jira

Comments

@adombeck
Copy link
Contributor

adombeck commented Sep 2, 2024

Describe the issue

Reported by @jhaar on #496:

Secondly, once I went through the whole initialization process again for "[email protected]", it still fails - this time this shows up in the logs

authd[900]: 2024/08/26 13:31:57 ERROR GID for group "group1" already in use by group "group2"
authd[900]: 2024/08/26 13:31:57 WARN can't check authentication: failed to update user "[email protected]": GID for group "group1" already in use by a different group

Now that I can't help you with. Yes those are two AD groups (if it matters, we are Hybrid mode, i.e. old Enterprise AD hooked into EntraID) that I am members of - why they show up as the same GID is beyond me - they certainly show no evidence of problems outside of this authd event.

@adombeck adombeck added bug Something isn't working jira labels Sep 2, 2024
@adombeck
Copy link
Contributor Author

adombeck commented Sep 2, 2024

I did some initial analysis. The error originates here:

// If a group with the same GID exists, we need to ensure that it's the same group or fail the update otherwise.
if existingGroup.Name != "" && existingGroup.Name != groupContent.Name {
slog.Error(fmt.Sprintf("GID for group %q already in use by group %q", groupContent.Name, existingGroup.Name))
return fmt.Errorf("GID for group %q already in use by a different group", groupContent.Name)
}

We set the GID here:

gidv := users.GenerateID(g.UGID)

and here:

defaultGroup := []users.GroupInfo{{Name: uInfo.Name, GID: &uInfo.UID}}

Both of these GIDs are generated by GenerateID as a number between 65536 and 100000. There is no check that the generated GID is not already in use, so it is to be expected that there will sometimes be a conflict. We could fix this by checking that the generated GID is not in use.

I wonder why we store the GID in our database at all. Wouldn't it be enough to store the group name and use addgroup groupadd to add the group to /etc/group and let that pick a GID which isn't in use already?

@didrocks
Copy link
Member

didrocks commented Sep 2, 2024

There is no check that the generated GID is not already in use

Are you sure? The logs GID for group "group1" already in use by group "group2" is exactly this check, where it can be done without any race (meaning, when we insert it to the database). This is the error that is reported, to avoid owerwriting a group with another one.

I wonder why we store the GID in our database at all. Wouldn't it be enough to store the group name and use addgroup groupadd to add the group to /etc/group and let that pick a GID which isn't in use already?

So, adduser and addgroup are used for local users (storing in /etc/passwd, /etc/shadow and /etc/group). Then, they can delegate with some patches to other processes to insert inside their database. However, for all "external directories" like ldap (with samba bindings) or sssd for AD for instance, they all own and store their own info as we do. Then, those components (as authd does) answers to nss requests for their respective users and groups.

Also, relying on adduser/addgroup for storing an user and group won’t solve one of the use case for directory handling: shared ressources. If you have a disk shared with nfs for instance, you need to map your uid/gid to the ones on disk for access permissions. This is why you need predictable UIDs and GIDs, and some that are shared across multiples machines and not generated per machines. (In a previous implementation, we were just adding one increment after another in case of collision, but we decided against it in this one and wanted to see if the collision rate was high).

Also, we need to predict the UID even before it’s created on disk for ssh first user access, which is doing a pre-flight even before authentication. Returning a different UID will then puzzle sshd and result in various errors on first login.

This is for those reasons we didn’t went for +increments right away.

Out of curiosity. can you share the conflicting ID groups name? We can probably work the generation algorithm to ensure we end up with less conflicts.

@adombeck
Copy link
Contributor Author

adombeck commented Sep 2, 2024

Are you sure? The logs GID for group "group1" already in use by group "group2" is exactly this check, where it can be done without any race (meaning, when we insert it to the database). This is the error that is reported, to avoid owerwriting a group with another one.

What I meant is that GenerateID (or the function calling it) doesn't check if the generated ID is already in use on the system, in which case it should generate another one. The current check results in an unsuccessful login for reasons not caused by the user. We should be able to avoid that.

aSo, adduser and addgroup are used for local users (storing in ´/etc/passwd, /etc/shadowand/etc/group`). Then, they can delegate with some patches to other processes to insert inside their database. However, for all "external directories" like ldap (with samba bindings) or sssd for AD for instance, they all own and store their own info as we do. Then, those components (as authd does) answers to nss requests for their respective users and groups.

But don't we add the group to /etc/group here?

if err := runGPasswd(opts.gpasswdCmd[0], args...); err != nil {

Also, relying on adduser/addgroup for storing an user and group won’t solve one of the use case for directory handling: shared ressources. If you have a disk shared with nfs for instance, you need to map your uid/gid to the ones on disk for access permissions. This is why you need predictable UIDs and GIDs, and some that are shared across multiples machines and not generated per machines.

How does the current implementation get us predictable GIDs across multiple machines? IIUC, we generate the GIDs per authd instance, and we don't support using an authd instance from a different machine. What am I missing?

Also, we need to predict the UID even before it’s created on disk for ssh first user access, which is doing a pre-flight even before authentication. Returning a different UID will then puzzle sshd and result in various errors on first login.

What do you mean with "pre-flight even before authentication"?

Out of curiosity. can you share the conflicting ID groups name? We can probably work the generation algorithm to ensure we end up with less conflicts.

I couldn't reproduce the error myself, it was reported on #496.

@didrocks
Copy link
Member

didrocks commented Sep 2, 2024

But don't we add the group to /etc/group here?

No, we do add a remote user to a local groups, this is group already handled by the local group utils. There is an option to request nss to still go through groups membership even if it found one matching, but this option is not enabled on Ubuntu by default and it’s something we should explore (but we shouldn’t do that without the distro support).

How does the current implementation get us predictable GIDs across multiple machines? IIUC, we generate the GIDs per authd instance, and we don't support using an authd instance from a different machine. What am I missing?

The implementation is computing the GIDs only based on the string passed to it: https://github.com/ubuntu/authd/blob/main/internal/users/manager.go#L418. There is nothing random. The proof is that the tests even assert the generated id, despite CI running on different machines.

What do you mean with "pre-flight even before authentication"?

sshd needs to ensure an user potentially exists before going on with authentication. Otherwise, it raised a mock authentication (so that the remote attacker doesn’t know if the user exists or not in the machine) with an unmatchable password.

To do this pre-check, it does an NSS request even before the pam authenticaiton starts and expect a real passwd entry for the user, with the UID and user name matching the one that will be returned later, after authentication.

However, on first login, we don’t have such user information (we even don't know if the user exists). This is why we thus needs a predictable ID so that, if later authentication succeed, we can then return a matching user information.

@adombeck
Copy link
Contributor Author

adombeck commented Sep 2, 2024

The implementation is computing the GIDs only based on the string passed to it: https://github.com/ubuntu/authd/blob/main/internal/users/manager.go#L418. There is nothing random.

Ah, I didn't actually read the implementation and assumed it's generating a random ID 😅

This approach will inherently lead to UID/GID conflicts though. I don't see a way to both have a predictable UID/GID and avoid conflicts, so we should weigh up the benefits of a predictable UID/GID (the shared disk use case you mentioned - is there anything else?) and the drawbacks (unsuccessful logins).

@didrocks
Copy link
Member

didrocks commented Sep 2, 2024

I don't see a way to both have a predictable UID/GID and avoid conflicts, so we should weigh up the benefits of a predictable UID/GID (the shared disk use case you mentioned - is there anything else?) and the drawbacks (unsuccessful logins).

Indeed, there is none. (share disk and ssh are the 2 reasons). I think though that maybe we can come up with an algorithm that would lead to less collisions? I wouldn’t have expected to see one so fast and that worries me, so there is probably better possible implementation (which will require a transition plan).

@adombeck
Copy link
Contributor Author

adombeck commented Sep 2, 2024

I'm not sure I correctly understood the SSH case, but the way I understand it, we should be able to solve it by generating a random UID ourselves until we generated one that doesn't exist yet on the system, no? That's still prone to a race, because the UID could have been taken by the time the user successfully authenticated and we actually try to create the user with that UID on the system, but that should be a rare enough case that it's fine to fail with a descriptive error message if that happens.

I think though that maybe we can come up with an algorithm that would lead to less collisions?

Given the small range of IDs we generate, I don't think there is any way to avoid collisions with an acceptable probability.

I wouldn’t have expected to see one so fast and that worries me

The fact that a surprisingly low number of elements from a relatively large set are needed to get a high probability of collisions is known as the birthday paradox.

@3v1n0
Copy link
Collaborator

3v1n0 commented Sep 2, 2024

That's still prone to a race, because the UID could have been taken by the time the user successfully authenticated

Yeah, I feel that's a potential issue, but indeed we could potentially hold that UID until done, so that it's returned by nss.

@martinKacenga
Copy link

Experiencing the same problem. The logs:
Sep 09 15:41:44 ***** authd[1548]: 2024/09/09 15:41:44 ERROR GID for group "advascope users sg" already in use by group "všichni interní sg" Sep 09 15:41:44 ***** authd[1548]: 2024/09/09 15:41:44 WARN can't check authentication: failed to update user "******@*****.**": GID for group "advascope users sg" already in use by a different group

@jibel jibel added the high High importance issue label Sep 17, 2024
@adombeck
Copy link
Contributor Author

There is no check that the generated GID is not already in use

Are you sure? The logs GID for group "group1" already in use by group "group2" is exactly this check, where it can be done without any race (meaning, when we insert it to the database). This is the error that is reported, to avoid owerwriting a group with another one.

After taking a closer look, I don't think this check is enough to avoid UID/GID clashes, because we only check if the UID/GID is already stored in our database, but we don't check if it's in use by any other NSS sources on the system, right?

@mazer-ai
Copy link

mazer-ai commented Oct 9, 2024

Am understanding this conversation correctly to mean that different linux cloud instances might end up with different assigned UIDs and GIDs for the same verified AAD user? If so, what happens when the home dir lives on an NFS mounted file system, where the ACLs are based on the UID and GID? Couldn't the user end up not owning their own home directory?

@adombeck
Copy link
Contributor Author

adombeck commented Oct 9, 2024

Am understanding this conversation correctly to mean that different linux cloud instances might end up with different assigned UIDs and GIDs for the same verified AAD user? If so, what happens when the home dir lives on an NFS mounted file system, where the ACLs are based on the UID and GID? Couldn't the user end up not owning their own home directory?

Yes, that's correct. We're exploring ways to support shared network resources like NFS which rely on UIDs and GIDs for access, but for now that's a known limitation. We plan to add that to the limitations section of the authd wiki.

@mazer-ai
Copy link

Thanks -- that's what I was afraid of... Any suggestions or advice on temporary work arounds? This is a killer for trying to use authd to facilitate centralized logins with a bunch of azure linux instances. Only solution we've come up with so far is including a list of manually maintained adduser commands in the packer setup when building the VM images. But this is not really sustainable and requires a rebuild each time a new user gets added...

adombeck added a commit that referenced this issue Oct 10, 2024
Removing users from the database allows other users for which the same
UID is generated (which can happen, see #509) to log in and gain access to
the deleted user’s home directory.

UDENG-4658
@jhaar
Copy link

jhaar commented Oct 11, 2024

if authd randomly generates UIDs & GIDs, why not get it to instead have some function that takes fully qualified AzureAD usernames and groups as inputs, and generates UIDs/GIDs from their (numeric) hash? That way all authd's instances will always generate the same UIDs/GIDs without needing some central database.

For security/privacy reasons, an improvement would be for each org to set an "org_seed" value - which gets prepended to the username/groupname - thus generating the same UID/GID across all installs with the same "org_seed" - but being undiscoverable by anyone else? Can't think of an attack vector - but that just means I don't have the imagination ;-)

@adombeck
Copy link
Contributor Author

why not get it to instead have some function that takes fully qualified AzureAD usernames and groups as inputs, and generates UIDs/GIDs from their (numeric) hash? That way all authd's instances will always generate the same UIDs/GIDs without needing some central database.

That's what authd is currently doing (see the implementation of the generateID function), but it leads to the issue which this GitHub issue is about: It leads to collisions, because UIDs and GIDs can only be in the range 1 to 2**32, which is too small. Currently, when authd notices a UID/GID collision, it aborts the login attempt. That's clearly not ideal, so we plan to change that in one of the next releases. It's not clear yet how we will change it. One approach is to generate random IDs until we generated one that's unique, another approach we're exploring is to store the IDs in the Entra directory.

@mazer-ai
Copy link

I think random IDs are going to be problematic, even if the computation is deterministic. Say Sally and Bob both hash to the same UID, then on the same machine, who gets assigned the next random unique UID will vary depending on the login order, and on two different machines, both could potentially get the same UID assigned since the machines don't communicate that a UID has already been taken...

I think you'd have to scan the entire list of users and generate a complete list of UIDs in order of account creation to be sure that this was completely deterministic and UIDs don't ever change once the account is created.. I guess if you access that data, you could also just assign UIDs to everyone starting at some fixed number based on the order in which users were added to the org...

If possible, seems like storing a UID in the Entra directory, if possible (don't know enough about what's possible with AAD), it would be safer. Then it's on the org to avoid collisions when adding new users. Guess you'd have to do the same for groups...

@bill-taut
Copy link

Agree with @mazer-ai. If you resolve collisions the way you describe, @adombeck, then the same user (or group, re the topic of this ticket) on different machines could have different IDs, which will lead to permission inconsistencies on shared network-mounted drives. Storing IDs in Entra could work.

@adombeck
Copy link
Contributor Author

Agree with @mazer-ai. If you resolve collisions the way you describe, @adombeck, then the same user (or group, re the topic of this ticket) on different machines could have different IDs, which will lead to permission inconsistencies on shared network-mounted drives.

We're aware of that. That's why I wrote above that we don't support shared network resources like NFS at the moment. It's now also mentioned in the limitations section of the wiki.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high High importance issue jira
Projects
None yet
Development

No branches or pull requests

8 participants