Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uuid #382

Open
qurbonzoda opened this issue Jul 23, 2024 · 31 comments
Open

Uuid #382

qurbonzoda opened this issue Jul 23, 2024 · 31 comments

Comments

@qurbonzoda
Copy link
Contributor

qurbonzoda commented Jul 23, 2024

This issue is for discussion of the proposed Uuid API. The full text of the proposal is here.

The API is already available in Kotlin 2.0.20-Beta2 and later, and is marked with @ExperimentalStdlibApi.

@axelfontaine
Copy link

As Uuid really is a base building block of data models, the package dependencies really matter. For the JVM, instead of having uuid depend on nio, it should really be the other way around!

@joffrey-bion
Copy link

joffrey-bion commented Jul 23, 2024

I do like that, by default, Uuid.toString() and Uuid.parse() rely on the RFC hex-and-dash format.

However, toString is initially meant for human readability. From the Javadoc of Object.toString(): "The result should be a concise but informative representation that is easy for a person to read". For this reason, I generally don't trust toString() as a reliable and stable way to serialize data (for any class), because it's not part of its contract.

I would find it nice to have Uuid.toRfc9562() or Uuid.toRfc9562String() in the API for the purpose of actually serializing a UUID for program consumption (and not human). toString() would of course use that behind the scenes because the RFC format is also nice for humans. As a side note: we do have Duration.toIsoString() for Kotlin durations. (Admittedly, durations have a different "natural" representation for humans, which means toString() was already "taken", but it kinda proves the point).

@lukellmann
Copy link
Contributor

As Uuid really is a base building block of data models, the package dependencies really matter. For the JVM, instead of having uuid depend on nio, it should really be the other way around!

If I understood this correctly, only the convenience extension functions for reading from / writing to ByteBuffers depend on NIO, not the Uuid class itself.

@joffrey-bion
Copy link

joffrey-bion commented Jul 23, 2024

@axelfontaine also, what exactly do you mean by "package dependencies"? It's not like there is a maven dependency (nio is in the Java runtime). Are you talking about Java 9 modules and the kotlin.uuid package requiring things? What exact problem are you worried about?

@adam-enko
Copy link
Member

adam-enko commented Jul 23, 2024

I would prefer the name to be UUID over Uuid.

It is consistent with existing names Base64.UrlSafe and Base64.Mime names.

Mime and UUID are both abbreviations, but the former is an acronym, and the later is an initialism.

  • 'Mime' is an acronym, and is pronounced as a single word. JSON is another acronym.
  • 'UUID' is an initialism, because each letter is pronounced individually. So is URL.
    (Although maybe I'm wrong and people do say 'oooo-eye-dee' and 'uhrrrll'(!)).

I can see how UrlSafe could be a counterpoint (should it be URLSafe?), but it follows the official Kotlin naming conventions.

When using an acronym as part of a declaration name, capitalize it if it consists of two letters (IOStream); capitalize only the first letter if it is longer (XmlFormatter, HttpInputStream).

The visual difference between UUID/Uuid is confusing when looking at the Kotlin/Java converter helper:

public fun Uuid.toJavaUuid(): java.util.UUID

Shouldn't the extension fun Uuid.toJavaUuid() instead be named fun Uuid.toJavaUUID(), to match the Java naming scheme?

  • It helps disambiguate whether the Java or Kotlin UUID type is being used without looking at imports.

This same confusion happens with other types (I've encountered this Random/Random). But those cases are exceedingly rare, and Kotlin provides tools to easily deal with them (named imports, or typealiases, or just using the FQN.)

Given that the confusion only happens in Kotlin/JVM, should other Kotlin targets have an unusual UUID name because of some other Kotlin target?

Maybe the confusion be helped by introducing official type aliases?

// kotlin-stdlib/src/jvmMain/kotlin/uuid/uuid-helpers.kt

typealias UuidKt = kotlin.uuid.UUID
typealias UuidJdk = java.util.UUID

@JakeWharton
Copy link

JakeWharton commented Jul 23, 2024

Mime and UUID are both abbreviations, but the former is an acronym, and the later is an initialism.

  • 'Mime' is an acronym, and is pronounced as a single word. JSON is another acronym.

  • 'UUID' is an initialism, because each letter is pronounced individually. So is URL.
    (Although maybe I'm wrong and people do say 'oooo-eye-dee' and 'uhrrrll'(!)).

I am definitely opposed to using pronunciation as a guiding principle on casing for a few reasons:

  1. Initialisms are a subset of acronyms and not disjoint, so by eliminating their special casing (heh) we eliminate an extra rule to remember.
  2. Many acronyms do not have universal agreement or canonical definition as whether they're initialisms or not (e.g., SQL is "S-Q-L" and "Sequel"). This also can vary as to whether English is your first language or not.
  3. It is impossible to tool this differentiation for static analysis except by continually chasing an exhaustive list of initialisms.
  4. Sibling and related acronyms may disagree on initialism, and thus would require different naming conventions (e.g., GUID is typically "goo-id" and so by your rules would be required to be named Guid despite both being rooted in "unique identifier").
  5. Adjacent initialisms in a name offer no visual delineation between them creating confusion for those unfamiliar with one or both terms. Casing them like words eliminates this problem.

@OliverO2
Copy link

I suggest to replace

  • Uuid.parse with Uuid.fromHexDashString,
  • Uuid.parseHex with Uuid.fromHexString.

This would result in

Consistency example:

val uuid2 = Uuid.fromULongs(0x550E8400E29B41D4uL, 0xA716446655440000uL)
val uuid3 = Uuid.fromHexDashString("550e8400-e29b-41d4-a716-446655440000")

I'd also support replacing

  • Uuid.toString with Uuid.toHexDashString

for the variant guaranteeing a specific format.

(Uuid.toString can still have the same implementation, just without the stability guarantees for machine readability.)

@joffrey-bion
Copy link

Admittedly I prefer toHexDashString() over my initial toRfc9562String() proposal 😆

@Amejonah1200
Copy link

I'm against using from for parsing strings, because UUIDs are 126bit long numbers which have a representation as a string (serialized as) which can be parsed later on.

@lppedd
Copy link

lppedd commented Jul 23, 2024

I was looking at the compiled JS code, and I think using Long is a pretty bad choice for the JS target, especially when performing bit operations. This is the code for fromByteArray, for example:

fromByteArray_d3r0u1_k$(byteArray) {
  if (!(byteArray.length === 16)) {
    var message = 'Expected exactly 16 bytes';
    throw IllegalArgumentException.new_kotlin_IllegalArgumentException_f8t9r5_k$(toString_1(message));
  }
  return this.fromLongs_uu1aj7_k$(toLong_1(byteArray, 0), toLong_1(byteArray, 8));
}

The toLong_1(byteArray, index) function, which corresponds to:

private fun ByteArray.toLong(startIndex: Int): Long {
    return ((this[startIndex + 0].toLong() and 0xFF) shl 56) or
            ((this[startIndex + 1].toLong() and 0xFF) shl 48) or
            ((this[startIndex + 2].toLong() and 0xFF) shl 40) or
            ((this[startIndex + 3].toLong() and 0xFF) shl 32) or
            ((this[startIndex + 4].toLong() and 0xFF) shl 24) or
            ((this[startIndex + 5].toLong() and 0xFF) shl 16) or
            ((this[startIndex + 6].toLong() and 0xFF) shl 8) or
            (this[startIndex + 7].toLong() and 0xFF)
}

will create more than 50 Long instances to perform those bit operations.
Not to mention the amount of function calls and those 50+ undefined checks because of the lazy initialization of the companion object and its properties.

@bcmedeiros
Copy link

Curiosity question, is there anything preventing Uuid to be a value class?

@CLOVIS-AI
Copy link

Curiosity question, is there anything preventing Uuid to be a value class?

Currently, value classes can only have a single field (see #340). Based on, what would its field be?

  • No primitives in Kotlin are 128 bits long.
  • Using a String would be wasteful of memory (36 bytes in hex-and-dash instead of the optimal 16 bytes)
  • Using an array of longs will have higher overhead than a regular class with two fields

Also, with a value class, the representation must be exactly the same on all platforms and is part of the public API, whereas with regular classes, it can be kept as an implementation detail of each platform, and change if performance problems are found.

@JakeWharton
Copy link

Using an array of longs will have higher overhead than a regular class with two fields

A small nit: on the JVM, at least, this is not true. Let's assume a 64-bit JVM running with compressed OOPs (the default).

  • An object with two long fields will have a 12-byte header plus two 8-byte fields for 28 bytes. This will be padded to 32 bytes total, as a multiple of 8 is required.
  • A two-element long array will have a 16-byte header plus 2*8 bytes for the elements for 32 bytes.
  • A sixteen-element byte array will have a 12-byte header 16 bytes for the elements for 28 bytes. This will be padded to 32 bytes total, as a multiple of 8 is required.

(I included a byte array version because you wouldn't want to use a long array in general due to its abysmal representation on JS)

So really no matter what you do you're allocating 32 bytes.

A value class, however, will require boxing at certain points (such as when used in a generic context) which will be a 16-byte allocation each time (12-byte header + 4-byte object reference).

You can validate all of this yourself using JOL.

Although having written all this maybe you were referring to runtime bounds checking on the array? There's usually ways to convince the JVM compilers and ART on Android to eliminate/minimize them, but for that I'd need to break out JMH and other tools. I don't think a value class is really being considered here, so I'm not going to bother right now.

@CLOVIS-AI
Copy link

I did mean runtime memory usage, but I forgot about padding, so you're completely right that it doesn't make a difference here. Thanks for reminding me!

@bcmedeiros
Copy link

I don't think a value class is really being considered here, so I'm not going to bother right now.

I wonder why that is, besides having to settle on a common backing field for all platforms, which might take some effort, a value class makes a lot of sense to me.

@JakeWharton
Copy link

I think the backing property type is the largest part of that discussion. It would probably mean this waiting for ByteString to move from kotlinx.io to the stdlib and to be stable itself. But ultimately you likely wind up with an object whose memory footprint matches that of Uuid being a normal class with 16 bytes of data in some storage format. And having a full class means you never have deal with the boxing problem. If Kotlin had 128-bit primitive that had efficient representation on every platform I think the argument would be much more strong.

@qurbonzoda
Copy link
Contributor Author

qurbonzoda commented Aug 2, 2024

Shouldn't the extension fun Uuid.toJavaUuid() instead be named fun Uuid.toJavaUUID(), to match the Java naming scheme?

This does make sense to me, given that the emphasis is on conversion to the specific type. When considering the conversion to Java's representation of the Universally Unique Identifier (UUID), the name toJavaUuid() works as well.
However, I think it is better to rename it to toJavaUUID().

@bcmedeiros
Copy link

Sorry I got a little bit late to this KEEP discussion, but another thing I just raised in https://youtrack.jetbrains.com/issue/KT-31880#focus=Comments-27-10312473.0-0 is the name choice for the random() function.

I found out about this new type looking for a UUID v7 Kotlin implementation, actually, and I'm willing to eventually even contribute with some code as I had to roll my own too many times already. If/when we do have a v7 generator function, how are we calling it? Why don't we just use the vX() notation as the spec has very clearly defined versions?

The KEEP mentions:

The popularity of each UUID version was also explored. Our findings indicate that in approximately 90 percent of cases users generate a UUID version 4 (random). Excluding time-based UUIDs, this figure rises to over 97 percent.

While this is true, I think we are getting things backwards here. In my experience with UUIDs, I think the lack of a standard trustworthy way of generating other versions of UUIDs other than v4 is the reason many people end up using v4 more than they "should".

For databases, for example, the impact of using v4 instead of v7 on index re-balancing is at least non-negligible (quick example, there are many more).

Not saying v7 or any other version should be a blocker for a initial release, but I think we should aim for excellence in the standard library (as the team has been doing, I think the Kotlin standard library is great) and that would include 2 things in regards to UUIDs:

  • use a naming convention that would allow other versions to be implemented without creating confusion (v7 also has a random component, so one sees a random() function and it might be confusing)
  • create some tickets to plan the next versions, I think v7 and v5 are the most important ones after v4 (I can help with this one if there is enough support)

@qurbonzoda
Copy link
Contributor Author

qurbonzoda commented Aug 14, 2024

From the Javadoc of Object.toString(): "The result should be a concise but informative representation that is easy for a person to read". For this reason, I generally don't trust toString() as a reliable and stable way to serialize data (for any class), because it's not part of its contract.

I would find it nice to have Uuid.toRfc9562() or Uuid.toRfc9562String() in the API for the purpose of actually serializing a UUID for program consumption (and not human). toString() would of course use that behind the scenes because the RFC format is also nice for humans. As a side note: we do have Duration.toIsoString() for Kotlin durations. (Admittedly, durations have a different "natural" representation for humans, which means toString() was already "taken", but it kinda proves the point).

For Uuid.toString(), the documentation explicitly specifies that the returned string is in the standard hex-and-dash format. Thus it becomes part of its contract.
I am not sure if adding a new function that duplicates toString() functionality won't cause confusion, both in writing and reviewing code. Theoretically, it might cause development teams to have a guideline about whether to use Uuid.toHexDashString() or Uuid.toString() in their projects, which adds an additional complexity to development process.

@qurbonzoda
Copy link
Contributor Author

qurbonzoda commented Aug 14, 2024

If/when we do have a v7 generator function, how are we calling it? Why don't we just use the vX() notation as the spec has very clearly defined versions?

I understand the concern about adopting a consistent naming convention for the future generator functions.
It's important to recognize, however, that not every developer using UUID has read the RFC or understands how each version differs from the others. Since v4 is the most popular UUID, it was decided to generate it with a name that is familiar to users.

While this is true, I think we are getting things backwards here. In my experience with UUIDs, I think the lack of a standard trustworthy way of generating other versions of UUIDs other than v4 is the reason many people end up using v4 more than they "should".

For databases, for example, the impact of using v4 instead of v7 on index re-balancing is at least non-negligible (quick example, there are many more).

Python, Go, and Rust can generate v1-v5 UUIDs, and we have observed a similar trend regarding UUID version popularity in these languages as well.

While I share your enthusiasm for v7 and believe it will gain wider adoption, as described in the KEEP, we see technical limitations in correctly generating it within the stdlib. I believe that database systems and frameworks are the appropriate places for generating v7, as they can handle clock rollbacks and guarantee the monotonicity and consistency of generated UUIDs within the system.

Could you please elaborate on how you would handle clock rollbacks and system restarts within the stdlib ? Or do you think ensuring monotonicity within a single program session would be enough?

@OliverO2
Copy link

For Uuid.toString(), the documentation explicitly specifies that the returned string is in the standard hex-and-dash format.

In the KEEP, it does not specify in-place what "standard" means, but just: "Returns the standard string representation of this uuid."

Thus it becomes part of its contract.

This contract would deviate from learned expectations. It seems unusual to use toString() for what's basically a form of serialization.

I am not sure if adding a new function that duplicates toString() functionality won't cause confusion, both in writing and reviewing code.

Given:

  • toHexDashString() would have the strong contract, express this via its name and nicely mirror fromHexDashString().
  • The duplication in toString() would be just an implementation coincidence. Its documentation could even state explicitly that its rendering makes no guarantees.
  • The IDE would by default always prioritize toHexDashString() in suggestions.

Why would there be a higher risk compared to the level of possible confusion we could always presume?

I'd expect even more confusion with toString() as we might see code using "$uuid".

@qurbonzoda
Copy link
Contributor Author

Please find the full documentation for Uuid on GitHub. The full documentation was not included in the KEEP document to keep it concise.

@bcmedeiros
Copy link

Python, Go, and Rust can generate v1-v5 UUIDs, and we have observed a similar trend regarding UUID version popularity in these languages as well.

I'm not trying to convince us to implement v7 right now, I'd like that but I understand we shouldn't spend time now if there's not enough demand. I just want to make sure we use an extensible and coherent naming convention and those links you posted are just another argument against random().

Python: uuid.uuid4()
Rust: Uuid::new_v4()

Go created a NewRandom(), but then used NewV7() for the v7 which is inconsistent. I guess they wanted to avoid deprecation, which is exactly what I'm trying to avoid here by not picking random() to begin with.

If I recall correctly, JS and PostgreSQL also have v4 in their method names.

I understand the concern about adopting a consistent naming convention for the future generator functions. It's important to recognize, however, that not every developer using UUID has read the RFC or understands how each version differs from the others. Since v4 is the most popular UUID, it was decided to generate it with a name that is familiar to users.

Allow me to politely disagree on this, @qurbonzoda. If a developer decides to use UUIDs without basic knowledge about what that type is, there's very little we can do for them. Also, if anything, that's an argument against random(), as this may trick users into thinking that they know what they are doing instead of pushing them towards the documentation.

If you by any chance don't know how to generate a UUID and you do Uuid.[Ctrl + Space], you see a newV4() or anything similar, you can just read the docs or even do a quick google search and figure that out. That is better than seeing random() (v4) and timeWithRandom() (v7) and having no idea what they are.

While I share your enthusiasm for v7 and believe it will gain wider adoption, as described in the KEEP, we see technical limitations in correctly generating it within the stdlib. I believe that database systems and frameworks are the appropriate places for generating v7, as they can handle clock rollbacks and guarantee the monotonicity and consistency of generated UUIDs within the system.

The entire point of using UUIDs is to avoid (or at least having the option of not) generating them inside the database. I understand there are challenges, but as we just saw in the links above many major languages already provide them.

Could you please elaborate on how you would handle clock rollbacks and system restarts within the stdlib? Or do you think ensuring monotonicity within a single program session would be enough?

In my experience clock rollbacks are not critical because we still have a fair bit of the payload being random. Even without monotonicity and rollback avoidance guarantees, v7 is very useful to avoid major and frequent index re-balances in the database.
In the future, we can provide multiple v7 factories that allow some kind of time source to be passed in so the user can sync if they want, but I'd pick having any implementation of v7 over holding on things to have a perfect one.

@hfhbd
Copy link

hfhbd commented Aug 18, 2024

Is there any reason why there are no fun NSUUID.toKotlinUuid(): Uuid = Uuid.parse(UUIDString) and fun Uuid.toNsUUID(): NSUUID = NSUUID(toString()) matching the JVM converter functions on iOS/macOS etc.?

@qurbonzoda
Copy link
Contributor Author

qurbonzoda commented Aug 19, 2024

Is there any reason why there are no fun NSUUID.toKotlinUuid(): Uuid = Uuid.parse(UUIDString) and fun Uuid.toNsUUID(): NSUUID = NSUUID(toString()) matching the JVM converter functions on iOS/macOS etc.?

Currently, the stdlib project does not have separate source sets for different Kotlin/Native targets. This is due to how the Kotlin/Native stdlib is built and published. When it becomes possible to have an Apple-specific source set, we will include such conversion functions as well.

@hfhbd
Copy link

hfhbd commented Sep 1, 2024

I just read the rfc9562 and I found these sections.

Regarding sorting:

UUID formats created by this specification are intended to be lexicographically sortable while in the textual representation.

https://www.rfc-editor.org/rfc/rfc9562.html#section-6.11

In Java/Kotlin, the interface to support sorting is Comparable, so implementing the interface sounds correct to me according to the rfc.

Regarding the function/factory names for different versions:
https://www.rfc-editor.org/rfc/rfc9562.html#section-7.1 contains a table with names registered by the IANA, so random() may be the preferred name for Uuidv4, and unixTimebased() for Uuidv7. (On the other hand reorderedGregorianTimebased() for Uuidv6 sounds very "interesting".)

@Amejonah1200
Copy link

I support @bcmedeiros argument of implementing various vX(...) methods (tbh, I would also go with a mix of name and ID of it (ex. v4Random(...) or v7UnixTimeBased(...))) to generate those UUIDs (randomly).

Where I do disagree tho, it is the removal of standard non-parametrized random() function, as it is wildly used as a way of getting a random UUID without much thought what exactly to use. But yes, underneath it could be any of those implementations and may or may not be changed over time.

@hfhbd
Copy link

hfhbd commented Sep 1, 2024

Another thing after I switched a project to kotlin.uuid.Uuid is the parsing function in combination with kotlinx-serialization (here Json). Uuid provides two parsing functions: parse requires a 36 length hex and dash string, while parseHex requires a 32 hex string. The default UuidSerializer of kotlinx-serialization uses the parse method, so you can't use the default serializer and pass a 32 hex string.
While both implementations are well documented, including its expected input, I would expect a generic parse method that is able to handle both inputs and specific parseHex, parseHexAndDash functions supporting only a subset.
My current workaround is a custom parse function that checks the size first, if the size is 36, it uses the parse function, otherwise the parseHex function.

Also, the KEEP mentions possible bracket parsing support in the future. Do you plan to add another parseX function (including all the combinations of hex, hex and dashes and brackets), a UuidFormat and builder function or "just" one parse function to rule them all?

@UnknownJoe796
Copy link

I don't want to switch to this new implementation of KMP UUID until JS efficiency is addressed. In addition, I second @hfhbd 's recommendation to implement Comparable as per the RFC. I think the convenience outweighs the Java transition risks, personally.

@qurbonzoda
Copy link
Contributor Author

While both implementations are well documented, including its expected input, I would expect a generic parse method that is able to handle both inputs and specific parseHex, parseHexAndDash functions supporting only a subset.
My current workaround is a custom parse function that checks the size first, if the size is 36, it uses the parse function, otherwise the parseHex function.
Also, the KEEP mentions possible bracket parsing support in the future. Do you plan to add another parseX function (including all the combinations of hex, hex and dashes and brackets), a UuidFormat and builder function or "just" one parse function to rule them all?

I have added a paragraph about introducing parseHexDash()/toHexDashString() and a making parse() support multiple formats to the KEEP document. Please let me know your thoughts on this approach. We will consider options since the Uuid API is still experimental.

cc: @hfhbd, @OliverO2, @joffrey-bion

@qurbonzoda
Copy link
Contributor Author

I have also added a paragraph about making Kotlin Uuid a mapped type to the Java UUID in Kotlin/JVM. This would help greatly in integrating with existing APIs that work with Java UUID. However, this approach has its own downsides. Please let me know your thoughts about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests