Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New legacy id generator #4875

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

New legacy id generator #4875

wants to merge 14 commits into from

Conversation

fmarco76
Copy link
Member

Current implementation of legacy id generator using serial number has the problem to leave gaps when new ranges are generated because some conversion problem from decimal to hexadecimal range value. The gap could create problem to third party tool/service if they expected all no gaps are present in the serials.

A new id generator is introduced to solve the problem. To use the new id generator the following value should be used in CS.cfg:

dbs.cert.id.generator=newLegacy
dbs.request.id.generator=newLegacy

This can be be specified in pkispawn with the parameters:

pki_request_id_generator=newLegacy
pki_cert_id_generator=newLegacy

The remaining configuration are similar to the legacy generator. The main difference is that hexadecimal value must have the prefix 0x, if it is not present then the value is read as decimal.

The number serial written in CS.cfg during range update can be forced to be only decimal or hexadecimal setting the value dbs.numberRangeRadix to 10 or 16 respectively. If not set or negative then the default for the specific generator will be used.

@fmarco76
Copy link
Member Author

fmarco76 commented Oct 11, 2024

To upgrade the id generator from legacy to newLegacy, in case there are no clones, the CS.cfg values for certificate serial numbers should be updated to hex. In partitular:

  • the beginSerialNumber can be directly converted adding the 0x prefix while;
  • the endSerialNumber is calcalted adding the increment read as hex.

Finally, the value of nextRange in DS node ou=certificateRepository,ou=ca,<DN suffix> has to be converted from hex to decimal (legacy use hes without prefix). If the previous values have been updated and the id modified then it could be enough the command: pki-server ca-range-update

@fmarco76
Copy link
Member Author

@parrjd testing the fix for next range gaps in PR #4840 create problem with update. If fixed all the radix issues (there are few other operation errors) then the ranges do not match properly generating errors in deployed instances (it is OK for new CA). To avoid problems we are implementing this fix which provides a new generator so in cases instances have no problem with gaps there is no need to modify it.
Is this approach working in your case?

Copy link
Contributor

@edewata edewata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmarco76 I'm still reviewing this, but could you add a test here (before switching to RSNv3) to show how to fix an existing CA that has a sequential serial number gap?

So basically the test might look like this:

  1. make the necessary changes to the CA
  2. allocate new ranges
  3. enroll certs to exhaust the current range
  4. repeat step 2 & 3 if necessary
  5. verify that no new gap is created

For simplicity it's not necessary to verify the config params or LDAP entries. The verification in step 5 should be sufficient. Thanks!

@parrjd
Copy link
Contributor

parrjd commented Oct 12, 2024

@fmarco76 sorry I did not see this earlier

I think forcing the values in the CS.cfg to be read as base 10 is a bad idea and will only cause more problems for existing installs. What we have currently in our prod environment is a hex number including characters a-f that are allowed in a hex string, and if it is forced to be read as a base 10 that will likely break on startup, or when ever that vaule is required to be read.

Using 100 as an example for both end and increment
100 = 0x64
256 = 0x100

if you force it to be read and used as a decimal and the CA has already issued 0x80 then it will jump to the next range starting at 200 which will overlap the previous range. The only safe way to switch to Dec would be to convert the Incremement and the end values to Decimal so increment and end would need to be changed in the CS.cfg

dbs.beginSerialNumber=1
dbs.endSerialNumber=10000000
dbs.serialCloneTransferNumber=10000

dbs.serialIncrement=10000000
dbs.serialLowWaterMark=2000000
dbs.serialCloneTransferNumber=10000
dbs.serialDN=ou=certificateRepository, ou=ca
dbs.serialRangeDN=ou=certificateRepository, ou=ranges
dbs.beginReplicaNumber=[pki_replica_number_range_start]
dbs.endReplicaNumber=[pki_replica_number_range_end]

As a side note:
dbs.beginRequestNumber=[pki_request_number_range_start]
dbs.endRequestNumber=[pki_request_number_range_end]
dbs.requestIncrement=10000000
dbs.requestLowWaterMark=2000000
dbs.requestCloneTransferNumber=10000
dbs.requestDN=ou=ca, ou=requests
dbs.requestRangeDN=ou=requests, ou=ranges

pkispawn values
pki_serial_number_range_start=
pki_serial_number_range_end=
pki_request_number_range_start=
pki_request_number_range_end=
pki_replica_number_range_start=
pki_replica_number_range_end=

For an install that did not override the begin and end for serial numbers CS.cfg with the pkispawn vars the first range has 268,435,456 possible serial numbers, and the increment has the same amount while the request only has 10,000,000 since the request is done in base 10. as long as range management is enabled it should not be an issue since the requests will role to the next range separate of the serials.

The nextrange DS Values do not get populated for some time. We do not have them on most of our prod systems as they get populated at a point well after the CA is stood up when serial issuances reaches a point and populates the values to the DS instance.

For a stand alone CA the default end serialNumer Value will result in most installs of a dogtag CA never rolling over to a new range. Active clones are more likely to role faster since they only get a very small fraction of the top end of the existing CA's serial range for a sequential CA. For a Random serial CA I believe it pulls the new nextRange when the CA is built, but may be wrong since I have not deployed a random serial CA yet.

Ldap output from a serial CA that has a clone that has not reached the point where a range has been added to the DS instance.

dn: ou=certificateRepository,ou=ranges,dc=*****
objectClass: top
objectClass: organizationalUnit
ou: certificateRepository

I think the best solution would be to set a couple of extra values in the CS.cfg so that different subsystems can be contolled differently for like the TPS, TKS, or OSCP where everything I believe is in base 10, and treat all of the values for a type the same, so start, end, and increment for Serial would be base 16, and for request and replica they would be base 10.

dbs.serialRadix=16
dbs.requestRadix=10
dbs.replicaRadix=10

From DOGTAG_10_5_BRANCH
base/server/cms/src/com/netscape/cms/servlet/csadmin/UpdateNumberRange.java


            if (type.equals("request")) {
                radix = 10;
                endNumConfig = "dbs.endRequestNumber";
                cloneNumConfig = "dbs.requestCloneTransferNumber";
                nextEndConfig = "dbs.nextEndRequestNumber";
            } else if (type.equals("serialNo")) {
                radix = 16;
                endNumConfig = "dbs.endSerialNumber";
                cloneNumConfig = "dbs.serialCloneTransferNumber";
                nextEndConfig = "dbs.nextEndSerialNumber";
            } else if (type.equals("replicaId")) {
                radix = 10;
                endNumConfig = "dbs.endReplicaNumber";
                cloneNumConfig = "dbs.replicaCloneTransferNumber";
                nextEndConfig = "dbs.nextEndReplicaNumber";
            }

For a RHEL 7 amd RHEL 8 instance it can be assumed that for Serial should be interprtied as base 16, and for Requests and replica should be done as base 10.

Since the math error in the code results in the value being recroded in the ds being higher than the value stored in the CS.cfg when taking base into account I am not seeing the risk to fixing it by fixing how the math is done.

For dealing with existing systems I see a couple of paths.

Handle at subsystem startup

If nextRange values are not in DS, no issue since the problem code has not been executed.

If nextRange exists in the DS, read the existing range from the cn=<start_of_current_range> from the DS and compare to what is in CS.cfg, if they match nothing to do, if they do not match the write back the End of the Range to CS.cfg so that they do match. This will prevent a gap from being created. Fix the code so that Reads from DS Range Values and the CS.cfg always bigInt with the Radix and toString(radix) so that the math is all done using the proper base for the values, and also stored in the same manner in both the DS and the CS.cfg so they can easily be compared.

Switch to all Dec, but manually convert/ update the increment and end to be hex to dec values so that 100 would be 256 written back to the CS.cfg. The challenge here is I think this is a bigger lift, and anywhere in the code that currently uses the value as hex would have to be found, and anywhere it needs to be hex would need to do an integer.toHexString() to convert it.

Basicly I am looking at the best way to prevent this from happening when a CA Rolls to the next range. Not looking to go back and try to fill in where we have ended up with gaps. We are already having to update Other systems that pull stuff from the CAs to handle, the gap, and we have had to do that before when we did not use the serial management provides and we did manual clones with completely different ranges. and had huge gaps so most of the other systems have a method to deal with gaps.

If we read from DS and CS.cfg with a Radix then it is coming in to Bigint as a bigint from hex or Dec, and when writing back to either toString(radix), then it is put back in the expected value.

I think adding 0x to designate it as hex is also likely a bad Idea as most of the code in the repository class is used by different subsystems that have different bases and then every value would need to be tested to see if the 0x needs to be stripped, off then bigint with a 16 radix if it had a 0x. where as if every subsystem provides the radix that it uses, it is just used vs having to add logic to determine the value type of dec/hex.

I dont think a new id generator is really required since currently the existing has no potential of Serial overlap which would be really bad, it just can result in gaps since the math is not done the same way for the value written to the DS, and the value written to the CS.cfg

Comment on lines 1454 to 1455
docker exec primary pki-server ca-config-set dbs.beginSerialNumber $BEGIN_SERIAL
docker exec primary pki-server ca-config-set dbs.endSerialNumber $END_SERIAL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these operations should be identical to Repository.switchToNextRange(), but the problem is the switchToNextRange() uses mNextMinSerialNo and mNextMaxSerialNo variables instead of beginRange and endRange attributes. I think they are supposed to be equivalent but computed separately so they might become out of sync because of the bug.

Now the question is should we use the next range defined in mNextMinSerialNo and mNextMaxSerialNo or in beginRange and endRange?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we are not switching range so the operation cannot be identical. When the generator is updated to the new version the range could still have space and it has to be used. We have to add a check and eventually update the next range configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you're converting this code into a tool? That would make it easier to implement the logic and reuse it in other cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also describe how the migration process works in this page? Thanks!
https://github.com/dogtagpki/pki/wiki/Sequential-Serial-Numbers-v2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have update the code but it is not yet complete. Beside the KRA update (they are the same so the update will work for both) which still has to be included, I have a couple of doubts I am still testing.

Additionally, since the range are switching to decimal as the next range (and the certificate serials) should we update the existing ranges which are already completed to be consistent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration works also for the RSNv3 but I have not modified the tests. We can do later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have update the page with the steps performed by the update script. If it is OK then we could add an example of before and after the update from the ca-sequential action.

Comment on lines 1474 to 1484
END_RANGE=0x$(cat output | sed -n 's/endRange: \(.*\)/\1/p' | sort -n | tail -1)
NEW_NEXT=$((END_RANGE + 1))
docker exec -i primaryds ldapmodify \
-H ldap://primaryds.example.com:3389 \
-D "cn=Directory Manager" \
-w Secret.123 << EOF
dn: ou=certificateRepository,ou=ca,dc=ca,dc=pki,dc=example,dc=com
changeType: modify
replace: nextRange
nextRange: $NEW_NEXT
EOF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably should be identical to Repository.getNextRange() too, but in getNextRange() it allocates a new range in addition to updating the nextRange. If we don't allocate a new range here would it create a gap?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the comment before, the change should not modify the ranges but fix the values so it is different from the method where the update is done because of a new range

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question is, how will other servers that have not been upgraded yet see the new nextRange, and how will they be affected by it? If it works like a regular new range allocation then we don't need to worry about it, but this one seems to be different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextRange is shared among all the clones. The migration should be all or nothing. It has not to be synced but before requesting a new range.

Copy link
Contributor

@edewata edewata Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the migration procedure should allow it to be done incrementally (i.e. one server at a time) so the service can continue running without interruption. If that's the case, we might even be able to run the migration automatically as part of package upgrade.

If that's not possible, then the migration can only be done manually during maintenance window, and that might require shutting down the entire service, or making the entire service read-only until it's done. We need to make sure this is clearly stated in the documentation so people won't accidentally run the migration outside of the maintenance window.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is only if a clone is asking for a new range while the master has already migrated. This could create problems although I have not tested yet. I'll test to verify if it possible to handle or not.

# Since there is a range the value are retrieved from the range. In general, id range is not
# present the value from CS.cfg are used.
BEGIN_SERIAL=0x$(cat output | sed -n 's/beginRange: \(.*\)/\1/p' | sort -n | tail -1)
END_SERIAL=$(printf "0x%x\n" $(( BEGIN_SERIAL + 0x12 -1 )) )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably cannot not assume that the range size will be a constant. It looks like the ranges are smaller for clones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As shown by the test I just updated, the size does change under certain circumstances.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is not for the clones. The first ranges are bigger and after the size become constant. It is evident also in the single CA sequential. This is because of the initial error on the nextRange.

@@ -673,6 +673,7 @@ dbs.replicaDN=ou=replica
dbs.replicaRangeDN=ou=replica, ou=ranges
dbs.ldap=internaldb
dbs.newSchemaEntryAdded=true
dbs.numberRangeRadix=-1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is dbs.numberRangeRadix needed for migration? If not I'd suggest we drop it from the PR or move it into a separate PR. I also think that we probably need separate params such as dbs.request.id.radix and dbs.cert.id.radix to replace the hard-coded radixes in RequestRepository and CertificateRepository.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am implementing the CLI for update and I also thought to remove dbs.numberRangeRadix. I have added this because I was thinking to fix without a new generator but with the new generator is better to remove.

Currently the hard coded is only used to write back the numbers in CS.cfg but for reading it is used the format, if with prefix 0x it is hex otherwise it is decimal. This should not generate problems so I think to not add the option to write in a specific base. Should the radix be configurable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required to be configurable, but it could help to show what radix it's using now, and in case we want to change the radix the admin will have the full control of when to change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! I have replaced with the two parameters but I have not added them to the default yet. Not sure if we should include.

With legacy serial ID, request and replica ranges are expressed in decimal while serial in hex.
However, there is a problem in the hex management which create gaps in
the sequences every time a new range is allocated.

Since gaps could create problems to third party software it has been
introduced a new parameter in CS.cfg to set the same format for all ranges:

dbs.numberRangeRadix

If this is not present or negative then the current default is adopted
and nothing change.

If this is set to 10 all the values are handled as decimal and the
ranges work properly. If set to different values than 10 (e.g. 16 for
hex) the gap problem is still present.

This new parameter is not update during the update to avoid creating
problem to running instances which do not have problem with range gaps.

In case to move the value to 10 and solve the gaps problem then the
CS.cfg serial range has to be fixed considering the current value as
hex. Also the nextRange in the following DS node has to fixed
accordingly:

ou=certificateRepository,ou=ca,<prefix>
Range information in DS will be stored in decimal format so all the
related operation do not need radix to be read.
The test convert from legacy generator to newLegacy and verify that not
gaps are present when new ranges are created
This include 2 command:
- "show" to get the id generator configured
- "update" to change the generator. It is possible to move from legacy
  to legacy2 and from legacy or legacy2 to random
The generator are updated with the new command:

pki-server ca-range-generator-update --type <generator_type>
<generator_namne>
The just introduced option `dbs.numberRangeRadix` has been splitted in:
- `dbs.cert.id.radix`
- `dbs.request.id.radix`

If they are not present the default value of 16 for cert and 10 for
request will be used.
In legacy serial number ranges are stored as hex for cert and decimal
for request. The legacy2 is using decimal for all the values stored in
DS. The update command is converting stored ranges to decimal to match
with the new format and avoid problems.
Copy link

sonarcloud bot commented Oct 18, 2024

endRange: 48
host: primary.example.com

EOF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the comment to indicate that the allocated cert range for the primary CA changed from 13-30 to 19-48.

Could you add a similar check for the request range too since we're migrating it to SSNv2 too?

Comment on lines +1023 to +1026
# cert nextRange should remain the same
cat > expected << EOF
nextRange: 49
EOF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a similar check for the request next range?


- name: Switch secondary to legacy2
run: |
docker exec secondary pki-server stop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow enrollments to happen in the secondary CA (which could change the ranges) during the migration of the primary CA? If we don't, we probably should stop both servers at the beginning of the migration, or somehow make them read-only. That will mark the beginning of the maintenance window.

Comment on lines +1065 to +1091
- name: Check cert range config in primary CA
run: |
tests/ca/bin/ca-cert-range-config.sh primary | tee output

cat > expected << EOF
dbs.beginSerialNumber=0x13
dbs.endSerialNumber=0x30
dbs.serialCloneTransferNumber=0x9
dbs.serialIncrement=0x12
dbs.serialLowWaterMark=0x9
EOF

diff expected output

- name: Check cert range config in secondary CA
run: |
tests/ca/bin/ca-cert-range-config.sh secondary | tee output

cat > expected << EOF
dbs.beginSerialNumber=0x31
dbs.endSerialNumber=0x48
dbs.serialCloneTransferNumber=0x9
dbs.serialIncrement=0x12
dbs.serialLowWaterMark=0x9
EOF

diff expected output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the ranges changed from 0x13-0x24 to 0x13-0x30 in the primary CA and from 0x31-0x42 to 0x31 to 0x48 in the secondary CA. Is there any possibility that the new ranges would conflict with other existing ranges?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I get the math correct the range in CS.cfg are smaller then in DS and not full used but the start range are always correct in case of gap. The ID between 0x42 and 0x48 are lost because when a new range is computed it start from the value recorded in nextRange and do not take in account the range objects and/or configuration range so overlap should not be possible. Using the full recorded range we are requesting to utilise the full range.

Comment on lines +52 to +58
Option option = new Option("d", true, "NSS database location");
option.setArgName("database");
options.addOption(option);

option = new Option("f", true, "NSS database password configuration");
option.setArgName("password config");
options.addOption(option);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These options don't seem to be needed.

Comment on lines 122 to +123
// generate nextRange in decimal
String nextSerialNumber = endSerialNumber.add(BigInteger.ONE).toString();
BigInteger nextSerialNumber = endSerialNumber.add(BigInteger.ONE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove "in decimal" since the radix is irrelevant for BigInteger?

Comment on lines 168 to +169
// generate nextRange in decimal
String nextRequestNumber = endRequestNumber.add(BigInteger.ONE).toString();
BigInteger nextRequestNumber = endRequestNumber.add(BigInteger.ONE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove "in decimal" since the radix is irrelevant for BigInteger?

@@ -258,3 +259,165 @@ def execute(self, argv):
sys.exit(1)

subsystem.update_ranges()


class RangeGeneratorCLI(pki.cli.CLI):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pki-server <subsystem>-range-generator-* name is probably not quite accurate since this command can be used to show RSNv3 config or migrate to RSNv3 which doesn't have ranges. How about something like pki-server <subsystem>-id-generator-*?

Comment on lines +178 to +189
String nextBeginSerial = dbConfig.getNextBeginSerialNumber();
String nextEndSerial = dbConfig.getNextEndSerialNumber();
if (nextBeginSerial != null && !nextBeginSerial.equals("-1")) {
dbConfig.setNextBeginSerialNumber("0x" + nextBeginSerial);

LDAPEntry entryNextSerial = conn.read("cn=" + nextBeginSerial + "," + rangeDN);
LDAPAttribute attrNextEnd = entryNextSerial.getAttribute("endRange");
if (attrNextEnd != null) {
nextEndSerial = attrNextEnd.getStringValues().nextElement();
}
dbConfig.setNextEndSerialNumber("0x" + nextEndSerial);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with how the "next" params work, let me try to add some tests.


BigInteger lastUsedSerial = BigInteger.ZERO;
boolean nextRangeToUpdate = true;
while (ranges.hasMoreElements()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a note describing what this loop is doing? Is it trying to find the biggest endRange for the server?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants