Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge upstream/release/2.6 into upstream/google/2.6 #15460

Merged
merged 35 commits into from
Nov 7, 2024
Merged

Conversation

mjmac
Copy link
Contributor

@mjmac mjmac commented Nov 6, 2024

wiliamhuang and others added 30 commits October 15, 2024 10:18
#15004) (#15098)

In HDF5, DFS, MPIIO, or POSIX, object class and container properties are defined
during the container create. If it’s DFS, object class is also set to the IOR
parameter. However, in HDF5-VOL, object class and container properties are
defined with the following environment variables of mpirun.

HDF5_DAOS_OBJ_CLASS (Object class)
HDF5_DAOS_FILE_PROP (Container properties)

The infrastructure to set these variables are already there in run_ior_with_pool().
In file_count_test_base.py, pass in the env vars to run_ior_with_pool(env=env) as a
dictionary. Object class is the oclass variable. Container properties can be
obtained from self.container.properties.value.

This fix is discussed in PR #14964.

Signed-off-by: Makito Kano <[email protected]>
Hadoope 3.4.0 has resolved a few CVE issues but introduces new

+ enable Trivy scans on release branch
+ enable on demand scan and scan on final PR merge.

Signed-off-by: Tomasz Gromadzki <[email protected]>
)

* DAOS-14408 common: enable NDCTL for DCPM

This PR prepares DAOS to be used with NDCTL enabled in PMDK, which means:
- NDCTL must not be used when non-DCPM (simulate PMem) - `storage class: "ram"` is used:
`PMEMOBJ_CONF=sds.at_create=0` env variable disables NDCTL features in the PMDK
This change affects all tests run on simulated PMem (e.g. inside VMs).
Some DOAS utility applications may also require `PMEMOBJ_CONF=sds.at_create=0` to be set.

- The default ULT stack size must be at least 20KiB to avoid stack overuse by PMDK with NDCTL enabled and be aligned with Linux page size.
`ABT_THREAD_STACKSIZE=20480` env variable is used to increase the default ULT stack size.
This env variable is set by control/server module just before engine is started.
Much bigger stack is used for pmempool open/create-related tasks e.g. `tgt_vos_create_one` to avoid stack overusage.

This modification shall not affect md-on-ssd mode as long as `storage class: "ram"` is used for the first tier in the `storage` configuration.
This change does not require any configuration changes to existing systems.

The new PMDK package with NDCTL enabled (daos-stack/pmdk#38) will land as soon as this PR is merged.

Signed-off-by: Jan Michalski <[email protected]>
* DAOS-16653 pool: Batch crt events

When multiple engines become unavailable around the same time, if a pool
cannot tolerate the unavailability of those engines, it is sometimes
desired that the pool would not exclude any of the engines. Hence, this
patch introduces a CaRT event delay, tunable via the server-side
environment variable, CRT_EVENT_DELAY, so that the events signaling the
unavailability of those engines will be handled in hopefully one batch,
giving pool_svc_update_map_internal a chance to reject the pool map
update based on the RF check.

When the RF check rejects a pool map change, we should revisit the
corresponding events later, rather than simply throwing them away. This
patch improves this case by returning the events back to the event
queue, and pause the queue handling until next new event or pool map
update.

  - Introduce event sets: pool_svc_event_set. Now the event queue can be
    simplified to just one event set.

  - Add the ability to pause and resume the event handling: pse_paused.

  - Track the time when the latest event was queued: pse_time.

Signed-off-by: Li Wei <[email protected]>
Pin isort to v1.1.0 to avoid suprprise changes and because v1.1.1 is not
working for us.

Signed-off-by: Dalton Bohning <[email protected]>
…15324)

Rarely, this test will produce timings that exceed the failure
threshold. Local and PR/CI experiments have shown that increasing
the test's NUM_OPS to more than 200 iterations greatly reduces
or may eliminate such intermittent timing failures, by "spreading out"
the magnitude of the time spent in the 3 main loops of the test
(0% loops perform fault injections, 33%, and 50%).


Signed-off-by: Kenneth Cain <[email protected]>
… b26 (#15186)

In rebuild_iv_ent_refresh() for refreshing DTX resync version, needs
to assign rt_global_dtx_resync_version firstly before wakeup related
rebuild_scan_leader.

Signed-off-by: Fan Yong <[email protected]>
Release branch PRs should use the release branch build
instead of master branch build for NLT reference

Signed-off-by: Jeff Olivier <[email protected]>
#14985)

Sometimes, after system shutdown unexpectedly, the users may expect
to check their critical data under some kind of maintenance mode.
Under such mode, no user data can be modified or moved or aggregated.
That will guarantee no further potential (DAOS logic caused) damage
can happen during the check.

For such purpose, we will enhance current DAOS CR logic with --dryrun
option to allow the pool (after check) to be opened as immutable with
disabling some mechanism that may potentially cause data modification
or movement (such as rebuild or aggregation).

Under such mode, if client wants to connect to the pool, the read-only
option must be specified. Similarly for opening container in such pool.

Signed-off-by: Fan Yong <[email protected]>
…#15340)

Prevent accumulating large server log files caused by temporarily
enabling the DEBUG log mask while creating or destroying pools.

Signed-off-by: Phil Henderson <[email protected]>
In rare cases, a reader may attempt to access a telemetry
node after it has been added to the tree, but before it
has been fully initialized. Use an atomic to prevent
reads before the initialization has completed. Unlucky
readers will get a -DER_AGAIN instead of crashing.

Signed-off-by: Michael MacDonald <[email protected]>
- Fix rc in error path during ivo_on_update failure

Required-githooks: true

Signed-off-by: Alexander A Oganezov <[email protected]>
Use 4KB blob for committed DTX table and 16KB for active DTX table.
It is more efficient for lower allocator and reduce the possibility
of space allocation failure when space pressure.

Simplify vos_dtx_commit logic and code cleanup.

Signed-off-by: Fan Yong <[email protected]>
Fix the description of the CRT_EVENT_DELAY environment variable in
docs/admin/env_variables.md.

Signed-off-by: Li Wei <[email protected]>
…#15349)

With this change, when a daos administrator runs dmg system exclude
for a given set of engines, the system map version / cart primary group
version will be updated. In turn, daos_engines will more immediately
detect the "loss" of the administratively excluded engines, update
pool maps and perform rebuild. This change supports a use case of
a proactive exclusion of ranks that are expected to be impacted by
planned maintenance that would cut off connectivity to certain
engines.

Signed-off-by: Kenneth Cain <[email protected]>
The VOS sys_db may have multuiple users, such as SMD and CHK.
It is caller's duty to take lock against the VOS sys_db before
accessing it to handle concurrent operations from multiple XS.

Signed-off-by: Fan Yong <[email protected]>
If there are a lot of committable DTX entries in DTX CoS cache,
then it may be inefficient to locate the DTX entry in CoS cache
with given oid + dkey_hash, that may happen under the case of
that DTX batched commit is blocked (such as because of network
trouble) as to trigger DTX refresh (for DTX cleanup) on other
related engines. If that happened, it will increase the system
load on such engine and slow down DTX commit further more. The
patch reduces unnecessary search operation inside CoS cache.

Other changes:

1. Metrics (io/dtx/async_cmt_lat/tgt_id) for DTX asynchronously
   commit latency (with unit ms).

2. Fix a bug in sched_ult2xs() with multiple numa sockets for
   DSS_XS_OFFLOAD case.

3. Delay commit (or abort) collective DTX on the leader target
   to handle resent race.

4. Avoid blocking dtx_req_wait() if chore failed to send out
   some DTX RPC.

5. Some cleanup for error handling.

Signed-off-by: Fan Yong <[email protected]>
…#14893) (#14917)

Add SWIM_TRAFFIC_CLASS env var (default is unspec)

Signed-off-by: Jerome Soumagne <[email protected]>
…5283)

To reduce the side-effect caused by frequent log with -DER_INPROGRESS.

Signed-off-by: Fan Yong <[email protected]>
Release branch PRs should use the release branch build
instead of master branch build for Fault Injection reference

Signed-off-by: Jeff Olivier <[email protected]>
- New crt_reply_send_input_free() API added which releases input buffer right
  after HG_Respond() instead of waiting until the handle is destroyed.
- srv_obj.c calls changed to use new crt_reply_send_input_free()
- I/O context takes refcount on RPC
- only release input buffer for target update

Signed-off-by: Alexander A Oganezov <[email protected]>
Signed-off-by: Liang Zhen <[email protected]>
Co-authored-by: Liang Zhen <[email protected]>
)

The old implementation did not correctly calculate some collective
object RPC size, and may cause trouble when need bulk data transfer
for large collective object RPC. It also potentially affects how to
dispatch collective RPCs from leader to other engines.

The patch also addes more sanity check for coll-punch RPC to detect
potential DRAM corruption.

Signed-off-by: Fan Yong <[email protected]>
…15296) (#15392)

Missing PCIe capabilities when querying a NVMe SSD's configuration
space is unusual but should be handled gracefully by the control-plane
and shouldn't cause a failure to return usage statistics when calling
dmg storage query usage.

Update so that pciutils lib is only called when attempting to display
health stats via dmg and not when fetching usage info. Improve clarity
of workflow to ease maintenance and add test coverage for updates.
Enable continued functionality when NVMe device doesn't return any
extended capabilities in PCIe configuration space data by adding
sentinel error to library for such a case.

Signed-off-by: Tom Nabarro <[email protected]>
Intercept PMPI_Init() to avoid calling daos_init() if MPI_Init() is intercepted by other library (like darshan and mpip).

Signed-off-by: Lei Huang <[email protected]>
…5069) (#15111)

Unlike fetch, we return DER_CSUM on update (turned into EIO by dfs) without any retry.
We should retry a few times in case it is a transient error.

The patch also prints more information about the actual checksum mismatch.

Signed-off-by: Johann Lombardi <[email protected]>
Co-authored-by: Dalton Bohning <[email protected]>
* DAOS-16211 vos: Avoid race condition with discard (#15370)

There is a possible race between aggregation deleting
the object tree and discard working on the same
object tree.  Add a check to avoid this race

Signed-off-by: Jeff Olivier <[email protected]>
Nasf-Fan and others added 5 commits November 5, 2024 17:49
The patch contains the following improvements:

1. When VOS level logic returns -DER_TX_RESATRT, the object level RPC
   handler should set 'RESEND' flag then restart the transaction with
   newer epoch. Because dtx_abort() logic cannot guarantee all former
   prepared DTX entries (on all related participants) can be aborted,
   especially if the former one failed for some network trouble, that
   may cause restarted transaction hit -DER_TX_ID_REUSED unexpectedly.

2. Compare the epoch for DTX entries with the same transaction ID for
   distinguishing potential reused TX ID more accurately.

3. Add DTX entry into DTX CoS cache if cannot commit it synchronously.
   Then subsequent batched commit logic can handle it.

4. If server complains suspected TX ID reusing, then reports -EIO to
   related application instead of assertion on client.

5. Control DTX related warning message frequency to avoid log flood.

6. Collect more information when generate some error/warning message.

Signed-off-by: Fan Yong <[email protected]>
#14932) (#15446)

libfabric loads libze_loader.so which calls zeInit(). We observed deadlock due to nested calls when daos_init() is called inside zeInit(). We intercept dlsym() and zeInit() to avoid calling daos_init() inside zeInit(). dlsym(RTLD_NEXT, ) checks returning address to determine caller's module. To maintain expected behavior of dlsym(RTLD_NEXT, ) with our interception, new_dlsym() is implemented with assembly code to use jmp instruction instead of call. dlsym() has been moved from libdl.so to libc.so since version 2.34.

Signed-off-by: Lei Huang <[email protected]>
Tag first test build for 2.6.2.

Signed-off-by: Phil Henderson <[email protected]>
…/2.6

Required-githooks: true

Change-Id: I9a00365e096ca292a5530b9d9f2ed99ba0e50f2f
Copy link

github-actions bot commented Nov 6, 2024

Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/Merge

@mjmac mjmac merged commit 2fe57fc into google/2.6 Nov 7, 2024
67 of 70 checks passed
@mjmac mjmac deleted the mjmac/google/2.6 branch November 7, 2024 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.