Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge upstream/release/2.6 into upstream/google/2.6 #15414

Closed
wants to merge 28 commits into from

Commits on Oct 15, 2024

  1. Configuration menu
    Copy the full SHA
    76cfb41 View commit details
    Browse the repository at this point in the history

Commits on Oct 16, 2024

  1. DAOS-16446 test: HDF5-VOL test - Set object class and container prope… (

    #15004) (#15098)
    
    In HDF5, DFS, MPIIO, or POSIX, object class and container properties are defined
    during the container create. If it’s DFS, object class is also set to the IOR
    parameter. However, in HDF5-VOL, object class and container properties are
    defined with the following environment variables of mpirun.
    
    HDF5_DAOS_OBJ_CLASS (Object class)
    HDF5_DAOS_FILE_PROP (Container properties)
    
    The infrastructure to set these variables are already there in run_ior_with_pool().
    In file_count_test_base.py, pass in the env vars to run_ior_with_pool(env=env) as a
    dictionary. Object class is the oclass variable. Container properties can be
    obtained from self.container.properties.value.
    
    This fix is discussed in PR #14964.
    
    Signed-off-by: Makito Kano <[email protected]>
    shimizukko authored Oct 16, 2024
    Configuration menu
    Copy the full SHA
    b4eb689 View commit details
    Browse the repository at this point in the history
  2. DAOS-16673 common: ignore Hadoop 3.4.0 related CVE (#15320)

    Hadoope 3.4.0 has resolved a few CVE issues but introduces new
    
    + enable Trivy scans on release branch
    + enable on demand scan and scan on final PR merge.
    
    Signed-off-by: Tomasz Gromadzki <[email protected]>
    grom72 authored Oct 16, 2024
    Configuration menu
    Copy the full SHA
    6e16c8e View commit details
    Browse the repository at this point in the history
  3. DAOS-14408 common: ensure NDCTL not used for storage class ram (#15203

    )
    
    * DAOS-14408 common: enable NDCTL for DCPM
    
    This PR prepares DAOS to be used with NDCTL enabled in PMDK, which means:
    - NDCTL must not be used when non-DCPM (simulate PMem) - `storage class: "ram"` is used:
    `PMEMOBJ_CONF=sds.at_create=0` env variable disables NDCTL features in the PMDK
    This change affects all tests run on simulated PMem (e.g. inside VMs).
    Some DOAS utility applications may also require `PMEMOBJ_CONF=sds.at_create=0` to be set.
    
    - The default ULT stack size must be at least 20KiB to avoid stack overuse by PMDK with NDCTL enabled and be aligned with Linux page size.
    `ABT_THREAD_STACKSIZE=20480` env variable is used to increase the default ULT stack size.
    This env variable is set by control/server module just before engine is started.
    Much bigger stack is used for pmempool open/create-related tasks e.g. `tgt_vos_create_one` to avoid stack overusage.
    
    This modification shall not affect md-on-ssd mode as long as `storage class: "ram"` is used for the first tier in the `storage` configuration.
    This change does not require any configuration changes to existing systems.
    
    The new PMDK package with NDCTL enabled (daos-stack/pmdk#38) will land as soon as this PR is merged.
    
    Signed-off-by: Jan Michalski <[email protected]>
    grom72 authored Oct 16, 2024
    Configuration menu
    Copy the full SHA
    d9f16a1 View commit details
    Browse the repository at this point in the history

Commits on Oct 18, 2024

  1. DAOS-16653 pool: Batch crt events (#15230) (#15302)

    * DAOS-16653 pool: Batch crt events
    
    When multiple engines become unavailable around the same time, if a pool
    cannot tolerate the unavailability of those engines, it is sometimes
    desired that the pool would not exclude any of the engines. Hence, this
    patch introduces a CaRT event delay, tunable via the server-side
    environment variable, CRT_EVENT_DELAY, so that the events signaling the
    unavailability of those engines will be handled in hopefully one batch,
    giving pool_svc_update_map_internal a chance to reject the pool map
    update based on the RF check.
    
    When the RF check rejects a pool map change, we should revisit the
    corresponding events later, rather than simply throwing them away. This
    patch improves this case by returning the events back to the event
    queue, and pause the queue handling until next new event or pool map
    update.
    
      - Introduce event sets: pool_svc_event_set. Now the event queue can be
        simplified to just one event set.
    
      - Add the ability to pause and resume the event handling: pse_paused.
    
      - Track the time when the latest event was queued: pse_time.
    
    Signed-off-by: Li Wei <[email protected]>
    liw authored Oct 18, 2024
    Configuration menu
    Copy the full SHA
    60d4b5d View commit details
    Browse the repository at this point in the history
  2. DAOS-16720 cq: pin isort to v1.1.0 (#15338) (#15339)

    Pin isort to v1.1.0 to avoid suprprise changes and because v1.1.1 is not
    working for us.
    
    Signed-off-by: Dalton Bohning <[email protected]>
    daltonbohning authored Oct 18, 2024
    Configuration menu
    Copy the full SHA
    e0f5883 View commit details
    Browse the repository at this point in the history

Commits on Oct 19, 2024

  1. DAOS-15852 test: more timing samples for co_op_dup_timing() (#14497) (#…

    …15324)
    
    Rarely, this test will produce timings that exceed the failure
    threshold. Local and PR/CI experiments have shown that increasing
    the test's NUM_OPS to more than 200 iterations greatly reduces
    or may eliminate such intermittent timing failures, by "spreading out"
    the magnitude of the time spent in the 3 main loops of the test
    (0% loops perform fault injections, 33%, and 50%).
    
    
    Signed-off-by: Kenneth Cain <[email protected]>
    kccain authored Oct 19, 2024
    Configuration menu
    Copy the full SHA
    cb9d278 View commit details
    Browse the repository at this point in the history

Commits on Oct 20, 2024

  1. DAOS-16572 rebuild: properly assign global_dtx_resync_version in IV -…

    … b26 (#15186)
    
    In rebuild_iv_ent_refresh() for refreshing DTX resync version, needs
    to assign rt_global_dtx_resync_version firstly before wakeup related
    rebuild_scan_leader.
    
    Signed-off-by: Fan Yong <[email protected]>
    Nasf-Fan authored Oct 20, 2024
    Configuration menu
    Copy the full SHA
    f8682fb View commit details
    Browse the repository at this point in the history

Commits on Oct 21, 2024

  1. DAOS-16716 ci: Set reference build for PRs (#15337)

    Release branch PRs should use the release branch build
    instead of master branch build for NLT reference
    
    Signed-off-by: Jeff Olivier <[email protected]>
    jolivier23 authored Oct 21, 2024
    Configuration menu
    Copy the full SHA
    81e57d0 View commit details
    Browse the repository at this point in the history
  2. DAOS-16329 chk: maintenance mode after checking pool with dryrun - b26 (

    #14985)
    
    Sometimes, after system shutdown unexpectedly, the users may expect
    to check their critical data under some kind of maintenance mode.
    Under such mode, no user data can be modified or moved or aggregated.
    That will guarantee no further potential (DAOS logic caused) damage
    can happen during the check.
    
    For such purpose, we will enhance current DAOS CR logic with --dryrun
    option to allow the pool (after check) to be opened as immutable with
    disabling some mechanism that may potentially cause data modification
    or movement (such as rebuild or aggregation).
    
    Under such mode, if client wants to connect to the pool, the read-only
    option must be specified. Similarly for opening container in such pool.
    
    Signed-off-by: Fan Yong <[email protected]>
    Nasf-Fan authored Oct 21, 2024
    Configuration menu
    Copy the full SHA
    c821379 View commit details
    Browse the repository at this point in the history
  3. DAOS-16265 test: Fix erasurecode/rebuild_fio.py out of space (#15020) (

    …#15340)
    
    Prevent accumulating large server log files caused by temporarily
    enabling the DEBUG log mask while creating or destroying pools.
    
    Signed-off-by: Phil Henderson <[email protected]>
    phender authored Oct 21, 2024
    Configuration menu
    Copy the full SHA
    b913d3e View commit details
    Browse the repository at this point in the history

Commits on Oct 22, 2024

  1. DAOS-16693 telemetry: Avoid race between init/read (#15306) (#15322)

    In rare cases, a reader may attempt to access a telemetry
    node after it has been added to the tree, but before it
    has been fully initialized. Use an atomic to prevent
    reads before the initialization has completed. Unlucky
    readers will get a -DER_AGAIN instead of crashing.
    
    Signed-off-by: Michael MacDonald <[email protected]>
    mjmac authored Oct 22, 2024
    Configuration menu
    Copy the full SHA
    ffa1c9d View commit details
    Browse the repository at this point in the history
  2. DAOS-16696 cart: Fix rc in error path (#15313) (#15357)

    - Fix rc in error path during ivo_on_update failure
    
    Required-githooks: true
    
    Signed-off-by: Alexander A Oganezov <[email protected]>
    frostedcmos authored Oct 22, 2024
    Configuration menu
    Copy the full SHA
    42a0d35 View commit details
    Browse the repository at this point in the history
  3. DAOS-16574 vos: shrink DTX table blob size - b26 (#15220) (#15221)

    Use 4KB blob for committed DTX table and 16KB for active DTX table.
    It is more efficient for lower allocator and reduce the possibility
    of space allocation failure when space pressure.
    
    Simplify vos_dtx_commit logic and code cleanup.
    
    Signed-off-by: Fan Yong <[email protected]>
    Nasf-Fan authored Oct 22, 2024
    Configuration menu
    Copy the full SHA
    1ae3f29 View commit details
    Browse the repository at this point in the history

Commits on Oct 23, 2024

  1. DAOS-16653 doc: Fix CRT_EVENT_DELAY description (#15351) (#15371)

    Fix the description of the CRT_EVENT_DELAY environment variable in
    docs/admin/env_variables.md.
    
    Signed-off-by: Li Wei <[email protected]>
    liw authored Oct 23, 2024
    Configuration menu
    Copy the full SHA
    dcf8419 View commit details
    Browse the repository at this point in the history
  2. DAOS-16650 control: dmg system exclude, update group version (#15288) (

    …#15349)
    
    With this change, when a daos administrator runs dmg system exclude
    for a given set of engines, the system map version / cart primary group
    version will be updated. In turn, daos_engines will more immediately
    detect the "loss" of the administratively excluded engines, update
    pool maps and perform rebuild. This change supports a use case of
    a proactive exclusion of ranks that are expected to be impacted by
    planned maintenance that would cut off connectivity to certain
    engines.
    
    Signed-off-by: Kenneth Cain <[email protected]>
    kccain authored Oct 23, 2024
    Configuration menu
    Copy the full SHA
    4c49f36 View commit details
    Browse the repository at this point in the history

Commits on Oct 24, 2024

  1. DAOS-16488 chk: take sd_lock before accessing VOS sys_db - b26 (#15269)

    The VOS sys_db may have multuiple users, such as SMD and CHK.
    It is caller's duty to take lock against the VOS sys_db before
    accessing it to handle concurrent operations from multiple XS.
    
    Signed-off-by: Fan Yong <[email protected]>
    Nasf-Fan authored Oct 24, 2024
    Configuration menu
    Copy the full SHA
    2819d45 View commit details
    Browse the repository at this point in the history
  2. DAOS-16469 dtx: optimize DTX CoS cache - b26 (#15085)

    If there are a lot of committable DTX entries in DTX CoS cache,
    then it may be inefficient to locate the DTX entry in CoS cache
    with given oid + dkey_hash, that may happen under the case of
    that DTX batched commit is blocked (such as because of network
    trouble) as to trigger DTX refresh (for DTX cleanup) on other
    related engines. If that happened, it will increase the system
    load on such engine and slow down DTX commit further more. The
    patch reduces unnecessary search operation inside CoS cache.
    
    Other changes:
    
    1. Metrics (io/dtx/async_cmt_lat/tgt_id) for DTX asynchronously
       commit latency (with unit ms).
    
    2. Fix a bug in sched_ult2xs() with multiple numa sockets for
       DSS_XS_OFFLOAD case.
    
    3. Delay commit (or abort) collective DTX on the leader target
       to handle resent race.
    
    4. Avoid blocking dtx_req_wait() if chore failed to send out
       some DTX RPC.
    
    5. Some cleanup for error handling.
    
    Signed-off-by: Fan Yong <[email protected]>
    Nasf-Fan authored Oct 24, 2024
    Configuration menu
    Copy the full SHA
    ec3aa1c View commit details
    Browse the repository at this point in the history
  3. DAOS-14262 cart: add ability to select traffic class for SWIM context (

    …#14893) (#14917)
    
    Add SWIM_TRAFFIC_CLASS env var (default is unspec)
    
    Signed-off-by: Jerome Soumagne <[email protected]>
    soumagne authored Oct 24, 2024
    Configuration menu
    Copy the full SHA
    2b5620b View commit details
    Browse the repository at this point in the history
  4. DAOS-16469 container: Lower log level for cont_aggregate_interval (#1…

    …5283)
    
    To reduce the side-effect caused by frequent log with -DER_INPROGRESS.
    
    Signed-off-by: Fan Yong <[email protected]>
    Nasf-Fan authored Oct 24, 2024
    Configuration menu
    Copy the full SHA
    70b12e3 View commit details
    Browse the repository at this point in the history
  5. DAOS-16716 ci: Set reference build for PRs (#15379)

    Release branch PRs should use the release branch build
    instead of master branch build for Fault Injection reference
    
    Signed-off-by: Jeff Olivier <[email protected]>
    jolivier23 authored Oct 24, 2024
    Configuration menu
    Copy the full SHA
    2a1892f View commit details
    Browse the repository at this point in the history

Commits on Oct 25, 2024

  1. DAOS-15914: crt_reply_send_input_free() (#14817)

    - New crt_reply_send_input_free() API added which releases input buffer right
      after HG_Respond() instead of waiting until the handle is destroyed.
    - srv_obj.c calls changed to use new crt_reply_send_input_free()
    - I/O context takes refcount on RPC
    - only release input buffer for target update
    
    Signed-off-by: Alexander A Oganezov <[email protected]>
    Signed-off-by: Liang Zhen <[email protected]>
    Co-authored-by: Liang Zhen <[email protected]>
    frostedcmos and gnailzenh authored Oct 25, 2024
    Configuration menu
    Copy the full SHA
    23f0787 View commit details
    Browse the repository at this point in the history
  2. DAOS-16721 object: fix coll RPC for obj with sparse layout - b26 (#15376

    )
    
    The old implementation did not correctly calculate some collective
    object RPC size, and may cause trouble when need bulk data transfer
    for large collective object RPC. It also potentially affects how to
    dispatch collective RPCs from leader to other engines.
    
    The patch also addes more sanity check for coll-punch RPC to detect
    potential DRAM corruption.
    
    Signed-off-by: Fan Yong <[email protected]>
    Nasf-Fan authored Oct 25, 2024
    Configuration menu
    Copy the full SHA
    67da3b9 View commit details
    Browse the repository at this point in the history

Commits on Oct 28, 2024

  1. DAOS-16687 control: Handle missing PCIe caps in storage query usage (#…

    …15296) (#15392)
    
    Missing PCIe capabilities when querying a NVMe SSD's configuration
    space is unusual but should be handled gracefully by the control-plane
    and shouldn't cause a failure to return usage statistics when calling
    dmg storage query usage.
    
    Update so that pciutils lib is only called when attempting to display
    health stats via dmg and not when fetching usage info. Improve clarity
    of workflow to ease maintenance and add test coverage for updates.
    Enable continued functionality when NVMe device doesn't return any
    extended capabilities in PCIe configuration space data by adding
    sentinel error to library for such a case.
    
    Signed-off-by: Tom Nabarro <[email protected]>
    tanabarr authored Oct 28, 2024
    Configuration menu
    Copy the full SHA
    c4cf4f7 View commit details
    Browse the repository at this point in the history
  2. DAOS-16722 client: to intercept PMPI_Init() in libpil4dfs (#15387)

    Intercept PMPI_Init() to avoid calling daos_init() if MPI_Init() is intercepted by other library (like darshan and mpip).
    
    Signed-off-by: Lei Huang <[email protected]>
    wiliamhuang authored Oct 28, 2024
    Configuration menu
    Copy the full SHA
    eb95b55 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    bde13c3 View commit details
    Browse the repository at this point in the history
  4. Merge remote-tracking branch 'origin/release/2.6' into juszhan/google…

    …/2.6
    
    Required-githooks: true
    Change-Id: I7fc290f325a3831c0508580a33dbc585f6c80fda
    juszhan1 committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    e13e18d View commit details
    Browse the repository at this point in the history
  5. Revert "DAOS-15914: crt_reply_send_input_free() (#14817)"

    This reverts commit 23f0787.
    juszhan1 committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    145b549 View commit details
    Browse the repository at this point in the history