-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mschaara/dfs dcache merge #15114
Merged
Merged
Mschaara/dfs dcache merge #15114
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…15049) When including the local host in the default interface selection a difference in ib0 speeds will cause the logic to select eth0 and then the tcp provider. Signed-off-by: Phil Henderson <[email protected]>
Cart has added the ability to select network interface on context creation. The daos_agent also added a numa-fabric map that can be queried at init time. Update the DAOS client to query from the agent a map of numa to network interface on daos_init(), and on EQ creation, select the best interface for the network context based on the numa of the calling thread. Signed-off-by: Mohamad Chaarawi <[email protected]>
We've noticed that with sequential order, object placement is poor. We get 40% fill for 8GiB files with 25 ranks and 16 targets per rank with EC_2P1G8. With this patch, we get a much better distribution. This patch adds the following: 1. A function for cycling oid.hi incrementing by a large prime 2. For DFS, randomize the starting value 3. Modify DFS to cycle OIDs using the new function. Signed-off-by: Jeff Olivier <[email protected]>
In dtx_req_send, since the crt_req_send releases the req reference, din may have been freed when dereferenced for the DL_CDEBUG call. Signed-off-by: Li Wei <[email protected]>
Wrap self_test to provide a simplified network test to detect obvious client/server connectivity and performance problems. Signed-off-by: Michael MacDonald <[email protected]>
If user creates a container without --file-oclass, the get_info call was returning the default oclass of a directory on daos fs get-attr. Fix that to properly use the enum types for default scenario. Signed-off-by: Mohamad Chaarawi <[email protected]>
* DAOS-15863 container: fix a race for container cache while destroying a container, cont_child_destroy_one() releases its own refcount before waiting, if another ULT releases its refcount, which is the last one, wakes up the waiting ULT and frees it ds_cont_child straightaway, because no one else has refcount. When the waiting ULT is waken up, it will try to change the already freed ds_cont_child. This patch changes the LRU eviction logic and fixes this race. Signed-off-by: Liang Zhen <[email protected]> Signed-off-by: Jeff Olivier <[email protected]> Co-authored-by: Jeff Olivier <[email protected]>
The dfuse/ioctl_pool_handles.py test is overloading the VM so reduce the number of engine targets. Signed-off-by: Phil Henderson <[email protected]>
It is possible that the DTX modified nothing when stop currnet backend transaction. Under such case, we may not generate persistent DTX entry. Then need to bypass such case before checking on-disk DTX entry status. The patch makes some clean and removed redundant metrics for committed DTX entries. Enhance vos_dtx_deregister_record() to handle GC case. Signed-off-by: Fan Yong <[email protected]>
) Signed-off-by: Joseph Moore <[email protected]>
…ace (#15050) Allow selecting a default interface that is running at a different speed on different hosts. Primarily this is to support selecting the ib0 interface by default when the launch node has a slower ib0 interface than the cluster hosts. Signed-off-by: Phil Henderson <[email protected]>
#15004) In HDF5, DFS, MPIIO, or POSIX, object class and container properties are defined during the container create. If it’s DFS, object class is also set to the IOR parameter. However, in HDF5-VOL, object class and container properties are defined with the following environment variables of mpirun. HDF5_DAOS_OBJ_CLASS (Object class) HDF5_DAOS_FILE_PROP (Container properties) The infrastructure to set these variables are already there in run_ior_with_pool(). In file_count_test_base.py, pass in the env vars to run_ior_with_pool(env=env) as a dictionary. Object class is the oclass variable. Container properties can be obtained from container -> properties field in the test yaml. This fix is discussed in PR #14964. Signed-off-by: Makito Kano <[email protected]>
set D_IL_REPORT per test instead of setting defaults values in utilities. This allows running without it set. Signed-off-by: Dalton Bohning <[email protected]>
Automatically include dfs tests when dfs files are modified in PRs. Signed-off-by: Dalton Bohning <[email protected]>
update pylint to 3.2.7 Signed-off-by: Dalton Bohning <[email protected]>
) replace usage of IorTestBase.execute_cmd with run_remote Signed-off-by: Dalton Bohning <[email protected]>
For EC object update via CPD RPC, when calculate the bitmap to skip some iods for current EC data shard, we may input NULL for "*skips" parameter. It may cause the old logic in obj_get_iods_offs_by_oid() to generate some undefined DRAM for "skips" bitmap. Such bitmap may be over-written by others, as to subsequent obj_bulk_transfer() may be misguided. The patch also fixes a bug inside obj_bulk_transfer() that cast any input RPC as UPDATE/FETCH by force. Signed-off-by: Fan Yong <[email protected]>
Client with stale pool map may try to send RPC to a DOWN target, if the target was brought DOWN due to faulty NVMe device, the ds_pool_child could have been stopped on the NVMe faulty reaction, We'd ensure proper error code is returned for such case. Signed-off-by: Niu Yawei <[email protected]>
Fix coverity 2555843 explict null dereferenced. Signed-off-by: Niu Yawei <[email protected]>
…5037) * DAOS-16467 rebuild: add DAOS_PW_RF ENV for massive failure case Allow user to set DAOS_PW_RF as pw_rf (pool wise RF). If SWIM detected engine failure is going to break pw_rf, don't change pool map, also don't trigger rebuild. With critical log message to ask administrator to bring back those engines in top priority (just "system start --ranks=xxx", need not to reintegrate those engines). a few functions renamed to avoid confuse - pool_map_find_nodes() -> pool_map_find_ranks() pool_map_find_node_by_rank() -> pool_map_find_dom_by_rank() pool_map_node_nr() -> pool_map_rank_nr() Signed-off-by: Xuezhao Liu <[email protected]>
…5069) Unlike fetch, we return DER_CSUM on update (turned into EIO by dfs) without any retry. We should retry a few times in case it is a transient error. The patch also prints more information about the actual checksum mismatch. Signed-off-by: Johann Lombardi <[email protected]>
To avoid allocation failure on a fragmented system, huge SV allocation will be split into multiple smaller allocations, each allocation size is capped to 8MB (the DMA chunk size, that could avoid huge DMA buffer allocation). The address of such scattered SV payload is represented by 'gang address'. Removed io_allocbuf_failure() vos unit test, it's not applicable in gang SV mode now. Signed-off-by: Niu Yawei <[email protected]>
The previous default of 1MiB isn't helpful at large scales. Use a default of 1KiB to get faster results and a better balance between raw latency and bandwidth. Also include calculated rpc throughput and bandwidth in JSON output. Signed-off-by: Michael MacDonald <[email protected]>
) Signed-off-by: Tomasz Gromadzki <[email protected]>
It has been seen that obj_ec_singv_split may read beyond the end of sgl->sg_iovs[0].iov_buf: iod_size=8569 c_bytes=4288 id_shard=0 tgt_off=1 iov_len=8569 iov_buf_len=8569 The memmove read 4288 bytes from offset 4288, whereas the buffer only had 8569 - 4288 = 4281 bytes from offset 4288. This patch fixes the problem by adding the min(...) expression. Signed-off-by: Li Wei <[email protected]>
Required-githooks: true Signed-off-by: Mohamad Chaarawi <[email protected]>
Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data |
mchaarawi
added a commit
that referenced
this pull request
Sep 11, 2024
This reverts commit b24a38c.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: