Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev reducedlock #1551

Merged

Commits on Apr 23, 2024

  1. PXB-3113 : Improve debug sync framework to allow PXB to pause and res…

    …ume threads
    
    https://perconadev.atlassian.net/browse/PXB-3113
    
    The current debug-sync option in PXB completely suspends PXB process and user can resume by sending SIGCONT signal
    This is useful for scenarios where PXB is paused and do certain operations on server and then resume PXB to complete.
    
    But many bugs we found during testing, involves multiple threads in PXB. The goal of this work is to be able to
    pause and resume the thread.
    
    Since many tests use the existing debug-sync option, I dont want to disturb these tests. We can convert them to
    the new mechanism later.
    
    How to use?
    -----------
    The new mechanism is used with option --debug-sync-thread="sync_point_name"
    
    In the code place a debug_sync_thread(“debug_point_1”) to stop thread at this place.
    
    You can pass the debug_sync point via commandline --debug-sync-thread=”debug_sync_point1”
    
    PXB will create a file of the debug_sync point name in the backup directory. It is suffixed with a threadnumber.
    Please ensure that no two debug_sync points use same name (it doesn’t make sense to have two sync points with same name)
    
    ```
    2024-03-28T15:58:23.310386-00:00 0 [Note] [MY-011825] [Xtrabackup] DEBUG_SYNC_THREAD: sleeping 1sec.  Resume this thread by deleting file /home/satya/WORK/pxb/bld/backup//xb_before_file_copy_4860396430306702017
    ```
    In the test, after activating syncpoint, you can use wait_for_debug_sync_thread_point <syncpoint_name>
    
    Do some stuff now. This thread is sleeping.
    
    Once you are done, and if you want the thread to resume, you can do so by deleting the file 'rm backup_dir/sync_point_name_*`
    Please use resume_debug_sync_thread_point <syncpoint_name> <backup_dir>. It dletes the syncpoint file and additionally checks that syncpoint is
    indeed resumed.
    
    More common/complicated scenario:
    ----------------------------------
    The scenario is to signal another thread to stop after reaching the first sync point. To achieve this. Do steps 1 to 3 (above)
    
    Echo the debug_sync point name into a file named “xb_debug_sync_thread”. Example:
    
    4. echo "xtrabackup_copy_logfile_pause" > backup/xb_debug_sync_thread
    
    5. send SIGUSR1 signal to PXB process. kill -SIGUSR1 496102
    
    6. Wait for syncpoint to be reached. wait_for_debug_sync_thread <syncpoint_name>
    
    PXB acknowledges it
    2024-03-28T16:05:07.849926-00:00 0 [Note] [MY-011825] [Xtrabackup] SIGUSR1 received. Reading debug_sync point from xb_debug_sync_thread file in backup directory
    2024-03-28T16:05:07.850004-00:00 0 [Note] [MY-011825] [Xtrabackup] DEBUG_SYNC_THREAD: Deleting  file/home/satya/WORK/pxb/bld/backup//xb_debug_sync_thread
    
    and then prints this once the sync point is reached.
    2024-03-28T16:05:08.508830-00:00 1 [Note] [MY-011825] [Xtrabackup] DEBUG_SYNC_THREAD: sleeping 1sec.  Resume this thread by deleting file /home/satya/WORK/pxb/bld/backup//xb_xtrabackup_copy_logfile_pause_10389933572825668634
    
    At this point, we have two threads sleeping at two sync points. Either of them can be resumed by deleting the filenames mentioned in the error log.
    (Or use resume_debug_sync_thread())
    satya-bodapati committed Apr 23, 2024
    Configuration menu
    Copy the full SHA
    294685a View commit details
    Browse the repository at this point in the history
  2. PXB-3252 : Xtrabackup failed to read page after 10 retries. File ./my…

    …sql.ibd seems to be corrupted.
    
    https://perconadev.atlassian.net/browse/PXB-3252
    
    Problem:
    --------
    With lock-ddl=REDUCED, ALTER ENCRYPTION='Y'/'N' happens. On general tablespaces, this is done inplace.
    ie the space_id of tablespace will not change and the pages are encrypted or decrypted.
    
    For file per table tablespaces, a new tablespace is created with encryption key and data is copied from
    old tablespace to new tablespace.
    
    In xtrabackup, the files are discovered and then they are copied. Between these two operations, the encrypted
    tablespace can change. For example, PXB saw that ts1.ibd is encrypted with key1, loaded into cache.
    
    Then server did ENCRYPTION='N' and then back to ENCRYPTION='Y', now the tablspace is encrypted with a different key.
    
    Now PXB copy threads tries to copy this tablespce and cannot decrypt a page. Page 0 is always unencrypted. So the
    problem typically detected at Page 1. It can happen on any page.
    
    Since PXB cannot decrypt the page, it reports corruption and aborts the backup.
    
    Fix:
    ----
    On decryption errors, we track such tablespaces with separate corrupted list. We also them to the recopy tables list.
    Under lock, these tablespaces are copied again. A .new extension is used.
    Then we process the corrupted list under lock. Create .corrupt files for the tablespaces from the corrupted list.
    For example, if the tablespace encrypted is ts1.ibd, the file will be ts1.ibd.corrupted.
    
    On prepare, we delete the corresponding ts1.ibd if the ts1.ibd.corrupted is present. This has to be done before the
    *.ibd scan becuase tablespace loading aborts on processing such half-written tablespaces.
    If the .corrupted is present in incremental directory, delete the ts1.ibd.meta and ts.ibd.delta files from the incremental
    backup directory.
    satya-bodapati committed Apr 23, 2024
    Configuration menu
    Copy the full SHA
    88db4a4 View commit details
    Browse the repository at this point in the history
  3. PXB-3246 : Assertion failure: log0recv.cc:2141:!page || fil_page_type…

    …_is_index(page_type)
    
    Problem:
    --------
    Unable to apply redo log record entry because page is in wrong state. It was observed that
    tablespace is created by incremental backup
    
    How did this happen?
    --------------------
    
    lets say tablespace is t1.ibd and happily in fullbackup
    before incremental, this gets renamed to t2.ibd
    incremental backup creates t2.ibd.delta and t2.ibd.meta files in incremental backup directory
    later there is drop t2.ibd,  we have space_id.del file in incremental backup directory
    also some redo generated on this table before it is dropped.
    
    During prepare of incremental backup, when we process a space_id.del file, we check the tablespace if tablespace is found.
    Lets say, it 2.del. To process 2.del, we first check, the tabespace that is with space_id 2.
    Since the tablespace name is t1.ibd in the full backup directory, we delete it. Additionally,
    we delete the .ibd and .meta files, so we try to delete t1.ibd.meta and t1.ibd.delta files.
    They never existed, so we ignore the errors to delete them.
    
    But in the inc backup directory, we still have t2.ibd.delta and t2.ibd.meta files. So inc backup prepare
    creates a tablespace with space_id 2 and apply the delta file changes. This tablespace is wrong
    because, we are creating a dropped tablespace and we dont have all the changes. incremental backup
    creates this tablespace with all-zero 7 pages. Later when we do MLOG_INSERT into the index page,
    we find out the page is NOT in correct state.
    
    Fix:
    ----
    We have to delete the right incremental files based on space_id. So we build metamap by scanning
    *.meta files and with the key as space_id (found in meta file).
    
    Later, when we process the space_id.del file, after removing the tablespace with space_id,
    we will now ask aka meta map cache to give the .delta and .meta file belonging to deleted space_id.
    By deleting the un-necessary .meta file and .delta, the tablespace is considred as dropped by redo
    and corresponding redo entries are not applied.
    satya-bodapati committed Apr 23, 2024
    Configuration menu
    Copy the full SHA
    56802d4 View commit details
    Browse the repository at this point in the history
  4. PXB-3253 : [ERROR] [MY-012592] [InnoDB] Operating system error number…

    … 2 in a file operation
    
    https://perconadev.atlassian.net/browse/PXB-3253
    
    Problem:
    --------
    Files disappear during backup with --lockd-ddl=reduced
    
    Analysis:
    ---------
    PXB open server files using os_file_create_simple_no_error_handling() via Fil_shard::open_file(),
    Fil_shard::get_file_size(), Datafile::open_read_only. This API doesn't tolerate file open errors.
    
    This particular bug occurs when the file disappeared after get_file_size() in Fil_shard::open_file().
    (See the testcase for more details).
    
    Fix:
    ----
    If lock ddl is reduced and if we have not yet acquired/entered the copy under lock phase
    ie is_server_locked() is false, we can tolerate the file open errors. So we use the function/API
    os_file_create() instead of other variants. Within this, based on lock_ddl reduced mode, we
    tolerate file opening errors.
    satya-bodapati committed Apr 23, 2024
    Configuration menu
    Copy the full SHA
    0e56e03 View commit details
    Browse the repository at this point in the history
  5. PXB-3223 : PXB must not allow --lock-ddl=REDUCED when pagetracking is…

    … enabled
    
    Problem:
    -------
    We cannot allow pagetracking with lock-ddl=REDUCED. This is because page-tracking gives
    us a set of page_ids (space_id, page_nos). PXB should copy these pages and while we copy
    these pages, tablespace disappear, get renamed, encrypted etc.
    
    We will enable it if there is need or usecase for this. For now, we will disable it.
    
    Fix:
    ----
    Disable the combination of --page-tracking and --lock-ddl=REDUCED
    satya-bodapati committed Apr 23, 2024
    Configuration menu
    Copy the full SHA
    5c33f6c View commit details
    Browse the repository at this point in the history
  6. PXB-3120 : Assertion failure: Dir_Walker::is_directory

    Problem:
    --------
    InnoDB assumes directories or files do not disappear. It is true
    for the engine because, it is in the startup mode and no opeartions are allowed
    at this point of time.
    
    Analysis:
    ---------
    With lock-ddl=RECUCED, tables can be dropped concurrently when pxb does *.ibd scan
    or subdirectories can disappear too.
    
    Fix:
    ----
    Handle walk_posix() for missing files/directories. The scan should continue and skip
    these deleted files or directories.
    satya-bodapati committed Apr 23, 2024
    Configuration menu
    Copy the full SHA
    141d30c View commit details
    Browse the repository at this point in the history