System & Database Related Solutions

Tuesday, 21 January 2020

ASM re-balancing time to complete

The disk group rebalance operation has three phases:

Planning
File extents relocation
Disk compacting

As far as the overall time to complete is concerned, the planning phase time is insignificant so there is no need to worry about it. The file extent relocation phase will take most of the time, so the main focus will be on that. The disk compacting phase may also take significant amount of time, in particular on disk add, so we will have a closer look at that as well.

It is important to understand why the rebalance is running. If we are adding a new disk, say to increase the available disk group space, it doesn't really matter how long it will take for the rebalance to complete. Similarly if we are resizing or dropping disk(s), to adjust the disk group space, we are generally not concerned with the time it takes for the rebalance to complete.

But if a disk has failed and ASM has initiated rebalance, there may be a legitimate reason for concern. If the disk group is normal redundancy AND if another disk fails AND it is the partner of the disk that has already failed, the disk group will be dismounted, all the databases that use that disk group will crash and there may be loss of data. In such cases it may be important to have an idea when the rebalance operation will complete. Actually, we want to see the file extents relocation phase completed, as once it does, all the data is fully redundant again (in case the rebalance was initiated due to a disk failure).

File extents relocation

To have a closer look at the file extents relocation phase, I drop one of the disks with the default rebalance power. I then query GV$ASM_OPERATION to check the estimated completion time (EST_MINUTES):

SQL> show parameter power

NAME TYPE VALUE
------------------------------------ ----------- ----------------
asm_power_limit integer 1

SQL> set time on
16:40:57 SQL> alter diskgroup DATA1 drop disk DATA1_CD_06_CELL06;

Diskgroup altered.

Initial estimated time to complete is 121 minutes:

SQL> select INST_ID, OPERATION, STATE, POWER, SOFAR, EST_WORK, EST_RATE, EST_MINUTES from GV$ASM_OPERATION ;

INST_ID OPERA STAT POWER SOFAR EST_WORK EST_RATE EST_MINUTES

---------- ----- ---- ---------- ---------- ---------- ---------- -----------

1 REBAL DONE 11 0 0 0 0

1 REBAL RUN 11 1846647 2336971 4043 121

1 REBAL WAIT 11 0 0 0 0

About 10 minutes into the rebalance, the estimate to complete is 69 minutes:

16:50:25 SQL> /

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES

---------- ----- ---- ---------- ---------- ---------- ---------- -----------

         1 REBAL DONE         11          0          0          0           0

         1 REBAL RUN          11    1851014    2336971       7030          69

         1 REBAL WAIT         11          0          0          0           0

While that EST_MINUTES doesn't give me much confidence, I see that SOFAR (number of allocation units moved so far) is going up, which is a good sign.

ASM alert log shows the time of the drop disk, the OS process ID of the ARB0 doing all the work, and most importantly - that there are no errors:

Tue Jan 21 09:19:38 2020

SUCCESS: ALTER DISKGROUP DATA DROP DISK

DATA_0003,

DATA_0001,

DATA_0006,

DATA_0005,

DATA_00018,

DATA_00019,

DATA_00020,

DATA_00021,

DATA_00022,

DATA_00023

REBALANCE POWER 11

Starting background process ARB0

Tue Jan 21 09:19:38 2020

ARB0 started with pid=22, OS id=2841 

NOTE: assigning ARB0 to group 1/0x39c42279 (DATA) with 11 parallel I/Os

NOTE: Attempting voting file refresh on diskgroup DATA

Tue Jan 21 09:19:44 2020

NOTE: F1X0 on disk 3 (fmt 2) relocated at fcn 0.95799708: AU 143462 -> AU 0

NOTE: F1X0 on disk 21 (fmt 2) relocated at fcn 0.95799708: AU 143458 -> AU 0

NOTE: F1X0 on disk 22 (fmt 2) relocated at fcn 0.95799708: AU 153322 -> AU 0

NOTE: header on disk 26 advanced to format #2 using fcn 0.0

...

ARB0 trace file should show which file extents are being relocated. It does, and that is how I know that ARB0 is doing what it is supposed to do:

-bash-4.1$ tail -f /d01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_arb0_2841.trc