RECURSIVE SUBROUTINE PDLAQR1( WANTT, WANTZ, N, ILO, IHI, A,
$ DESCA, WR, WI, ILOZ, IHIZ, Z,
$ DESCZ, WORK, LWORK, IWORK,
$ ILWORK, INFO )
*
* Contribution from the Department of Computing Science and HPC2N,
* Umea University, Sweden
*
* -- ScaLAPACK auxiliary routine (version 2.0.1) --
* University of Tennessee, Knoxville, Oak Ridge National Laboratory,
* Univ. of Colorado Denver and University of California, Berkeley.
* January, 2012
*
IMPLICIT NONE
*
* .. Scalar Arguments ..
LOGICAL WANTT, WANTZ
INTEGER IHI, IHIZ, ILO, ILOZ, ILWORK, INFO, LWORK, N
* ..
* .. Array Arguments ..
INTEGER DESCA( * ), DESCZ( * ), IWORK( * )
DOUBLE PRECISION A( * ), WI( * ), WORK( * ), WR( * ), Z( * )
* ..
*
* Purpose
* =======
*
* PDLAQR1 is an auxiliary routine used to find the Schur decomposition
* and or eigenvalues of a matrix already in Hessenberg form from
* cols ILO to IHI.
*
* This is a modified version of PDLAHQR from ScaLAPACK version 1.7.3.
* The following modifications were made:
* o Recently removed workspace query functionality was added.
* o Aggressive early deflation is implemented.
* o Aggressive deflation (looking for two consecutive small
* subdiagonal elements by PDLACONSB) is abandoned.
* o The returned Schur form is now in canonical form, i.e., the
* returned 2-by-2 blocks really correspond to complex conjugate
* pairs of eigenvalues.
* o For some reason, the original version of PDLAHQR sometimes did
* not read out the converged eigenvalues correclty. This is now
* fixed.
*
* Notes
* =====
*
* Each global data object is described by an associated description
* vector. This vector stores the information required to establish
* the mapping between an object element and its corresponding process
* and memory location.
*
* Let A be a generic term for any 2D block cyclicly distributed array.
* Such a global array has an associated description vector DESCA.
* In the following comments, the character _ should be read as
* "of the global array".
*
* NOTATION STORED IN EXPLANATION
* --------------- -------------- --------------------------------------
* DTYPE_A(global) DESCA( DTYPE_ )The descriptor type. In this case,
* DTYPE_A = 1.
* CTXT_A (global) DESCA( CTXT_ ) The BLACS context handle, indicating
* the BLACS process grid A is distribu-
* ted over. The context itself is glo-
* bal, but the handle (the integer
* value) may vary.
* M_A (global) DESCA( M_ ) The number of rows in the global
* array A.
* N_A (global) DESCA( N_ ) The number of columns in the global
* array A.
* MB_A (global) DESCA( MB_ ) The blocking factor used to distribute
* the rows of the array.
* NB_A (global) DESCA( NB_ ) The blocking factor used to distribute
* the columns of the array.
* RSRC_A (global) DESCA( RSRC_ ) The process row over which the first
* row of the array A is distributed.
* CSRC_A (global) DESCA( CSRC_ ) The process column over which the
* first column of the array A is
* distributed.
* LLD_A (local) DESCA( LLD_ ) The leading dimension of the local
* array. LLD_A >= MAX(1,LOCr(M_A)).
*
* Let K be the number of rows or columns of a distributed matrix,
* and assume that its process grid has dimension p x q.
* LOCr( K ) denotes the number of elements of K that a process
* would receive if K were distributed over the p processes of its
* process column.
* Similarly, LOCc( K ) denotes the number of elements of K that a
* process would receive if K were distributed over the q processes of
* its process row.
* The values of LOCr() and LOCc() may be determined via a call to the
* ScaLAPACK tool function, NUMROC:
* LOCr( M ) = NUMROC( M, MB_A, MYROW, RSRC_A, NPROW ),
* LOCc( N ) = NUMROC( N, NB_A, MYCOL, CSRC_A, NPCOL ).
* An upper bound for these quantities may be computed by:
* LOCr( M ) <= ceil( ceil(M/MB_A)/NPROW )*MB_A
* LOCc( N ) <= ceil( ceil(N/NB_A)/NPCOL )*NB_A
*
* Arguments
* =========
*
* WANTT (global input) LOGICAL
* = .TRUE. : the full Schur form T is required;
* = .FALSE.: only eigenvalues are required.
*
* WANTZ (global input) LOGICAL
* = .TRUE. : the matrix of Schur vectors Z is required;
* = .FALSE.: Schur vectors are not required.
*
* N (global input) INTEGER
* The order of the Hessenberg matrix A (and Z if WANTZ).
* N >= 0.
*
* ILO (global input) INTEGER
* IHI (global input) INTEGER
* It is assumed that A is already upper quasi-triangular in
* rows and columns IHI+1:N, and that A(ILO,ILO-1) = 0 (unless
* ILO = 1). PDLAQR1 works primarily with the Hessenberg
* submatrix in rows and columns ILO to IHI, but applies
* transformations to all of H if WANTT is .TRUE..
* 1 <= ILO <= max(1,IHI); IHI <= N.
*
* A (global input/output) DOUBLE PRECISION array, dimension
* (DESCA(LLD_),*)
* On entry, the upper Hessenberg matrix A.
* On exit, if WANTT is .TRUE., A is upper quasi-triangular in
* rows and columns ILO:IHI, with any 2-by-2 or larger diagonal
* blocks not yet in standard form. If WANTT is .FALSE., the
* contents of A are unspecified on exit.
*
* DESCA (global and local input) INTEGER array of dimension DLEN_.
* The array descriptor for the distributed matrix A.
*
* WR (global replicated output) DOUBLE PRECISION array,
* dimension (N)
* WI (global replicated output) DOUBLE PRECISION array,
* dimension (N)
* The real and imaginary parts, respectively, of the computed
* eigenvalues ILO to IHI are stored in the corresponding
* elements of WR and WI. If two eigenvalues are computed as a
* complex conjugate pair, they are stored in consecutive
* elements of WR and WI, say the i-th and (i+1)th, with
* WI(i) > 0 and WI(i+1) < 0. If WANTT is .TRUE., the
* eigenvalues are stored in the same order as on the diagonal
* of the Schur form returned in A. A may be returned with
* larger diagonal blocks until the next release.
*
* ILOZ (global input) INTEGER
* IHIZ (global input) INTEGER
* Specify the rows of Z to which transformations must be
* applied if WANTZ is .TRUE..
* 1 <= ILOZ <= ILO; IHI <= IHIZ <= N.
*
* Z (global input/output) DOUBLE PRECISION array.
* If WANTZ is .TRUE., on entry Z must contain the current
* matrix Z of transformations accumulated by PDHSEQR, and on
* exit Z has been updated; transformations are applied only to
* the submatrix Z(ILOZ:IHIZ,ILO:IHI).
* If WANTZ is .FALSE., Z is not referenced.
*
* DESCZ (global and local input) INTEGER array of dimension DLEN_.
* The array descriptor for the distributed matrix Z.
*
* WORK (local output) DOUBLE PRECISION array of size LWORK
*
* LWORK (local input) INTEGER
* WORK(LWORK) is a local array and LWORK is assumed big enough
* so that LWORK >= 6*N + 6*385*385 +
* MAX( 2*MAX(DESCZ(LLD_),DESCA(LLD_)) + 2*LOCc(N),
* 7*Ceil(N/HBL)/LCM(NPROW,NPCOL)) )
*
* IWORK (global and local input) INTEGER array of size ILWORK
*
* ILWORK (local input) INTEGER
* This holds the some of the IBLK integer arrays. This is held
* as a place holder for the next release.
*
* INFO (global output) INTEGER
* < 0: parameter number -INFO incorrect or inconsistent
* = 0: successful exit
* > 0: PDLAQR1 failed to compute all the eigenvalues ILO to IHI
* in a total of 30*(IHI-ILO+1) iterations; if INFO = i,
* elements i+1:ihi of WR and WI contain those eigenvalues
* which have been successfully computed.
*
* Logic:
* This algorithm is very similar to _LAHQR. Unlike _LAHQR,
* instead of sending one double shift through the largest
* unreduced submatrix, this algorithm sends multiple double shifts
* and spaces them apart so that there can be parallelism across
* several processor row/columns. Another critical difference is
* that this algorithm aggregrates multiple transforms together in
* order to apply them in a block fashion.
*
* Important Local Variables:
* IBLK = The maximum number of bulges that can be computed.
* Currently fixed. Future releases this won't be fixed.
* HBL = The square block size (HBL=DESCA(MB_)=DESCA(NB_))
* ROTN = The number of transforms to block together
* NBULGE = The number of bulges that will be attempted on the
* current submatrix.
* IBULGE = The current number of bulges started.
* K1(*),K2(*) = The current bulge loops from K1(*) to K2(*).
*
* Subroutines:
* This routine calls:
* PDLAWIL -> Given the shift, get the transformation
* DLASORTE -> Pair up eigenvalues so that reals are paired.
* PDLACP3 -> Parallel array to local replicated array copy &
* back.
* DLAREF -> Row/column reflector applier. Core routine here.
* PDLASMSUB -> Finds negligible subdiagonal elements.
*
* Current Notes and/or Restrictions:
* 1.) This code requires the distributed block size to be square
* and at least six (6); unlike simpler codes like LU, this
* algorithm is extremely sensitive to block size. Unwise
* choices of too small a block size can lead to bad
* performance.
* 2.) This code requires A and Z to be distributed identically
* and have identical contxts.
* 3.) This release currently does not have a routine for
* resolving the Schur blocks into regular 2x2 form after
* this code is completed. Because of this, a significant
* performance impact is required while the deflation is done
* by sometimes a single column of processors.
* 4.) This code does not currently block the initial transforms
* so that none of the rows or columns for any bulge are
* completed until all are started. To offset pipeline
* start-up it is recommended that at least 2*LCM(NPROW,NPCOL)
* bulges are used (if possible)
* 5.) The maximum number of bulges currently supported is fixed at
* 32. In future versions this will be limited only by the
* incoming WORK array.
* 6.) The matrix A must be in upper Hessenberg form. If elements
* below the subdiagonal are nonzero, the resulting transforms
* may be nonsimilar. This is also true with the LAPACK
* routine.
* 7.) For this release, it is assumed RSRC_=CSRC_=0
* 8.) Currently, all the eigenvalues are distributed to all the
* nodes. Future releases will probably distribute the
* eigenvalues by the column partitioning.
* 9.) The internals of this routine are subject to change.
*
* Implemented by: G. Henry, November 17, 1996
*
* Modified by Robert Granat and Meiyue Shao, Department of Computing
* Science and HPC2N, Umea University, Sweden
*
* =====================================================================
*
* .. Parameters ..
INTEGER BLOCK_CYCLIC_2D, CSRC_, CTXT_, DLEN_, DTYPE_,
$ LLD_, MB_, M_, NB_, N_, RSRC_
PARAMETER ( BLOCK_CYCLIC_2D = 1, DLEN_ = 9, DTYPE_ = 1,
$ CTXT_ = 2, M_ = 3, N_ = 4, MB_ = 5, NB_ = 6,
$ RSRC_ = 7, CSRC_ = 8, LLD_ = 9 )
DOUBLE PRECISION ZERO, ONE, HALF
PARAMETER ( ZERO = 0.0D+0, ONE = 1.0D+0, HALF = 0.5D+0 )
DOUBLE PRECISION CONST
PARAMETER ( CONST = 1.50D+0 )
INTEGER IBLK, LDS
PARAMETER ( IBLK = 32, LDS = 12*IBLK+1 )
* ..
* .. Local Scalars ..
INTEGER CONTXT, DOWN, HBL, I, I1, I2, IAFIRST, IBULGE,
$ ICBUF, ICOL, ICOL1, ICOL2, IERR, II,
$ IRBUF, IROW, IROW1, IROW2, ISPEC, ISTART,
$ ISTARTCOL, ISTARTROW, ISTOP, ISUB,
$ ITERMAX, ITMP1, ITMP2, ITN, ITS, J, JAFIRST,
$ JBLK, JJ, K, KI, L, LCMRC, LDA, LDZ, LEFT,
$ LIHIH, LIHIZ, LILOH, LILOZ, LOCALI1, LOCALI2,
$ LOCALK, LOCALM, M, MODKM1, MYCOL, MYROW,
$ NBULGE, NH, NODE, NPCOL, NPROW, NR, NUM, NZ,
$ RIGHT, ROTN, UP, VECSIDX, TOTIT, TOTNS, TOTSW,
$ DBLK, NIBBLE, ND, NS, LTOP, LWKOPT, S1, S2, S3
DOUBLE PRECISION AVE, DISC, H00, H10, H11, H12, H21, H22, H33,
$ H43H34, H44, OVFL, S, SMLNUM, SUM, T1, T1COPY,
$ T2, T3, ULP, UNFL, V1SAVE, V2, V2SAVE, V3,
$ V3SAVE, SN, CS, SWAP
LOGICAL AED
* ..
* .. Local Arrays ..
INTEGER ICURCOL( IBLK ), ICURROW( IBLK ), K1( IBLK ),
$ K2( IBLK ), KCOL( IBLK ), KP2COL( IBLK ),
$ KP2ROW( IBLK ), KROW( IBLK ), LOCALK2( IBLK )
DOUBLE PRECISION SMALLA( 6, 6, IBLK ), VCOPY( 3 )
* ..
* .. External Functions ..
INTEGER ILCM, NUMROC, ILAENV
DOUBLE PRECISION PDLAMCH
EXTERNAL ILCM, NUMROC, ILAENV, PDLAMCH
* ..
* .. External Subroutines ..
EXTERNAL BLACS_GRIDINFO, DCOPY, DGEBR2D, DGEBS2D,
$ DGERV2D, DGESD2D, DGSUM2D, DLAHQR, DLAREF,
$ DLARFG, DLASORTE, IGAMN2D, INFOG1L, INFOG2L,
$ PDLABAD, PDLACP3, PDLASMSUB,
$ PDLAWIL, PXERBLA, DLANV2, PDLAQR2, PDLAQR4
* ..
* .. Intrinsic Functions ..
INTRINSIC ABS, DBLE, MAX, MIN, MOD, SIGN, SQRT
* ..
* .. Executable Statements ..
*
INFO = 0
*
ITERMAX = 30*( IHI-ILO+1 )
IF( N.EQ.0 )
$ RETURN
*
* NODE (IAFIRST,JAFIRST) OWNS A(1,1)
*
HBL = DESCA( MB_ )
CONTXT = DESCA( CTXT_ )
LDA = DESCA( LLD_ )
IAFIRST = DESCA( RSRC_ )
JAFIRST = DESCA( CSRC_ )
LDZ = DESCZ( LLD_ )
CALL BLACS_GRIDINFO( CONTXT, NPROW, NPCOL, MYROW, MYCOL )
NODE = MYROW*NPCOL + MYCOL
NUM = NPROW*NPCOL
LEFT = MOD( MYCOL+NPCOL-1, NPCOL )
RIGHT = MOD( MYCOL+1, NPCOL )
UP = MOD( MYROW+NPROW-1, NPROW )
DOWN = MOD( MYROW+1, NPROW )
LCMRC = ILCM( NPROW, NPCOL )
TOTIT = 0
TOTNS = 0
TOTSW = 0
*
* Determine the number of columns we have so we can check workspace
*
LOCALK = NUMROC( N, HBL, MYCOL, JAFIRST, NPCOL )
JJ = N / HBL
IF( JJ*HBL.LT.N )
$ JJ = JJ + 1
JJ = 7*JJ / LCMRC
LWKOPT = INT( 6*N+MAX( 3*MAX( LDA, LDZ )+2*LOCALK, JJ )
$ +6*LDS*LDS )
IF( LWORK.EQ.-1 .OR. ILWORK.EQ.-1 ) THEN
WORK( 1 ) = DBLE( LWKOPT )
IWORK( 1 ) = 3
RETURN
ELSEIF( LWORK.LT.LWKOPT ) THEN
INFO = -15
END IF
IF( DESCZ( CTXT_ ).NE.DESCA( CTXT_ ) ) THEN
INFO = -( 1300+CTXT_ )
END IF
IF( DESCA( MB_ ).NE.DESCA( NB_ ) ) THEN
INFO = -( 700+NB_ )
END IF
IF( DESCZ( MB_ ).NE.DESCZ( NB_ ) ) THEN
INFO = -( 1300+NB_ )
END IF
IF( DESCA( MB_ ).NE.DESCZ( MB_ ) ) THEN
INFO = -( 1300+MB_ )
END IF
IF( ( ILO.GT.N ) .OR. ( ILO.LT.1 ) ) THEN
INFO = -4
END IF
IF( ( IHI.GT.N ) .OR. ( IHI.LT.1 ) ) THEN
INFO = -5
END IF
IF( HBL.LT.5 ) THEN
INFO = -( 700+MB_ )
END IF
CALL IGAMN2D( CONTXT, 'ALL', ' ', 1, 1, INFO, 1, ITMP1, ITMP2, -1,
$ -1, -1 )
IF( INFO.LT.0 ) THEN
CALL PXERBLA( CONTXT, 'PDLAQR1', -INFO )
WORK( 1 ) = DBLE( LWKOPT )
RETURN
END IF
*
* Set work array indices
*
S1 = 0
S2 = S1+LDS*LDS
S3 = S2+LDS*LDS
VECSIDX = S3+4*LDS*LDS
ISUB = VECSIDX+3*N
IRBUF = ISUB+N
ICBUF = IRBUF+N
*
* Find a value for ROTN
*
ROTN = HBL / 3
ROTN = MAX( ROTN, HBL-2 )
ROTN = MIN( ROTN, 1 )
*
IF( ILO.EQ.IHI ) THEN
CALL INFOG2L( ILO, ILO, DESCA, NPROW, NPCOL, MYROW, MYCOL,
$ IROW, ICOL, II, JJ )
IF( ( MYROW.EQ.II ) .AND. ( MYCOL.EQ.JJ ) ) THEN
WR( ILO ) = A( ( ICOL-1 )*LDA+IROW )
ELSE
WR( ILO ) = ZERO
END IF
WI( ILO ) = ZERO
WORK( 1 ) = DBLE( LWKOPT )
RETURN
END IF
*
NH = IHI - ILO + 1
NZ = IHIZ - ILOZ + 1
*
* If the diagonal block is small enough, copy it to local memory and
* call DLAHQR directly.
*
IF( NH .LE. LDS ) THEN
CALL PDLAQR4( WANTT, WANTZ, N, ILO, IHI, A, DESCA, WR, WI,
$ ILOZ, IHIZ, Z, DESCZ, WORK( S1+1 ), NH,
$ WORK( S2+1 ), NH, WORK( S3+1 ), 4*LDS*LDS,
$ INFO )
WORK( 1 ) = DBLE( LWKOPT )
RETURN
END IF
*
CALL INFOG1L( ILOZ, HBL, NPROW, MYROW, DESCZ(RSRC_), LILOZ, LIHIZ)
LIHIZ = NUMROC( IHIZ, HBL, MYROW, DESCZ(RSRC_), NPROW )
*
* Set machine-dependent constants for the stopping criterion.
* If NORM(H) <= SQRT(OVFL), overflow should not occur.
*
UNFL = PDLAMCH( CONTXT, 'SAFE MINIMUM' )
OVFL = ONE / UNFL
CALL PDLABAD( CONTXT, UNFL, OVFL )
ULP = PDLAMCH( CONTXT, 'PRECISION' )
SMLNUM = UNFL*( NH / ULP )
*
* I1 and I2 are the indices of the first row and last column of H
* to which transformations must be applied. If eigenvalues only are
* being computed, I1 and I2 are set inside the main loop.
*
IF( WANTT ) THEN
I1 = 1
I2 = N
END IF
*
* ITN is the total number of QR iterations allowed.
*
ITN = ITERMAX
*
* The main loop begins here. I is the loop index and decreases from
* IHI to ILO in steps of our schur block size (<=2*IBLK). Each
* iteration of the loop works with the active submatrix in rows
* and columns L to I. Eigenvalues I+1 to IHI have already
* converged. Either L = ILO or the global A(L,L-1) is negligible
* so that the matrix splits.
*
I = IHI
10 CONTINUE
L = ILO
IF( I.LT.ILO )
$ GO TO 450
*
* Perform QR iterations on rows and columns ILO to I until a
* submatrix of order 1 or 2 splits off at the bottom because a
* subdiagonal element has become negligible.
*
DO 420 ITS = 0, ITN
TOTIT = TOTIT + 1
*
* Look for a single small subdiagonal element.
*
CALL PDLASMSUB( A, DESCA, I, L, K, SMLNUM, WORK( IRBUF+1 ),
$ LWORK-IRBUF )
L = K
*
IF( L.GT.ILO ) THEN
*
* H(L,L-1) is negligible
*
CALL INFOG2L( L, L-1, DESCA, NPROW, NPCOL, MYROW, MYCOL,
$ IROW, ICOL, ITMP1, ITMP2 )
IF( ( MYROW.EQ.ITMP1 ) .AND. ( MYCOL.EQ.ITMP2 ) ) THEN
A( ( ICOL-1 )*LDA+IROW ) = ZERO
END IF
WORK( ISUB+L-1 ) = ZERO
END IF
*
* Exit from loop if a small submatrix has split off.
*
M = L - 10
IF ( L .GT. I - LDS )
$ GO TO 430
*
* Now the active submatrix is in rows and columns L to I. If
* eigenvalues only are being computed, only the active submatrix
* need be transformed.
*
IF( .NOT.WANTT ) THEN
I1 = L
I2 = I
END IF
*
* Copy submatrix of size 2*JBLK and prepare to do generalized
* Wilkinson shift or an exceptional shift
*
NH = I-L+1
AED = .TRUE.
JBLK = MIN( IBLK, ( NH / 2 )-1 )
IF( JBLK.GT.LCMRC ) THEN
*
* Make sure it's divisible by LCM (we want even workloads!)
*
JBLK = JBLK - MOD( JBLK, LCMRC )
END IF
JBLK = MIN( JBLK, 2*LCMRC )
JBLK = MAX( JBLK, 1 )
*
IF( ITS.EQ.20 .OR. ITS.EQ.40 ) THEN
*
* Exceptional shift.
*
CALL PDLACP3( 2*JBLK, I-2*JBLK+1, A, DESCA, WORK( S1+1 ),
$ LDS, -1, -1, 0 )
DO 20 II = 2*JBLK, 2, -1
WORK( S1+II+(II-1)*LDS ) = CONST*(
$ ABS( WORK( S1+II+(II-1)*LDS ) )+
$ ABS( WORK( S1+II+(II-2)*LDS ) ) )
WORK( S1+II+(II-2)*LDS ) = ZERO
WORK( S1+II-1+(II-1)*LDS ) = ZERO
20 CONTINUE
WORK( S1+1 ) = CONST*ABS( WORK( S1+1 ) )
ELSE
*
* Aggressive early deflation.
*
IF( AED ) THEN
DBLK = ILAENV( 13, 'DLAQR0', 'SV', N, L, I, 4*LDS*LDS )
DBLK = MAX( 2*JBLK, DBLK ) + 1
DBLK = MIN( NH, LDS, DBLK )
CALL PDLAQR2( WANTT, WANTZ, N, L, I, DBLK, A, DESCA,
$ ILOZ, IHIZ, Z, DESCZ, NS, ND, WR, WI,
$ WORK( S1+1 ), LDS, WORK( S2+1 ), DBLK,
$ WORK( IRBUF+1 ), WORK( ICBUF+1 ),
$ WORK( S3+1 ), 4*LDS*LDS )
*
* Skip a QR sweep if enough eigenvalues are deflated.
*
NIBBLE = ILAENV( 14, 'DLAQR0', 'SV', N, L, I, 4*LDS*LDS )
NIBBLE = MAX( 0, NIBBLE )
I = I - ND
DBLK = DBLK - ND
IF( 100*ND .GT. NIBBLE*NH .OR. DBLK .LT. 2*JBLK ) GOTO 10
*
* Use unconverged eigenvalues as shifts for the QR sweep.
* (This option is turned off because of the quality of
* these shifts are not so good.)
*
* IF( ND.GE.0 .AND. ND+DBLK.GE.64 ) THEN
IF( .FALSE. ) THEN
CALL DLASET( 'L', DBLK-1, DBLK-1, ZERO, ZERO,
$ WORK( S1+2 ), LDS )
WORK( IRBUF+1 ) = WORK( S1+1 )
WORK( ICBUF+1 ) = ZERO
*
* Shuffle shifts into pairs of real shifts and pairs of
* complex conjugate shifts assuming complex conjugate
* shifts are already adjacent to one another.
*
DO 21 II = DBLK, 3, -2
IF( WORK( ICBUF+II ).NE.-WORK( ICBUF+II-1 ) ) THEN
SWAP = WORK( IRBUF+II )
WORK( IRBUF+II ) = WORK( IRBUF+II-1 )
WORK( IRBUF+II-1 ) = WORK( IRBUF+II-2 )
WORK( IRBUF+II-2 ) = SWAP
SWAP = WORK( ICBUF+II )
WORK( ICBUF+II ) = WORK( ICBUF+II-1 )
WORK( ICBUF+II-1 ) = WORK( ICBUF+II-2 )
WORK( ICBUF+II-2 ) = SWAP
END IF
21 CONTINUE
*
* Copy undeflatable eigenvalues to the diagonal of S1.
*
II = 2
22 CONTINUE
IF( WORK( ICBUF+II ) .EQ. ZERO ) THEN
WORK( S1+II+(II-1)*LDS ) = WORK( IRBUF+II )
WORK( S1+II+(II-2)*LDS ) = ZERO
II = II + 1
ELSE
WORK( S1+II+(II-1)*LDS ) = WORK( IRBUF+II )
WORK( S1+II+1+II*LDS ) = WORK( IRBUF+II )
WORK( S1+II+1+(II-1)*LDS ) = WORK( ICBUF+II )
WORK( S1+II+II*LDS ) = -WORK( ICBUF+II )
II = II + 2
END IF
IF( II .LE. DBLK ) GOTO 22
ELSE
CALL DLAHQR( .FALSE., .FALSE., DBLK, 1, DBLK,
$ WORK( S1+1 ), LDS, WORK( IRBUF+1 ),
$ WORK( ICBUF+1 ), 1, DBLK, Z, LDZ, IERR )
END IF
ELSE
DBLK = 2*JBLK
CALL PDLACP3( DBLK, I-DBLK+1, A, DESCA, WORK( S1+1 ),
$ LDS, -1, -1, 0 )
CALL DLAHQR( .FALSE., .FALSE., DBLK, 1, DBLK,
$ WORK( S1+1 ), LDS, WORK( IRBUF+1 ),
$ WORK( ICBUF+1 ), 1, DBLK, Z, LDZ, IERR )
END IF
TOTSW = TOTSW + 1
*
* Prepare to use Wilkinson's double shift
*
H44 = WORK( S1+DBLK+(DBLK-1)*LDS )
H33 = WORK( S1+DBLK-1+(DBLK-2)*LDS )
H43H34 = WORK( S1+DBLK-1+(DBLK-1)*LDS )*
$ WORK( S1+DBLK+(DBLK-2)*LDS )
IF( ( JBLK.GT.1 ) .AND. ( ITS.GT.30 ) ) THEN
S = WORK( S1+DBLK-1+(DBLK-3)*LDS )
DISC = ( H33-H44 )*HALF
DISC = DISC*DISC + H43H34
IF( DISC.GT.ZERO ) THEN
*
* Real roots: Use Wilkinson's shift twice
*
DISC = SQRT( DISC )
AVE = HALF*( H33+H44 )
IF( ABS( H33 )-ABS( H44 ).GT.ZERO ) THEN
H33 = H33*H44 - H43H34
H44 = H33 / ( SIGN( DISC, AVE )+AVE )
ELSE
H44 = SIGN( DISC, AVE ) + AVE
END IF
H33 = H44
H43H34 = ZERO
END IF
END IF
END IF
*
* Look for two consecutive small subdiagonal elements:
* PDLACONSB is the routine that does this.
*
* CALL PDLACONSB( A, DESCA, I, L, M, H44, H33, H43H34,
* $ WORK( IRBUF+1 ), LWORK-IRBUF )
*
* Skip small submatrices
*
* IF ( M .GE. I - 5 )
* $ GO TO 80
*
* In principle PDLACONSB needs to check all shifts to decide
* whether two consecutive small subdiagonal entries are suitable
* as the starting position of the bulge chasing phase. It can be
* dangerous to check the first pair of shifts only. Moreover it
* is quite rare to obtain an M which is much larger than L. This
* process is a bit expensive compared with the benefit.
* Therefore it is sensible to abandon this routine. Total amount
* of communications is saved in average.
*
M = L
*
* Double-shift QR step
*
* NBULGE is the number of bulges that will be attempted
*
ISTOP = MIN( M+ROTN-MOD( M, ROTN ), I-2 )
ISTOP = MIN( ISTOP, M+HBL-3-MOD( M-1, HBL ) )
ISTOP = MIN( ISTOP, I2-2 )
ISTOP = MAX( ISTOP, M )
NBULGE = ( I-1-ISTOP ) / HBL
*
* Do not exceed maximum determined.
*
NBULGE = MIN( NBULGE, JBLK )
IF( NBULGE.GT.LCMRC ) THEN
*
* Make sure it's divisible by LCM (we want even workloads!)
*
NBULGE = NBULGE - MOD( NBULGE, LCMRC )
END IF
NBULGE = MAX( NBULGE, 1 )
*
TOTNS = TOTNS + NBULGE*2
*
IF( ( ITS.NE.20 ) .AND. ( ITS.NE.40 ) .AND. ( NBULGE.GT.1 ) )
$ THEN
*
* sort the eigenpairs so that they are in twos for double
* shifts. only call if several need sorting
*
* CALL DLASORTE( S1( 2*( JBLK-NBULGE )+1,
* $ 2*( JBLK-NBULGE )+1 ), 3*IBLK, 2*NBULGE,
* $ WORK( IRBUF+1 ), IERR )
CALL DLASORTE( WORK(S1+DBLK-2*NBULGE+1+(DBLK-2*NBULGE)*LDS),
$ LDS, 2*NBULGE, WORK( IRBUF+1 ), IERR )
END IF
*
* IBULGE is the number of bulges going so far
*
IBULGE = 1
*
* "A" row defs : main row transforms from LOCALK to LOCALI2
*
CALL INFOG1L( M, HBL, NPCOL, MYCOL, DESCA(CSRC_),ITMP1,LOCALK )
LOCALK = NUMROC( N, HBL, MYCOL, DESCA(CSRC_), NPCOL )
CALL INFOG1L( 1, HBL, NPCOL, MYCOL,DESCA(CSRC_),ICOL1,LOCALI2 )
LOCALI2 = NUMROC( I2, HBL, MYCOL, DESCA(CSRC_), NPCOL )
*
* "A" col defs : main col transforms from LOCALI1 to LOCALM
*
CALL INFOG1L( I1, HBL, NPROW,MYROW,DESCA(RSRC_),LOCALI1,ICOL1 )
ICOL1 = NUMROC( N, HBL, MYROW, DESCA(RSRC_), NPROW )
CALL INFOG1L( 1, HBL, NPROW, MYROW, DESCA(RSRC_),LOCALM,ICOL1 )
ICOL1 = NUMROC( MIN( M+3, I ), HBL, MYROW, DESCA(RSRC_),NPROW )
*
* Which row & column will start the bulges
*
ISTARTROW = MOD( ( M+1 ) / HBL + IAFIRST, NPROW )
ISTARTCOL = MOD( ( M+1 ) / HBL + JAFIRST, NPCOL )
*
CALL INFOG1L( M, HBL, NPROW, MYROW, DESCA(RSRC_), II, ITMP2 )
ITMP2 = NUMROC( N, HBL, MYROW, DESCA(RSRC_), NPROW )
CALL INFOG1L( M, HBL, NPCOL, MYCOL, DESCA(CSRC_), JJ, ITMP2 )
ITMP2 = NUMROC( N, HBL, MYCOL, DESCA(CSRC_), NPCOL )
CALL INFOG1L(1,HBL,NPROW,MYROW,DESCA(RSRC_),ISTOP,KP2ROW( 1 ) )
KP2ROW( 1 ) = NUMROC( M+2, HBL, MYROW, DESCA(RSRC_), NPROW )
CALL INFOG1L(1,HBL,NPCOL,MYCOL,DESCA(CSRC_),ISTOP,KP2COL( 1 ) )
KP2COL( 1 ) = NUMROC( M+2, HBL, MYCOL, DESCA(CSRC_), NPCOL )
*
* Set all values for bulges. All bulges are stored in
* intermediate steps as loops over KI. Their current "task"
* over the global M to I-1 values is always K1(KI) to K2(KI).
* However, because there are many bulges, K1(KI) & K2(KI) might
* go past that range while later bulges (KI+1,KI+2,etc..) are
* finishing up.
*
* Rules:
* If MOD(K1(KI)-1,HBL) < HBL-2 then MOD(K2(KI)-1,HBL)