Sub-tile decoding: speed up vertical pass in IDWT5x3 by processing 4 cols at a time