(Originally a PS2Dev forum post)
Drawing large prims as strips is much faster due to the way the GS' cache and rasterization works. The GS will rasterize a maximum of 8x2 pixels per clock or 4x2 when texturing, this is the pixel pipeline layout, it can not, for example, render 2x8 pixels per clock or 4x4 pixels per clock. It ALWAYS scans left-to-right, then top-to-bottom... and it can only operate on one page at a time (which is 64x32 pixels, with 32bpp), when switching from one page to another you get a "page break" penalty where the memory unit has to load from ram into the internal cache.
So knowing this we can estimate how many page breaks will occur when we render...
For a 640x512 single prim at 0,0 without texturing the GS will render 8x2 pixels at the following locations:
0,0 ; 8,0 ; 16,0 ; 24,0 ; 32,0 ; 40,0 ; 48,0 ; 56,0 <Load a new page> 64,0 ; 72,0 ; 80,0 ; 88,0 ; 96,0 ; 104,0 ; 112,0 ; 128,0 <Load a new page> [...]
When it gets to the end of the scanline it'll start at 0,2 and proceed in the same manner as above until all scanlines are rendered.
So for this 640x512 prim, you end up having about (640/64)*512/2=~2560 page breaks... which is very expensive... For every eight cycles of drawing you have a page break which might take dozens to hundreds of cycles.
If you cut the prim up into ten 64x512 strips and properly align them to page boundaries you end up with a page break every 32nd scanline for each strip...
0,0 ; 8,0 ; 16,0 ; 24,0 ; 32,0 ; 40,0 ; 48,0 ; 56,0 0,2 ; 8,2 ; 16,2 ; 24,2 ; 32,2 ; 40,2 ; 48,2 ; 56,2 [...] 0,30 ; 8,30 ; 16,30 ; 24,30 ; 32,30 ; 40,30 ; 48,30 ; 56,30 <Load a new page>
So (512/32)*10 = ~160 page breaks... this is sixteen times fewer page breaks than with a single primitive... This time we get to draw for 128 cycles before each page break... this improves performance A LOT.
The extra command overhead is extremely negligible for the GS and the GIF since the setup engine will happily chew through one command every cycle at 150mhz, and can overlap this with drawing... generating the strips may be slower on the EE side, but it shouldn't be that much...
Unfortunately, I don't have my PS2 set up so I can't measure the different prim times, but I seem to remember a full-screen 640x512 prim taking about 4milliseconds to draw, and ten 64x512 strips taking less than half a millisecond.
Also note this doesn't just apply to sprites, if you render large triangles that cross page boundaries then you will get huge performance penalties, the GS was designed to render many small primitives, not few large ones... A lot of "later generation" PS2 games will actually sub-divide triangles if they get too large (trading VU time for GPU time)
Also references: http://www.technology.scee.net/files/presentations/agdc2000/ThePowerOfPS2.pdf "Keep polygons small - many small polys are much quicker than one large polygon." http://www.technology.scee.net/files/presentations/agdc2002/PS2Optimisations.pdf "Wide Primitives will cause page misses, Use 32 Pixel wide strips to reduce page misses"
The reason the Sony guys recommend 32 pixel wide strips is that the GS only has a single page buffer for the colour buffer and zbuffer. If you aren't zbuffering, then 64 pixel wide strips will not cause any unnecessary page breaks. But if you are zbuffering, then due to the way the page buffer is structured you will get one page break every two scan lines.
A basic overview of why this is the case (the ordering is a simplified version, it's much more complex, but this gives a good idea) Colour pages are ordered like:
<- 64 pixels -> C0 C1 C4 C5 C2 C3 C6 C7
Depth pages are ordered like:
<- 64 pixels -> Z6 Z7 Z2 Z3 Z4 Z5 Z0 Z1
So if you're drawing to blocks 0->3 the page cache will look like:
C0 C1 Z2 Z3 C2 C3 Z0 Z1
Which means it has the first 32x32 pixels of each the colour buffer and the depth buffer loaded into it, so if you draw a single primitive that is wider than 32 pixels it will page break every other scanline.