* DDJ Home

* Today's Headlines
* Past Headlines
* Microprocessor Articles
* Intel Secrets
* Intel Errata
* Undocumented Corner
* Processor Manuals
* Motherboard Manuals
* Links

Microprocessor Resources

 

Introduction to the Streaming SIMD Extensions in the Pentium III: Part III

By Bipin Patwardhan


1. Data Swizzling

The speedup that the Pentium III SSE achieves on floating-point operations comes at a price. The data operated on by SSE instructions has to be stored in the new data type defined by SSE. If the application stores the data in its own format, the data has to be converted into the new data type before the SSE instructions can operate on it, and has to be converted back afterward.

This conversion of data from one format into another is termed "data swizzling."

This conversion takes time and machine cycles. If an application converts data from one format to another too often, the machine cycles saved by executing SSE instructions may well be lost. Hence, care is needed.

1.1 Data Organization

Usually, 3D applications store the coordinates of a point in one structure. When handling multiple points, applications use an array of structures, also called AoS. Typical geometric operations operate differently on the x, y and z coordinates of the point. The code given below lists the typical declaration used by applications processing 3D data. When handling large data sets, this structure amounts to an array-of-structures, as illustrated in figure 9.

struct point {
	float x, y, z;
};
...
point dataset[...];


Figure 1: Array of structures.

To exploit the advantages of SSE, it would be better to operate on multiple points simultaneously. This can be done by operating on the coordinates of multiple points. This is possible if we collect together the x-, the y- and z-coordinates of the points. The application can then process multiple x-, y- and z-coordinates separately. For this, the application must rearrange the data into either three separate arrays, or a structure of arrays with one array each for one coordinate of the point. This arrangement is called the SoA arrangement.

The code given below lists the declaration of the struture of arrays, while figure 10 is the diagrammatic representation of the struture of arrays.

struct point {
	float *x, *y, *z;
};


Figure 2: Structure of arrays.

2. Memory Issues

2.1 Alignment

Handling and manipulating simple variables of the new data type does not create problems. However, it is recommended that variables of the new data type be aligned to 16-byte boundaries. This alignment can be enforced either by setting the appropriate compiler flags or by explicitly using align commands in the program, during variable declaration.

A variable can be specified to be aligned to a 16-byte boundary using the __declspec compiler directive, as illustrated in the following example. The variable myVar will be aligned to a 16-byte boundary due to the align directive. It is not necessary to align the new data types to 16-byte boundaries, as the compiler aligns the data types when it comes across the new data type declarations. The alignment directive is issued as shown:

__declspec(align(16)) float[4] myVar;

2.2 Dynamic Memory

The condition on the new data types stipulating that pointers accessing memory locations be aligned to 16-byte boundaries creates problems when allocating memory dynamically or at the time of accessing allocated arrays through a pointer.

When accessing arrays through pointers, we have to ensure that the pointer is aligned to a 16-byte boundary.

To allocate memory at run time we use either the malloc function or the new command. The default behaviour of both is that they do not align the pointer address to a 16-byte boundary. Hence, we have to either allocate memory and then adjust the pointer to a 16-byte boundary, or allocate the memory using the _mm_malloc function. The _mm_malloc function allocates a memory block that is aligned to a 16-byte boundary.

Just as malloc has a free, the _mm_malloc function has the function _mm_free. Memory blocks allocated using _mm_malloc have to be freed using _mm_free.

2.3 Custom Datatype

The restriction that pointers be aligned to 16-byte boundaries can be troublesome. It would be much better to be able to ignore the alignment of pointers.

When operating on 128-bit data types, it may be necessary to access the floats stored in the data type. In assembly language there is not much choice but to use assembly language constructs. Using C or C++ and the intrinsics library, however, the data will be sortd in the data type __mm128. In this data type, once the value is set, it is not possible to access the individual floating-point numbers directly. One way to access them is to transfer all floating point numbers into an array of floats, change the values and load the array of floats back into the data type. The second method is to cast the data type into a float array and then access the required element. The first method is time consuming and the second method may cause problems if not used properly.

Defining a custom data type can overcome these problems. The custom data type is defined as a union of the data type (__m128) and an array of four floats. The declaration of the new data, called sse4 for now, is given below.

union sse4 {
	__m128 m;
	float f[4];
};

Using this data type, it is no longer necessary to align memory locations to 16-byte boundaries. When the compiler encounters the data type __m128, it aligns it to a 16-byte boundary. An added advantage of this data type is that the individual floating-point numbers stored in the 128-bit data can be acessed directly.

2.4 Detecting the CPU

As the usage of SSE depends on the presence Pentium III, it is important that applications be able to detect the Pentium III chip. This is done using the cpuid instruction.

For the cpuid instruction to work as desired, the eax register has to be set to the appropriate value. As we are interested only in the CPU ID, we need to set the eax register to 1 before invoking the cpuid instruction.

The source code to detect the presence of the Pentium III CPU is given below. To be able to compile the code, the file fvec.h has to be included.

BOOL CheckP3HW()
{
	BOOL SSEHW = FALSE;
	_asm {
		// Move the number 1 into eax - this will move the
		// feature bits into EDX when a CPUID is issued, that
		// is, EDX will then hold the key to the cpuid
		mov eax, 1

		// Does this processor have SSE support?
		cpuid

		// Perform CPUID (puts processor feature info in EDX)
		// Shift the bits in edx to the right by 26, thus bit 25
		// (SSE bit) is now in CF bit in EFLAGS register.
		shr edx,0x1A

		// If CF is not set, jump over next instruction
		jnc nocarryflag

		// set the return value to 1 if the CF flag is set
		mov [SSEHW], 1

		nocarryflag:
	}
	
	return SSEHW;
}

The SSE SDK also has an SSE emulation mode that emulates the Pentium III and the SSE registers. The code given below can be used to detect this emulation. To be able to compile the code, the file fvec.h has to be included.

// Checking for SSE emulation support
BOOL CheckP3Emu()
{
	BOOL SSEEmu = TRUE;
	Fvec32 pNormal = (1.0, 2.0, 3.0, 4.0);
	Fvec32 pZero = 0.0;
	// Checking for SSE HW emulation
	__try {
		_asm {
			// Issue a move instruction that will cause exception
			// w/out HW support emulation
			movups xmm1, [pNormal]
			
			// Issue a computational instruction that will cause
			// exception w/out HW support emulation
			divps xmm1, [pZero]
		}
	}
	// If there's an exception, set emulation variable to false
	__except(EXCEPTION_EXECUTE_HANDLER) {
		SSEEmu = FALSE;
	}
	
	return SSEEmu;
}

3. Additional References

For more details about the architecture of the Pentium III, refer [11], [12], [13] and [10].

For more information about Processor identification and CPUID, refer [15] and [7].

For more information about the Streaming SIMD Extensions, refer [19]. For more information about the programming issue, the software conventions and the software development strategies, refer [9], [16] and [17] respectively.

For more about application tuning for SSE and the VTune performance enhancement application, refer [1] and [8] respectively. For more details about VTune and the Intel C/C++ Compiler, refer [3] and [2] respectively.

For additional information about the Pentium III processor and its capabilities, refer [14], [18], [20], [6], [5] and [4].

4. Additional Examples

In this section, we present additional examples to illustrate the usage of the Streaming SIMD Extensions.

4.1 Array Manipulation

In this example, we take two arrays, each with 400 floats. A multiplication operation is performed on each of the array elements. The result of the multiplication is stored in a third array. The two arrays used as operands are named A and B. The result of the multiplication is stored in array C. In all the sources given below, the following declartion is assumed

#include <fvec.h>

#define ARRSIZE 400

__declspec(align(16)) float a[ARRSIZE], b[ARRSIZE], c[ARRSIZE];
4.1.1 Assembly Language
_asm {
	push esi;
	push edi;
	mov edi, a;
	mov esi, b;
	mov edx, c;
	mov ecx, 100;
loop:
	movaps xmm0, [edi];
	movups xmm1, [esi];
	mulps xmm0, xmm1;
	movups [edx], xmm0;
	add edi, 16;
	add esi, 16;
	add edx, 16;
	dec ecx;
	jnz loop;
	pop edi;
	pop esi;
}
4.1.2 Intrinsics
__m128 m1, m2, m3;

for ( int i = 0; i <ARRSIZE; i +="4" ) {
    m1= _mm_loadu_ps(a+i);
    m2= _mm_loadu_ps(b+i);
    m3= _mm_mul_ps(m1," m2);
    _mm_storeu_ps(c+i, m3); 
}
4.1.3 C++
F32vec4 f1, f2, f3;

for ( int i = 0; i <ARRSIZE; i +="4" ){ 
    loadu(f1, a+i); 
    loadu(f2, b+i); 
    f3 = f1 * f2;
    storeu(c+i, f3); 
}

4.2 Vector for 3D

This example presents a vector in 3D. The vector is implemented as a class. The functionality of the class is implemented using the intrinsics library.

The class declaration is given below.

union sse4 {
	__m128 m;
	float f[4];
};

class sVector3 {
protected:
	sse4 val;
public:
	sVector3(float, float, float);
	float& operator [](int);
	sVector3& operator +=(const sVector3&);
	float length() const;
friend float dot(const sVector3&, const sVector3&);
};

The class implementation is given below.

sVector3::sVector3(float x, float y, float z) {
	val.m = _mm_set_ps(0, z, y, x);
}
float& sgmVector3::operator [](int i) {
	return val.f[i];
}
sVector3& sVector3::operator +=(const sVector3& v) {
	val.m = _mm_add_ps(val.m, v.val.m);
	return *this;
}
float sVector3::length() const {
	sse4 m1;
	m1.m = _mm_sqrt_ps(_mm_mul_ps(val.m, val.m));
	return m1.f[0] + m1.f[1] + m1.f[2];
}
float dot(const sVector3& v1, const sVector3& v2) {
	sVector3 v(v1);
	v.val.m = _mm_mul_ps(v.val.m, v2.val.m);
	return v.val.f[0] + v.val.f[1] + v.val.f[2];
}

4.3 4x4 Matrix

This example presents a 4x4 matrix. The matrix is implemented as a class. The functionality of the class is implemented using the intrinsics library.

The class declaration is given below.

float const sEPSILON = 1.0e-10f;

union sse16 {
	__m128 m[4];
	float f[4][4];
};

class sMatrix4 {
protected:
	sse16 val;
	sse4 sFuzzy;
public:
	sMatrix4(float*);
	float& operator()(int, int);
	sMatrix4& operator +=(const sMatrix4&);
	bool operator ==(const sMatrix4&) const;
	sVector4 operator *(const sVector4&) const;
private:
	float RCD(const sMatrix4& B, int i, int j) const;
};

The class implementation is given below.

sMatrix4::sMatrix4(float* fv) {
	val.m[0] = _mm_set_ps(fv[3], fv[2], fv[1], fv[0]);
	val.m[1] = _mm_set_ps(fv[7], fv[6], fv[5], fv[4]);
	val.m[2] = _mm_set_ps(fv[11], fv[10], fv[9], fv[8]);
	val.m[3] = _mm_set_ps(fv[15], fv[14], fv[13], fv[12]);
	float f = sEPSILON;
	sFuzzy.m = _mm_set_ps(f, f, f, f);
}
float& sMatrix4::operator()(int i, int j) {
	return val.f[i][j];
}
sMatrix4& sMatrix4::operator +=(const sMatrix4& M) {
	val.m[0] = _mm_add_ps(val.m[0], M.val.m[0]);
	val.m[1] = _mm_add_ps(val.m[1], M.val.m[1]);
	val.m[2] = _mm_add_ps(val.m[2], M.val.m[2]);
	val.m[3] = _mm_add_ps(val.m[3], M.val.m[3]);
	return *this;
}
bool sMatrix4::operator ==(const sMatrix4& M) const {
	int res[4];
	res[0] = res[1] = res[2] = res[3] = 0;
	res[0] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
		_mm_max_ps(val.m[0], M.val.m[0]),
		_mm_min_ps(val.m[0], M.val.m[0])), sFuzzy.m));
	res[1] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
		_mm_max_ps(val.m[1], M.val.m[1]),
		_mm_min_ps(val.m[1], M.val.m[1])), sFuzzy.m));
	res[2] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
		_mm_max_ps(val.m[2], M.val.m[2]),
		_mm_min_ps(val.m[2], M.val.m[2])), sFuzzy.m));
	res[3] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
		_mm_max_ps(val.m[3], M.val.m[3]),
		_mm_min_ps(val.m[3], M.val.m[3])), sFuzzy.m));
	if ( (15 == res[0]) && (15 == res[1])
			&& (15 == res[2]) && (15 == res[3]) )
		return 1;
	return 0;
}
sVector4 sMatrix4::operator *(const sVector4& v) const {
	return sVector4(
		val.f[0][0] * v[0] + val.f[0][1] * v[1]
			+ val.f[0][2] * v[2] + val.f[0][3] * v[3],
		val.f[1][0] * v[0] + val.f[1][1] * v[1]
			+ val.f[1][2] * v[2] + val.f[1][3] * v[3],
		val.f[2][0] * v[0] + val.f[2][1] * v[1]
			+ val.f[2][2] * v[2] + val.f[2][3] * v[3],
		val.f[3][0] * v[0] + val.f[3][1] * v[1]
			+ val.f[3][2] * v[2] + val.f[3][3] * v[3]);
}
float sMatrix4::RCD(const sMatrix4& B, int i, int j) const {
	return val.f[i][0] * B.val.f[0][j] + val.f[i][1] * B.val.f[1][j]
		+ val.f[i][2] * B.val.f[2][j] + val.f[i][3] * B.val.f[3][j];
}

 


References

[1] James Abel, Kumar Balasubramanian, Mike Bargeron, Tom Craver, and Mike Phlipot. Applications tuning for streaming simd extensions. Technical report, Intel Corporation, 1999.

[2] Intel Corporation. Intel C/C++ Compiler Web Site. http://developer.intel.com/vtune/icl.

[3] Intel Corporation. Vtune Performance Analyzer Web Site. http://developer.intel.com/vtune/performance.

[4] Intel Corpotation. Developer Relations Group Web Site. http://developer.intel.com/drg.

[5] Intel Corpotation. Intel Developer Web Site. http://developer.intel.com.

[6] Intel Corpotation. Web site. http://www.intel.com.

[7] Stephan Fischer, James Mi, and Albert Tang. Pentium iii processor serial number feature and applications. Technical report, Intel Corporation, 1999.

[8] Joe Wolf III. Programming methods for the pentium iii processor streaming simd extensions using the vtune performance enhancement environment. Technical report, Intel Corporation, 1999.

[9] Intel Corporation. Data Alignment and Programming Issues for the Streaming SIMD Extensions with the Intel C/C++ Compiler, 1999. App Note ap833.

[10] Intel Corporation. Intel Architecture Optimization Reference Manual, 1999.

[11] Intel Corporation. Intel Architecture Software Development Manual. Volume 1: Basic Architecture, 1999.

[12] Intel Corporation. Intel Architecture Software Development Manual. Volume 2: Instruction Set Reference, 1999.

[13] Intel Corporation. Intel Architecture Software Development Manual. Volume 3: Systems Programming Guide, 1999.

[14] Intel Corporation. Intel Pentium III Processor Performance Brief, 1999.

[15] Intel Corporation. Intel Processor Identification and CPUID Instruction, 1999. App Note Ap-485.

[16] Intel Corporation. Software Conventions for Streaming SIMD Extensions, 1999. App Note AP589.

[17] Intel Corporation. Software Development Strategies for Streaming SIMD Extensions, 1999. App Note AP814.

[18] Jagannath Keshavan and Vladimir Penkovski. Pentium iii processor implementation trade-offs. Technical report, Intel Corporation, 1999.

[19] Shreekant Thakkar and Tom Huff. Internet streaming simd extensions. Technical report, Intel Corporation, 1999.

[20] Paul Zagacki, Deep Duch, Emil Hsiech, Daniel Melaku, and Vladimir Pentkovski. Architecture of 3d software stack for peak pentium iii processor performance. Technical report, Intel Corporation, 1999.


Bipin Patwardhan

National Centre for Software Technology, Mumbai.
email: bipin@ncst.ernet.in


Back to Book and Articles


Advertisement
Copyright © 2008 Dr. Dobb's Journal