C vs C++ style of interface for hardware abstraction layer

CAR145 · Post by **CAR145** » Sat Jan 12, 2013 9:28 pm

Hey,

I've recently started a new game project and I've began working on the hardware abstraction layer. I cannot decide between a C or a C++ (OOP) style of interface for the rendering part. (Engine is in C++ and will be using DX11 for rendering on PC)

When writing the vector library for the engine I came across this article that shows when writing a SIMD optimized vector library you can gain a 62.74ms speed increase (Microsoft compiler, the one I'm using) on a cloth simulation just by giving the library a C style interface instead of encapsulating the vector in a class!

Will giving my rendering engine a C style interface give a notable speed increase? I prefer using OOP but I don't want to throw away performance I may need later when working on much less powerful consoles.

Thanks.

Pornomag · Post by **Pornomag** » Mon Jan 14, 2013 11:11 pm

If I were you, I would not worry about performance, just get it working, and then later you can optimize if you need to. You should not premature optimize, optimize when you have working code that you think could be faster. Do note that this doesn't mean write dumb code. Read this: http://prog21.dadgum.com/106.html

Also... why make re-invent the wheel if there's plenty of math libraires pre-written for you directly related to computer graphics? For example: GLM, http://glm.g-truc.net/, although aimed for OpenGL it can be used for DirectX. Unless you're doing this for an educational purpose? (although vector/matrix math are pretty easy topics in math, so I don't see why you'd want make your own classes/structures/operations if there's portable solutions) s:

qpHalcy0n · Post by **qpHalcy0n** » Tue Jan 15, 2013 12:08 am

Well I see the point and I always advocate learning from the ground up. I think the analytic approach imparts a knowledge necessary to build the technology of 5 years from now, not the technology of a few years ago to now. On some systems, this is pretty crucial. On the PS2 this was exceptionally evident as they had built in types for aligned 128bit float types even before SSE2 was a "thing". It really was a new form of optimization that wasn't new for architects and people who developed systems but WAS a different thing for game programmers accustomed to developing on prototypical Intel 32-bit architecture. Learn all you can, I say.

That said, there is a level of practicality that you have to be willing to approach and it's completely application dependent. For real-time graphics, you're probably not going to be doing THAT much CPU-level computation unless you're running pure Monte-Carlo simulations (which there is definitely application for). The thing about these numbers is this: first off, it's on the order of 62ms for the entire data set to be computed. Not per tick. Processors execute instructions in NANO-seconds. 6 orders of magnitude faster than milliseconds. That said, for simulations that may be relevant. In the real-time domain with a ~16ms frame time budget, if you spent 1ms on CPU floating point ops you're abusing the CPU. At that point, the time slice is so fine that the two numbers start to converge. IE: When you force the simulation to run at that fine of a granularity, the differences tend to disappear. There are cases on some architectures where this can be a BIG stumbling block and it really requires a more intimate knowledge of that architecture and that toolchain. I can say for typical Intel platforms, it really comes down to just "writing better C++" code. I always say, "What runs faster than C++ code?".....well "Better C++ code". Similarly, "How can I make Java code run faster?"....."Write better Java".

In this case, the author misses a few things that the compiler (especially in the VS 2012 case) is not doing that it probably should still not be doing. By inlining the method, the intrinsic can be passed to an XMM register directly and avoid needless overhead. The compiler has no reason to inline these methods, and you're really enforcing that point if they're overloaded. It's pretty critical to provide a discrete pathway there or inline the method. I think a few comments alluded to this fact and they were spot on. Of course, you wouldn't know this unless you profiled your code and identified it as a bottleneck. In a pipelined architecture, you will only run as fast as your slowest code...so it goes without saying there are things that can be "optimized" that would never make a difference anyways.

I'm also not sure what "hardware abstraction" you're referring to. DX11 *IS* a hardware abstraction layer.

Post by **Falco Girgis** » Tue Jan 15, 2013 1:41 am

qpHalcy0n wrote:Well I see the point and I always advocate learning from the ground up. I think the analytic approach imparts a knowledge necessary to build the technology of 5 years from now, not the technology of a few years ago to now. On some systems, this is pretty crucial. On the PS2 this was exceptionally evident as they had built in types for aligned 128bit float types even before SSE2 was a "thing". It really was a new form of optimization that wasn't new for architects and people who developed systems but WAS a different thing for game programmers accustomed to developing on prototypical Intel 32-bit architecture. Learn all you can, I say.

Same goes for the Dreamcast. The SH4 has SIMD instructions for loading and storing 4x4 matrices and transforming 4x4 matrices and 4x1 vectors. That's why they called it the first 128-bit console.

I think qp really covered the majority of it from an optimization standpoint, but I will at least mention the C vs C++ performance part. I am all about writing low-level drivery shit in C, but that's admittedly my own personal taste (especially from my fucking around with drivers as Linux kernel modules). There is no reason that a C++ driver could not be as fast as C. There is no magic intrinsic to the C language that makes it faster or to the C++ language that makes it slower. With just a little bit of understanding of how C++ works under the hood, you can easily get C-like performance with the OO organization C++ offers.

Probably the most common overhead from C++ is associated with run-time polymorphism and vtable lookups from invoking virtual member functions. I would highly recommend not using the "virtual" keyword for anything time-critical. I get pissy when static drivers implement virtual interfaces just because it looks pretty, when they will never ever be polymorphed to or from any other datatype at runtime. Don't EVER use virtual inheritance for a performance-critical class either.

Templates have overhead associated with them, but that usually tends to be with code size/bloat rather than execution time (templates used for compile-time polymorphism can be much faster than using virtuals for run-time polymorphism and even sometimes faster than C-style function pointers).

Accessor methods (that aren't inlined) invoked on multiple layers of encapsulated C++ objects is another great way to quickly add function call overhead in C++. When calling set on object A calls a set on the object B it is encapsulating which calls a set on object C that is encapsulated by B, you are adding a shitload of pushes/pops to the stack to achieve almost nothing... That's why simple accessor methods should ALWAYS be inlined in C++.

But anyway, not to get into too much gory detail... C makes it easier to write very fast code by the nature of the language. You can still achieve these speeds with C++ with a little bit of understanding of the language. It's up to you whether you prefer to write in C or C++. There is no "better" option based solely on speed.

I prefer C-style drivers because 1) they can be used in C or C++ 2) It is easier for me to write fast C code 3) I feel like a static, global, C-style API more adequately represents something like a driver. Hardware truly IS a global state.

Post by **Falco Girgis** » Tue Jan 15, 2013 2:16 am

CAR145 wrote:When writing the vector library for the engine I came across this article that shows when writing a SIMD optimized vector library you can gain a 62.74ms per tick speed increase (Microsoft compiler, the one I'm using) on a cloth simulation just by giving the library a C style interface instead of encapsulating the vector in a class!

Let me comment directly on this.

Using overloaded operators in C++ is a VERY rookie mistake for time-critical code. You can look all around game development boards like this and see Vector2/3/4 classes and Matrix classes with overloaded operators. These are going to introduce a good amount of overhead by nature of temporary variable creation.

Lets use the common example:

Code: Select all

class Vector2 {
private:
    int _x, _y;
public:
    inline Vector2(const int x, const int y): _x(x), _y(y) {}

    inline Vector2 operator+(const Vector2 &rhs) {
        return Vector2(x+rhs.x, y+rhs.y);
     }
};

Now this sneaky little fucker may look absolutely adorable to all of you newbies out there. Look, you can do this!

Code: Select all

Vector2 vec = vec1 + vec2 + vec3 + vec4;

It looks like real math and shit, brah!

But what you don't realize is the completely unnecessary run-time overhead you just introduced just so you can do cute shit like that. If you notice, the overloaded '+' operator is actually returning a temporary object by value. For every addition you make, you have just created a new, temporary variable that must be stored intermediately on the stack by the compiler. For the above code, you created and initialized 3 temporary variables on the stack behind the scenes just in that one line! Absolutely unjustifiable for a math class like that. Note: Actual overhead from that statement may vary slightly based on how smart your compiler is, but it's still not justifiable.

The most efficient way to do that is definitely C-style:

Code: Select all

inline void Vec2Add(Vector2 *const dest, const Vector2 *const src1, const Vector2 *const src2) {
    dest->_x = src1->_x + src2->_x;
    dest->_y = src1->_y + src2->_y;
}

There is no intermediate value created on the stack and copied over to the destination. Everything is now stored directly into the destination object. Your pretty little (inefficient as shit) line of code would become:

Code: Select all

Vector2 vec;
Vec2Add(&vec, &vec1, &vec2);
Vec2Add(&vec, &vec, &vec3);
Vec2Add(&vec, &vec, &vec4);

Uglier? Yes. But that's the trade-off if you want performance.

You can still achieve that same thing in (uglier) C++ too:

Code: Select all

Class Vector2 {
private:
    int _x, _y;
public:
    inline Vector2(const int x, const int y): _x(x), _y(y) {}

    inline add(const Vector2 &src1, const Vector2 &src2) {
        _x = src1.x + src2.x;
        _y = src1.y + src2.y;
    }
};

Now you can use this bitch like:

Code: Select all

Vector2 vec;
vec.add(vec1, vec2);
vec.add(vec, vec3);
vec.add(vec, vec4);

That's C++ with member functions, and I can guarantee you it will be just as fast as the C version... It's basically the same code to the compiler (with an implicit "this" ptr as the destination address).

Now this is not "pretty" C++. This is not the way of doing things that the OOphiles and Javacunts like. But then again, those guys aren't reaping the benefits of efficiency or hardware acceleration... A more important question to ask yourself is would you rather write a "pretty" C API or an "ugly" C++ API?

The fast approach is "just the way you do it" in C.
The fast approach is "an ugly way of doing it" in C++.

Post by **Falco Girgis** » Tue Jan 15, 2013 9:56 am

Pornomag wrote:If I were you, I would not worry about performance, just get it working, and then later you can optimize if you need to. You should not premature optimize, optimize when you have working code that you think could be faster. Do note that this doesn't mean write dumb code. Read this: http://prog21.dadgum.com/106.html

The truth is that what he is asking is not nitty-gritty details that are encapsulated somewhere and can later be fixed. This is a question of high-level API design, which needs to be addressed before he even begins writing the code.

CAR145 · Post by **CAR145** » Tue Jan 15, 2013 1:48 pm

Awesome, thanks Falco!
I think I'll take the C route, because I know that if I write ugly code, I will end up re-writing it later when I have to change something in that section of code (and ~~I'm a bit~~ very OCD with my code).
I'll definitely read everything else posted here too.

Thanks guys! Maybe I'll post my progress when I get something worth showing

Rebornxeno · Post by **Rebornxeno** » Tue Jan 22, 2013 6:37 pm

Falco Girgis wrote:These are going to introduce a good amount of overhead by nature of temporary variable creation.

Why does this happen and is this done by compilers? What is causing the variables to be created?

Post by **dandymcgee** » Tue Jan 22, 2013 6:47 pm

Rebornxeno wrote:
Falco Girgis wrote:These are going to introduce a good amount of overhead by nature of temporary variable creation.
Why does this happen and is this done by compilers? What is causing the variables to be created?

class Vector2 {
private:
    int _x, _y;
public:
    inline Vector2(const int x, const int y): _x(x), _y(y) {}

    inline Vector2 operator+(const Vector2 &rhs) {
        return Vector2(x+rhs.x, y+rhs.y); //This line creates a new instance of the Vector2 class
     }
};

Pornomag · Post by **Pornomag** » Wed Jan 23, 2013 2:21 am

dandymcgee wrote:
Rebornxeno wrote:
Falco Girgis wrote:These are going to introduce a good amount of overhead by nature of temporary variable creation.
Why does this happen and is this done by compilers? What is causing the variables to be created?
class Vector2 {
private:
    int _x, _y;
public:
    inline Vector2(const int x, const int y): _x(x), _y(y) {}

    inline Vector2 operator+(const Vector2 &rhs) {
        return Vector2(x+rhs.x, y+rhs.y); //This line creates a new instance of the Vector2 class
     }
};

To be fair, the compiler can optimize away any temporary variables from ever being created, this isn't a guarantee but any sane compiler will do so (if it can).

Just incase anyone doesn't believe me:
http://en.wikipedia.org/wiki/Return_value_optimization
http://stackoverflow.com/questions/6658 ... 854#665854

Post by **Falco Girgis** » Wed Jan 23, 2013 12:34 pm

Pornomag wrote:To be fair, the compiler can optimize away any temporary variables from ever being created, this isn't a guarantee but any sane compiler will do so (if it can).

Just incase anyone doesn't believe me:
http://en.wikipedia.org/wiki/Return_value_optimization
http://stackoverflow.com/questions/6658 ... 854#665854

Which is exactly why I said this:

Falco Girgis wrote:Note: Actual overhead from that statement may vary slightly based on how smart your compiler is, but it's still not justifiable.

The compiler "may" use the Return Value Optimization (RVO), but with a scenario as complex as this:

Code: Select all

Vector2 vec = vec1 + vec2 + vec3;

Will it?

This is far more than a simple "get" method initializing a single object.

Well, lets see then, shall we? The following program should test the compiler's optimization. Note that I have been extremely strict with my consting, and that everything is inlined. The members _x and _y couldn't even be made const in the real world, but I wanted to see the compiler's best shot. Build and run the following program with GCC on any optimization setting.

Code: Select all

struct Vector2 {
	const int _x, _y;

	Vector2(const int x, const int y): _x(x), _y(y) {
		qDebug() << "Constructing <" << x << ", " << y << "> at " << this;
	}

	Vector2(const Vector2 &vec): _x(vec._x), _y(vec._y) {
		qDebug() << "Copy Constructing <" << _x << ", " << _y << "> from " << &vec;
	}

	Vector2 operator+(const Vector2 &rhs) const {
		return Vector2(_x+rhs._x, _y+rhs._y);
	}

};


int main(int argc, char *argv[]) {

	const Vector2 vec1(1, 1);
	const Vector2 vec2(2, 2);
	const Vector2 vec3(3, 3);

	const Vector2 vec = vec1 + vec2 + vec3;

	return 0;
}

I am getting this output:

Code: Select all

Constructing < 1 ,  1 > at  0x28fe30 
Constructing < 2 ,  2 > at  0x28fe28 
Constructing < 3 ,  3 > at  0x28fe20 
Constructing < 3 ,  3 > at  0x28fe10 
Constructing < 6 ,  6 > at  0x28fe18

Since we see no copy constructor being invoked anywhere, we can assume the compiler has optimized away copying the return value (rvalue) to the lvalue.

But even then, we see 5 objects being constructed here, rather than just the three we explicitly constructed. This statement still requires temporary storage allocated (and initialized) on the stack for every addition that is not immediately assigned to an lvalue. For n additions, that's n-1 temporary variables allocated on the stack.

So no, even the RVO with overloaded operators cannot defeat the C-style math API. Assuming any compiler out there could have the audacity to completely optimize-away the temporary storage here, what have you achieved? You have code that at best case will run as fast as the C code, but will usually fall short on other compilers.

For some high-level C++ API that is not being called thousands of times per frame, go ahead and return by value. For very time-critical math operations like this, it is simply poor coding practice or ignorance to not embrace a C-style API.

Pornomag · Post by **Pornomag** » Wed Jan 23, 2013 8:28 pm

Ah yes, I see, my bad.

Post by **Fillius** » Fri Feb 01, 2013 8:41 am

Hi,

there is a problem with the example provided. You effectively prevent the compiler from optimizing away all temporaries by making your constructors non-trival(because of the debug output).
The last published draft of the current C++ Standard dictates that a non-trivial constructor of temporaries must be called:

In Section 12.2, Page 259:

When an implementation introduces a temporary object of a class that has a non-trivial constructor (12.1,
12.8), it shall ensure that a constructor is called for the temporary object.

The only exception are non-trivial Copy or Move Constructors, which may be omitted for copy elision:

In Section 12.7, Page 283:

When certain criteria are met, an implementation is allowed to omit the copy/move construction of a class
object, even if the copy/move constructor and/or destructor for the object have side effects. In such cases,
the implementation treats the source and target of the omitted copy/move operation as simply two different
ways of referring to the same object, and the destruction of that object occurs at the later of the times
when the two objects would have been destroyed without the optimization.123

This code(essentially the same as above, but without most of the const and without any debug output):

struct Vector2 {
   int _x, _y;

   Vector2(){};
   
   Vector2(int x, int y): _x(x), _y(y) {}

   Vector2(const Vector2 &vec): _x(vec._x), _y(vec._y) {}

   Vector2 operator+(const Vector2 &rhs) const {
      return Vector2(_x+rhs._x, _y+rhs._y);
   }
   
   void add(const Vector2 &src1, const Vector2 &src2) {
        _x = src1._x + src2._x;
        _y = src1._y + src2._y;
    }

};


int main(int argc, char *argv[]) {

   Vector2 vec1(1, 1);
   Vector2 vec2(2, 2);
   Vector2 vec3(3, 3);

   Vector2 vec;//=vec1 + vec2 + vec3;
   vec.add(vec1,vec2);
   vec.add(vec,vec3);

   return 0;
}

Produces(on my laptop, g++ version 4.4.5, lowest optimization level) the excact same assembly as the version using the overloaded operator+:

struct Vector2 {
   int _x, _y;

   Vector2(){};
   
   Vector2(int x, int y): _x(x), _y(y) {}

   Vector2(const Vector2 &vec): _x(vec._x), _y(vec._y) {}

   Vector2 operator+(const Vector2 &rhs) const {
      return Vector2(_x+rhs._x, _y+rhs._y);
   }
   
   void add(const Vector2 &src1, const Vector2 &src2) {
        _x = src1._x + src2._x;
        _y = src1._y + src2._y;
    }

};


int main(int argc, char *argv[]) {

   Vector2 vec1(1, 1);
   Vector2 vec2(2, 2);
   Vector2 vec3(3, 3);

   Vector2 vec=vec1 + vec2 + vec3;/*
   vec.add(vec1,vec2);
   vec.add(vec,vec3);*/

   return 0;
}

The assembly(generated with a call "g++ -O -S main.cpp") is the following:

Code: Select all

	.file	"main.cpp"
	.text
.globl main
	.type	main, @function
main:
.LFB11:
	.cfi_startproc
	.cfi_personality 0x0,__gxx_personality_v0
	pushl	%ebp
	.cfi_def_cfa_offset 8
	movl	%esp, %ebp
	.cfi_offset 5, -8
	.cfi_def_cfa_register 5
	movl	$0, %eax
	popl	%ebp
	ret
	.cfi_endproc
.LFE11:
	.size	main, .-main
	.ident	"GCC: (Debian 4.4.5-8) 4.4.5"
	.section	.note.GNU-stack,"",@progbits

Of course, having side effects in the Constructor significantly reduce the efficency of a simple addition is still pretty unintuitive and quite undesired behaviour, but instead of falling back to a C like method, one could simply overload operator+=, which would retain a bit of the beauty of the C++ version, whilst being as efficient as the C-like add method in pretty much all cases.

I am sorry if I offended anyone, that is not my intention, I merely wanted to point out the fact that if you know what you are doing overloading operators doesnt necessarily slow down your code(and imho it really does make the code look more beautiful)

Fillius

moreson · Post by **moreson** » Mon Feb 11, 2013 6:22 pm

Falco Girgis wrote:I am getting this output:
Code: Select all
Constructing < 1 ,  1 > at  0x28fe30 
Constructing < 2 ,  2 > at  0x28fe28 
Constructing < 3 ,  3 > at  0x28fe20 
Constructing < 3 ,  3 > at  0x28fe10 
Constructing < 6 ,  6 > at  0x28fe18 
Since we see no copy constructor being invoked anywhere, we can assume the compiler has optimized away copying the return value (rvalue) to the lvalue.

But even then, we see 5 objects being constructed here, rather than just the three we explicitly constructed. This statement still requires temporary storage allocated (and initialized) on the stack for every addition that is not immediately assigned to an lvalue. For n additions, that's n-1 temporary variables allocated on the stack.

Well that's because you use a non-trivial constructor with side effects (printing something) so those constructor calls had to be generated, an actual vector class wouldn't have any side effects in the constructor.

Code: Select all

struct vec2 {
	int x, y;

	vec2(int xx, int yy) : x(xx), y(yy) {}
	vec2(const vec2 &v) : x(v.x), y(v.y) {}

	vec2 operator+(const vec2 &v) {
		return vec2(x+v.x, y+v.y);
	}
};

int main()
{
	vec2 a = vec2(rand(), rand());
	vec2 b = vec2(rand(), rand());
	vec2 c = vec2(rand(), rand());

	vec2 d = a + b + c;

	printf("%d,%d\n", d.x, d.y);

	return 0;
}

Generates this assembly (g++ 4.7 at -O3):

Code: Select all

	call	rand
	movl	%eax, %r14d
	call	rand
	movl	%eax, %ebp
	call	rand
	movl	%eax, %r13d
	call	rand
	movl	%eax, %ebx
	addl	%r14d, %r13d
	call	rand
	addl	%ebp, %ebx
	movl	%eax, %r12d
	call	rand
	leal	(%r12,%r13), %edx
	leal	(%rax,%rbx), %esi
	movl	$.LC0, %edi
	xorl	%eax, %eax
	call	printf

Which as you can see has no overhead at all, it's just straight up plain data in registers being added, no temporaries, no calls, no bloat.

Post by **Falco Girgis** » Fri Apr 19, 2013 5:13 pm

Fillius wrote:there is a problem with the example provided. You effectively prevent the compiler from optimizing away all temporaries by making your constructors non-trival(because of the debug output).
The last published draft of the current C++ Standard dictates that a non-trivial constructor of temporaries must be called:

In Section 12.2, Page 259:
When an implementation introduces a temporary object of a class that has a non-trivial constructor (12.1,
12.8), it shall ensure that a constructor is called for the temporary object.

Well goddamn. Thanks, Fillius and moreson. I was unaware of this.

Is this just new to C++11x, or did the old standards also work in such a manner?

Elysian Shadows

C vs C++ style of interface for hardware abstraction layer

C vs C++ style of interface for hardware abstraction layer

Re: C vs C++ style of interface for hardware abstraction lay

Re: C vs C++ style of interface for hardware abstraction lay

Re: C vs C++ style of interface for hardware abstraction lay

Re: C vs C++ style of interface for hardware abstraction lay

Re: C vs C++ style of interface for hardware abstraction lay

Re: C vs C++ style of interface for hardware abstraction lay

Re: C vs C++ style of interface for hardware abstraction lay

Re: C vs C++ style of interface for hardware abstraction lay

Re: C vs C++ style of interface for hardware abstraction lay

Re: C vs C++ style of interface for hardware abstraction lay

Re: C vs C++ style of interface for hardware abstraction lay

Re: C vs C++ style of interface for hardware abstraction lay

Re: C vs C++ style of interface for hardware abstraction lay

Re: C vs C++ style of interface for hardware abstraction lay