using normal_distribution in a loop - c++

I'm wondering if there could be a problem with putting normal_distribution in a loop.
Here is the code that uses normal_distribution in this strange way:
std::default_random_engine generator;
//std::normal_distribution<double> distribution(5.0,2.0);
for (int i=0; i<nrolls; ++i) {
std::normal_distribution<double> distribution(5.0,2.0);
float x = distribution(generator);
}

Putting the normal_distribution object outside the loop may be slightly more efficient than putting it in the loop. When it's inside the loop, the normal_distribution object may be re-constructed every time, whereas if it's outside the loop it's only constructed once.
Comparison of the assembly.
Based on an analysis of the assembly, declaring distribution outside the loop is more efficient.
Let's look at two different functions, along with the corresponding assembly. One of them declares distribution inside the loop, and the other one declares it outside the loop. To simplify the analysis, they're declared const in both cases, so we (and the compiler) know that the distribution doesn't get modified.
You can see the complete assembly here.
// This function is here to prevent the compiler from optimizing out the
// loop entirely
void doSomething(std::normal_distribution<double> const& d) noexcept;
void inside_loop(double mean, double sd, int n) {
for(int i = 0; i < n; i++) {
const std::normal_distribution<double> d(mean, sd);
doSomething(d);
}
}
void outside_loop(double mean, double sd, int n) {
const std::normal_distribution<double> d(mean, sd);
for(int i = 0; i < n; i++) {
doSomething(d);
}
}
inside_loop assembly
The assembly for the loop looks like this (compiled with gcc 8.3 at O3 optimization).
.L3:
movapd xmm2, XMMWORD PTR [rsp]
lea rdi, [rsp+16]
add ebx, 1
mov BYTE PTR [rsp+40], 0
movaps XMMWORD PTR [rsp+16], xmm2
call foo(std::normal_distribution<double> const&)
cmp ebp, ebx
jne .L3
Basically, it
- constructs the distribution
- invokes foo with the distribution
- tests to see if it should exit the loop
outside_loop assembly
Using the same compilation options, outside_loop just calls foo repeatedly without re-constructing the distribution. There's fewer instructions, and everything stays within the registers (so no need to access the stack).
.L12:
mov rdi, rsp
add ebx, 1
call foo(std::normal_distribution<double> const&)
cmp ebp, ebx
jne .L12
Are there ever any reasons to declare variables inside a loop?
Yes. There are definitely good times to declare variables inside a loop. If you were modifying distribution somehow inside the loop, then it would make sense to reset it every time just by constructing it again.
Furthermore, if you don't ever use a variable outside of a loop, it makes sense to declare it inside the loop just for the purposes of readability.
Types that fit inside a CPU's registers (so floats, ints, doubles, and small user-defined types) oftentimes have no overhead associated with their construction, and declaring them inside a loop can actually lead to better assembly by simplifying compiler analysis of register allocation.

Looking at the interface of the normal distribution, there is a member called reset, who:
resets the internal state of the distribution
This implies that the distribution may have an internal state. If it does, then you definitely reset that when you recreate the object at each iteration. Not using it as intended may produce a distribution which is not normal or might be just inefficient.
What state could it be? That is certainly implementation defined. Looking at one implementation from LLVM, the normal distribution is defined around here. More specifically, the operator() is here. Looking at the code, there is certainly some state shared in between subsequent calls. More specifically, at each subsequent call, the state of the boolean variable _V_hot_ is flipped. If it is true, significantly less computations are performed and the value of the stored _V_ is used. If it is false, then _V_ is computed from scratch.
I did not look very deep into why did they choose to do this. But, looking only at the computations performed, it should be much faster to rely on the internal state. While this is only some implementation, it shows that the standard allows the usage of internal state, and in some case it is beneficial.
Later edit:
The GCC libstdc++ implementation of std::normal_distribution can be found here. Note that the operator() calls another function, __generate_impl, which is defined in a separate file here. While different, this implementation has the same flag, here named _M_saved_available that speeds up every other call.

Related

Fastest implementation of simple, virtual, observer-sort of, pattern in c++?

I'm working my arse off trying to implement an alternative for vtables using enums and a ton of macro magic that's really starting to mess with my brain. I'm starting to think i'm not walking the right path since the code is getting uglier and uglier, and will not be fit for production by any means.
How can the pattern of the following code be implemented with the least amount of redirection/operations?
It has to be done in standard c++, up to 17.
class A{
virtual void Update() = 0; // A is so pure *¬*
};
class B: public A
{
override void Update() final
{
// DO B STUFF
}
}
class C: public A
{
override void Update() final
{
// DO C STUFF
}
}
// class...
int main()
{
std::vector<A*> vecA{};
// Insert instances of B, C, ..., into vecA
for(auto a: vecA) // This for will be inside a main loop
a->Update(); // Ridiculous amount of calls per unit of time
// Free memory
}
PS: If enum, switch and macros are indeed the best option, I guess I'll simply try to freshen up my caches and come up with a better design.
PSS: I know this is micro-optimization... Heck, I need to nano or even pico optimize this (figuratively speaking), so I will simply ignore any utilitarian responses that might come up.
As the first comment said, you have an XY problem here. Sorting / reordering is ok, and you have many objects, not a huge number of different classes, and no need to support types that your code doesn't know about at compile time. Polymorphism + virtual inheritance is the wrong choice.
Instead, use N different containers, one for each type of object, with no indirection. Letting the compiler inline B::Update() into a loop over all the B objects is much better. (For the trivial example below of incrementing one member int, my static performance analysis from looking at the asm puts it at about 24x faster on Skylake with the data hot in L1D cache. AVX2 auto-vectorization vs. call in a loop is really that huge.)
If there was some required order between the objects, including between different types of objects, then some kind of polymorphism or manual dispatching would be appropriate. (e.g. if it mattered what order you processed vecA in, keeping all the B objects separate from all the C objects wouldn't be equivalent.)
If you care about performance, you have to realize that making the source larger might simplify things for the compiler / in the asm output. Checking / dispatching based on the type of each object inside the inner loop is expensive. Using any kind of function pointer or enum to dispatch on a per-object basis can easily suffer from branch mispredicts when you have a mix of different objects.
Looping separately over multiple containers effectively hoists that type check out of the inner loop and lets the compiler devirtualize. (Or better, shrinks each object to not need a vtable pointer, enum, or function pointer in the first place, because its type is statically known.)
Writing out a separate loop for each container with a different type is sort of like fully unrolling a loop over different types after hoisting the type dispatching out of the inner loop. This is necessary for the compiler to inline the calls, which you want if there are a lot of objects of each type. Inlining lets it keep constants in registers across objects, enables SIMD auto-vectorization across multiple objects, and simply avoids the overhead of an actual function call. (Both the call itself and spill/reload of registers.)
You were right that if you did need per-object dispatch, C++ virtual functions are an expensive way to get it when you're using final overrides. You're paying the same runtime cost that would let your code support new derived classes of arbitrary size that it didn't know about at compile time, but not gaining any benefit from that.
Virtual dispatch only works with a level of indirection (e.g. a vector of pointers like you're using), which means you need to manage the pointed-to objects somehow, e.g. by allocating them from vector<B> poolB and vector<C> poolC. Although I'm not sure most implementations of vector<> use realloc() when they need to grow; the new/delete API doesn't have a realloc, so vector may actually copy every time it grows, instead of trying to extend the existing allocation in place. Check what your C++ implementation does, since it might suck compared to what you can do with malloc/realloc.
And BTW, it should be possible to do the new/delete with RAII with no extra overhead for allocation/deallocation, as long as all your classes are trivially destructible. (But note that unique_ptr may defeat other optimizations for using the vector of pointers). std::unique_ptr warns that it's UB to destroy it via a pointer to the base class, so you might have to roll your own. Still, on gcc on x86-64, sizeof(unique_ptr<class C>) is only 8, so it only has a single pointer member. But whatever, separately allocating zillions of tiny objects sucks so don't do it that way in the first place.
If you did need some kind of dispatch like the title asks for
If the objects are all similar sizes, then you really want to loop over objects, not pointers-to-objects. That would avoid the extra cache footprint of a vector of pointers, and it avoids the extra pointer-chasing latency that out-of-order execution has to hide to keep the execution units busy. But C++ virtual inheritance doesn't provide any standards-compliant way to get polymorphism for union upoly { B b; C c; } poly_array[1024]; You can hack this up yourself with reinterpret_cast<> in a way that probably works on x86-64 gcc, but you probably shouldn't. See #BeeOnRope's followup: Contiguous storage of polymorphic types. (Also an older Q&A: C++ polymorphism of a object in an array).
If you need that, the highest-performance way would probably be to build it yourself with an enum to index a table of function pointers (or use a switch() if your functions can inline). If your functions don't inline, switch() to a bunch of function-call cases doesn't usually optimize down to a table of function pointers even if they all have the same args (or no args). You usually get a jump table to a block of call instructions, rather than actually doing an indirect call. So there's an extra jump in every dispatch.
C++17 std::visit with std::variant<B, C> (using non-virtual inheritance for B and C) seems to give you dispatch based on an internal enum. std::visit uses its own jump table to dispatch, even with only 2 possible types, instead of inlining them both and using a conditional branch. It also has to check for the "uninitialized" state all the time. You can get good code if you manually work around that with B *tmp = std::get_if<B>(&my_variant), and a __builtin_unreachable() to tell gcc that nullptr isn't a possibility. But at that point you might as well just roll your own struct polymorph { enum type; union { B b; C c; }; }; (with non-virtual functions) if you don't need an "uninitialized" state. Related: C++ polymorphism of a object in an array.
In this case you only have one function, so you can put a function pointer inside each object as a member. Like void (*m_update)(A* this_object). In the caller, pass a pointer to the object as a void* or A*, since it's a non-member function. The implementation of the function will reinterpret_cast<C*>(this_object). (Not dynamic_cast: we're doing our own dispatchiing, not using C++'s).
If you want to use B and C in other contexts where the function-pointer member would be taking up space for no benefit, you can keep the function pointers in a separate container instead of in the base class. So it would be for(i=0..n) funcptrs[i]( &objects[i] );. As long as your containers don't get out of sync, you're always passing a pointer to a function that knows what to do with it. Use that with union {B b; C c} objects[] (or a vector<union>).
You can use void* if you want, especially if you make a separate array of function pointers. Then the union members don't need to inherit from a common base.
You could use std::function<> to store pointers to instance member functions, but on x86-64 gcc that's a 32-byte object. It's better for your cache footprint to only use 8-byte regular function pointers and write code that knows to pass an explicit pointer equivalent to the this pointer.
Putting a function pointer in each object may take more space than an enum or uint8_t, depending on current size/alignment. A small integer index into a table of function pointers might reduce the size of each instance of your objects vs. a pointer member, especially for 64-bit targets. Smaller objects could easily be worth the couple extra instructions to index an array of function pointers, and the possibly higher mispredict penalty from an extra pointer dereference. Memory / cache misses are often a bottleneck.
I'm assuming you do have some per-instance state, even though you don't show any. If not, then a vector of ordinary function pointers to non-member functions will be much cheaper!
Overhead comparison:
I had a look at the compiler-generated asm (gcc and clang targeting x86-64) for a few ways of doing this.
Source for multiple ways of doing this + asm from x86-64 clang 5.0 on the Godbolt compiler explorer. You can flip it over to gcc, or non-x86 architectures.
class A{
public:
virtual void Update() = 0; // A is so pure *¬*
};
struct C : public A {
int m_c = 0;
public:
void Update() override final
{ m_c++; }
};
int SC = sizeof(C); // 16 bytes because of the vtable pointer
C global_c; // to instantiate a definition for C::Update();
// not inheriting at all gives equivalent asm to making Update non-virtual
struct nonvirt_B //: public A
{
int m_b = 0;
void Update() //override final
{ m_b++; }
};
int SB = sizeof(nonvirt_B); // only 4 bytes per object with no vtable pointer
void separate_containers(std::vector<nonvirt_B> &vecB, std::vector<C> &vecC)
{
for(auto &b: vecB) b.Update();
for(auto &c: vecC) c.Update();
}
clang and gcc auto-vectorize the loop over vecB with AVX2 to process 8 int elements in parallel, so if you don't bottleneck on memory bandwidth (i.e. hot in L1D cache), this loop can increment 8 elements per clock cycle. This loop runs as fast as a loop over a vector<int>; everything inlines and optimizes away and it's just a pointer increment.
The loop over vecC can only do 1 element per clock cycle, because each object is 16 bytes (8 byte vtable pointer, 4 byte int m_c, 4 bytes of padding to the next alignment boundary because the pointer has an 8B alignment requirement.) Without final, the compiler also checks the vtable pointer to see if it's actually a C before using the inlined C::Update(), otherwise it dispatches. It's like what you'd get for a loop over struct { int a,b,c,d; } vecC[SIZE]; doing vecC[i].c++;
final allowed full devirtualization, but our data is mixed with vtable pointers, so compilers just do scalar add [mem], 1 which can only run at 1 per clock (bottlenecked on 1 per clock store throughput, regardless of the size of the store if it's hot in L1D cache). This mostly defeats SIMD for this example. (With -march=skylake-avx512, gcc and clang do some ridiculous shuffling or gather/scatter that's even slower than scalar, instead of just loading/restoring the whole object and adding with a vector that only changes the int member. That's allowed because it doesn't contain any volatile or atomic members, and would run a 2 per clock with AVX2, or 4 per clock with AVX512.) Having your objects up to 12 bytes larger is a major downside if they're small and you have a lot of them.
With multiple members per object, this doesn't necessarily defeat SIMD, but it still costs space in each object, just like an enum or function pointer would.
Since you mentioned the separating axis theorem, I hope you're not planning on storing float x,y pairs in each object. Array-of-structs basically sucks for SIMD, because it needs a lot of shuffling to use the x with the y for the same object. What you want is std::vector<float> x, y or similar, so your CPU can load 4 x values into a register and 4 y values into another register. (Or 8 at a time with AVX).
See Slides: SIMD at Insomniac Games (GDC 2015) for an intro to how to structure your data for SIMD, and some more advanced stuff. See also the sse tag wiki for more guides. Also, the x86 tag wiki has lots of low-level x86 optimization material. Even if you don't manually vectorize anything, with separate arrays for x and y there's a good chance that the compiler can auto-vectorize for you. (Look at the asm output, or benchmark gcc -O3 -march=native vs. gcc -O3 -march=native -fno-tree-vectorize). You may need -ffast-math for some kinds of FP vectorization.
C++ virtual dispatch:
Writing it the way you do in the question, with virtual inheritance and
std::vector<A*> vecA{};
void vec_virtual_pointers() {
for(auto a: vecA)
a->Update();
}
We get this loop from clang5.0 -O3 -march=skylake
# rbx = &vecA[0]
.LBB2_1: # do{
mov rdi, qword ptr [rbx] # load a pointer from the vector (will be the this pointer for Update())
mov rax, qword ptr [rdi] # load the vtable pointer
call qword ptr [rax] # memory-indirect call using the first entry in the vtable
add rbx, 8 # pointers are 8 bytes
cmp r14, rbx
jne .LBB2_1 # }while(p != vecA.end())
So the final function pointer is at the end of a chain of three dependent loads. Out-of-order execution lets this overlap between iterations (if the branch predicts correctly), but that's a lot of overhead just in total instructions for the front-end, as well as in mispredict penalty. (call [m] is 3 uops, so just the loop itself is 8 uops, and can only issue one per 2 cycles on Skylake. Call/return has overhead too. If the callee is not totally trivial, we probably don't bottleneck on store-forwarding for pushing / popping the return address. Loop with function call faster than an empty loop. (I'm not sure about the throughput of independent store/reload operations on the same address. That would normally require memory renaming, which Skylake doesn't do, to not bottleneck on that if the callee is tiny like here.)
Clang's definition for C::Update() is
C::Update(): # #C::Update()
inc dword ptr [rdi + 8]
ret
If this needed to set up any constants before calculating something, it would be even more expensive to not have it inlined. So, with virtual dispatch, this probably runs at about one per 3 to 5 clocks, instead of about 1 member per clock, on Skylake. (Or 8 members per clock with AVX2 for non-virtual class B which doesn't waste space, and makes auto-vectorization work well.) http://agner.org/optimize/ says Skylake has one per 3 clock call throughput, so lets say 24x performance loss when the data is hot in L1D cache. Different microarchitectures will be different, of course. See the x86 tag wiki for more x86 perf info.
Union hack:
Probably you should never use this, but you can see from the asm that it will work on x86-64 with clang and gcc. I made an array of unions, and looped over it:
union upoly {
upoly() {} // needs an explicit constructor for compilers not to choke
B b;
C c;
} poly_array[1024];
void union_polymorph() {
upoly *p = &poly_array[0];
upoly *endp = &poly_array[1024];
for ( ; p != endp ; p++) {
A *base = reinterpret_cast<A*>(p);
base->Update(); // virtual dispatch
}
}
A B and C all have their vtable at the start, so I think this will generally work. We asm that's basically the same, with one less step of pointer-chasing. (I used a static array instead of a vector, since I was keeping things simple and C-like while sorting out what to cast.)
lea rdi, [rbx + poly_array] ; this pointer
mov rax, qword ptr [rbx + poly_array] ; load it too, first "member" is the vtable pointer
call qword ptr [rax]
add rbx, 16 ; stride is 16 bytes per object
cmp rbx, 16384 ; 16 * 1024
jne .LBB4_1
This is better, and touches less memory, but it's only slightly better for overhead.
std::function from #include <functional>
It can hold any kind of callable thing. But it has even more overhead than vtable dispatch, because it's allowed to be in an error-if-used state. So the inner loop has to check every instance for that, and trap if it is. Also, sizeof(std::function<void()>); is 32 bytes (on x86-64 System V ABI).
#include <functional>
// pretty crappy: checks for being possibly unset to see if it should throw().
std::vector<std::function<void()>> vecF{};
void vec_functional() {
for(auto f: vecF) f();
}
# do {
.LBB6_2: # =>This Inner Loop Header: Depth=1
mov qword ptr [rsp + 16], 0 # store a 0 to a local on the stack?
mov rax, qword ptr [rbx + 16]
test rax, rax
je .LBB6_5 # throw on pointer==0 (nullptr)
mov edx, 2 # third arg: 2
mov rdi, r14 # first arg: pointer to local stack memory (r14 = rsp outside the loop)
mov rsi, rbx # second arg: point to current object in the vector
call rax # otherwise call into it with 2 args
mov rax, qword ptr [rbx + 24] # another pointer from the std::function<>
mov qword ptr [rsp + 24], rax # store it to a local
mov rcx, qword ptr [rbx + 16] # load the first pointer again
mov qword ptr [rsp + 16], rcx
test rcx, rcx
je .LBB6_5 # check the first pointer for null again (and throw if null)
mov rdi, r14
call rax # call through the 2nd pointer
mov rax, qword ptr [rsp + 16]
test rax, rax
je .LBB6_12 # optionally skip a final call
mov edx, 3
mov rdi, r14
mov rsi, r14
call rax
.LBB6_12: # in Loop: Header=BB6_2 Depth=1
add rbx, 32
cmp r15, rbx
jne .LBB6_2
.LBB6_13: # return
add rsp, 32
pop rbx
pop r14
pop r15
ret
.LBB6_5:
call std::__throw_bad_function_call()
jmp .LBB6_16
mov rdi, rax
call __clang_call_terminate
So there are up to three call instructions unless the pointer is nullptr. This looks far worse than virtual dispatch.
It looks a bit different with clang -stdlib=libc++, instead of the default libstdc++. (https://libcxx.llvm.org/). But still three call instructions in the inner loop, with conditionals to skip them or throw.
Unless the code-gen is very different for different kinds of function<T>, it's probably not even worth looking at it for pointers to member functions if you can write a more efficient alternative.
If you really need virtual dispatch, one method to speed up the dispatch for the same virtual method on a list of objects of varying derived types is to use what I'll call type-unswitching.
Somewhat analogously to loop unswitching, this transforms the single loop calling the method on every object in order into N loops (for N supported types) which each call the method on all objects of a specific type. This avoids the primary cost of unpredictable virtual dispatch: the branch mis-predictions implied by the indirect call of an unknown, unpredictable function in the vtable.
The generic implementation of this technique involves a first pass to partition the objects by type: information about this partition is used by the second pass which has separate loops for each each type1, calling the method. This generally doesn't involve any unpredictable branches at all, if implemented carefully.
In the case of two derived classes B and C you can simply use a bitmap to store the type information. Here's an example implementation, using the types A, B, C from the code in the question:
void virtual_call_unswitch(std::vector<A*>& vec) {
// first create a bitmap which specifies whether each element is B or C type
std::vector<uint64_t> bitmap(vec.size() / 64);
for (size_t block = 0; block < bitmap.size(); block++) {
uint64_t blockmap = 0;
for (size_t idx = block * 64; idx < block * 64 + 64; idx++) {
blockmap >>= 1;
blockmap |= (uint64_t)vec[idx + 0]->typecode_ << 63;
}
bitmap[block] = blockmap;
}
// now loop over the bitmap handling all the B elements, and then again for all the C elements
size_t blockidx;
// B loop
blockidx = 0;
for (uint64_t block : bitmap) {
block = ~block;
while (block) {
size_t idx = blockidx + __builtin_ctzl(block);
B* obj = static_cast<B*>(vec[idx]);
obj->Update();
block &= (block - 1);
}
blockidx += 64;
}
// C loop
blockidx = 0;
for (uint64_t block : bitmap) {
while (block) {
size_t idx = blockidx + __builtin_ctzl(block);
C* obj = static_cast<C*>(vec[idx]);
obj->Update();
block &= (block - 1);
}
blockidx += 64;
}
}
Here, typecode is a common field in A which identifies the object type, 0 for B and 1 for C. Something similar is needed to make the categorization by type feasible (it can't be a virtual call, since making an unpredictable call is what we're trying to avoid in the first place).
A slightly optimized version of the above shows about a 3.5x speedup for the unswitched version over the plain virtually dispatched loop, with the virtual version clocking in about 19 cycles per dispatch, and the unswitched version at around 5.5. Full results:
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------
BenchWithFixture/VirtualDispatchTrue 30392 ns 30364 ns 23033 128.646M items/s
BenchWithFixture/VirtualDispatchFakeB 3564 ns 3560 ns 196712 1097.34M items/s
BenchWithFixture/StaticBPtr 3496 ns 3495 ns 200506 1117.6M items/s
BenchWithFixture/UnswitchTypes 8573 ns 8571 ns 80437 455.744M items/s
BenchWithFixture/StaticB 1981 ns 1981 ns 352397 1.9259G items/s
VirtualDispatchTrue is the simple loop calling Update() on a pointer of type A:
for (A *a : vecA) {
a->Update();
}
VirtualDispatchFakeB casts the pointer to B* (regardless of what the underlying type is) before calling Update(). Since B::Update() is final, the compiler can fully de-virtualize and inline the call. Of course, this isn't doing the right thing at all: it's treating any C objects as B and so calling the wrong method (and is totally UB) - but it's here to estimate how fast you could call methods on a vector of pointers if every object was the same statically known type.
for (A *a : vecA) {
((B *)a)->Update();
}
StaticBPtr iterates over a std::vector<B*> rather than a std::vector<A*>. As expected the performance is the same as the "fake B" code above, since the target for Update() is statically known and fully inlinable. It's here as a sanity check.
UnswitchTypes is the type unswitching trick described above.
StaticB iterates over a std::vector<B>. That is, contiguously allocated B objects rather than a vector of pointers to B objects. This removes one level of indirection and shows something like the best case for this object layout2.
The full source is available.
Limitations
Side-effects and Order
The key limitation with this technique is that the order of Update() calls shouldn't matter. While Update() is still called once on each object, the order has clearly changed. As long as the object doesn't update any mutable global or shared state, this should be easy to satisfy.
Supports For Two Types
The code above supports only two types, based on the use of bitmap to record type information.
This restriction is fairly easy to remove. First, the bitmap approach can be extended. E.g., to support 4 types, two similar bitmaps can be created for which the corresponding bits of each bitmap essentially for a 2-bit field encoding the type. The loops are similar, except that in the outer loop they & and ~ the bitmaps together in ways that over all the 4 types. E.g.:
// type 1 (code 11)
for (size_t i = 0; i < bitmap1.size(); i++) {
block = bitmap1[i] & bitmap2[i];
...
}
// type 2 (code 01)
for (size_t i = 0; i < bitmap1.size(); i++) {
block = ~bitmap1[i] & bitmap2[i];
...
}
...
Another approach is to not use bitmaps at all, but simply store an array of indexes per type. Each index in an array points to an object of that type in the master array. Essentially it's a 1-pass radix sort on the type code. This probably makes the type sorting part a bit slower, but potentially speeds up the loop iteration logic (the x & (x - 1) and ctz stuff disappears, at the cost of another indirection).
Fixed Number of Supported Types
The code above supports a fixed number of compile-time known types (namely, B and C). If a new type is introduced, the code above will either break and will certainly fail to call Update() on these new types.
However, it is straightforward to add support for unknown types. Simply group all unknown types, and then for those types only, do a full virtual dispatch within the loop (i.e., call Update() directly on the A*). You'll pay the full price, but only for types which you didn't explicitly support! In this way, the technique retails the generality of the virtual dispatch mechanism.
1 Actually, you need only one loop per group of types that all share the same implementation of the virtual method, although this might be hard to implement in a generic manner since this information isn't readily available. For example if classes Y and Z both derive from X, but neither overrides the implementation of some virtual method from X, then all of X, Y and Z can be handled by the same loop.
2 By "object layout" I mean B objects that still have virtual methods and hence a vtable. If you remove all the virtual methods and get rid of the vtable, things go much faster since the compiler then vectorizes the addition to the compactly arranged fields. The vtable messes that up.

Will a C/C++ compiler optimise code by reusing a recently calculated function result?

Suppose I have a function double F(double x) and let's assume for the sake of this example that calls to F are costly.
Suppose I write a function f that calculates the square root of F:
double f(double x){
return sqrt(F(x));
}
and in a third function sum I calculate the sum of f and F:
double sum(double x){
return F(x) + f(x);
}
Since I want to minimise calls to F the above code is inefficient compared to e.g.
double sum_2(double x){
double y = F(x);
return y + sqrt(y);
}
But since I am lazy, or stupid, or want to make my code as clear as possible, I opted for the first definition instead.
Would a C/C++ compiler optimise my code anyway by realizing that the value of F(x) can be reused to calculate f(x), as it is done in sum_2?
Many thanks.
Would a C/C++ compiler optimise my code anyway by realizing that the value of F(x) can be reused to calculate f(x), as it is done in sum_2?
Maybe. Neither language requires such an optimization, and whether they allow it depends on details of the implementation of F(). Generally speaking, different compilers behave differently with respect to this sort of thing.
It is entirely plausible that a compiler would inline function f() into function sum(), which would give it the opportunity to recognize that there are two calls to F(x) contributing to the same result. In that case, if F() has no side effects then it is conceivable that the compiler would emit only a single call to F(), reusing the result.
Particular implementations may have extensions that can be employed to help the compiler come to such a conclusion. Without such an extension being applied to the problem, however, I rate it unlikely that a compiler would emit code that performs just one call to F() and reuses the result.
What you're describing is called memoization, a form of (usually) run-time caching. While it's possible for this to be implemented in compilers, it's most often not performed in C compilers.
C++ does have a clever workaround for this, using the STL, detailed at this blog post from a few years ago; there's also a slightly more recent SO answer here. It's worth noting that with this approach, the compiler isn't "smartly" inferring that a function's multiple identical results will be reused, but the effect is largely the same.
Some languages, like Haskell, do feature baked-in support for compile-time memoization, but the compiler architecture is fairly different from Clang or GCC/G++.
Many compilers use hints to figure out if a result of a previous function call may be reused. A classical example is:
for (int i=0; i < strlen(str); i++)
Without optimizing this, the complexity of this loop is at least O(n2), but after optimization it can be O(n).
The hints that gcc, clang, and many others can take are __attribute__((pure)) and __attribute__((const)) which are described here. For example, GNU strlen is declared as a pure function.
GCC can detect pure functions, and suggest the programmer which functions should be married pure. In fact, it does that automatically for the following simplistic example:
unsigned my_strlen(const char* str)
{
int i=0;
while (str[i])
++i;
return i;
}
unsigned word_len(const char *str)
{
for (unsigned i=0 ; i < my_strlen(str); ++i) {
if (str[i] == ' ')
return i;
}
return my_strlen(str);
}
You can see the compilation result for gcc with -O3 -fno-inline. It calls my_strlen(str) only once in the whole word_len function. Clang 7.0.0 does not seem to perform this optimization.
word_len:
mov rcx, rdi
call my_strlen ; <-- this is the only call (outside any loop)
test eax, eax
je .L7
xor edx, edx
cmp BYTE PTR [rcx], 32
lea rdi, [rdi+1]
jne .L11
jmp .L19
.L12:
add rdi, 1
cmp BYTE PTR [rdi-1], 32
je .L9
.L11:
add edx, 1
cmp eax, edx
jne .L12
.L7:
ret
.L19:
xor edx, edx
.L9:
mov eax, edx
ret

No performance gain with SOA and AOS with SIMD [duplicate]

EDIT: As Cody Gray pointed out in his comment, profiling with disabled optimization is complete waste of time. How then should i approach this test?
Microsoft in its XMVectorZero in case if defined _XM_SSE_INTRINSICS_ uses _mm_setzero_ps and {0.0f,0.0f,0.0f,0.0f} if don't. I decided to check how big is the win. So i used the following program in Release x86 and Configuration Properties>C/C++>Optimization>Optimization set to Disabled (/Od).
constexpr __int64 loops = 1e9;
inline void fooSSE() {
for (__int64 i = 0; i < loops; ++i) {
XMVECTOR zero1 = _mm_setzero_ps();
//XMVECTOR zero2 = _mm_setzero_ps();
//XMVECTOR zero3 = _mm_setzero_ps();
//XMVECTOR zero4 = _mm_setzero_ps();
}
}
inline void fooNoIntrinsic() {
for (__int64 i = 0; i < loops; ++i) {
XMVECTOR zero1 = { 0.f,0.f,0.f,0.f };
//XMVECTOR zero2 = { 0.f,0.f,0.f,0.f };
//XMVECTOR zero3 = { 0.f,0.f,0.f,0.f };
//XMVECTOR zero4 = { 0.f,0.f,0.f,0.f };
}
}
int main() {
fooNoIntrinsic();
fooSSE();
}
I ran the program twice first with only zero1 and second time with all lines uncommented. In the first case intrinsic loses, in the second intrinsic is clear winner. So, my questions are:
Why intrinsic does not always win?
Does the profiler i used is a proper tool for such measurements?
Profiling things with optimization disabled gives you meaningless results and is a complete waste of time. If you are disabling optimization because otherwise the optimizer notices that your benchmark actually does nothing useful and is removing it entirely, then welcome to the difficulties of microbenchmarking!
It is often very difficult to concoct a test case that actually does enough real work that it will not be removed by a sufficiently smart optimizer, yet the cost of that work does not overwhelm and render meaningless your results. For example, a lot of people's first instinct is to print out the incremental results using something like printf, but that's a non-starter because printf is incredibly slow and will absolutely ruin your benchmark. Making the variable that collects the intermediate values as volatile will sometimes work because it effectively disables load/store optimizations for that particular variable. Although this relies on ill-defined semantics, that's not important for a benchmark. Another option is to perform some pointless yet relatively cheap operation on the intermediate results, like add them together. This relies on the optimizer not outsmarting you, and in order to verify that your benchmark results are meaningful, you'll have to examine the object code emitted by the compiler and ensure that the code is actually doing the thing. There is no magic bullet for crafting a microbenchmark, unfortunately.
The best trick is usually to isolate the relevant portion of the code inside of a function, parameterize it on one or more unpredictable input values, arrange for the result to be returned, and then put this function in an external module such that the optimizer can't get its grubby paws on it.
Since you'll need to look at the disassembly anyway to confirm that your microbenchmark case is suitable, this is often a good place to start. If you are sufficiently competent in reading assembly language, and you have sufficiently distilled the code in question, this may even be enough for you to make a judgment about the efficiency of the code. If you can't make heads or tails of the code, then it is probably sufficiently complicated that you can go ahead and benchmark it.
This is a good example of when a cursory examination of the generated object code is sufficient to answer the question without even needing to craft a benchmark.
Following my advice above, let's write a simple function to test out the intrinsic. In this case, we don't have any input to parameterize upon because the code literally just sets a register to 0. So let's just return the zeroed structure from the function:
DirectX::XMVECTOR ZeroTest_Intrinsic()
{
return _mm_setzero_ps();
}
And here is the other candidate that performs the initialization the seemingly-naïve way:
DirectX::XMVECTOR ZeroTest_Naive()
{
return { 0.0f, 0.0f, 0.0f, 0.0f };
}
Here is the object code generated by the compiler for these two functions (it doesn't matter which version, whether you compile for x86-32 or x86-64, or whether you optimize for size or speed; the results are the same):
ZeroTest_Intrinsic
xorps xmm0, xmm0
ret
ZeroTest_Naive
xorps xmm0, xmm0
ret
(If AVX or AVX2 instructions are supported, then these will both be vxorps xmm0, xmm0, xmm0.)
That is pretty obvious, even to someone who cannot read assembly code. They are both identical! I'd say that pretty definitively answers the question of which one will be faster: they will be identical because the optimizer recognizes the seemingly-naïve initializer and translates it into a single, optimized assembly-language instruction for clearing a register.
Now, it is certainly possible that there are cases where this is embedded deep within various complicated code constructs, preventing the optimizer from recognizing it and performing its magic. In other words, the "your test function is too simple!" objection. And that is most likely why the library's implementer chose to explicitly use the intrinsic whenever it is available. Its use guarantees that the code-gen will emit the desired instruction, and therefore the code will be as optimized as possible.
Another possible benefit of explicitly using the intrinsic is to ensure that you get the desired instruction, even if the code is being compiled without SSE/SSE2 support. This isn't a particularly compelling use-case, as I imagine it, because you wouldn't be compiling without SSE/SSE2 support if it was acceptable to be using these instructions. And if you were explicitly trying to disable the generation of SSE/SSE2 instructions so that you could run on legacy systems, the intrinsic would ruin your day because it would force an xorps instruction to be emitted, and the legacy system would throw an invalid operation exception immediately upon hitting this instruction.
I did see one interesting case, though. xorps is the single-precision version of this instruction, and requires only SSE support. However, if I compile the functions shown above with only SSE support (no SSE2), I get the following:
ZeroTest_Intrinsic
xorps xmm0, xmm0
ret
ZeroTest_Naive
push ebp
mov ebp, esp
and esp, -16
sub esp, 16
mov DWORD PTR [esp], 0
mov DWORD PTR [esp+4], 0
mov DWORD PTR [esp+8], 0
mov DWORD PTR [esp+12], 0
movaps xmm0, XMMWORD PTR [esp]
mov esp, ebp
pop ebp
ret
Clearly, for some reason, the optimizer is unable to apply the optimization to the use of the initializer when SSE2 instruction support is not available, even though the xorps instruction that it would be using does not require SSE2 instruction support! This is arguably a bug in the optimizer, but explicit use of the intrinsic works around it.

Busy polling std::atomic - msvc optimizes loop away - why, and how to prevent?

I'm trying to implement a simple busy loop function.
This should keep polling a std::atomic variable for a maximum number of times (spinCount), and return true if the status did change (to anything other than NOT_AVAILABLE) within the given tries, or false otherwise:
// noinline is just to be able to inspect the resulting ASM a bit easier - in final code, this function SHOULD be inlined!
__declspec(noinline) static bool trySpinWait(std::atomic<Status>* statusPtr, const int spinCount)
{
int iSpinCount = 0;
while (++iSpinCount < spinCount && statusPtr->load() == Status::NOT_AVAILABLE);
return iSpinCount == spinCount;
}
However, it seems that MSVC just opitmizes the loop away on Release mode for Win64. I'm pretty bad with Assembly, but doesn't look to me like it's ever even trying to read the value of statusPtr at all:
int iSpinCount = 0;
000000013F7E2040 xor eax,eax
while (++iSpinCount < spinCount && statusPtr->load() == Status::NOT_AVAILABLE);
000000013F7E2042 inc eax
000000013F7E2044 cmp eax,edx
000000013F7E2046 jge trySpinWait+12h (013F7E2052h)
000000013F7E2048 mov r8d,dword ptr [rcx]
000000013F7E204B test r8d,r8d
000000013F7E204E je trySpinWait+2h (013F7E2042h)
return iSpinCount == spinCount;
000000013F7E2050 cmp eax,edx
000000013F7E2052 sete al
My impression was that std::atomic with std::memory_order_sequential_cst creates a compiler barrier that should prevent something like this, but seems that's not the case (or rather, my understanding was probably wrong).
What am I doing wrong here, or rather - how can I best implement that loop without having it optimized away, with least impact on overall performance?
I know I could use #pragma optimize( "", off ), but (other than in the example above), in my final code I'd very much like to have this call inlined into a larger function for performance reasons. seems that this #pragma will generally prevent inlining though.
Appreciate any thoughts!
Thanks
but doesn't look to me like it's ever even trying to read the value of statusPtr at all
It does reload it on every iteration of the loop:
000000013F7E2048 mov r8d,dword ptr [rcx] # rcx is statusPtr
My impression was that std::atomic with std::memory_order_sequential_cst creates a compiler barrier that should prevent something like this,
You do not need anything more than std::memory_order_relaxed here because there is only one variable shared between threads (even more, this code doesn't change the value of the atomic variable). There are no reordering concerns.
In other words, this function works as expected.
You may like to use PAUSE instruction, see Benefitting Power and Performance Sleep Loops.

Performance issue C++ - searching through an array

I have two versions of searching through an int array for a specific value.
The first version is the straight forward one
int FindWithoutBlock(int * Arr, int ArrLen, int Val)
{
for ( int i = 0; i < ArrLen; i++ )
if ( Arr[i] == Val )
return i;
return ArrLen;
}
The second version should be faster. The passed array needs to be one element larger than in the previous case. Say for an array with 5 values, you allocate six ints and then do the following
int FindWithBlock(int * Arr, int LastCellIndex, int Val)
{
Arr[LastCellIndex] = Val;
int i;
for ( i = 0 ; Arr[i] != Val; i++ );
return i;
}
this version should be faster - you don't need to check array boundaries with each iteration through Arr.
Now the "issue". When running these functions 100K times on an array of 100K elements in Debug, the second version is roughly 2x faster. In Release however, the first version is approximately 6000x faster. And the question is why.
A program that demonstrates this is to be found at http://eubie.sweb.cz/main.cpp
Any insight is much appreciated.
Daniel
Here are my results using DevStudio 2005:
Debug:
Without block: 25.109
With block: 19.703
Release:
Without block: 0
With block: 6.046
It is very important to run this from the command line and not from within DevStudio, DevStudio does something to affect the performance of the app.
The only way to know what's really happening is to look at the assembler code. Here's the assembler generated in release:-
FindWithoutBlock:
00401000 xor eax,eax
00401002 cmp dword ptr [ecx+eax*4],0F4240h
00401009 je FindWithoutBlock+1Ah (40101Ah)
0040100B add eax,1
0040100E cmp eax,186A0h
00401013 jl FindWithoutBlock+2 (401002h)
00401015 mov eax,186A0h
0040101A ret
Note that the compiler has removed the ArrLen parameter and replaced it with a constant! It has also kept it as a function.
Here's what the compiler did with the other function (FindWithBlock):-
004010E0 mov dword ptr [esp+38h],186A0h
004010E8 mov ebx,0F4240h
004010ED mov dword ptr [esi+61A80h],ebx
004010F3 xor eax,eax
004010F5 cmp dword ptr [esi],ebx
004010F7 je main+0EFh (40110Fh)
004010F9 lea esp,[esp]
00401100 add eax,1
00401103 cmp dword ptr [esi+eax*4],ebx
00401106 jne main+0E0h (401100h)
00401108 cmp eax,186A0h
0040110D je main+0F5h (401115h)
0040110F call dword ptr [__imp__getchar (4020D0h)]
00401115 sub dword ptr [esp+38h],1
0040111A jne main+0CDh (4010EDh)
Here, the function has been in-lined. The lea esp,[esp] is just a 7 byte nop to align the next instruction. The code checks index 0 separately to all the other indices but the main loop is definately tighter than the FindWithoutBlock version.
Hmmm. Here's the code that calls FindWithoutBlock:-
0040106F mov ecx,edi
00401071 mov ebx,eax
00401073 call FindWithoutBlock (401000h)
00401078 mov ebp,eax
0040107A mov edi,186A0h
0040107F cmp ebp,186A0h
00401085 je main+6Dh (40108Dh)
00401087 call dword ptr [__imp__getchar (4020D0h)]
0040108D sub edi,1
00401090 jne main+5Fh (40107Fh)
Aha! The FindWitoutBlock function is only being called once! The compiler has spotted that the function will return the same value every time and has optimised it to a single call. In the FindWithBlock, the compiler can't make the same assumption because you write to the array before the search, thus the array is (potentially) different for each call.
To test this, add the volatile keyword like this:-
int FindWithoutBlock(volatile int * Arr, int ArrLen, int Val)
{
for ( int i = 0; i < ArrLen; i++ )
if ( Arr[i] == Val )
return i;
return ArrLen;
}
int FindWithBlock(volatile int * Arr, int LastCellIndex, int Val)
{
Arr[LastCellIndex] = Val;
int i;
for ( i = 0 ; Arr[i] != Val; i++ );
return i;
}
Doing this, both versions run in similar time (6.040). Seeing as the memory access is a major bottleneck, the more complex tests of the FindWithoutBlock don't impact on the overall speed.
First, ewwwwww disgusting C garbage. std::find and iterators?
But secondly, the compiler's optimizer is written to recognize the first form- not the second. It may be, for example, inlined, unrolled, or vectorized, whereas the second cannot be.
In the general case, consider the cache issue. You are touching the end of the array and then going to the beginning- this could be a cache miss. However in the first block you are cheerily going only sequentially through the array- more cache coherent.
This is more of an extended comment than an answer. Skizz already answered the question with "Aha! The FindWithoutBlock function is only being called once!"
Test driver
I typically tend to put the code for the test driver and the test article in separate files. For one thing, you aren't going to deliver the test driver. For another, combining them like you did lets the optimizer do things you really do not want to be done such as calling the function once rather than 100,000 times. Separating them lets you use different optimization levels for the driver and test article. I tend to compile the driver unoptimized so that the loop that does the same thing 100K times truly is executed 100K times. The test article on the other hand is compiled with the optimization expected for the release.
Use of getchar()
It's usually a bad idea to use any I/O inside the test loop when testing for CPU utilization. Your test code is calling getchar when the item to be found is not in the array. [Rest of faulty analysis elided.] Update: Your test code calls getchar when the item to be found is in the array. Even though your test code ensures the item will not be found (and hence getchar won't be called) it's still not a good idea to have that call. Do something fast and benign instead.
C versus C++
Your code looks more like C± rather than C++. You are using malloc rather than new, you are intermingling C and C++ I/O, and you aren't using the C++ library such as std::find. This is typical for someone moving from C to C++. It's good to be aware of things like std::find. This allows you to completely eliminate your FindWithoutBlock function.
Premature optimization
The only reason to use that FindWithBlock formulation is because this search is a bottleneck. So is this truly a bottleneck? The FindWithoutBlock formulation (and even better, std::find) is arguably a better way to go because you do not need to modify the array and hence the array argument can be marked as const. The array cannot be marked as such with FindWithBlock because you are modifying the array.
What I observe is that in the first case, the compiler knows at run-time the size of the loop (e.g. < ArrLen). In the second case, the compiler cannot know.
The first for .. loop contains two conditions for each iteration while the second for loop contains one iteration per loop. For a large number of iterations, this difference should show because there is a RAW dependency between second condition and the iterator increment. But I still don't think that the speedup should be so high.
In the first example there are two conditions checked at every iteration: i < ArrLen and Arr[i] == Val. In the second example there's only one condition to check. That's why the first loop is twice as slower.
I can't observe the same behavior using GCC: the first loop is still slower.
With -O0:
Without block: 25.83
With block: 20.35
With -O3:
Without block: 6.33
With block: 4.75
I guess that the compiler somehow deduced that there is no SearchVal in array and thus there's no reason to call a function which searches for it.
Your compiler is smart.
If you use the LLVM Try Out page, you will obtain the following IR generated:
define i32 #FindWithoutBlock(i32* nocapture %Arr, i32 %ArrLen, i32 %Val) nounwind uwtable readonly
define i32 #FindWithBlock(i32* nocapture %Arr, i32 %ArrLen, i32 %Val) nounwind uwtable
The only difference is the presence of the readonly attribute on the first function:
From the Language Reference page:
readonly
This attribute indicates that the function does not write through any pointer arguments (including byval arguments) or otherwise modify any state (e.g. memory, control registers, etc) visible to caller functions. It may dereference pointer arguments and read state that may be set in the caller. A readonly function always returns the same value (or unwinds an exception identically) when called with the same set of arguments and global state. It cannot unwind an exception by calling the C++ exception throwing methods.
It means that, potentially, the optimizer may realise that the function will always return the same computation (for a given loop) and hoist it outside the loop.

Resources