Redundant computations in attempt to parallelize recursive function with OpenMP - c++

I have a recursive function that calls itself twice. My attempt to parallelize the function works eventually, but does a lot of redundant computations in the interim, thus wiping all gains from parallelism.
The main program is trying to compute an auxiliary graph, which is an intermediate data structure required in computing all k-edge connected components of a graph.
I've been having a go at this problem for months now and I only decided to ask for help here as a last resort. I will appreciate any comments or suggestions pointing me in the right direction; I'm not necessarily looking for a solution on a plate.
I tried using the #pragma omp single nowait, but that only resulted in sequential execution of the code.
I tried using cilk_spawn one other time but that only resulted in my computer running out of memory. I guess too many processes were spawned.
I extracted the spirit of the problem into a minimum working example that I paste below.
The code posted below repeats each computation about eight times. I guess eight different processes run a separate copy of the program instead of working on parts of the problem simultaneously.
#include <iostream>
#include <omp.h>
#include <numeric>
#include <vector>
#include <random>
#include <algorithm>
using namespace std;
int foo(std::vector<int> V, int s){
int n = V.size();
if (n>1){
std::cout<<n<<" ";
std::random_device rd; // obtain a random number from hardware
std::mt19937 eng(rd()); // seed the generator
std::uniform_int_distribution<int> distr(0, n-1); // define the range
int t = 1;
auto first = V.begin();
auto mid = V.begin() + (t);
auto mid_1 = V.begin() + (t);
std::vector<int> S(first, mid);
std::vector<int> T(mid_1, V.end());
#pragma omp parallel
{
#pragma omp task
foo(S, s);
#pragma omp task
foo(T, t);
}
}
return 0;
}
int main(){
std::vector<int> N(100);
iota(N.begin(), N.end(), 0);
int p = foo(N,0);
return (0);
}
My aim is to have all processes/threads work together to complete the recursion.

The correct way to apply task parallelism with OpenMP for your example would be as follows.
int foo(std::vector<int> V, int s)
{
int n = V.size();
if (n > 1)
{
std::cout << n << " ";
std::random_device rd; // obtain a random number from hardware
std::mt19937 eng(rd()); // seed the generator
std::uniform_int_distribution<int> distr(0, n - 1); // define the range
int t = 1;
auto first = V.begin();
auto mid = V.begin() + (t);
auto mid_1 = V.begin() + (t);
std::vector<int> S(first, mid);
std::vector<int> T(mid_1, V.end());
#pragma omp task
foo(S, s);
#pragma omp task
foo(T, t);
}
return 0;
}
int main()
{
std::vector<int> N(10000);
std::iota(N.begin(), N.end(), 0);
#pragma omp parallel
#pragma omp single
{
int p = foo(N, 0);
}
return (0);
}
That said, the particular example won't show a performance improvement because it is very fast on its own and dominated by memory allocation. So if you do not see a benefit in applying this, feel free to update or post a new question with a more specific example.

Related

Why is unordered_map and map giving the same performance?

Here is my code, my unordered_map and map are behaving the same and taking the same time to execute. Am I missing something about these data structures?
Update: I've changed my code based on the below answers and comments. I've removed string operation to reduce the impact in profiling. Also now am only measuring the find() which takes almost 40% of CPU in my code. The profile shows that unordered_map is 3 times faster, however, is there any other way to make this code faster?
#include <map>
#include <unordered_map>
#include <stdio.h>
struct Property {
int a;
};
int main() {
printf("Performance Summery:\n");
static const unsigned long num_iter = 999999;
std::unordered_map<int, Property > myumap;
for (int i = 0; i < 10000; i++) {
int ind = rand() % 1000;
Property p;
p.a = i;
myumap.insert(std::pair<int, Property> (ind, p));
}
clock_t tStart = clock();
for (int i = 0; i < num_iter; i++) {
int ind = rand() % 1000;
std::unordered_map<int, Property >::iterator itr = myumap.find(ind);
}
printf("Time taken unordered_map: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
std::map<int, Property > mymap;
for (int i = 0; i < 10000; i++) {
int ind = rand() % 1000;
Property p;
p.a = i;
mymap.insert(std::pair<int, Property> (ind, p));
}
tStart = clock();
for (int i = 0; i < num_iter; i++) {
int ind = rand() % 1000;
std::map<int, Property >::iterator itr = mymap.find(ind);
}
printf("Time taken map: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
}
The output is here
Performance Summery:
Time taken unordered_map: 0.12s
Time taken map: 0.36s
Without going into your code, I would make a few general comments.
What exactly are you measuring? Your profiling includes both populating and scanning the data structures. Given that (presumably) populating an ordered map would take longer, measuring both works against the idea of the gains (or otherwise) of an ordered map. Figure out what you are measuring and just measure that.
You also have a lot going on in the code that is probably incidental to what you are profiling: there is a lot of object creation, string concatenation, etc etc. This is probably what you are actually measuring. Focus on profiling only what your want to measure (see point 1).
10,000 cases is way too small. At this scale other considerations can overwhelm what you are measuring, particularly when you are measuring everything.
There is a reason we like getting minimal, complete and verifiable examples. Here's my code:
#include <map>
#include <unordered_map>
#include <stdio.h>
struct Property {
int a;
};
static const unsigned long num_iter = 100000;
int main() {
printf("Performance Summery:\n");
clock_t tStart = clock();
std::unordered_map<int, Property> myumap;
for (int i = 0; i < num_iter; i++) {
int ind = rand() % 1000;
Property p;
//p.fileName = "hello" + to_string(i) + "world!";
p.a = i;
myumap.insert(std::pair<int, Property> (ind, p));
}
for (int i = 0; i < num_iter; i++) {
int ind = rand() % 1000;
myumap.find(ind);
}
printf("Time taken unordered_map: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
tStart = clock();
std::map<int, Property> mymap;
for (int i = 0; i < num_iter; i++) {
int ind = rand() % 1000;
Property p;
//p.fileName = "hello" + to_string(i) + "world!";
p.a = i;
mymap.insert(std::pair<int, Property> (ind, p));
}
for (int i = 0; i < num_iter; i++) {
int ind = rand() % 1000;
mymap.find(ind);
}
printf("Time taken map: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
}
Run time is:
Performance Summery:
Time taken unordered_map: 0.04s
Time taken map: 0.07s
Please note that I am running 10 times the number of iterations you were running.
I suspect there are two problems with your version. The first is that you are running too little iterations for it to make a difference. The second is that you are doing expensive string operations inside the counted loop. The time it takes to run the string operations is greater than the time saved by using unordered map, hence you not seeing the difference in performance.
Whether a tree (std::map) or a hash map (std::unordered_map) is faster really depends on the number of entries and the characteristics of the key (the variability of the values, the compare and hashing functions, etc.)
But in theory, a tree is slower than a hash map because insertion and searching inside a binary tree is O(log2(N)) complexity while insertion and searching inside a hash map is roughly O(1) complexity.
Your test didn't show it because:
You call rand() in a loop. That takes ages in comparison with the map insertion. And it generates different values for the two maps you're testing, skewing results even further. Use a lighter-weight generator e.g. a minstd LCG.
You need a higher resolution clock and more iterations so that each test run takes at least a few hundred milliseconds.
You need to make sure the compiler does not reorder your code so the timing calls happen where they should. This is not always easy. A memory fence around the timed test usually helps to solve this.
Your find() calls have a high probability of being optimized away since you're not using their value (I just happen to know that at least GCC in -O2 mode doesn't do that, so I leave it as is).
String concatenation is also very slow in comparison.
Here's my updated version:
#include <atomic>
#include <chrono>
#include <iostream>
#include <map>
#include <random>
#include <string>
#include <unordered_map>
using namespace std;
using namespace std::chrono;
struct Property {
string fileName;
};
const int nIter = 1000000;
template<typename MAP_TYPE>
long testMap() {
std::minstd_rand rnd(12345);
std::uniform_int_distribution<int> testDist(0, 1000);
auto tm1 = high_resolution_clock::now();
atomic_thread_fence(memory_order_seq_cst);
MAP_TYPE mymap;
for (int i = 0; i < nIter; i++) {
int ind = testDist(rnd);
Property p;
p.fileName = "hello" + to_string(i) + "world!";
mymap.insert(pair<int, Property>(ind, p));
}
atomic_thread_fence(memory_order_seq_cst);
for (int i = 0; i < nIter; i++) {
int ind = testDist(rnd);
mymap.find(ind);
}
atomic_thread_fence(memory_order_seq_cst);
auto tm2 = high_resolution_clock::now();
return (long)duration_cast<milliseconds>(tm2 - tm1).count();
}
int main()
{
printf("Performance Summary:\n");
printf("Time taken unordered_map: %ldms\n", testMap<unordered_map<int, Property>>());
printf("Time taken map: %ldms\n", testMap<map<int, Property>>());
}
Compiled with -O2, it gives the following results:
Performance Summary:
Time taken unordered_map: 348ms
Time taken map: 450ms
So using unordered_map in this particular case is faster by ~20-25%.
It's not just the lookup that's faster with an unordered_map. This slightly modified test also compares the fill times.
I have made a couple of modifications:
increased sample size
both maps now use the same sequence of random numbers.
-
#include <map>
#include <unordered_map>
#include <vector>
#include <stdio.h>
struct Property {
int a;
};
struct make_property : std::vector<int>::const_iterator
{
using base_class = std::vector<int>::const_iterator;
using value_type = std::pair<const base_class::value_type, Property>;
using base_class::base_class;
decltype(auto) get() const {
return base_class::operator*();
}
value_type operator*() const
{
return std::pair<const int, Property>(get(), Property());
}
};
int main() {
printf("Performance Summary:\n");
static const unsigned long num_iter = 9999999;
std::vector<int> keys;
keys.reserve(num_iter);
std::generate_n(std::back_inserter(keys), num_iter, [](){ return rand() / 10000; });
auto time = [](const char* message, auto&& func)
{
clock_t tStart = clock();
func();
clock_t tEnd = clock();
printf("%s: %.2gs\n", message, double(tEnd - tStart) / CLOCKS_PER_SEC);
};
std::unordered_map<int, Property > myumap;
time("fill unordered map", [&]
{
myumap.insert (make_property(keys.cbegin()),
make_property(keys.cend()));
});
std::map<int, Property > mymap;
time("fill ordered map",[&]
{
mymap.insert(make_property(keys.cbegin()),
make_property(keys.cend()));
});
time("find in unordered map",[&]
{
for (auto k : keys) { myumap.find(k); }
});
time("find in ordered map", [&]
{
for (auto k : keys) { mymap.find(k); }
});
}
example output:
Performance Summary:
fill unordered map: 3.5s
fill ordered map: 7.1s
find in unordered map: 1.7s
find in ordered map: 5s

C++ parallel average matrix with OpenMP

I have this code in C++ that calculates the average of each column of a matrix. I want to parallelize the code using OpenMP.
#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>
using namespace std;
vector<double> average(const vector<vector<unsigned char>>& original){
vector<vector<double>> result(original.size(), vector<double>(original[0].size()));
vector<double> average(original[0].size(), 0.0);
for (int i=0; i<original.size(); i++) {
const vector<unsigned char>& vector = original[i];
for (int k = 0; k < vector.size(); ++k) {
average[k] += vector[k];
}
}
for (double& val : average) {
val /= original.size();
}
return average;
}
Adding #pragma omp parallel for before the outer for loop gets me bogus results. Do you have any pointers? I thought I would find tons of examples of this online but wasn't able to find much. This is my first time using OpenMP.
Frank is right in saying that your immediate problem may be that you're using a non-atomic operation:
average[k] += vector[k];
You can fix that by using:
#pragma omp atomic
average[k] += vector[k];
But a larger conceptual problem is that this probably will not speed up your code. The operations you are doing a very simple and your memory (at least the rows) are contiguous.
Indeed, I have made a Minimum Working Example for your code (you should have done this for your question):
#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>
using namespace std;
vector<float> average(const vector<vector<unsigned char>>& original){
vector<float> average(original[0].size(), 0.0);
#pragma omp parallel for
for (int i=0; i<original.size(); i++) {
const vector<unsigned char>& vector = original[i];
for (int k = 0; k < vector.size(); ++k) {
#pragma omp atomic
average[k] += vector[k];
}
}
for (float& val : average) {
val /= original.size();
}
return average;
}
int main(){
vector<vector<unsigned char>> mat(1000);
for(int y=0;y<mat.size();y++)
for(int x=0;x<mat.size();x++)
mat.at(y).emplace_back(rand()%255);
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
double dont_optimize = 0;
for(int i=0;i<100;i++){
auto ret = average(mat);
dont_optimize += ret[0];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout<<"Time = "<<(std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()/100)<<std::endl;
return 0;
}
Compiling this with g++ -O3 temp.cpp -fopenmp enables OpenMP. Run-times on my four-core machine are consistently about 10,247 microseconds. When I disable OpenMP, run-times are about 2,561 microseconds.
Starting and managing a thread team is expensive.
But there is a real way to speed up your code: improve your memory layout.
Using a std::vector< std::vector<T> > design means that each vector<T> can be located anywhere in memory. Rather, we would like all our memory nice and contiguous. We can achieve this by using flat-array indexing, like so:
(Note that an earlier version of the below code used, e.g., mat.at(y*width+x). The range checking this implies results in a significant loss of speed versus using mat[y*width+x], as the code now does. Times have been updated appropriately.)
#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>
using namespace std;
class Matrix {
public:
vector<unsigned char> mat;
int width;
int height;
Matrix(int width0, int height0){
width = width0;
height = height0;
for(int i=0;i<width*height;i++)
mat.emplace_back(rand()%255);
}
unsigned char& operator()(int x, int y){
return mat[y*width+x];
}
unsigned char operator()(int x, int y) const {
return mat[y*width+x];
}
unsigned char& operator()(int i){
return mat[i];
}
unsigned char operator()(int i) const {
return mat[i];
}
};
vector<float> average(const Matrix& original){
vector<float> average(original.width, 0.0);
#pragma omp parallel for
for(int y=0;y<original.height;y++)
for(int x=0;x<original.width;x++)
#pragma omp atomic
average[x] += original(x,y);
for (float& val : average)
val /= original.height;
return average;
}
int main(){
Matrix mat(1000,1000);
std::cerr<<mat.width<<" "<<mat.height<<std::endl;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
double dont_optimize = 0;
for(int i=0;i<100;i++){
auto ret = average(mat);
dont_optimize += ret[0];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout<<"Time = "<<(std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()/100)<<std::endl;
return 0;
}
Note that I am also using float instead of double: you can cram twice the numbers into the same amount of space this way, which is good for caching.
This gives run-times of 292 microseconds without OpenMP and 9426 microseconds with OpenMP.
In conclusion, using OpenMP/parallelism slows down your code because the work being done in parallel takes less time than setting up the parallelism, but using a better memory layout gives a ~90% increase in speed. In addition, using the handy Matrix class I create improves your code's readability and maintainability.
Edit:
Running this on matrices that are 10,000x10,000 instead of 1,000x1,000 gives similar results. For the vector of vectors: 7,449 microseconds without OpenMP and 156,316 microseconds with OpenMP. For flat-array indexing: 32,668 miroseconds without OpenMP and 145,470 microseconds with OpenMP.
The performance may be related to the hardware instruction set available on my machine (particularly, if my machine lacks atomic instructions then OpenMP will have to simulate them with mutexes and such). Indeed, in the flat-array example compiling with -march=native gives improved, though still not-great, performance for OpenMP: 33,079 microseconds without OpenMP and 127,841 microseconds with OpenMP. I will experiment with a more powerful machine later.
Edit
While the aforementioned testing was performed on the Intel(R) Core(TM) i5 CPU M 480 # 2.67GHz, I have compiled this code (with -O3 -march=native) on the badass Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz. The results are similar:
1000x1000, Vector of vectors, without OpenMP: 145μs
1000x1000, Vector of vectors, with OpenMP: 2,941μs
10000x10000, Vector of vectors, without OpenMP: 20,254μs
10000x10000, Vector of vectors, with OpenMP: 85,703μs
1000x1000, Flat array, without OpenMP: 139μs
1000x1000, Flat array, with OpenMP: 3,171μs
10000x10000, Flat array, without OpenMP: 18,712μs
10000x10000, Flat array, with OpenMP: 89,097μs
This confirms our earlier result: using OpenMP for this task tends to slow things down, even if your hardware is amazing. In fact, most of the speed-up between the two processors is likely due to the Xeon's large L3 cache size: at 30,720K it's 10x larger than the 3,720K cache on the i5.
Edit
Incorporating Zulan's reduction strategy from their answer below allows us to efficiently leverage parallelism:
vector<float> average(const Matrix& original){
vector<float> average(original.width, 0.0);
auto average_data = average.data();
#pragma omp parallel for reduction(+ : average_data[ : original.width])
for(int y=0;y<original.height;y++){
for(int x=0;x<original.width;x++)
average_data[x] += original(x,y);
}
for (float& val : average)
val /= original.height;
return average;
}
For 24 threads, this gives 2629 microseconds on the 10,000x10,000 arrays: a 7.1x improvement over the serial version. Using Zulan's strategy on your original code (without the flat-array indexing) gives 3529 microseconds, so we're still getting a 25% speed-up by using better layouts.
Frank and Richard have the basic issue right. The hint about memory layout is also true. However, it is possible to do much better than using atomic. Not only is atomic data access quite expensive, by writing to a completly shared memory space from all threads, cache performance goes way down. So a parallel loop with nothing but an atomic increment is likely to not scale well.
Reduction
The basic idea is to first compute a local sum vector, and then sum those vectors up safely later. That way, most of the work can be done independently and efficiently. Recent OpenMP versions make it quite convenient to do that.
Here is the example code, base on Richards example - I do however swap the indexes and fix the operator() efficiency.
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <memory>
#include <vector>
class Matrix {
public:
std::vector<unsigned char> mat;
int width;
int height;
Matrix(int width0, int height0) {
srand(0);
width = width0;
height = height0;
for (int i = 0; i < width * height; i++)
mat.emplace_back(rand() % 255);
}
unsigned char &operator()(int row, int col) { return mat[row * width + col]; }
unsigned char operator()(int row, int col) const {
// do not use at here, the extra check is too expensive for the tight loop
return mat[row * width + col];
}
};
std::vector<float> __attribute__((noinline)) average(const Matrix &original) {
std::vector<float> average(original.width, 0.0);
// We can't do array reduction directly on vectors
auto average_data = average.data();
#pragma omp parallel reduction(+ : average_data[ : original.width])
{
#pragma omp for
for (int row = 0; row < original.height; row++) {
for (int col = 0; col < original.width; col++) {
average_data[col] += original(row, col);
}
}
}
for (float &val : average) {
val /= original.height;
}
return average;
}
int main() {
Matrix mat(500, 20000);
std::cerr << mat.width << " " << mat.height << std::endl;
std::chrono::steady_clock::time_point begin = chrono::steady_clock::now();
double dont_optimize = 0;
for (int i = 0; i < 100; i++) {
auto ret = average(mat);
dont_optimize += ret[0];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "Time = "
<< (std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count() / 100.)
<< "\n" << optimize << std::endl;
return 0;
}
For your given matrix size , this reduces the time from ~1.8 ms to ~0.3 ms with 12 threads on a Intel Xeon E5-2680 v3 at 2.5 GHz nominal frequency.
Switching the loop
Alternatively, you could parallelize the inner loop, as it's iterations are independent of each other. However, that would be slower due to the small chunks of work for each thread. Then you can swap the inner and outer loop, but that makes memory access non-contiguous, which is also bad for performance. So the best for that approach is to split the inner loop like such:
constexpr size_t chunksize = 128;
#pragma omp parallel for
for (size_t col_chunk = 0; col_chunk < original.width; col_chunk += chunksize) {
for (size_t row = 0; row < original.height; row++) {
const auto col_end = std::min(col_chunk + chunksize, original.width);
for (size_t col = col_chunk; col < col_end; col++) {
This gives you reasonable contiguous memory access while avoiding all interaction between threads. However, there still may be some false sharing at the border of threads. I haven't been able to easily get very good performance, but it's still faster than serial with sufficient amount of threads.
average[k] += vector[k]; is not an atomic operation.
Each thread might (and probably will) read the current value of k (possibly at the same time), add to it, and write the value back.
These types of cross-thread data races are undefined behavior.
Edit:
An easy fix would be to invert the order the loops, and parallelize on the k loop. This way, each thread would be writing to only one value. But then, you would multiply by K the number of lookups on the top-level vector, so you may not get that big of a performance gain as you will start thrashing the cache pretty hard.

How to generate random numbers in a thread safe way with OpenMP

Before parallelization, I was creating one default_random_engine object outside the loop since creating such objects isn't cheap. And I was reusing it inside the loop.
When parallelizing with OpenMP, I noticed that uniform_dist(engine) takes a mutable reference to the random engine which I assume isn't thread safe.
The program isn't crashing but I worry about its correctness.
I assume that random_device is thread safe so I could just move the definition of default_random_engine inside the loop but I don't want to create a random engine object every iteration since I read that that isn't cheap.
I think that an other way would be to create an array (of size: the number of threads) of default_random_engine objects and use OpenMP functions to select the right object at the start of each iteration based on thread ids.
Is there a better way?
#include <iostream>
#include <random>
using namespace std;
int main() {
int N = 1000;
vector<int> v(N);
random_device r;
default_random_engine engine(r());
#pragma omp parallel for
for (int i = 0; i < N; ++i) {
uniform_int_distribution<int> uniform_dist(1, 100);
// Perform heavy calculations
v[i] = uniform_dist(engine); // I assume this is thread unsafe
}
return 0;
}
Since the actual code passes the random engine to many functions (Each generating integers and real numbers from different distributions), I went with the array of generator per thread because it imposes least changes to the code base:
#include <iostream>
#include <omp.h>
#include <vector>
#include <random>
using namespace std;
int main() {
random_device r;
std::vector<std::default_random_engine> generators;
for (int i = 0, N = omp_get_max_threads(); i < N; ++i) {
generators.emplace_back(default_random_engine(r()));
}
int N = 1000;
vector<int> v(N);
#pragma omp parallel for
for (int i = 0; i < N; ++i) {
// Get the generator based on thread id
default_random_engine& engine = generators[omp_get_thread_num()];
// Perform heavy calculations
uniform_int_distribution<int> uniform_dist(1, 100);
v[i] = uniform_dist(engine); // I assume this is thread unsafe
}
return 0;
}
Keep in mind that this code assumes that the function omp_set_num_threads never gets called in the program. If that was to happen, it will be possible for threads to get numbers (omp_get_thread_num()) greater than the old omp_get_max_threads() which would cause buffer overflow bugs.
And sadly, this solution assumes an implementation detail which is not required by the standard as explained in this other comment.

Use std::chrono::high_resolution_clock to measure std::lower_bound execution time?

I used std::chrono::high_resolution_clock to measure std::lower_bound execution time. Here is my test code:
#include <iostream>
#include <algorithm>
#include <chrono>
#include <random>
const long SIZE = 1000000;
using namespace std::chrono;
using namespace std;
int sum_stl(const std::vector<double>& array, const std::vector<double>& vals)
{
long temp;
auto t0 = high_resolution_clock::now();
for(const auto& val : vals) {
temp += lower_bound(array.begin(),array.end(),val) - array.begin();
}
auto t1 = high_resolution_clock::now();
cout << duration_cast<duration<double>>(t1-t0).count()/vals.size()
<< endl;
return temp;
}
int main() {
const int N = 1000;
vector<double> array(N);
auto&& seed = high_resolution_clock::now().time_since_epoch().count();
mt19937 rng(move(seed));
uniform_real_distribution<float> r_dist(0.f,1.f);
generate(array.begin(),array.end(),[&](){return r_dist(rng);});
sort(array.begin(), array.end());
vector<double> vals;
for(int i = 0; i < SIZE; ++i) {
vals.push_back(r_dist(rng));
}
int index = sum_stl(array, vals);
return 0;
}
array is a sorted vector with 1000 uniformed random numbers. vals has a size of 1 million. At first I set the timer inside the loop to measure every single std::lower_bound execution, timing result was around 1.4e-7 seconds. Then I tested other operations like +, -, sqrt, exp, but they all gave the same result as std::lower_bound.
In a former topic resolution of std::chrono::high_resolution_clock doesn't correspond to measurements, it's said that the 'chrono' resolution might be not enough to represent a duration less than 100 nanoseconds. So I set a timer for the whole loop and get an average by dividing the iteration number. Here is the output:
1.343e-14
There must be something wrong since it gave a duration even less than the CPU cycle time, but I just can't figure it out.
To make the question more general, how can I measure accurate execution time for a short function?

Generating random numbers in parallel with identical engines fails

I am using the RNG provided by C++11 and I am also toying around with OpenMP. I have assigned an engine to each thread and as a test I give the same seed to each engine. This means that I would expect both threads to yield the exact same sequence of randomly generated numbers. Here is a MWE:
#include <iostream>
#include <random>
using namespace std;
uniform_real_distribution<double> uni(0, 1);
normal_distribution<double> nor(0, 1);
int main()
{
#pragma omp parallel
{
mt19937 eng(0); //GIVE EACH THREAD ITS OWN ENGINE
vector<double> vec;
#pragma omp for
for(int i=0; i<5; i++)
{
nor(eng);
vec.push_back(uni(eng));
}
#pragma omp critical
cout << vec[0] << endl;
}
return 0;
}
Most often I get the output 0.857946 0.857946, but a few times I get 0.857946 0.592845. How is the latter result possible, when the two threads have identical, uncorrelated engines?!
You have to put nor and uni inside the omp parallel region too. Like this:
#pragma omp parallel
{
uniform_real_distribution<double> uni(0, 1);
normal_distribution<double> nor(0, 1);
mt19937 eng(0); //GIVE EACH THREAD ITS OWN ENGINE
vector<double> vec;
Otherwise there will only be one copy of each, when in fact every thread needs its own copy.
Updated to add: I now see that exactly the same problem is discussed in
this stackoverflow thread.

Resources