Can Hadoop reduce runtime of SIFT? - object-recognition

Can we use hadoop to run SIFT on multiple images?
SIFT takes ~ 1s on each image to extract keypoints and its descriptors. Considering that each run is independent of others and runtime of 1 run cannot be reduced, can we reduce runtime anyhow?
Multithreading reduces runtime by a factor of number of core processors you have. We can run each image on each processor.
Can hadoop be used anyhow to parallelize run on multiple images?
If yes, by what factor can it reduce runtime supposing we have 3 clusters?

Could you give some good references for mappers? What are the kind of mappers which would be relevant for this job?

Related

Does OpenCV's Cascade classifier use multiple cores?

Recently I wrote my own modified version (single threaded CPU's code) of Cascade Classifier which uses OpenCV's XML file.
I want to compare my bare VJ algorithm with OpenCV's. So I disabled OpenCL and when I run OpenCV's one it takes 19-23ms to process whole image, while my code takes 39-49ms which is 2 times slower.
I suspect it's because I have 2 cores in my CPU and they used parallel for loops to increase efficiency. Am I right?
If wrong how much impact do parallel loops in OpenCV's code have in overall performance?

Is cuda::SURF_cuda faster than cv::xfeatures2d::SURF?

I'm trying to build opencv with CUDA support to compare cuda::SURF_CUDA with cv::xfeatures2d::SURF, but it's challenging.
However, suppose that I want to get SURF descriptors for an high performance, real time application. Yeah yeah, I know that FAST, or ORB are more suitable descriptors, but they're binaries and I need euclidean descriptors.
Anyway, the point is that I want to know which of these two implementations is faster given only one (query) image. I think it's important because someone told me that CUDA is reasonable to use only when a lot of images has to be processed, since the time to load them in the GPU memory becomes small compared to the time for computing descriptors, but I don't know if this is true.
Another reason because I post this is that I have only one NVIDIA GT755m, which is not an high-level GPU, and so my results could be not so good for this reason. On the other hand, I'm trying to improve the parallel section of cv::xfeatures2d::SURF (and test it on a Xeon Phi with 64 cores).
"the time to load them in the GPU memory becomes small compared to the time for computing descriptors" - OP
Yes you are correct. See here and here for explanations on why CUDA kernels seem to be slow on their 1st runs.
For your application, it will depend entirely on the CPU and GPU you're running the code on and how well the CPU and GPU code is written. Like #NAmorim said, it will be dependent on how much overhead your code creates and how much parallelism it is able to utilize.
Note that it could also depend on how many features you are processing as this factors into both CPU/GPU computation time along with a large portion of GPU overhead (think uploading/downloading the descriptors to the GPU).

How to measure latency of low latency c++ application

I need to measure message decoding latency (3 to 5 us ) of a low latency application.
I used following method,
1. Get time T1
2. Decode Data
3. Get time T2
4. L1 = T2 -T1
5. Store L1 in a array (size = 100000)
6. Repeat same steps for 100000 times.
7. Print array.
8. Get the 99% and 95% presentile for the data set.
But i got fluctuation between each test. Can some one explain the reason for this ?
Could you suggest any alternative method for this.
Note: Application is tight loop (acquire 100% cpu) and Bind to CPU via taskset commad
There are a number of different ways that performance metrics can be gathered either using code profilers or by using existing system calls.
NC State University has a good resource on the different types of timers and profilers that are available as well as the appropriate case for using each and some examples on their HPC website here.
Fluctuations will inevitably occur on most modern systems, certain BIOS setting related to hyper threading and frequency scaling can have a significant impact on the performance of certain applications, as can power-consumption and cooling/environmental settings.
Looking at the distribution of results as a histogram and/or fitting them to a Gaussian will also help determine how normal the distribution is and if the fluctuations are normal statistical noise or serious outliers. Running additional tests would also be beneficial.

CUDA processing image

Thanks for reading my thread.
Here is what I want to do:
I have many images on harddrive, say 100000 of them. Mostly they are 512X512 in size.
I'd like to load one by one, and calculate the statistics, say mean intensity, variance, min, max and etc, of each single image.
I am wondering, can I use CUDA to accelerate the process here? Is it going to be faster then CPU processing?
I am brand new to CUDA, but I am thinking of using C++ project to do image file I/O (libtiff, for example), then using CUDA to do the calculation. In general, what would be the reasonable/fastest/quasi-fastest way to implement this project?
Any comment is higjhly appreciated. Thanks a lot.
People are right , but I will tell you go ahead and do it . Doing such operations on images is PERFECT to understand CUDA and parallel programming , though not quite efficient as a multithreaded CPU.
My advice use
1- use a single thread to the do the operations and time each operation.
2- then use OpenMP to do the operations on the CPU ( using 2 , 4 and more threads , depending on the number of cores you have.
3- Try to program that in Cuda, you will learn a lot of parallel programing primitives in the process ( parallel reduce for the min , and mean operation )
later on when you are going to have more complex stuff like Histograms , deformable registration or smoothing operation you will start to appreciate the parallel programing speed.
HINT : the IO operations like reading images will have same code in all programs.
I believe that having CUDA will not help you because:
1. Image data must be transferred to Graphics Controller.
2. Results must be transferred from Graphics Controller to CPU.
Data paths from the GPU to the CPU are not optimized for speed, since most of the traffic is from CPU to GPU.
You would get better performance by optimizing your operations for Data Cache. Search the web for "C++ Data Cache Optimization".
Also search for "loop unrolling" and "double buffer read".
I predict your bottleneck is not in the image analysis of the data, but reading the data into memory.
Also look at distributing computation among CPU cores. For example, Core 1 works on the odd numbered images and Core 2 works on the even numbered images.
CUDA will probably not help you very much (assuming such cheap statistical evaluations) since host-to-device copy instructions are quite expensive. If you want to try it: Take a look at parallel reduction for your tasks.
I would first implement a simple CPU-threaded version and check performance. Just use one thread to fill a buffer with files (input) and a few workers (depends on the configuration of your machine) to to the calculations.

Android Renderscript for CPU computation

Firstly, I read that there's a possibility of using renderscript for compute task on nexus 10 at http://android-developers.blogspot.sg/2013/01/evolution-of-renderscript-performance.html
I was wondering if anyone has tried it out, does it help in computationally intensive algorithm such as N-Queen? Or does it only work on algorithms that can be splited into many small tasks to make use of GPU cores.
Secondly for Renderscript allocation, are they usable for only mainly graphics?
API at http://developer.android.com/reference/android/renderscript/Allocation.html
Is there any chance that I can pass an array of integer over to the script?
It probably depends on how you implement n-queens. We support recursion, but you'd need to split your task into some reasonable number of subtasks such that we could parallelize it over multiple cores (or on a GPU).
To pass an array of integers to RenderScript, create an Allocation of the appropriate size with Element.I32, then copy the array to the Allocation.

Resources