What is Rmax/RPeak (Ratio) in terms of Supercomputer - supercomputers

I am working on top500 supercomputer database.(http://www.top500.org/)
Rmax is maximum performance
RPeak is theorotical maximum performance.
Does Ratio of Rmax to RPeak results to something? Like say efficiency? or anything which could say something about a supercomputer.
Could it be something like Lie factor?

Rmax is determine by HPL benchmark. Details aren't always published, unfortunately, but in most cases, the problem dimension requires a decent fraction of total memory.
Rpeak is determined by multiplying the number of floating point units (usually vector) per processor times processor count times the number of floating point instructions that can be issued per second. This is a bit hard today because of frequency variation.
The ratio can be viewed as an efficiency factor, although it may not be productive to use the result for assigning value to systems. 75% of 1000 is the same as 100% of 750, and if they have the same dollar and power costs, what difference does it make?
I tend to view the combination of Top500, Graph500 and HPCG results as a more robust way to compare systems, but one cannot ignore power and dollar costs if one pays for systems (most users do not, at least directly).

Related

Estimating the efficiency of GPU in FLOPS (CUDA SAMPLES)

It seems to me, that I don't completely understand the conception of FLOPS. In CUDA SAMPLES, there is Matrix Multiplication Example (0_Simple/matrixMul). In this example the number of FLOPs (operations with floating point) per matrix multiplication is calculated via the formula:
double flopsPerMatrixMul = 2.0 * (double)dimsA.x * (double)dimsA.y * (double)dimsB.x;
So, this means, that in order to multiply matrix A(n x m) over B(m x k), we need to do: 2*n*m*k operations with floating point.
However, in order to calculate 1 element of the resulting matrix C (n x k), one have to perform m multiplication and (m-1) addition operations. So, the total number of operations (to calculate n x k elements), is m*n*k multiplications and (m-1)*n*k additions.
Of course, we could set the number of additions to m*n*k as well, and the total number of operations will be 2*n*m*k, half of them are multiplications and half additions.
But, I guess, multiplication is more computationally expensive, than addition. Why this two types of operations are mixed up? Is it always the case in computer science? How can one take into account two different types of operations?
Sorry for my English)
The short answer is that yes, they count both the multiplications and the additions. Even though most floating point processors have a fused multiply/add operation, they still count the multiply and add as two separate floating point operations.
This is part of why people have been complaining for decades that FLOPs is basically a meaningless measurement. To mean even a little, you nearly need to specify some particular body of code for which you're measuring the FLOPs (e.g., "Linpack gigaflops"). Even then, you sometimes need fairly tight control over things like what compiler optimizations are allowed to assure that what you're measuring is really machine speed rather than the compiler's ability to simply eliminate some operations.
Ultimately, it's concerns like these that have led to organizations being formed to set up benchmarks and rules about how those benchmarks must be run and results reported (e.g., SPEC). Otherwise, it can be difficult to be at all certain that the results you see reported for two different processors are really comparable in any meaningful way. Even with it, comparisons can be difficult, but without such things they can border on meaningless.

Generating 'random' numbers from time?

I have read about many random number generators and all problems that most have (repeatable, non-uniform distribution, floating-point precision, modulus w/e).
I'm a game developer and I'm thinking why not generate 'random' numbers from time ? I know they won't be 'random', but at least they can't be predicted and I'm only happy for them to just feel random for the players.
For example let's say, at every frame we can take 5 digits out of the current time and use them to generate random numbers.
Let's say if we have the time as a float ss.mmmuuunnn where ss = seconds, mmm = miliseconds, uuu = microseconds and nnn = nanoseconds, we can take only the part muuun and use this to generate our very own random numbers. I have investigated them a bit, and they seem and feel pretty random. I can come up with so many formulas to play around with those 5 digits and get new numbers.
Anyone here seeing anything wrong or that can perform miserably ?
Reminder, I'm just looking for an easy way to generate numbers that 'FEEL' randomly distributed and are unpredictable.
Is this an easy and decent way to give players the sense of randomness ?
Let us assume for the sake of the argument that you are making, on average, one call to your random function every 0.1 milliseconds (since you need it to be fast, you are calling it often, right?) and that it is equally probable to fall anywhere into that time range. In other words, the uun part is assumed to be completely random, but everything higher only changes slowly from call to call and is thus mostly predictable.
That is 1000 possible outcomes or ~10 bits of randomness. There are 1,056,964,608 normal floats bewteen 0 and 1 - not equally distributed of course. That's three orders of magnitude more, which sounds like "poor randomness" to me. Similarly, spreading your 10 bits to the 32 bits of an int (no matter how fancy your function) won't actually improve the randomness.
Also note that none of this deals with the possibility (and very likely scenario) that your calls will probably be extremely periodic and/or in short sequences, as well as the fact that your system time function might not have high enough resolution (or significantly increase power consumption of the system). Both further reducing the randomness of the obtained time, and the side effect of the latter can be very undesirable.
Reminder, I'm just looking for an easy way to generate numbers that 'FEEL' randomly distributed and are unpredictable.
That is extremely unspecific. Humans are terrible at judging randomness and will likely "feel" a close-to-uniform distribution to be more random than a true, fully random one - especially when it comes to streaks.
Unpredictability also exists on way too many levels, from "the player can't manually predict what the enemy will do" to "cryptographically secure until the end of time". For example, given the above assumptions, it might be possible to predict the result of the second of two random calls that happen in quick succession with a success rate of anywhere from 0.1% to 100%. How this pattern emerges in a game is hard to tell, but humans are exceedingly good at spotting patterns.

Definition of number of steps

According to this
Let n be a parameter that measures the size of the problem, and let
R(n) be the amount of resources the process requires for a problem of
size n. In our previous examples we took n to be the number for which
a given function is to be computed, but there are other possibilities.
For instance, if our goal is to compute an approximation to the square
root of a number, we might take n to be the number of digits accuracy
required. For matrix multiplication we might take n to be the number
of rows in the matrices. In general there are a number of properties
of the problem with respect to which it will be desirable to analyze a
given process. Similarly, R(n) might measure the number of internal
storage registers used, the number of elementary machine operations
performed, and so on. In computers that do only a fixed number of
operations at a time, the time required will be proportional to the
number of elementary machine operations performed.
and
... For instance, if we count process steps as “machine operations” we
are making the assumption that the number of machine operations needed
to perform, say, a multiplication is independent of the size of the
numbers to be multiplied, which is false if the numbers are
sufficiently large. Similar remarks hold for the estimates of space.
Like the design and description of a process, the analysis of a
process can be carried out at various levels of abstraction.
I usually try to count the number of steps by counting how many times a single operation is executed and totalling all of them (e.g. 1 if statement, n multiplications, n additions = 2n+1 = 2n = n operations where n is the input). This book makes it appear there are different things to count when counting the number of steps. That is, what steps to count might vary. My questions are:
Other than registers and elementary machine operations, are there other examples of things that can be counted when counting the number of steps?
I believe I am counting elementary machine operations when determining order of growth. Are there example programs where the steps we're counting should not be elementary machine operations (e.g. registers)?
This statement is confusing: "In computers that do only a fixed number of operations at a time, the time required will be proportional to the number of elementary machine operations performed." Does this talk of all computers in general? Are there computers that do different numbers of operations at a time?
It is up to you what you count.
What you count defines the "model" you're using in your analysis, like e.g. RAM model.
With a model you get some predictions as to your program's behaviour. Whether these predictions are any good, -- which we can actually measure by running our actual programs at different problem sizes -- says to us whether the execution model we used is any good, whether it is in good correspondence with what is actually going on inside our computer, whether it is a useful one.
Analyzing empirical run-time behavior can be done by plotting the run times versus the problem sizes on a log-log plot. We'd get some curve; any curve can be approximated by a sequence of straight lines; straight line on a log-log plot expresses a power rule of growth:
t ~ na <==> log(t) ~ a log(n).
Empirical orders of growth for the win!
cf. this Q&A entry of mine on this site. Also, great examples of log-log plots in answers by Todd Lehman here.
update: to your last bullet point, parallel computers could be doing different numbers of operations at a time, depending on available resources and / or the specifics of the computational task's stage currently being executed -- like the quoted material is saying, multiplying some numbers early on in the process (say) while they are still small enough takes O(1) machine operations; not so when they become bigger.
Also, modern CPUs can (will?) perform several operations at once, and that depending on the specifics of instructions currently in the pipeline(*) (some combinations are more amenable to this, some less).
(*) Wikipedia says: "Instruction pipelining is a technique for implementing instruction-level parallelism within a single processor."

Can I assume that a bitwise AND operation is O(1)? If so why?

My processor can only do arithmetic operations on 8-bit or 16-bit unsigned integers.
1) This means the word size for this processor is 16 bits, correct?
Operations on words are O(1).
The reason for this has to do with the actual circuit-level implementation of how the processor works, correct?
If I were to add two words, and the result were a number with more than 16-bits, could I state the following?
1) the processor can add the numbers but would only report the 16 least significant digits.
2) to also report more than 16 bits, the processor must have software that permits these operations with large numbers (numbers that don't fit in a word).
Finally,
Let's say I have a word w which is a 16-bit digit and I want the eight least significant digits. I can do w & 0xFF. What is the time complexity of this operation? Is this O(1) because of circuit level implementation of the processor as well?
Short Answer:
Yes, a single bitwise AND would be considered O(1).
A Little More Detail:
Even if you looked at the number of operations on each bit it is still O(1). The actual number of bit operations may vary by the variable type, e.g. 8 bits vs. 16 bits vs. 32 bits vs. 64 bits (even 128 bits or more). The key is no matter what the underlying machine is using, it will still do a constant number of operations to perform it. So even as computers develop over time, a bitwise AND will still be O(1).
Another Example to Help Add Clarification
The following block of code is O(1):
print('Hello World');
print('Hello World');
print('Hello World');
Although we print hello world 3 times, every time we run it, it will take a constant amount of time to run and operate and doesn't take longer if someone feeds a large data set into the program. It'll simply just print 3 things, no matter the input.
In the case of the bitwise AND, we perform a specified number of sub-operations that are always the same number. e.g. 8, 16, 32, etc. for the one operation, but its always the same or constant.
In your example, it sounds like you are trying to show that you have some operations that would not require all of the bits to perform as well. Even if these smaller operations only considered 4 bits of say 8. Your code would always just do a constant number of operations when it hits that code. Its like printing 4 hello world statements instead of 8 hello world statements. Either way, 4 or 8 prints, its still constant.
This is why a single bitwise AND operation is O(1).
First about the add. Most CPU have what is called a carry flag in some processor status register that let you detect the addition and subtraction overflowing the bit size of the register. So find out the size of the registers on your specific CPU to determine the data bus bandwidth then find out how you can check the status register flags. Most CPU will have a SUB & ADD with and without carry for that purpose.
Next, about time complexity: You can't assume to use Big O notation for this. You need to find out the cycles it takes for the CPU to carry the operation in absolute time (Frequency * cycles) then you need to take into considerations other things like memory access vs L1 and L2 cache access to figure out the total time an operation will take for that CPU.
Finally, accessing memory from assembly code (as you seem to imply) lets you be much more efficient than with higher order languages like Python. CPU will include instructions that can adjust their memory addressing to fit the size of what you are looking for. C-like languages will also carry such ability in the language, but Python won't. JavaScript doesn't even have integers but I digress...
If your goal is to understand low-level programming, something that will always benefit you to understand the machine better, specially around pointers and debugging, I would encourage you to take a video class on Arduino. You may even enjoy it and start a new hobby.
You apparently have some confusion about what O(...) means.
the big-O notation requires a variable; for example sorting using compare and swap an array of n elements is known to be at least O(n*log(n)) and you have n that is the variable: as n increases the sorting time will also increase even faster. When saying that x & 0xFF is O(1) what is the variable you're talking about?
big-O is about abstract algorithms where n can grow to infinity. If n and the computer are limited by an upper bound then any algorithm either doesn't terminate or is bounded by a constant (it doesn't make sense to discuss about the limit of something as n increases toward infinity if n cannot increase past a specific limit).
When talking about low-level hardware operations all operations are O(1). Some are faster and require just one "step" (like clearing a register), some require more steps (like an integer division). Even the division however will take at most n steps where n is a small integer.
When discussing about the performance of different algorithms in a concrete CPU it doesn't make sense to use big-O notation, what you can do is count the number of machine cycles required to complete with an explicit formula, possibly dependent on the input size n, but where n cannot grow to infinity.
This is what Knuth does in TAOCP
PS: unfortunately CPUs today are so complex that cycle counting doesn't work any more in the real world. They can for example split instructions into micro-instruction rescheduled to run in parallel, they support speculative execution with backtracking, branch prediction and other hard to analyze techniques. On top of all there is the caching issue that is of extreme importance today and different but compatible models can have vastly different approaches. The only way to really know how long it takes to execute a piece of code with modern CPUs is just to run and measure it over real data on the specific CPU model and hardware.
In order:
No. If your processor can do 8- or 16-bit operations, it could be either an 8- or 16- bit processor. In fact, it's more likely to be 8 bits, since most processors try to handle double-size operations.
Yes, O(1). But not because it's in hardware, rather because it's implemented O(1) in hardware. Also, keep in mind that all O(x) are actually "times a constant." Thus if something is O(16), that's really O(1) times a constant 16.
Finally, if you have a 16-bit word and you want the low bits, and your processor really does support 8 bits operations, you can probably access the low bits with a MOV instruction. Something like:
mov16 ax, (memory)
mov8 (othermemory), al
If that isn't available, and you have to do an AND, then yes, the AND is going to be O(1) because it's almost certainly in hardware that way. Even if not, it's probably O(8) at worst, and that's really an alias for O(1).

Large doubles/float/numbers

Say I have a huge floating number, say a trillion decimal places out. Obviously a long double can't hold this. Let's also assume I have a computer with more than enough memory to hold it. How do you do something like this?
You need arbitrary-precision arithmetic.
Arbitrary-precision math.
It's easy to say "arbitrary precision arithmetic" (or something similar), but I think it's worth adding that it's difficult to conceive of ways to put numbers anywhere close to this size to use.
Just for example: the current estimates of the size of the universe are somewhere in the vicinity of 150-200 billion light years. At the opposite end of the spectrum, the diameter of a single electron is estimated at a little less than 1 atometer. 1 light year is roughly 9.46x1015 meters (for simplicity, we'll treat it as 1016 meters).
So, let's take 1 atometer as our unit, and figure out the size of number for the diameter of the universe in that unit. 1018 units/meter * 1016 meters/light year * 1011 light years/universe diameter = about a 45 digit number to express the diameter of the universe in units of roughly the diameter of an electron.
Even if we went the next step, and expressed it in terms of the theorized size of a superstring, and added a few extra digits just in case the current estimates are off by a couple orders of magnitude, we'd still end up with a number around 65 digits or so.
This means, for example, that if we knew the diameter of the universe to the size of a single superstring, and we wanted to compute something like volume of the universe in terms of superstring diameters, our largest intermediate result would be something like 600-700 digits or so.
Consider another salient point: if you were to program a 64-bit computer running at, say, 10 GHz to do nothing but count -- increment a register once per clock cycle -- it would take roughly 1400 years for it to just cycle through the 64-bit numbers so it wrapped around to 0 again.
The bottom line is that it's incredibly difficult to come up with excuses (much less real reasons) to carry out calculations to anywhere close to millions, billions/milliards or trillions/billions of digits. The universe isn't that big, doesn't contain that many atoms, etc.
Sounds like what logarithms were invented for.
Without knowing what you intend to do with the number, it's impossible to accurately say how to represent it.

Resources