CPM and APM in supercomuting? - supercomputers

I am doing a research for my paper work on supercomputer subject, specifically for Tianhe-2. So i was reading a report from a professor Jack Dongarra and he mentions the CPM and APM halves of the board: "The compute board has two compute nodes and is composed of
two half’s the CPM and the APM halves.The CPM portion of the compute board
contains the 4 Ivy Bridge processors, memory, and 1 Xeon Phi board and the CPM half
contains the 5 Xeon Phi boards".
So the first thing i have problem with is a compute, as a term, because i don't know how to translate compute board, if a have Xeon Phi boards on that compute board... O.o?
The 2nd thing is about CPM and APM. What are CPM and APM? What is their full name? And how are they functioning?
Please help me, I'm stuck with it and can't find explanation anywhere ?
Thaks.
Tami

Tianhe-2 is a cluster, a set of computers (called 'nodes') linked together with a fast interconnect (network), and a distributed storage system. Most nodes are dedicated to computing (the 'compute nodes'), while some others are dedicated to management ('management nodes'.) Dongarra's document also mentions blades as a synonym to nodes. Blades are a kind of node form factor whose working is similar to a docking station for a laptop.
Traditionally, a node is a full computer, with a main circuitry system (the 'board', or 'motherboard') on which the processors and the memory modules are plugged, a network interface, and possibly a local hard disk, and an operating system.
On Tianhe-2, things are a bit different. A single board is made of two distinct parts (modules) plugged together (the CPM and the APU) and that single board hosts two separate nodes. Rather than having two identical boards for two distinct nodes, Tianhe-2 uses one, two-parts-board for two distinct nodes.
One of the boards (the CPM) hosts the CPUs (Intel IvyBridge) and the memory, plus one accelerator (Intel Xeon Phi) and two network connections, while the other (APU) hosts 5 accelerators. Plugged together, they offer two nodes, each with 2 CPUS and 3 accelerators and one network connection.
The Intel Xeon Phi is an extension card, that is plugged to the main board. In that extension card lives a fully-featured mini-computer with a CPU, some memory, and ... a tiny motherboard.
The exact meaning of CPM and APU (also referred to as APM in the Dongarra's document, which looks more like a typo(?) though it was quoted in many many places) is nowhere to be found, one could guess it means Central Processing Module and Accelerated Processing Unit or a variant of it.

Related

Best way to approach FPGA Device Requirements

When designing FPGA systems how can I estimate roughly the number of logic blocks a given task would require?
Anyone have a rough order of magnitude on what I should expect for these comon devices?:
UART
packet deframer with CRC32
8 micro core
I've seen www.opencores.org, however, they are not giving a number of gates magnitude for each project.
UART: 3200 gates.
8-bit uC: 10k gates.
Check http://www.design-reuse.com/ for others.
An entire Amiga can fit in 400k gates, excluding CPU. See the Minimig project, it's opensource and should include some useful reference files. There's also an FPGA 68k core somewhere online that you can check, written by tobiflex. Also check out the commodore one machine and C64/CPC cores (Z80, 6845, SID, 6502, etc) to see how they compare.
I'd avoid gate counts with FPGAs, here's some 4-input look-up-table estimates (most of my experience is with Xilinx, but it'll be similar for Altera and others):
A raw UART is a few dozen LUT/FFs - if it has a bus interface to a micro, then it'll be more (likely <100 still) and if it has 16550-style FIFOs then even more (and maybe some ram blocks as well).
8-bit micro - in Xilinx, see Picoblaze (113 slices - each slice is two LUTs and two FFs, but not all of them are used in every slice)
Packet deframer - no idea - depends on the framer spec sorry :)
I'd recommend going to Opencores.org, finding a design similar to yours and synthesizing it. I'd say that's the most accurate way to estimate logic utilization.

Genetic algorithm for workstation assignments

Background
I have a project where I need to assign roughly 200 full time employees (FTEs) to their workstations. FTEs have the following properties:
ID (numeric string)
Level (Admins, Seniors, Mid-Level, Juniors, Associates)
Manager ID
Workstation Type (office, premium, new, average, old) (in descending order of quality).
Department
All FTEs have been assigned a type of workstation based on their level, so Admin-levels are more likely to get office types. Mid-Levels are likely to get new types, but if there aren't enough, they might get bumped up to a premium if there are spares. There will likely be empty workstations after everyone is assigned.
Workstations are also divided into separate sections, numbered 0-9. Each section has a different number of workstations.
Objective
Assignments are done with the following priorities (highest first):
FTEs in the Marketing department must sit in Section 2. In case there are not enough workstations there, they can "overflow" to an adjacent section.
Group members (those who share the same manager) must be within 5 workstation units of each other.
Managers must be within 10 units of the closest group member.
Those in the same department must be within 5 units of each other.
Distances between one workstation to another are represented as a weighted graph data structure. Each vertex is a workstation and each edge has the distance to its neighboring workstations.
What is the best way to encode this problem for a genetic algorithm, and what would be suitable choices for crossover, mutation, and fitness functions?
I currently have an algorithm using OX1 crossover and random swapping, but in general it performs poorly and slowly. The chromosome is encoded as an array of objects representing FTEs, but since I'm green to all this, that probably isn't the best way to do things.

Looking to get started with FPGAs — speed up? [closed]

I'm very interested in learning FPGA development. I've found a bunch of "getting started with FPGA" questions here, and other tutorials and resources on the internet. But I'm primarily interested in using FPGAs as an accelerator, and I can't figure out what devices will actually offer a speed up over a desktop CPU (say a recent i7).
My particular interest at the moment is cellular automata (and other parallel environments like neural networks and agent based modeling). I'd like to experiment with 3d or higher dimensional cellular automatas. My question is - will the low-cost $100-$200 starter kits provide something that have potential to produce a significant speed up over a desktop CPU? Or would I need to spend more and get a higher end model FPGA?
The FPGA can be a very good accelerator, but (and this is a big BUG) it is usually is very expensive. We have here machines like the beecube, a convey or from Dini godzillas part time nanny, and they are all very expensive (>10k$) and even with these machines many applications can be better accelerated with a standard cpu cluster or gpus. That FPGA is a bit better when the total cost of ownership is considered as you have there usually a better engery efficiency.
But there are applications that you can accelerate. On the lower scale you can/should do an rough estimate if its worth for you application, but you need there more concrete numbers for your application. Consider an standard deskop cpu: usually it has at least 4 cores (or dual with hyperthreading, not to mention the vector units), and clocks at, say 3 GHz. This results in 12 GCycles per second computation power. The (cheap) FPGAs you can get to 250 MHz (better can reach up to 500 MHz, but that must be very friendly designs and very good speed grades), so you need approx. 50 Operations in parallel, to compete with the CPU (actually its a bit better because the cpu has usually not 1 cycle ops, but it also has vector operations so we are equal).
50 Operations sounds much, and is hard, but is doable (the magic word here is pipeling). So you should know exactly how you are going to implement you design in hardware and which degree of parallelism you can use.
Even if you solve that parallelism problem, we come now to the real problem: The memory.
Above mentioned accelerators have so much compute capacity, they could do thousands things in parallel, but the real problem with such computation power is: how to get the data into/out of them. And you also have this problem in your small scale. In your desktop PC the cpu transfers more than 20GB/s to/from memory (good GPU card make 100GB/s and more), while your small accelerator for 100-200$ has at most (when you get lucky) 1-2 GB/s per PCI-Exp.
If its worth for you, depends completely on your application (and here you need far more details than: 3D Cellular Automatas, you must know the neighbourhoods, the required precision (do you double, single float, or integers or fixpoint...?), and your use case (do you transfer initial cell values, let the machine compute 2 day, and than transfer cell values back, or do you need the cell values after every step (this makes a huge difference in the required bandwidth while computation)).
But overall, without knowing more, I would say: Its 100$-200$ worth.
But not because you can compute your cellular automatas faster (which I dont believe), but because you will lern. And you will not only learn to design hardware and the development on FPGAs, but I see with our students that we have here, always that they get with the hardware design knowledge also a far better understanding on how the hardware actually look and behaves like. Sure nothing what you do on your FPGA is direct related to the interior of the cpu, but many get a better feeling for what hardware in general is capable of, which in turn make them even more effective software developer.
But I have also to admit: You are going to pay a much higher price than just the 100-200$: You have to spent really much time on it.
Disclaimer: I work for a reconfigurable system developer/manufacturer.
A short answer to your question "will the low-cost $100-$200 starter kits provide something that have potential to produce a significant speed up over a desktop CPU" is probably not.
A longer answer:
A microprocessor is a set of fixed, shared functional units tuned to perform reasonably well across a broad range of applications. The operating system and compilers do a good job of making sure that these fixed, shared functional units are utilized appropriately.
FPGA based systems get their performance from dedicated, dense, computational efficiency. You create exactly what you need to execute your application, no more, no less - and whatever you create is not shared with any other user, process, operating system, whatever. If you need 80 floating point units, you create 80 dedicated floating point units that run in parallel. Compare that to a microprocessor scheduling floating point operations across some smaller number of floating point units. To get performance faster than a microprocessor, you have to instantiate enough dedicated FPGA-based functional units to make a performance difference vs. a microprocessor. This often requires the resources in the larger FPGA devices.
An FPGA alone is not enough. If you create a large number of efficient computational engines in an FPGA you -have- to keep these engines fed with data. This requires some number of high bandwidth connections to large amounts of data memory around the FPGA. What you often see with I/O based FPGA cards is some of the potential performance gain is usually diminished by moving data back and forth across the I/O bus.
As a data point, my company uses the '530 Stratix IV FPGA from Altera. We surround it with several directly coupled memories and tie this subsystem directly into the microprocessor memory. We get several advantages over microprocessor systems for many applications, but this is not a $100-$200 starter kit, this is a full-blown integrated system.

MPI and cluster

Just learn to program on a supercomputer consists of ~100 nodes, each node consists of 4 Xeon CPUs and 64GB ram.
What I want to do is assigning jobs to each node and then creating local multi-threading programs on each node, what I want to know is,by default, when MPI create a group of processes, is there a 1-1 mapping between each task process and one particular local node or not?(in my case, it is a node consist of 4 Xeon CPUs with totally 24 cores and 64GB ram).
MPI will run M processes on N nodes where M may be less than, equal to, or greater than N.
This site describes the setup.
I can't find a direct answer to your question, but there are a number of sites on the internet discussing process migration and checkpointing. But the general theme of these sites seems to be that this is still very much a work in progress. As such, I wouldn't expect that this would be happening automagically in your MPI implementation.
This site discusses the MPI_GET_PROCESSOR_NAME command, which can be used in process migration, but states that "nothing in MPI requires or defines process migration; this definition of MPI_GET_PROCESSOR_NAME simply allows such an implementation". With this command, you can at least check if your code is being actively migrated.

Cluster computer

I am trying to build a parallel processing computer.
I have
10 Windows7 64-bit machines
3 Ubuntu Linux machines
1 Windows 2008 server
around 1km network cable
3-4 Switches
My need:
make my animation rendering faster by clustering these computers.
I am using 3Dmax for my project, and i am doing medical animations/videos. What is the best way to achieve this? I am not that good in networking, but all the basics, I know.
And one more question:
Suppose I build a cluster of Windows PCs, and if I connect a Linux machine to it, will that be any good?
Thanks in Advance
I think your question is interesting but off-topic here. However, some fragments of an answer:
These two books cover most of the topics necessary for building a home-brew (or even much more sophisticated) cluster out of a pile of PCs: this one's for Linux and this one's for Windows. They're both a bit out of date, which is probably more serious for the Window's version, but would give you a good coverage of the necessary topics. Do some of your own research too, Google for topics such as Beowulf clustering and Condor; the latter is a system for scavenging spare cycles from networked computers.
I think heterogenous clusters, with machines having different OSes will be a little more difficult to build and configure than homogenous, but the degree of difficulty will be in proportion to the degree of integration you seek.
Your topic, rendering movie frames, falls into the class of embarassingly parallel programs and there are two general approaches:
you simply pass frames one at a time to computers which work independently of each other; the difficulty (if there is any) comes in ensuring load-balancing, that is getting each computer to work as hard as every other computer. This could look like a network of PCs independently reading frames from networked storage and scarcely like a cluster at all.
you build a rendering pipeline: computer 1 does render operation 1 on frame 1, then passes the frame to computer 2 which does operation 2 on frame 1, while computer 1 starts render operation 1 on frame 2, and so on; again you have to give attention to keeping the pipeline full and busy.

Resources