Is it a given that utilizing multicores will result in a speedup of an
application? Amdahl’s Law is not the only thing that plays a role in
the speedup of an application.
In general, if speedup is the sole objective when adding
multiprocessors, the following must hold true: (1) the processor is
overloaded and is not processing the work available in a satisfactory
time frame; (2) the workload contains elements that can be divided and
worked on in parallel; and, (3) a suitably faster processor cannot
provide the processing power needed to handle the workload in a
satisfactory time.
Part 1 in this series examined the “classic” reasons why one does
not get a proportional increase in performance by adding additional
processors to a computing machine. Most, if not all of them, were based
in some form or fashion on Amdahl’s Law.
Basically, Amdahl’s Law states that the upper limit on the speedup
gained by adding additional processors is determined by the amount of
serial code that is contained in the application. Some of the reasons
for serialized code are that it is explicitly written into the code.
Another reason why the code becomes serialized is because the code
shares resources. This includes data sharing. Only one processor or
core can access shared data at a time.
The next step in the exploration of multicore processing and whether
or not it will be of benefit in your application is the hardware. Most
embedded designs use shared memory (all cores are able to access some
or all of the memory on the chip) and they have the capability to
communicate with each other in some fashion. For most applications, the
addition of more cores does not lead to a proportional increase in
performance.
It is how the cores are combined and how they utilize memory and
other resources and their communication topologies that differentiate
the architectures into the various classes of multicore architectures.
From a hardware perspective, there are several different standard
approaches used in designing multicore hardware.
I use the term “standard approaches” because there have been many
different ways that cores, memory, resource access and communication
mechanisms have been combined. Luckily, for the embedded space, most of
that experimentation was addressed long before embedded designs
incorporated multiple cores.
This article looks at two of the most commonly used multicore
designs for the embedded space: Symmetric Multi-Processing (SMP) and
Asymmetric Multi-Processing (AMP).
SMP Hardware Architectures
SMPs are characterized by the symmetrical nature of their organization
and layout. They utilize two or more identical (homogeneous) cores that
have access to a common shared memory. Another attribute of SMP
architectures is that not only are each core identical, but each core
has identical access to all the resources of the system: memory, disk,
UARTs, ethernet controllers, etc. Analog Devices’ Blackfin 561 and
ARM’s MPCore are just two examples of (SMPs).
SMPs are a cost-effective way of increasing performance in a system,
rather than replicating the entire system: core, memory, IO and other
resources. Multicore designers realize that the most heavily used
component of a processor is the core. All other resources of a
processor are idle, relative to core activity since multiple resources
can not be utilized concurrently. As a result, the odds are that
different cores can operate in parallel by utilizing different
resources and data streams.
 |
| Figure
1: Symmetric Multiprocessing Hardware |
Asymmetric Hardware Architectures
AMPs are characterized by the non-symmetrical nature of their
architecture. In other words, AMP architectures are not bound by the
SMP rule which states that all cores used in it are identical and do
not require equally shared resources by each processor. Similar to
SMP-based architectures, AMPs seek additional perfomance gains by
adding multiple cores.
Unlike SMP architectures, AMP architectures seek additional
performance gains by utilizing different cores or hardware
configurations that are optimized to do very specific activities. Texas
Instruments’ OMAP and Freescale’s i.MX family of application processors
are two examples of a homogeneous AMP architecture. The OMAP and i.MX
families combine a general purpose MCU with a digital signal processing
(DSP) core.
DSPs are capable of doing highly specialized mathmatical operations
very efficiently when compared to a general purpose MCU. What may take
a general purpose MCU hundreds of cycles to do, a DSP can accomplish in
only a few cycles. By combining the cores in this fashion, designers
not only provide for the division of labor among cores, but also, like
Adam Smiths’ pin manufacturing example given in the first article,
practice the concept of labor specialization. Due to the specialized
nature of an AMP architecture, its area of application is more narrow
than a more general purpose SMP architecure.
Because of the complex nature of the devices for which multicores
are an appealing solution, many multicore devices use a real-time
operating system (RTOS) to provide operating system services such as
scheduling, communication and task management. RTOSs suitable for
multi-core architectures are as varied and specialized as the
architectures they are geared toward. In general, they can be segmented
into two camps, those suitable for SMP architectures, and those
suitable for AMP-based architectures.
 |
| Figure
2: Asymmetric Multiprocessing Hardware |
Real-time operating systems are regarded as AMP or SMP because they
exploit the different hardware attributes of AMP and SMP architectures.
Since all the cores or processors that make up the SMP architecture are
exactly the same, SMP RTOSs are characterized by the use of a single
(instance) of an RTOS image that runs on all the different cores at the
same time. SMP RTOSs perform what is called “load balancing.” Load
balancing involves parceling out ready-to-run tasks to available (idle)
processors.
Spin locks and SMP: a review
Since SMP architectures are based on equal access to resources,
resource protection is a fundamental requirement in SMP systems. As
will be demonstrated in the next few examples, “Spin Lock Granularity”
is of supreme importance in its effect on the efficient operation of an
SMP system and is worthy of further examination. The primary mechanism
for maintaining coherency in an operating system as well as its
applications and data is the spin lock.
Spin locks are a logical abstraction that act as gatekeepers to
resources such as shared data and tables, controllers and kernel
services such as the scheduler. They are typically implemented as
“atomic” test and set locks. In other words, the first task to gain
access to the spin lock gets it and keeps the spin lock until the task
is finished with it. Special hardware arbitrates between two or more
tasks that attempt to gain access to a resource simultaneously. Tasks
that do not get access will “spin” until the resource becomes
available.
System and application designers have to balance two conflicting
goals when determining the granularity of spin locks. From one
perspective, the more spin locks a designer uses to protect resources,
the more that these resources may be utilized in parallel. The downside
of this is that as the granularity of the locking scheme grows, so does
the overhead associated with maintaining the locks. Furthermore, as the
locking scheme gets finer, the opportunity increases for a deadlock to
occur. The addition of anti-deadlock algorithms also adds to the
overhead.
Since lock contention tends to serialize execution, it is
intuitively obvious that the shorter the amount of time a lock is held,
the less the potential for serializing to occur. The following
techniques are used in both RTOSs and multicore programming to reduce
the time that a lock is held.
 |
| Table
1 |
Symmetric OSes: master/slave and the alternatives
Primitive SMP RTOS implementations usually have a single processor that
is designated as the master and activities like interrupt handling,
resource arbitration and scheduling are performed on it. The master
processor is statically defined or determined at boot. In more advanced
SMP implementation, load balancing is used for interrupt and scheduling
tasks and responsibility floats among the various processors. Load
balancing RTOS-related tasks provide an ideal platform for “high
availability” or “hot swap” capabilities.
SMP RTOSs that use the master/slave approach are the easiest to
implement but provide the least performance increase of all the
different SMP implementations. One reason for this is that the master
acts as a bottleneck to providing kernel services. As with any other
resource in a SMP RTOS, the kernel services must be protected from
simultaneous access by different process demands.
One solution involves running the kernel in spin-lock mode, where
only one kernel service at a time is serviced (for example, a memory
create). Other processors seeking concurrent access to kernel services
(for example, to the file system) have to wait until the first process
is finished. Although the system services are protected from competing
processes, it operates inefficiently.
An improvement on the basic master/slave approach would be to
allocate the different kernel services to different spin locks. For
example the scheduler, interrupt controller and file system would each
have a spin lock. By increasing the granularity of the kernel
protection data structures, the increased granularity allows the kernel
to service more tasks concurrently. Now, only processes that compete
for identical kernel services will spin, waiting on a lock to be
released.
A superior solution to the two implementations presented above is an
SMP RTOS, that in addition to “load balancing” different tasks, also
load balances kernel services. By threading the kernel services and
allowing them to run in parallel on different microprocessors, the
possibility of an unavailable kernel service decreases dramatically.
 |
| Table
2 |
There are other less obvious advantages to using SMP. Design teams
typically focus on a single problem domain (task) at a time. Segmenting
the problem domain into tasks is not only a natural approach from the
human problem solving perspective, it is the same one used in the
uni-processor world. Since this is how most company’s design teams
work, there is no need to re-organize an engineering team. Designing a
threaded program is also a familiar technique used in the uni-processor
world. No new paradigms are needed to utilize SMP effectively.
Compiler technology can also be applied to problems that benefit
from decomposing the problem into smaller discrete parts that can be
processed independently from each other. An example of this would be
array processing where the array can be decomposed so that parts may be
attacked independently of each other. It is unlikely that the embedded
arena will benefit from this type of solution anytime soon as it is
typically applied to architectures that have dozens if not hundreds of
processors.
 |
| /Table
3 |
RTOS strategies for asymmetric multiprocessing
Unlike SMP RTOSs, RTOSs for AMP architectures do not require that the
hardware be symmetric or asymmetric. The primary characteristic of an
AMP RTOS is that it only runs on a single core or a single processor.
Put simply, an AMP solution requires one RTOS image for each core in
the design. An AMP RTOS is the one that everyone is familiar with and
has used in the embedded space for the last 50 years. It could be a
“roll your own” or a commercial RTOS. Regardless, the RTOS image is
compiled for that core and only sees the resources that the designer
dictates.
The fact that AMP-targeted RTOSs don’t require a symmetrical
hardware architecture, and that each core may or may not have access to
different resources makes it a very flexible in terms of how the total
system can be put together. This singular characteristic makes it a
very good candidate for use in a number of situations.
For heterogeneous architectures like the OMAP platform where one
core is a DSP and the other is an ARM microcontroller unit (MCU) core,
the AMP solution is the only one possible. One RTOS image is compiled
for and run on the DSP and the other is compiled for and run on the
MCU. Just like the hardware designers’ ability to provide cores that
have specific functionality, software developers deploy a mixture of
RTOS functionality for each core.
As is the case of the OMAP platform, the DSP cannot see most of the
system resources like network connections and storages devices; it only
exists to crunch numbers. Therefore, developers of DSP applications
rarely require anything beyond rudimentary RTOS services such as a
scheduler and the ability to create and delete tasks.
The MCU developer on the other hand may need support for a
man-machine interface, GUI, file system, memory management and
networking and communication protocols. To minimize the combined
footprint of the combined RTOSes, each one may be scaled so that it
provides only the functionality needed by the applications software
running on each core.
There is no requirement that AMP-based RTOSs have to run on
hetrogeneous architectures. There are a number of situations where a
developer chooses to use an AMP operating system over an SMP
implementation.
One example would be in the case where deterministic real-time
deadlines have to be met. Unlike SMP-based solutions where the
scheduler and interrupt handling mechanism may be shared amongst
multiple cores, an AMP implementation, with its own scheduler and
interrupt handling mechanism on each core, can respond to interrupts
and deadlines without waiting to acquire a lock for a resource.
The developer can mix and match their RTOSs and hardware to provide
an optimal mix of performance and features. For example, the developer
can choose an optimized deterministic, real-time RTOS for the cores
that respond to the real-time needs of the application, and RTOSs that
offer a rich palette of services. Developers can also choose
third-party software for the non-real-time aspects of the system.
In an effort to reduce costs associated with software safety
certifications, developers can partition the system so that only the
bare minimum code pertaining to the safe operation of the device runs
(on one or more cores), and all the “non-safe” software can run on the
other(s). While it may not make a difference for most applications, for
those that do, the final cost of developing code to meet FAA, FDA and
industrial standards starts at $60 to $100 per line of code.
Partitioning and legacy code issues
Partitioning the system is a viable way of reducing overall costs of
delivering a safety certified system. Another reason is the case where
large amounts of legacy code are developed with a specific RTOS in
mind. In this case, it is easier to reuse the total application intact,
rather than port it to a new
<> RTOS. Furthermore, it would take a significant investment in
resources to determine the different interactions between the reused
legacy code and the new code in a load-balancing system.
AMP solutions provide the designer with total control over how their
system functions. AMP solutions leave nothing to chance. Each process
is tasked to a core and has a priority assigned to it. Since no kernel
resources are shared, AMP provides very deterministic behavior. If one
word could describe an AMP solution, it would be "predictable.” The
designer binds each task to a processor at compile time. If tasks
running on different cores need to interact, the designer governs how,
when and where they will interact.
One thing that SMP software developers take for granted is the
ability to communicate between the different cores. Typically,
AMP-based solutions have no built-in mechanism to provide a method for
communication and synchronization between the two cores.
Since many of the multicore hardware solutions for the embedded
space use shared memory, a very efficient inter-processor communication
mechanism can be utilized for inter-processor communication (IPC)
between the two cores. A basic implementation would provide for data
marshaling if needed, a table and set of message buffers, a mechanism
to protect and synchronize access to the message buffers and a method
to signal that the other core can access a message buffer.
A last word about tools
One final aspect of multicore development is the availability of tools
for the RTOS and the hardware target. Debuggers, like multi-core
hardware and software solutions vary in sophistication. SMP debuggers
need to be able to determine what tasks have been allocated to
different cores to run. The inability to set and trigger breakpoints
across multiple cores makes finding and solving some bugs very
difficult. Choosing the development environment use for multicore
development can be as important a decision in the success of a project
as the RTOS and hardware platform. The next article will address the
topic of multicore development tools in detail.
To read Part 1 in this series, go to The
Pros and Cons of Multicore Architectures
Todd Brian, is product marketing manager in the Accelerated
Technology group at Mentor Graphics
Inc.
To learn about this general subject on Embedded.com go to More
about multicores, multiprocessing and tools.
For further information about upcoming activities in the industry
relating to multicore design, go to the Multicore Association Web
site.