ARM ARCHITECTURE

The 'ARM' architecture (previously, the 'Advanced RISC Machine', and prior to that 'Acorn RISC Machine') is a 32-bit RISC processor architecture developed by ARM Limited that is widely used in a number of embedded designs. Because of their power saving features, ARM CPUs are dominant in the mobile electronics market, where low power consumption is a critical design goal.
Today, the ARM family accounts for approximately 75% of all embedded 32-bit RISC CPUs,[1] making it one of the most prolific 32-bit architectures in the world. ARM CPUs are found in all corners of consumer electronics, from portable devices (PDAs, mobile phones, media players, handheld gaming units, and calculators) to computer peripherals (hard drives, desktop routers). Important branches in this family include Marvell's XScale and the Texas Instruments OMAP series.

Contents
History
The cores
Design notes
Thumb
Jazelle
Thumb-2
Thumb Execution Environment (ThumbEE)
Advanced SIMD (NEON)
VFP
Security Extensions (TrustZone)
ARM licensees
Approximate licensing costs
See also
References
External links

History


A Conexant ARM processor used mainly in routers

The ARM design was started in 1983 as a development project at Acorn Computers Ltd.
The team, led by Roger Wilson and Steve Furber, started development of what in some ways resembles an advanced MOS Technology 6502. Acorn had a long line of computers based on the 6502, so a chip that was similar to program could represent a significant advantage for the company.
The team completed development samples called 'ARM1' by April 1985[2], and the first "real" production systems as 'ARM2' the following year. The ARM2 featured a 32-bit data bus, a 26-bit address space giving a 64 Mbyte address range and sixteen 32-bit registers. One of these registers served as the (word aligned) program counter with its top 6 bits and lowest 2 bits holding the processor status flags. The ARM2 was possibly the simplest useful 32-bit microprocessor in the world, with only 30,000 transistors (compare with Motorola's six-year older 68000 model with around 70,000 transistors). Much of this simplicity comes from not having microcode (which represents about one-fourth to one-third of the 68000) and, like most CPUs of the day, not including any cache. This simplicity led to its low power usage, while performing better than the Intel 80286. A successor, 'ARM3', was produced with a 4KB cache, which further improved performance.
In the late 1980s Apple Computer started working with Acorn on newer versions of the ARM core. The work was so important that Acorn spun off the design team in 1990 into a new company called Advanced RISC Machines Ltd.. For this reason, ARM is sometimes expanded as 'Advanced RISC Machine' instead of Acorn RISC Machine. Advanced RISC Machines became ARM Ltd when its parent company, ARM Holdings plc, floated on the London Stock Exchange and NASDAQ in 1998.[1]
This work would eventually turn into the 'ARM6'. The first models were released in 1991, and Apple used the ARM6-based ARM 610 as the basis for their Apple Newton PDA. In 1994, Acorn used the ARM 610 as the main CPU in their Risc PC computers.
The core has remained largely the same size throughout these changes. ARM2 had 30,000 transistors, while the ARM6 grew to only 35,000. The idea is that the Original Design Manufacturer combines the ARM core with a number of optional parts to produce a complete CPU, one that can be built on old semiconductor fabs and still deliver lots of performance at a low cost.
ARM's business has always been to sell IP cores, which licensees use to create microcontrollers and CPUs based on this core. The most successful implementation has been the ARM7TDMI with hundreds of millions sold in almost every kind of microcontroller equipped device.
DEC licensed the architecture (which caused some confusion because they also produced the DEC Alpha) and produced the 'StrongARM'. At 233 MHz this CPU drew only 1 watt of power (more recent versions draw far less). This work was later passed to Intel as a part of a lawsuit settlement, and Intel took the opportunity to supplement their aging i960 line with the StrongARM. Intel later developed its own high performance implementation known as 'XScale' which it has since sold to Marvell.
The common architecture supported on smartphones, Personal Digital Assistants and other handheld devices is 'ARMv4'. XScale and ARM926 processors are 'ARMv5TE', and are now more numerous in high-end devices than the StrongARM, ARM925T and ARM7TDMI based ARMv4 processors. The architecture version is shown in the 'Arch' column below.

The cores


Family Arch Core Feature Cache (I/D)/MMU typical MIPS @ MHz in Application
ARM1ARMv1ARM1NoneARM Evaluation System second processor for BBC Micro
ARM2ARMv2ARM2Architecture 2 added the MUL (multiply) instructionNone4 MIPS @ 8MHzAcorn Archimedes, Chessmachine
ARMv2aARM250Integrated MEMC (MMU), Graphics and IO processor. Architecture 2a added the SWP and SWPB (swap) instructions.None, MEMC1a7 MIPS @ 12MHzAcorn Archimedes
ARM3ARMv2aARM2aFirst use of a processor cache on the ARM.4K unified12 MIPS @ 25MHzAcorn Archimedes
ARM6ARMv3ARM60v3 architecture first to support addressing 32bits of memory (as opposed to 26bits)None10 MIPS @ 12MHz3DO Interactive Multiplayer
ARM600Cache and coprocessor bus (for FPA10 floating-point unit).4K unified28 MIPS @ 33MHz
ARM610Cache, no coprocessor bus.4K unified17 MIPS @ 20MHzAcorn Risc PC 600, Apple Newton 100 series
ARM7ARMv3ARM7008KB unified40MHz
ARM710a8KB unified40MHzAcorn Risc PC 700, Apple eMate 300
ARM7100Integrated SoC.8KB unified18MHzPsion Series 5
ARM7500Integrated SoC.4KB unified40MHzAcorn A7000
ARM7500FEIntegrated SoC. "FE" Added FPA and EDO memory controller.4KB unified56MHzAcorn A7000+
ARM7TDMIARMv4TARM7TDMI(-S)3-stage pipelinenone15 MIPS @ 16.8 MHzGame Boy Advance, Nintendo DS, iPod, Lego NXT
ARM710T8KB unified, MMU36 MIPS @ 40 MHz
ARM720T8KB unified, MMU60 MIPS @ 59.8 MHzZipit
ARM740TMPU
ARMv5TEJARM7EJ-SJazelle DBX, Enhanced DSP instructions, 5-stage pipelinenone
StrongARMARMv4SA-11016KB/16KB200MHzApple Newton 2x00 series
ARM9TDMIARMv4TARM9TDMI5-stage pipelinenone
ARM920T16KB/16KB, MMU200 MIPS @ 180 MHzArmadillo, GP32,GP2X (first core), Tapwave Zodiac (Motorola i. MX1), Hp49g+, Sun SPOT
ARM922T8KB/8KB, MMU
ARM940T4KB/4KB, MPUGP2X (second core)
ARM9EARMv5TEARM946E-SEnhanced DSP instructionsvariable, tightly coupled memories, MPUNintendo DS, Nokia N-Gage Conexant 802.11 chips
ARM966E-Sno cache, TCMsST Micro STR91xF, includes Ethernet [2]
ARM968E-Sno cache, TCMs
ARMv5TEJARM926EJ-SJazelle DBX, Enhanced DSP instructionsvariable, TCMs, MMU220 MIPS @ 200 MHzMobile phones: Sony Ericsson (K, W series),Siemens and Benq (x65 series and newer), Texas Instruments OMAP1710
ARMv5TEARM996HSClockless processor, Enhanced DSP instructionsno caches, TCMs, MPU
ARM10EARMv5TEARM1020E(VFP), 6-stage pipeline, Enhanced DSP instructions32KB/32KB, MMU
ARM1022E(VFP)16KB/16KB, MMU
ARMv5TEJARM1026EJ-SJazelle DBX, Enhanced DSP instructionsvariable, MMU or MPU
XScaleARMv5TE80200/IOP310/IOP315I/O Processor, Enhanced DSP instructions
80219400/600MHzThecus N2100
IOP321600 BogoMips @ 600 MHzIyonix
IOP33x
IOP34x1-2 core, RAID Acceleration32K/32K L1, 512K L2, MMU
PXA210/PXA250Applications processor, 7-stage pipelineZaurus SL-5600, iPAQ H3900
PXA25532KB/32KB, MMU400 BogoMips @ 400 MHzGumstix, Palm Tungsten E2,Mentor Ranger & Stryder
PXA26xdefault 400 MHz, up to 624 MHzPalm Tungsten T3
PXA27x800 MIPS @ 624 MHzHTC Universal, Zaurus SL-C1000,3000,3100,3200, Dell Axim x30, x50, and x51 series, Motorola Q
PXA800(E)F
Monahans1000 MIPS @ 1.25 GHz
PXA900Blackberry 8700, Blackberry Pearl (8100)
IXC1100Control Plane Processor
IXP2400/IXP2800
IXP2850
IXP2325/IXP2350
IXP42xNSLU2
IXP460/IXP465
ARM11ARMv6ARM1136J(F)-SSIMD, Jazelle DBX, (VFP), 8-stage pipelinevariable, MMU740 @ 532-665MHz (i.MX31 SoC)Nokia N95, Nokia N93, Zune, Nokia N800, Texas Instruments OMAP2
ARMv6T2ARM1156T2(F)-SSIMD, Thumb-2, (VFP), 9-stage pipelinevariable, MPU
ARMv6KZARM1176JZ(F)-SSIMD, Jazelle DBX, (VFP)variable, MMU+TrustZoneApple iPhone
ARMv6KARM11 MPCore1-4 core SMP, SIMD, Jazelle DBX, (VFP)variable, MMU
CortexARMv7-ACortex-A8Application profile, VFP, NEON, Jazelle RCT, Thumb-2, 13-stage pipelinevariable (L1+L2), MMU+TrustZoneup to 2000 (2.0 DMIPS/MHz in speed from 600 MHz to greater than 1 GHz)Texas Instruments OMAP3
ARMv7-RCortex-R4(F)Embedded profile, (FPU)variable cache, MPU optional600 DMIPSBroadcom is a user
ARMv7-MCortex-M3Microcontroller profile, Thumb-2 only.no cache, (MPU)125 DMIPS @ 100MHzLuminary Micro[3] microcontroller family, ST Microelectronics STM32[4]
ARMv6-MCortex-M1FPGA targeted, Microcontroller profile, Thumb-2 (BL, MRS, MSR, ISB, DSB, and DMB).None, tightly coupled memory optional.Up to 136 DMIPS @ 170MHz[3] (0.8 DMIPS/MHz[4], MHz achievable FPGA-dependent)"Actel ProASIC3 and Actel Fusion PSC devices will sample in Q3 2007"[5]

Design notes


To keep the design clean, simple and fast, it was hardwired without microcode, like the much simpler 8-bit 6502 processor used in prior Acorn microcomputers.
The ARM architecture includes the following RISC features:

★ Load/store architecture

★ No support for misaligned memory accesses (now supported in ARMv6 cores)

Orthogonal instruction set

★ Large 16 × 32-bit register file

★ Fixed instruction width of 32 bits to ease decoding and pipelining, at the cost of decreased code density

★ Mostly single-cycle execution
To compensate for the simpler design, compared with contemporary processors like the Intel 80286 and Motorola 68020, some unique design features were used:

★ Conditional execution of most instructions, reducing branch overhead and compensating for the lack of a branch predictor

★ Arithmetic instructions alter condition codes only when desired

★ 32-bit barrel shifter which can be used without performance penalty with most arithmetic instructions and address calculations

★ Powerful indexed addressing modes

★ Simple, but fast, 2-priority-level interrupt subsystem with switched register banks
An interesting addition to the ARM design is the use of a 4-bit ''condition code'' on the front of every instruction, meaning that execution of every instruction is optionally conditional.
This cuts down significantly on the encoding bits available for displacements in memory access instructions, but on the other hand it avoids branch instructions when generating code for small if statements. The standard example of this is the Euclidean algorithm by Euclid:
In the C programming language, the loop is:
int gcd (int i, int j)
{
while (i != j)
{
if (i > j)
i -= j;
else
j -= i;
}
return i;
}
In ARM assembly, the loop is:
loop CMP Ri, Rj ; set condition "NE" if (i != j)
; "GT" if (i > j),
; or "LT" if (i < j)
SUBGT Ri, Ri, Rj ; if "GT", i = i-j;
SUBLT Rj, Rj, Ri ; if "LT", j = j-i;
BNE loop ; if "NE", then loop
which avoids the branches around the then and else clauses.
Another unique feature of the instruction set is the ability to fold shifts and rotates into the "data processing" (arithmetic, logical, and register-register move) instructions, so that, for example, the C statement
:a += (j << 2);
could be rendered as a single word, single cycle instruction on the ARM.
: ADD Ra, Ra, Rj, LSL #2
This results in the typical ARM program being denser than expected with less memory access thus the pipeline is used more efficiently. Even though the ARM runs at what many would consider to be low speeds, it nevertheless competes quite well with much more complex CPU designs.
The ARM processor also has some features rarely seen on other RISC architectures, such as PC-relative addressing (indeed, on the ARM the PC is one of its 16 registers) and pre- and post-increment addressing modes.
Another item of note is that the ARM has been around for a while, with the instruction set increasing somewhat over time. Some early ARM processors (prior to ARM7TDMI), for example, have no instruction to load a two-byte quantity, thus, strictly speaking, for them it's not possible to generate code that would behave the way one would expect for C objects of type "volatile short".
The ARM7 and most earlier designs have a three stage pipeline; the stages being fetch, decode, and execute. Higher performance designs, such as the ARM9, have a five stage pipeline. Additional changes for higher performance include a faster adder, and more extensive branch prediction logic.
The architecture provides a non-intrusive way of extending the instruction set using "coprocessors" which can be addressed using MCR, MRC, MRRC and MCRR commands from software. The coprocessor space is divided logically into 16 coprocessors with numbers from 0 to 15, coprocessor 15 (cp15) being reserved for some typical control functions like managing the caches and MMU operation (on processors that have one).
In ARM based machines, peripheral devices are usually attached to the processor by mapping their physical registers into ARM memory space or into the coprocessor space or connecting to another device (a bus) which in turn attaches to the processor. Coprocessor accesses have lower latency so some peripherals (for example XScale interrupt controller) are designed to be accessible in both ways (through memory and through coprocessors).
Thumb

Newer ARM processors have a compressed instruction set, called 'Thumb', that uses a 16-bit-wide instruction encoding (but still processes 32-bit data). In Thumb, the smaller opcodes have less functionality. For example, only branches can be conditional, and many opcodes cannot access all of the CPU's registers. However, the shorter opcodes give improved code density overall, even though some operations require more instructions. Particularly in situations where the memory port or bus width is constrained to less than 32 bits, the shorter Thumb opcodes allows greater performance than with 32-bit code because of the more efficient use of the limited memory bandwidth. Typically embedded hardware has a small range of addresses of 32-bit datapath and the rest are 16 bits or narrower (e.g. the Game Boy Advance). In this situation, it usually makes sense to compile Thumb code and hand-optimise a few of the most CPU-intensive sections using the (non-Thumb) 32-bit instruction set, placing them in the limited 32-bit bus width memory.
The first processor with a Thumb instruction decoder was the ARM7TDMI. All ARM9 and later families, including XScale have included a Thumb instruction decoder.
Jazelle

A technology called Jazelle DBX (Direct Bytecode eXecution) allows some ARM architectures to execute Java bytecode in hardware as another execution state alongside the existing ARM and Thumb states. It provides acceleration for some bytecodes while calling out to special software for others.
The first processor with Jazelle technology was the 'ARM926EJ-S'[6]: Jazelle being denoted by the 'J' in the CPU name. It is used by mobile phone manufacturers to speed up execution of Java ME games and applications.
Thumb-2

'Thumb-2' technology made its debut in the 'ARM1156 core', announced in 2003. Thumb-2 extends the limited 16-bit instruction set of Thumb with additional 32-bit instructions to give the instruction set more breadth. The resulting stated aim for Thumb-2 is to achieve code density similar to Thumb with performance similar to the ARM instruction set on 32-bit memory.
Thumb-2 also extends both the ARM and Thumb instruction set with yet more instructions, including bit-field manipulation, table branches, and conditional execution.
All ARMv7 chips support the Thumb-2 instruction set.
Some chips, such as the Cortex-M3, support only the Thumb-2 instruction set.
Other chips in the Cortex and ARM11 series support both "ARM instruction set mode" and "Thumb-2 instruction set mode"
[5]
[6]
[7].
Thumb Execution Environment (ThumbEE)

'ThumbEE', also known as 'Thumb-2EE', and marketed as Jazelle RCT, was announced in 2005, first appearing in the 'Cortex-A8' processor. ThumbEE provides a small extension to the Thumb-2 extended Thumb instruction set, making the instruction set particularly suited to code generated at runtime (e.g. by JIT compilation) in managed 'Execution Environments'. ThumbEE is a target for languages such as Limbo, Java, C#, Perl and Python, and allows JIT compilers to output smaller compiled code without impacting performance.
New features provided by ThumbEE include automatic null pointer checks on every load and store instruction, an instruction to perform an array bounds check. Access to registers r8-r15 (where the Jazelle/DBX Java VM state is held) and the ability to branch to handlers—small sections of frequently called code—commonly used to implement a feature of a high level language, such as allocating memory for a new object.
Advanced SIMD (NEON)

The 'Advanced SIMD' extension, marketed as 'NEON' technology, is a combined 64 and 128 bit SIMD (Single Instruction Multiple Data) instruction set that provides standardized acceleration for media and signal processing applications. NEON can execute MP3 audio decoding on CPUs running at 10 MHz and can run the GSM AMR (Adaptive Multi-Rate) speech codec at no more than 13 MHz. It features a comprehensive instruction set, separate register files and independent execution hardware. NEON supports 8-, 16-, 32- and 64-bit integer and single precision floating-point data and operates in SIMD operations for handling audio/video processing as well as graphics and gaming processing. In NEON, the SIMD supports up to 16 operations at the same time.
VFP

'VFP' technology is a coprocessor extension to the ARM architecture. It provides low-cost single-precision and double-precision floating-point computation fully compliant with the ''ANSI/IEEE Std 754-1985 Standard for Binary Floating-Point Arithmetic''. VFP provides floating-point computation suitable for a wide spectrum of applications such as PDAs, smartphones, voice compression and decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications. The VFP architecture also supports execution of short vector instructions allowing SIMD (Single Instruction Multiple Data) parallelism. This is useful in graphics and signal-processing applications by reducing code size and increasing throughput.
Other floating-point and/or SIMD coprocessors found in ARM-based processors include FPA, FPE, iwMMXt. They provide some of the same functionality as VFP but are not opcode-compatible with it.
Security Extensions (TrustZone)

The 'Security Extensions', marketed as 'TrustZone'(TM) Technology, is found in ARMv6KZ and later application profile architectures. It provides a low cost alternative to adding an additional dedicated security core to a SoC, by providing two virtual processors backed by hardware based access control. This enables the application core to switch between two states (referred to as worlds to reduce confusion with other names for capability domains) in a manner such that information can be prevented from leaking from the more trusted world to the less trusted world. This world switch is generally orthogonal to all other capabilities of the processor and so each world can operate independently of the other while using the same core. Memory and peripherals are then made aware of the operating world of the core and may use this to provide access control to secrets and code on the device. A typical application of TrustZone Technology is to run a rich operating system in the less trusted world, and smaller security-specialized code in the more trusted world.

ARM licensees


ARM Ltd does not manufacture and sell CPU devices based on their own designs, but rather, licenses the processor architecture to interested parties. ARM offers a variety of licensing terms, varying in cost and deliverables. To all licensees, ARM provides an integratable hardware description of the ARM core, as well as complete software development toolset (compiler, debugger, SDK), and the right to sell manufactured silicon containing the ARM CPU. Fabless licensees, who wish to integrate an ARM core into their own chip design, are usually only interested in acquiring a ready-to-manufacture verified IP core. For these customers, ARM delivers a gate netlist description of the chosen ARM core, along with an abstracted simulation model and test programs to aid design integration and verification. More ambitious customers, including integrated device manufacturers (IDM) and foundry operators, choose to acquire the processor IP in synthesizable RTL (Verilog) form. With the synthesizable RTL, the customer has the ability to perform architectural level optimizations and extensions. This allows the designer to achieve exotic design goals not otherwise possible with an unmodified netlist (high clock speed, very low power consumption, instruction set extensions, etc.). While ARM does not grant the licensee the right to resell the ARM architecture itself, licensees may freely sell manufactured product (chip devices, evaluation boards, complete systems, etc.). Merchant foundries can be a special case; not only are they allowed to sell finished silicon containing ARM cores, they generally hold the right to remanufacture ARM cores for other customers.
Like most IP vendors, ARM prices its IP based on perceived value. In architectural terms, the lower performance ARM cores command a lower license cost than the higher performance cores. In terms of silicon implementation, a synthesizable core is more expensive than a hard macro (blackbox) core. Complicating price matters, a merchant foundry who holds an ARM license (such as Samsung and Fujitsu) can offer reduced licensing costs to its fab customers. In exchange for acquiring the ARM core through the foundry's in-house design services, the customer can reduce or eliminate payment of ARM's upfront license fee. Compared to dedicated semiconductor foundries (such as TSMC and UMC) without in-house design services, Fujitsu/Samsung charge 2 to 3 times more per manufactured wafer. For low to mid volume applications, a design service foundry offers lower overall pricing (through subsidization of the license fee). For high volume mass produced parts, the long term cost reduction achievable through lower wafer pricing reduces the impact of ARM's NRE (Non-Recurring Engineering) costs, making the dedicated foundry a better choice.
Many semiconductor or IC design firms hold ARM licenses: Analog Devices, Atmel, Broadcom, Cirrus Logic, Faraday technology, Freescale (spun off from Motorola in 2004), Fujitsu, Intel (through its settlement with DEC), IBM, Infineon Technologies, Nintendo, NXP Semiconductors (spun off from Philips in 2006), OKI, Samsung, Sharp, STMicroelectronics, Texas Instruments and VLSI are some of the many companies who have licensed the ARM in one form or another. Although ARM's license terms are covered by NDA, within the IP industry, ARM is widely known to be among the most expensive CPU cores. A single customer product containing a basic ARM core can incur a one-time license fee in excess of (USD) $200,000. Where significant quantity and architectural modification are involved, the license fee can exceed $10M.
Approximate licensing costs

ARM's 2006 annual report and accounts state that royalties totalling £88.7 million ($164.1 million) were the result of licensees shipping 2.45 billion units[7]. This is equivalent to 3.6 pence (6.7 cents) per unit shipped. However, this is averaged across all cores, including expensive new cores and inexpensive older cores.
In the same year ARM's licensing revenues for processor cores were £65.2 million ($119.5 million)[8], in a year when 65 processor licenses were signed[9], an average of £1 million ($1.84 million) per license. Again, this is averaged across both new and old cores.
Given that ARM's 2006 income from processor cores was approximately 60% from royalties and 40% from licenses, ARM makes the equivalent of 6 pence (11 cents) per unit shipped including both royalties and licenses. However, as one-off licenses are typically bought for new technologies, unit sales (and hence royalties) are dominated by more established products. Hence, these figures above do not reflect the true costs of any single ARM product.

See also



Inferno

DirectBand

AMULET - a family of asynchronous ARMs

Philips LPC2000 ARM7TDMI-S Microcontrollers

Texas Instruments OMAP - an ARM core plus DSP and application acceleration cores

Armulator, ARM Instruction Set Simulator

References


1. http://www.arm.com/miscPDFs/3823.pdf
2. "Some facts about the Acorn RISC Machine" Roger Wilson posting to comp.arch, Nov 2 1988, Accessed 25 May 2007.
3. "ARM Extends Cortex Family with First Processor Optimized for FPGA", ARM press release, March 19 2007. Accessed April 11, 2007.
4. "ARM Cortex-M1", ARM product website. Accessed April 11, 2007.
5. http://www.arm.com/news/17017.html
6. http://www.us.design-reuse.com/news/news6919.html
7. "Business review/Financial review/IFRS", p. 10, ARM annual report and accounts, 2006. Retrieved May 7 2007
8. Based on total £110.6 million (2.5 million) divided by "License revenues by product"; "Business review/Financial review/IFRS" and "Key performance indicators" respectively, p. 10 / p. 3ARM annual report and accounts, 2006. Retrieved May 7 2007
9. "Key performance indicators", p. 3, ARM annual report and accounts, 2006. Retrieved May 7 2007

External links



ARM Ltd.

ARM Assembler Programming; tutorial, resources, and examples

TrustZone(TM) Technology

ARM Microcontroller Development Resources - header files, schematics, CAD files, etc..

Programming the ARM Microprocessor for Embedded Systems Slides

Arm Architecture

This article provided by Wikipedia. To edit the contents of this article, click here for original source.

psst.. try this: add to faves
Featured Companies
Vacation By VVacation By V
Optimum 1 TravelOptimum 1 Travel