The Itanium processor, part 1: Warming up

nickpsecurity · on July 27, 2015

One of the best benefits of Itanium are its security features:

http://www.intel.com/content/dam/www/public/us/en/documents/...

These are rarely discussed. However, Secure64's SourceT OS got plenty of mileage out of them. You might not be able to argue for Itanium on cost, performance, or ease of use. However, one could argue for it as a better start on secure OS's or appliances. Unlike academic prototypes, it's also in production with high speed and reliability. Let's also not forget you can do reverse stacks and other bug prevention tricks with less performance hit or clunkiness on a RISC architecture vs x86.

Personally, I'd just rather them have modernized the i960MX, minus other BiiN stuff:

https://en.wikipedia.org/wiki/BiiN

Targeting a robust, UNIX-compatible OS and C toolchain to it might have let it survive and get continually updated. Then, when HLL's got popular, we'd have a good hardware target for them that supported POLA inside applications. As usual, I'm speaking of great stuff in a past or never-happened tense. Least Itanium made it so far.

wglb · on July 27, 2015

Here is an interesting (not) security feature: https://news.ycombinator.com/item?id=9956044

nickpsecurity · on July 28, 2015

Not doing it fully during a procedure call is an optimization. I responded to the comment with links from 2002 and 2005 that showed this along with proper way to save everything to memory. Shouldn't have been a surprise to someone reading up on it.

wglb · on July 28, 2015

So they used a securely designed component and got bit anyway.

nickpsecurity · on July 28, 2015

That wasn't the security features. It was the features not designed for security and targeting an audience wanting performance enhancements. The kind of thing that often leads to unexpected problems. ;)

kjs3 · on July 28, 2015

Yeah...the i960mx was a missed opportunity. It was certainly oversold as "Intels RISC replacement for x86" (it was never going to be that in the real world), but it would have been interesting to see how far it would have gone if the decision hadn't been "we'll just neuter the little bastard and try and recoup some development dollars by using it as an embedded processor".

nickpsecurity · on July 28, 2015

""we'll just neuter the little bastard and try and recoup some development dollars by using it as an embedded processor"."

Lmao well-put. Maybe if they saw it worded that way on the drawing board they might have seen the folly. I keep thinking of checking on the Alpha ISA licensing situation to get PALcode, etc back. Last I checked Intel and Samsung had control of licensing but it kinda disappeared.

Might just stick with SPARC or RISC-V with modifications to use clever modifications from the past. Might even try to design a knock-off of i960's better features updates based on lessons learned over time.

kjs3 · on July 30, 2015

I've been impressed with the RISC-V folks. Seem to have solid talent and realistic goals.

nickpsecurity · on July 31, 2015

Did you mean the main University group (s) that defined and implemented the ISA? If so, I agree they've done well. I especially loved that they open-sourced a 1+GHz 48nm core. Awesome stuff.

I posted a link to Schneiers Squid thread today on a bunch of asynchronous chip work. One was a 180nm FPGA w/ several times Xilinx's performance and one a 40nm microcontroller. I imagine a combination of RISC-V with that async flow would produce one drool-worthy processor in price, performance, and NRE cost.

thesz · on Aug 1, 2015

Can you point out where the links are?

nickpsecurity · on Aug 3, 2015

Enjoy:

http://pastebin.com/bakX4RRZ

thesz · on Aug 5, 2015

Thank you very much!

kjs3 · on Aug 3, 2015

Yeah...that 1GHz core could be the game changer as far as Open Source hardware goes. Seems pretty much anyone with an above average undergrad digital design competency has cooked up a core/ISA that ranges from the few tens to a few hundred MHz, but a real 1GHz core on a real process potentially puts it in range of the ARM/MIPS crowd performance wise (that is, out of "isn't that a cute toy" territory).

I confess to not being particularly conversant with async design (it became mainstream after my time), but I'll definitely check out your links and see what I can learn.

Interesting times and all that.

nickpsecurity · on Aug 3, 2015

More interesting than you think: we have an open-source FPGA architecture out there now and cheap[ish] processes to implement it.

http://pastebin.com/bakX4RRZ

kjs3 · on Aug 3, 2015

I thank you, sir, for this months reading list.

nickpsecurity · on Aug 4, 2015

You're welcome. I have over ten thousand in all and someone just asked me for dome on making C programs memory safe. So if you like that too check back on my profile's comments in a day or so.

kjs3 · on Aug 8, 2015

That was me asking about memory safety, actually. You ever get to Atlanta, GA, I owe you a dinner and a beer.

nickpsecurity · on Aug 8, 2015

Oh must have lost track lol. Appreciate it and might take you up on it.

rwmj · on July 27, 2015

You can buy old Itanium hardware super-cheap now on eBay. I got an HP Integrity RX2620 for £58 which included tax and delivery.

As hardware goes it's .. interesting. It's fast. But it uses huge amounts of power and requires massive cooling (if you disable any of the 4 or 5 fans in my 2U machine, it overheats in 5 minutes). It has early EFI which should be quite familiar if you've used UEFI on the command line. And it has excellent iLO / remote / serial support so it's great practice for learning about enterprise ops.

It's getting hard to find software that runs on it. The last Debian (Wheezy) runs, but current Debian has dropped ia64 support. RHEL dropped support years ago. You'll find there are lots of strange bugs because people no longer test their software on this platform.

https://rwmj.wordpress.com/2015/05/03/raise-the-itanic-part-...

https://rwmj.wordpress.com/2014/09/08/raise-the-itanic/#cont...

bluedino · on July 27, 2015

>> You can buy old Itanium hardware super-cheap now on eBay. I got an HP Integrity RX2620 for £58

My last gig had an Itanium running some HP-UX software that had been cobbled together over the past 20 (?) years. eBay provides cheap parts to upgrade the server as well as put together a machine for the failover datacenter.

CPU and RAM were both cheap. However, you're stuck with commodity SCSI drives which are still expensive for the largest sizes. 300GB drives, NIB and matching are like $150 USD a piece.

trentnelson · on July 27, 2015

Heh, HP donated two rx5670s to Snakebite, you can see the one I managed to rack here: https://youtu.be/X_IfRHgJubM?t=60.

Huge power hogs. Even more than my ES40.

The iLO stuff is phenomenal. As is all the hot-swap support ... DIMMs, CPUs, fans, you name it. This goes back to 2003, too.

macintux · on July 27, 2015

This brings back memories. At Progeny Linux Systems we helped HP get Debian packages working properly on ia64.

We had one of HP's first generation Itanium servers to test on, and man, was that SLOW.

zokier · on July 27, 2015

> if you disable any of the 4 or 5 fans in my 2U machine, it overheats in 5 minutes

Huh, I would have expected better redundancy given how expensive the hardware must have been at the time.

rwmj · on July 27, 2015

I agree it's strange. If you pull any fan, the machine hard shuts down a few minutes later (it even warns you of this in a note printed inside the case). Presumably if a fan fails and you don't manage to get to it in a few minutes, then you're out of luck. The only good thing is that it is possible to hot-swap a fan in a few seconds if the fan is failing-but-not-failed (if that ever happens - it seems unlikely).

Edit: Would love to know what the list price of my machine was back in 2006. Probably thousands ...

mikequinlan · on July 27, 2015

Google turns up http://www.openpa.net/systems/hp_rx2600_rx2620.html with prices for the rx2600 which was released a few years before the rx2620.

"Time of introduction: 2002-2003 (rx2600)/December 2004 (rx2620) with prices at the time starting at $7,300 (entry rx2600), $16,000 (average rx2600) to $33,000 (large rx2600)."

jacquesm · on July 27, 2015

> Probably thousands ...

Try 15K euros!

rwmj · on July 27, 2015

Ouch! Thanks. I wonder if anyone ever paid list price for these? AIUI many were given away for educational use and the like.

jacquesm · on July 28, 2015

I never got further than a quote. The most exotic hardware I actually bought were a DEC Alpha and a whole bunch of SGI gear (at considerable discount).

apaprocki · on July 27, 2015

Brings back memories.. When running SpiderMonkey (interpreter, not JIT) on IA64 it would randomly crash and burn with what looked like GC issues. Values were being collected even though they were still in use. It turned out the mark-and-sweep collector that would scan the registers and stack was working properly, but was not aware that IA64 would not write out all of its registers to the buffer provided to `setjmp()`. You would have to send the processor a `flushrs` instruction to tell it to flush all stacked general registers in the "dirty" partition (not yet written to the backing store) of the register stack to the backing store. After that, you'd need to get the exact pointers to the register backing store and then scan those. Fun times.

nickpsecurity · on July 28, 2015

That comes with the RSE it uses for performance enhancement. The flushrs requirement, etc were mentioned here:

Microsoft on RSE (2002) https://web.archive.org/web/20021018050724/http://portals.de...

Smotherman's notes (2002) http://people.cs.clemson.edu/~mark/subroutines/itanium.html

USENIX presentation on Itanium (2005) https://www.usenix.org/legacy/events/usenix05/tech/general/g...

I could see how you wouldn't expect it coming from another ISA and they could be a bit more explicit. It was weird. It was documented, though, by different people building on Itanium. Different workloads use it in a different way for efficiency.

apaprocki · on July 28, 2015

Sure.. it was straightforward to track down once you see that a value is in a register but is being collected anyway. I seem to remember the other fun bits were the fact that function pointers were not actually function pointers.. they were pointers into a giant lookup table that contained a struct that contained the actual address of the call. Also, unwinding the stack was so complicated that you could not reasonably do it manually like on nearly any other architecture -- you needed to link in an Intel library to do it. None of these things were giant issues -- it just showed how much of a departure from other prevalent systems it was.

nickpsecurity · on July 28, 2015

Wow, that does sound like a painful learning curve. Curious, were the function pointer problems inherent to the architecture or the tool/lib you used? Might be worth documenting in case a reader stumbles upon this before going through what you did.

apaprocki · on July 28, 2015

If you want all the nasty details, this post covers it all:

"This also has a side-effect on function pointers. Since function pointers are generally used at some distance from allocation, they might be used in a module with a different gp value. The compiler gets around this by not compiling a function pointer to a single pointer-sized value; it compiles to a pair of pointer-sized values, one representing the address of the first instruction (bundle on IA64) in the function, the other being the correct gp value to use."

http://mikedimmick.blogspot.com/2004/01/ia64s-global-pointer...

nickpsecurity · on July 28, 2015

Wow. That's a huge mess of stuff to mentally track just for some optimization. To be honest, I'd probably just end up writing a macro-assembler and coding it directly against a C reference implementation just to avoid all the nonsense lol.

peterfirefly · on July 27, 2015

You might have had less visible problems on other architectures as well.

setjmp()/longjmp() are not guaranteed to work with local variables not marked volatile.

peterfirefly · on July 27, 2015

dmr explaining what's going on:

http://yarchive.net/comp/longjmp.html

aardvark179 · on July 27, 2015

The size of the jump buffers on ia64 terrifies me. X64 on windows is bad enough, but ia64 looks horrible. Was it a performance problem?

apaprocki · on July 28, 2015

I don't think it amounted to any performance issue because SM was only using the setjmp() buffer in that way as a shortcut to get all of the register values in a buffer to scan linearly. Since there was the RSE backing store, there was an extra block of memory to scan regardless, but you're still talking about a reasonably bounded area only scanned on each GC run, which is not a frequent (to the processor) event.

nickpsecurity · on July 27, 2015

HN readers interested in VLIW architectures might find the TRIPS project a good read:

http://www.cs.utexas.edu/~trips/overview.html

It tries to avoid some pitfalls of architectures such as Itanium. Seems fairly complex to me, though. Might be inherent in EDGE and VLIW's, though.

Symmetry · on July 27, 2015

And there's the Mill if you're willing to be somewhat more exotic. http://millcomputing.com/

nickpsecurity · on July 27, 2015

It's a very interesting architecture that I've heard about on paper for a while now but never seen on silicon. Have they done anything with it on ASIC/FPGA or is it vaporware for now?

Talking exotic, look at No Instruction Set Computing (NISC) which was at least prototyped and published synthesis/compilers:

https://web.archive.org/web/20080302041756/http://www.ics.uc...

Really interesting stuff. Reminded me of Tensillica's tools that create a custom processor for your application. Need to accelerate your Hadoop, etc application? Run most of it on Intel CPU with an onboard FPGA & NISC tools doing the critical path. Intel's Altera acquisition might make something like that achievable in future.

Note: Used archive because their site is having a configuration error.

WallWextra · on July 27, 2015

Last I heard, the Mill guys don't even have a compiler working, but do have a simulator. Not sure what they're simulating.

wtallis · on July 27, 2015

Their June talk was about their compiler and toolchain, which borrows heavily from LLVM. It's almost certainly not done, but they do have something more than just an assembler. They've also started working on implementing it for FPGA, but only as a proof of concept rather than something intended to be ready to make into an ASIC.

protomyth · on July 27, 2015

There are some embedded chips that use VLIW. RISC chips need to do instruction scheduling to really get their performance up, but for a lot of embedded task those extra transistors are a waste compared to a VLIW chip where the software can actually do the scheduling well (unlike what happened to the Itanium).

CHY872 · on July 27, 2015

Yeah - the most common area is DSP chips and similar, where you require really predictable performance, and have tight arithmetical loops with a good source of parallelism (also a field where it isn't a problem if you have to write a new compiler for your chip).

There are some consumer VLIW chips bumming around; I think the Nexus 9 has a VLIW chip - but it is essentially a hardware JIT compiler, translating ARM into its internal instruction set for hot paths.

CUViper · on July 27, 2015

Yeah, that's Denver in the Nexus 9: https://en.wikipedia.org/wiki/Project_Denver

nickpsecurity · on July 27, 2015

Wolfe wrote a nice article on VLIW use in embedded and issues to overcome:

http://www.embedded.com/design/prototyping-and-development/4...

Symmetry · on July 27, 2015

Yeah. If you can do a good job predicting in advance how long long a given memory access is going to take then VLIWs work pretty well. Many applications have predictable memory access patterns and can perform very well with in-order processors. Decoding an MP3, say. Just not stuff like a web browser or word processor.

pcwalton · on July 27, 2015

In fact, you're quite possibly using a VLIW chip right now: your GPU.

noipv4 · on July 27, 2015

Actually many TI 6000 series DSP chips are also VLIW. http://users.ece.utexas.edu/~bevans/hp-dsp-seminar/01_Introd...

Narishma · on July 27, 2015

Not anymore. On the desktop, only AMD has ever used a VLIW architecture in their GPUs for a while, but they switched to a scalar architecture years ago.

protomyth · on July 27, 2015

I though most of them switch when the whole GPGPU movement happened?

jcranmer · on July 27, 2015

AMD is a VLIW architecture, NVidia is not.

Correction: AMD GPUs were VLIW when I took a class on GPGPU in 2011. Apparently, AMD subsequently switched from VLIW: <https://en.wikipedia.org/wiki/Graphics_Core_Next>.

octatoan · on July 27, 2015

What separates Itanium from other processors?

koenigdavidmj · on July 27, 2015

Modern x86 processors invest a lot of time in rearranging instructions, shadowing registers, and the like, with the intent to parallelize instructions without having to reveal this fact to the end user.

Itanium code takes the form of fixed-width chunks, but they're a lot longer than a RISC instruction and contain multiple small instructions. All of these run in parallel. It's illegal to have any conflicts within a chunk (such as two instructions writing to the same register, or one writing from a register that another one is reading).

Itanium exposes this parallelism to compiler authors, so that a sufficiently smart code generator can take advantage of this. In practice, the sufficiently smart compilers never materialized.

cysun · on July 27, 2015

An interesting innovation which only was used in the IA-64 simulators apparently, was the neural branch predictor: https://en.wikipedia.org/wiki/Branch_predictor#Neural_branch...

noipv4 · on July 27, 2015

VLIW architecture with predicated execution.

mappu · on July 28, 2015

I don't think predicated execution is unheard of, ARM has the same concept (although it's limited to flags instead of what seems like one-bit registers). The feature has since been dropped in AArch64 though.