Advanced SIMD - Extending the reach of contemporary SIMD architectures

Matthias Boettcher, Bashir M. Al-Hashimi, Mbou Eyole, Giacomo Gabrielli, Alastair Reid
ARM Ltd and University of Southampton
[pdf] [doi]

Design, Automation & Test in Europe Conference & Exhibition (DATE 2014)
Dresden, Germany
March 2014

Abstract

SIMD extensions have gained widespread acceptance in modern microprocessors as a way to exploit data-level parallelism in general-purpose cores. Popular SIMD architectures (e.g., Intel SSE/AVX) have evolved by adding support for wider registers and datapaths, and advanced features like indexed memory accesses, per-lane predication and inter-lane instructions, at the cost of additional silicon area and design complexity.

This paper evaluates the performance impact of such advanced features on a set of workloads considered hard to vectorize for traditional SIMD architectures. Their sensitivity to the most relevant design parameters (e.g. register/datapath width and L1 data cache configuration) is quantified and discussed.

We developed an ARMv7 NEON based ISA extension (ARGON), augmented a cycle accurate simulation framework for it, and derived a set of benchmarks from the Berkeley dwarfs. Our analyses demonstrate how ARGON can, depending on the structure of an algorithm, achieve speedups of 1.5x to 16x.

First page of paper

BibTeX

@inproceedings{DBLP:conf/date/BoettcherAEGR14 , abstract = { SIMD extensions have gained widespread acceptance in modern microprocessors as a way to exploit data-level parallelism in general-purpose cores. Popular SIMD architectures (e.g., Intel SSE/AVX) have evolved by adding support for wider registers and datapaths, and advanced features like indexed memory accesses, per-lane predication and inter-lane instructions, at the cost of additional silicon area and design complexity.

This paper evaluates the performance impact of such advanced features on a set of workloads considered hard to vectorize for traditional SIMD architectures. Their sensitivity to the most relevant design parameters (e.g. register/datapath width and L1 data cache configuration) is quantified and discussed.

We developed an ARMv7 NEON based ISA extension (ARGON), augmented a cycle accurate simulation framework for it, and derived a set of benchmarks from the Berkeley dwarfs. Our analyses demonstrate how ARGON can, depending on the structure of an algorithm, achieve speedups of 1.5x to 16x. } , acceptance = {22} , affiliation = {ARM Ltd and University of Southampton} , ar_file = {DATE_14} , ar_shortname = {DATE 14} , author = {Matthias Boettcher and Bashir M. Al-Hashimi and Mbou Eyole and Giacomo Gabrielli and Alastair Reid} , booktitle = {Design, Automation \& Test in Europe Conference \& Exhibition (DATE 2014)} , day = {24-28} , doi = {10.7873/DATE.2014.037} , editor = {Gerhard Fettweis and Wolfgang Nebel} , file = {date2014_adv_simd.pdf} , location = {Dresden, Germany} , month = {March} , pages = {1-4} , png = {date2014_adv_simd.png} , publisher = {European Design and Automation Association} , title = {{A}dvanced {S}IM{D}: {E}xtending the reach of contemporary {S}IM{D} architectures} , year = {2014} }