Advanced SIMD Extending the reach of contemporary SIMD architectures

Matthias Boettcher, Bashir M. Al-Hashimi, Mbou Eyole, Giacomo Gabrielli, Alastair Reid
ARM Ltd and University of Southampton
[pdf] [doi]

Design, Automation & Test in Europe Conference & Exhibition (DATE 2014)
Dresden, Germany
March 2014

Abstract

SIMD extensions have gained widespread acceptance in modern microprocessors as a way to exploit data-level parallelism in general-purpose cores. Popular SIMD architectures (e.g., Intel SSE/AVX) have evolved by adding support for wider registers and datapaths, and advanced features like indexed memory accesses, per-lane predication and inter-lane instructions, at the cost of additional silicon area and design complexity.

This paper evaluates the performance impact of such advanced features on a set of workloads considered hard to vectorize for traditional SIMD architectures. Their sensitivity to the most relevant design parameters (e.g. register/datapath width and L1 data cache configuration) is quantified and discussed.

We developed an ARMv7 NEON based ISA extension (ARGON), augmented a cycle accurate simulation framework for it, and derived a set of benchmarks from the Berkeley dwarfs. Our analyses demonstrate how ARGON can, depending on the structure of an algorithm, achieve speedups of 1.5x to 16x.

BibTeX

@inproceedings{DBLP:conf/date/BoettcherAEGR14 , abstract = {SIMD extensions have gained widespread acceptance in modern microprocessors as a way to exploit data-level parallelism in general-purpose cores. Popular SIMD architectures (e.g., Intel SSE/AVX) have evolved by adding support for wider registers and datapaths, and advanced features like indexed memory accesses, per-lane predication and inter-lane instructions, at the cost of additional silicon area and design complexity.

This paper evaluates the performance impact of such advanced features on a set of workloads considered hard to vectorize for traditional SIMD architectures. Their sensitivity to the most relevant design parameters (e.g. register/datapath width and L1 data cache configuration) is quantified and discussed.

We developed an ARMv7 NEON based ISA extension (ARGON), augmented a cycle accurate simulation framework for it, and derived a set of benchmarks from the Berkeley dwarfs. Our analyses demonstrate how ARGON can, depending on the structure of an algorithm, achieve speedups of 1.5x to 16x.} , acceptance = {22} , affiliation = {ARM Ltd and University of Southampton} , ar_file = {DATE_14} , ar_shortname = {DATE 14} , author = {Matthias Boettcher and Bashir M. Al{-}Hashimi and Mbou Eyole and Giacomo Gabrielli and Alastair Reid} , booktitle = {Design, Automation {\&} Test in Europe Conference {\&} Exhibition ({DATE} 2014)} , day = {24-28} , doi = {10.7873/DATE.2014.037} , editor = {Gerhard Fettweis and Wolfgang Nebel} , file = {date2014_adv_simd.pdf} , location = {Dresden, Germany} , month = {March} , pages = {1-4} , publisher = {European Design and Automation Association} , title = {{A}dvanced {S{I}MD:} {E}xtending the reach of contemporary {S{I}MD} architectures} , year = {2014} }