-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmain.tex
2395 lines (2003 loc) · 116 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[oneside]{book}
% Needed so characters such as underscore are recognized in PDF
% viewers. Otherwise searching for say `hsa_open` produces no
% results
\usepackage[T1]{fontenc}
% Use Open Sans font
\usepackage[default]{opensans}
% Margins
\usepackage[top=2.5cm,bottom=2.5cm,left=2.5cm,right=2.5cm]{geometry}
\usepackage[toc,page]{appendix}
\usepackage[usenames,dvipsnames,svgnames,table]{xcolor}
\definecolor{lightgray}{gray}{0.94}
\usepackage{makeidx}
\usepackage{pbox}
\usepackage{sidecap}
\usepackage{float}
\newcommand{\doctitle}{HSA Runtime Programmer's Reference Manual}
% overwrite global draft option so the source is shown in final mode
\usepackage[final]{listings}
% allow tables across page breaks
\usepackage{longtable}
% Danger sign
\usepackage{stackengine}
\usepackage{scalerel}
\newcommand\danger[1][1.75ex]{%
\renewcommand\stacktype{L}%
\scaleto{\stackon[1.3pt]{\color{black}$\triangle$}{\tiny !}}{#1}%
}
% Use Tikz for simple diagrams
\usepackage{tikz}
\usetikzlibrary{arrows,automata,positioning,chains,shapes.geometric}
\tikzset{
% define global arrow tip format
>=angle 45}
% allows width arithmetic, currently used in the LaTeX generated by xml2tex
\usepackage{calc}
% allow usage of underscore w/o resorting to underscore package
% the underscore package imposes more restrictions on where underscores still
% cannot be used (e.g.: labels)
\catcode`_=12 % we need this line, even when it is repeated afterwards
\AtBeginDocument{
\catcode`_=12
\begingroup\lccode`~=`_
\lowercase{\endgroup\let~}\sb
\mathcode`_="8000
}
\usepackage{ifthen}
\usepackage{textcomp}
% inline enumerations and itemizes in the paragraph instead of breaking
\usepackage{paralist}
% Customizable headers/footers
\usepackage{fancyhdr}
% add API index. Note that 'imakeidx' works within Latex (unlike 'makeidx', which
% requires running 'makeindex' from command line)
\usepackage{imakeidx}
\makeindex[name=api,title=Index - Core APIs,columns=1,intoc]
\makeindex[name=ext,title=Index - Extension APIs,columns=1,intoc]
% allows customization of description environment (margin, indentation, etc.)
\usepackage{enumitem}
\lstset{% in alphabetical order
backgroundcolor=\color{lightgray},
basicstyle=\footnotesize,
breakatwhitespace=true,
breaklines=true,
captionpos=b, % sets the caption-position to bottom
columns=fullflexible,
commentstyle=\color{ForestGreen},
emphstyle={\textbf},
frame=single,
framexbottommargin=3pt, % bottom frame padding
framextopmargin=3pt, % top frame padding
inputencoding=utf8,
keywordstyle=\color{black},
language=C,
rulecolor=\color{lightgray}, % invisible (we use it because it allows padding)
showstringspaces=false,
tabsize=4,
}
% prevent listings package from changing hyphens to minus signs
\makeatletter
\lst@CCPutMacro\lst@ProcessOther {"2D}{\lst@ttfamily{-{}}{-{}}}
\@empty\z@\@empty
\makeatother
% Use Example instead of Listing in captions
\renewcommand{\lstlistingname}{Example}
% List of Listings -> List of Examples
\renewcommand{\lstlistlistingname}{List of \lstlistingname s}
% include automatically-generated Listing commands
\input{api/altlatex/listings}
\input{example/altlatex/lstinputfunlisting}
% formatting of arguments, function names, types, etc.
% remember to add this commands to the safe command list of Latexdiff so
% it includes them in the diff algorithm
\newcommand{\reffun}[1]{\textbf{#1}}
\newcommand{\refarg}[1]{\textit{#1}}
\newcommand{\reffld}[1]{\textit{#1}}
\newcommand{\reftyp}[1]{#1}
\newcommand{\refenu}[1]{\reftyp{#1}}
\newcommand{\refhsl}[1]{\reffun{#1}}
% allows string comparison, which is used to match \hsaref arguments
\usepackage{pdftexcmds}
% Automatically generated file containing all the definitions in the header that
% can be referenced via \hsaref commands
\input{api/altlatex/commands}
% every section in a new page
\usepackage{etoolbox}
\pretocmd{\section}{%
\ifnum\value{section}=0 \else\clearpage\fi
}{}{}
% define marginparwidth so todonotes renders properly
\setlength{\marginparwidth}{2cm}
\usepackage[obeyDraft,obeyFinal,textsize=scriptsize]{todonotes}
\usepackage[
% link color is black
allcolors=Cerulean,
% Do not use colors or frames in links
% If colors are used, they overwrite those of the DIFF tool so
% many differences would not be shown
colorlinks=true,
% use links even if the document is draft
final,
%sections and subsections linked
linktoc=all,
]{hyperref}
% alternate rowcolors for all long-tables
\let\oldlongtable\longtable
\let\endoldlongtable\endlongtable
\newenvironment{mylongtable}{\rowcolors{0}{lightgray}{lightgray}\longtable} {
\endlongtable}
% Alter some LaTeX defaults for better treatment of figures:
% See p.105 of "TeX Unbound" for suggested values.
% See pp. 199-200 of Lamport's "LaTeX" book for details.
% General parameters, for ALL pages:
\renewcommand{\topfraction}{0.9} % max fraction of floats at top
\renewcommand{\bottomfraction}{0.8} % max fraction of floats at bottom
% Parameters for TEXT pages (not float pages):
\setcounter{topnumber}{2}
\setcounter{bottomnumber}{2}
\setcounter{totalnumber}{4} % 2 may work better
\setcounter{dbltopnumber}{2} % for 2-column pages
\renewcommand{\dbltopfraction}{0.9} % fit big float above 2-col. text
\renewcommand{\textfraction}{0.07} % allow minimal text w. figs
% Parameters for FLOAT pages (not text pages):
\renewcommand{\floatpagefraction}{0.7} % require fuller float pages
% N.B.: floatpagefraction MUST be less than topfraction !!
\renewcommand{\dblfloatpagefraction}{0.7} % require fuller float pages
% push footer a little further away
\setlength{\footskip}{35pt}
% Side notes. One command per author
\newcommand{\mariotodo}[1]{\todo[color=CarnationPink]{#1}}
% Increase paragraph separation
\setlength{\parskip}{2mm}
% no indentation
\setlength{\parindent}{0cm}
% number (sub)sections up to level 3
\setcounter{secnumdepth}{3}
\makeindex
\setcounter{tocdepth}{3}
\renewcommand{\footrulewidth}{0.4pt}
\renewcommand{\familydefault}{\sfdefault}
\RequirePackage[normalem]{ulem}
\newenvironment{DIFnomarkup}{}{}
\makeatletter
\newcommand*{\tidyhdr}[2]{
% if \rightmark is empty or equal to \leftmark, omit
\ifnum\pdf@strcmp{#1}{#2}=\z@ #1\else \ifnum\pdf@strcmp{}{#2}=\z@ #1\else #1: #2\fi\fi
}
\makeatother
% header and footer layout
\newcommand{\hfstyle}{
% clear all header and footer fields
\fancyhf{}
\rfoot{\scriptsize{\color{Gray} \thepage}}
\lfoot{\scriptsize{\color{Gray}\doctitle, Version 1.1}}
\lhead{\scriptsize{\color{Gray}\nouppercase{\tidyhdr{\leftmark}{\rightmark}}}}
\renewcommand{\headrulewidth}{0pt}
\renewcommand{\footrulewidth}{.5pt}
}
% use custom header and footer
\pagestyle{fancy}
\hfstyle{}
\renewcommand{\chaptermark}[1]{\markboth{#1}{}}
\renewcommand{\sectionmark}[1]{\markright{#1}{}}
\usepackage{datatool}
\newcommand{\sortitem}[2]{%
\DTLnewrow{list}%
\DTLnewdbentry{list}{company}{#1}%
\DTLnewdbentry{list}{description}{#2}%
}
\newcommand*{\processed}{}
\newenvironment{sorteddescription}{%
\DTLifdbexists{list}{\DTLcleardb{list}}{\DTLnewdb{list}}
}{
\DTLsort{company,description}{list}
\begin{description}[itemsep=0pt,leftmargin=0pt, labelindent=0cm]
\DTLforeach*{list}{\comp=company,\desc=description}{
\expandafter\DTLifinlist\expandafter{\comp}{\processed}
{\hspace{-1mm}\desc\\}
{\ifdefempty{\processed}{\let\processed\comp{\item[\comp]$\;$\\\desc\\}}
{\eappto\processed{,\comp}{\item[\comp]$\;$\\\desc\\}}% append to list
}
}
\end{description}
}
% plain style is identical to fancy style. This avoids different formatting in
% the first page of every chapter, for example.
\fancypagestyle{plain}{\hfstyle{}}
\begin{document}
% redefine DIFadd command to avoid \uwave, which causes troubles when the
% argument includes \hypertarget or \hyperlink
% also, changing default 'add color' to avoid collisions with lstlistings colors
\providecommand{\DIFadd}[1]{{\protect\color{Green}#1}}
\renewcommand{\DIFadd}[1]{{\protect\color{Green}#1}}
% Surrounding the text with \mbox is the only way \sout can wrap \hyperref or
% \hypertarget; otherwise, we get a compilation error.
% However, mbox'ed text does not respect the page margins, resulting in lines
% that disappear through the right hand-side of the page.
\providecommand{\DIFdel}[1]{{\protect\color{red}\sout{\mbox{#1}}}}
\renewcommand{\DIFdel}[1]{{\protect\color{red}\sout{\mbox{#1}}}}
\pagenumbering{roman}
\addcontentsline{toc}{chapter}{Cover} % add cover to TOC
\begin{titlepage}
\includegraphics[width=.3\textwidth]{fig/foundation.png}
\vspace*{7cm}
\begin{center}
{\Huge \color{Cerulean}{\doctitle}\\[7cm]}
{\small Revision: Version 1.1 $\bullet$ Issue Date: \today}\\ % remember to change footer version too!
\end{center}
\end{titlepage}
\thispagestyle{empty} {\textcopyright 2013-\the\year $\:$HSA Foundation. All rights
reserved.}
The contents of this document are provided in connection with the HSA Foundation
specifications. This specification is protected by copyright laws and contains
material proprietary to the HSA Foundation. It or any components may not be
reproduced, republished, distributed, transmitted, displayed, broadcast or
otherwise exploited in any manner without the express prior written permission
of HSA Foundation. You may use this specification for implementing the
functionality therein, without altering or removing any trademark, copyright or
other notice from the specification, but the receipt or possession of this
specification does not convey any rights to reproduce, disclose, or distribute
its contents, or to manufacture, use, or sell anything that it may describe, in
whole or in part.
HSA Foundation grants express permission to any current Founder, Promoter,
Supporter Contributor, Academic or Associate member of HSA Foundation to copy
and redistribute UNMODIFIED versions of this specification in any fashion,
provided that NO CHARGE is made for the specification and the latest available
update of the specification for any version of the API is used whenever
possible. Such distributed specification may be re-formatted AS LONG AS the
contents of the specification are not changed in any way. The specification may
be incorporated into a product that is sold as long as such product includes
significant independent work developed by the seller. A link to the current
version of this specification on the HSA Foundation web-site should be included
whenever possible with specification distributions.
HSA Foundation makes no, and expressly disclaims any, representations or
warranties, express or implied, regarding this specification, including, without
limitation, any implied warranties of merchantability or fitness for a
particular purpose or non-infringement of any intellectual property. HSA
Foundation makes no, and expressly disclaims any, warranties, express or
implied, regarding the correctness, accuracy, completeness, timeliness, and
reliability of the specification. Under no circumstances will the HSA
Foundation, or any of its Founders, Promoters, Supporters, Academic,
Contributors, and Associates members or their respective partners, officers,
directors, employees, agents or representatives be liable for any damages,
whether direct, indirect, special or consequential damages for lost revenues,
lost profits, or otherwise, arising from or in connection with these materials.
\clearpage
{\Huge \textbf{Acknowledgments}}\\[3mm]
This specification is the result of the contributions of many people. Here
is a partial list of the contributors, including the company that they
represented at the time of their contribution:
\begin{sorteddescription}
\sortitem{AMD}{M\'{e}ndez-Lojo, Mario (spec editor)}
\sortitem{AMD}{Tye, Tony}
\sortitem{AMD}{Zhuravlyov, Konstantin}
\sortitem{AMD}{Sander, Ben}
\sortitem{AMD}{Tipparaju, Vinod}
\sortitem{AMD}{Thangirala, Hari}
\sortitem{AMD}{Apte, Prasad}
\sortitem{AMD}{Cao, Fan}
\sortitem{AMD}{Cornwall, Jay}
\sortitem{AMD}{Ding, Wei}
\sortitem{AMD}{Edwards, Adrian}
\sortitem{AMD}{Errabolu, Ramesh}
\sortitem{AMD}{Keely, Sean}
\sortitem{AMD}{Ramalingam, Shreyas}
\sortitem{AMD}{Wicaksono, Besar}
\sortitem{AMD}{Xiao, Shucai}
\sortitem{AMD}{Yao, Ming}
\sortitem{AMD}{Blinzer, Paul}
\sortitem{AMD}{Herdeg, Mark}
\sortitem{AMD}{Hesik, Chris}
\sortitem{AMD}{McInally, Callan}
\sortitem{AMD}{Purnomo, Budirijanto}
\sortitem{AMD}{Mistry, Perhaad}
\sortitem{AMD}{Tebeka, Yaki}
\sortitem{AMD}{Grant, Al}
\sortitem{Qualcomm}{Howes, Lee}
\sortitem{Qualcomm}{Gaster, Ben}
\sortitem{Qualcomm}{Bourd, Alex}
\sortitem{Qualcomm}{Bellows, Greg}
\sortitem{Qualcomm}{Bin, Lihan}
\sortitem{Qualcomm}{Rychlik, Bob}
\sortitem{Qualcomm}{Simpson, Robert J.}
\sortitem{ARM}{Kovacevic, Djordje}
\sortitem{ARM}{Parker, Jason}
\sortitem{ARM}{Persson, H\r{a}kan}
\sortitem{Imagination}{Aldis, James}
\sortitem{Imagination}{Meredith, Jason}
\sortitem{Imagination}{Howson, John}
\sortitem{Imagination}{Glew, Andy}
\sortitem{Imagination}{McCarthy, James}
\sortitem{Imagination}{Rankilor, Mark}
\sortitem{Imagination}{Zaric, Zoran}
\sortitem{Mediatek}{Bagley, Richard}
\sortitem{Mediatek}{Ju, Roy}
\sortitem{Mediatek}{Lo, Trent}
\sortitem{Mediatek}{Lin, Jason}
\sortitem{Mediatek}{Hsu, Barz}
\sortitem{Mediatek}{Huang, Emerson}
\sortitem{Mediatek}{Agarwal, Rahul}
\sortitem{Via Alliance Technologies}{Hong, Mike}
\sortitem{Sandia National Laboratories}{Stark, Dylan}
\sortitem{Sandia National Laboratories}{Hammond, Simon D.}
\sortitem{Samsung Electronics}{Ryu, Soojung}
\sortitem{Samsung Electronics}{Shebanow, Michael}
\sortitem{Codeplay}{Brand, Simon (tools spec editor)}
\sortitem{Codeplay}{Potter, Ralph}
\sortitem{Codeplay}{Richards, Andrew (workgroup chair)}
\sortitem{General Processor}{D'Arcy, Paul}
\sortitem{General Processor}{Glossner, John}
\sortitem{Northeastern University}{Kaeli, Dave}
\sortitem{Rice University}{Ahdianto, Laksono}
\sortitem{National Taiwan University}{Yeh, Medicine}
\sortitem{SUSE LLC}{Jambor, Martin}
\sortitem{MultiCore Ware}{Jablin, Tom}
\end{sorteddescription}
\clearpage
\phantomsection
\addcontentsline{toc}{chapter}{Contents} % add to TOC
\tableofcontents
\clearpage
\pagenumbering{arabic}
\setcounter{page}{1}
\chapter{Introduction} \label{index}
\vspace{-7mm}
\section{Overview}\label{overview}
\vspace{-3mm}
Recent heterogeneous system designs have integrated CPU, GPU, and other
accelerator devices into a single platform with a shared high-bandwidth memory
system. Specialized accelerators now complement general purpose CPU chips and
are used to provide both power and performance benefits. These
heterogeneous designs are now widely used in many computing markets including
cellphones, tablets, personal computers, and game consoles. The Heterogeneous
System Architecture (HSA) builds on the close physical integration of
accelerators that is already occurring in the marketplace, and takes the next
step by defining standards for uniting the accelerators architecturally. The HSA
specifications include requirements for virtual memory, memory coherency,
architected dispatch mechanisms, and power-efficient signals. HSA refers to
these accelerators as kernel agents.
The HSA system architecture defines a consistent base for building portable
applications that access the power and performance benefits of the dedicated
kernel agents. Many of these kernel agents, including GPUs and DSPs, are capable
and flexible processors that have been extended with special hardware for
accelerating parallel code. Historically these devices have been difficult to
program due to a need for specialized or proprietary programming languages. HSA
aims to bring the benefits of these kernel agents to mainstream programming
languages using similar or identical syntax to that which is provided for
programming multi-core CPUs. For more information on the system architecture,
refer to the HSA Platform System Architecture Specification~\cite{sar}.
In addition to the system architecture, HSA defines a portable, low-level,
compiler intermediate language called HSAIL. A high-level compiler
generates the HSAIL for the parallel regions of code. A low-level compiler
called the finalizer translates the intermediate HSAIL to target machine
code. The finalizer can be run at compile-time, install-time, or run-time. Each
kernel agent provides its own implementation of the finalizer. For more
information on HSAIL, refer to the HSA Programmer's Reference Manual~\cite{prm}.
The final piece of the puzzle is the HSA runtime API. The runtime is a thin,
user-mode API that provides the interfaces necessary for the host to launch
compute kernels to the available kernel agents. This document describes the
architecture and APIs for the HSA runtime. Key sections of the runtime API
include:
\begin{itemize}[itemsep=0pt,topsep=0pt,partopsep=0pt]
\item Error handling
\item Runtime initialization and shutdown
\item System and agent information
\item Signals and synchronization
\item Architected dispatch
\item Memory management
\end{itemize}
The remainder of this document describes the HSA software architecture and
execution model, and includes functional descriptions for all of the HSA APIs
and associated data structures.
\begin{figure}[t]
\centering
\tikzstyle{lang}=[rectangle,draw,fill=black!30,align=center,minimum width=1.25cm,minimum height=.75cm]
\tikzstyle{hsa}=[rectangle,draw,fill=black!10,align=center,minimum height=.75cm]
\tikzstyle{comp}=[rectangle,draw,minimum height=.75cm]
\begin{tikzpicture}[thick,auto, node distance=1.5cm]
\scriptsize
\node[lang] (l0) {OpenCL\texttrademark \\ app};
\node[lang,below of=l0] (r0) {OpenCL\texttrademark \\ runtime};
\node[lang,right of=l0] (l1) {Java \\ app};
\node[lang,below of=l1] (r1) {JVM};
\node[inner sep=0,below right=.75cm of l1] (k) {...};
\node[lang,above right=.75cm of k] (l2) {OpenMP \\ app};
\node[lang,below of=l2] (r2) {OpenMP \\ runtime};
\node[lang,right of=l2] (l3) {DSL \\ app};
\node[lang,below of=l3] (r3) {DSL \\ runtime};
\node[hsa,minimum width=3.5cm,below=2cm of k] (h) {HSA runtime};
\node[hsa,minimum width=3cm,right=1.5cm of h] (hf) {HSA finalizer};
\tiny
\node[comp, below=.75cm of h.west,anchor=west] (c1) {kernel agent 1};
\node[below=.5cm of h.south,anchor=south] (cAny) {...};
\node[comp,below=.75cm of h.east,anchor=east] (cN) {kernel agent N};
\path[->]
(l0) edge (r0)
(l1) edge (r1)
(l2) edge (r2)
(l3) edge (r3)
(r0.south) edge (h)
(r1) edge (h)
(r2) edge (h)
(r3.south) edge (h)
(h) edge[dashed] (hf)
;
\end{tikzpicture}
\caption{HSA Software Architecture}
\label{fig:swarch}
\end{figure}
Figure~\ref{fig:swarch} shows how the HSA runtime fits into a typical software
architecture stack. At the top of the stack is a programming model such as
OpenCL\texttrademark, Java, OpenMP, or a domain-specific language (DSL). The
programming model must include some way to indicate a parallel region that can
be accelerated. For example, OpenCL has calls to \texttt{clEnqueueNDRangeKernel}
with associated kernels and grid ranges. Java defines stream and lambda APIs,
which provide support for both multi-core CPUs and kernel agents. OpenMP
contains OMP pragmas that mark loops for parallel computing and that control
other aspects of the parallel implementation. Other programming models can also
build on this same infrastructure.
The language compiler is responsible for generating HSAIL code for the parallel
regions of code. The code can be pre-compiled before runtime or compiled at
runtime. A high-level compiler can generate the HSAIL before runtime, in which
case, when the application loads the finalizer, converts the HSAIL to machine
code for the target machine. Another option is to run the finalizer when the
application is built, in which case the resulting binary includes the machine
code for the target architecture. The HSA finalizer is an optional element of
the HSA runtime, which can reduce the footprint of the HSA software on systems
where the finalization is done before runtime.
Each language also includes a "language runtime" that connects the language
implementation to the HSA runtime. When the language compiler generates code for
a parallel region, it will include calls to the HSA runtime to set up and
dispatch the parallel region to the kernel agent. The language runtime is also
responsible for initializing the HSA runtime, selecting target devices, creating
execution queues, managing memory. The language runtime may use other HSA
runtime features as well. A runtime implementation may provide optional
extensions. Applications can query the runtime to determine which extensions are
available. This document describes extensions such as Finalization, Images, and
Profiling.
The API for the HSA runtime is standard across all HSA vendors. This means that
languages that use the HSA runtime can execute on different vendors' platforms
that support the API. Each vendor is responsible for supplying their own HSA
runtime implementation that supports all of the kernel agents in the vendor's
platform. HSA does not provide a mechanism to combine runtimes from different
vendors. The implementation of the HSA runtime may include kernel-level
components (required for some hardware components) or may only include
user-space components (for example, simulators or CPU implementations).
Figure~\ref{fig:swarch} shows the ``AQL'' (Architected Queuing
Language) path that application runtimes use to send commands directly to
kernel agents. For more information on AQL, refer to Section~\ref{sec:aql}.
\section{Programming Model}\label{sec:executionmodel}
This section introduces the main concepts behind the HSA programming model by
outlining how they are exposed in the runtime API. In this introductory example
we show the basic steps that are needed to launch a kernel.
The rest of the sections in this specification provide a more formal and
detailed description of the different components of the HSA API, including many
not discussed here.
\subsection{Initialization and Agent Discovery}
The first step any HSA application must perform is to initialize the runtime
before invoking any other calls to the API:
\begin{lstlisting}
hsa_init();
\end{lstlisting}
The next step the application performs is to find a device where it can launch
the kernel. In HSA parlance, a regular device is called an \emph{agent}, and
if the agent can run kernels then it is also an \emph{kernel agent}. The
glossary at the end of this document contains more precise definitions of these
terms. The HSA API uses opaque handles of type \hsaref{hsa_agent_t} to represent
agents and kernel agents.
The HSA runtime API exposes the set of available agents via
\hsaref{hsa_iterate_agents}. This function receives a callback and a buffer from
the application; the callback is invoked once per agent unless it returns a
special 'break' value or an error. In this case, the callback queries an
agent attribute (\hsaref{HSA_AGENT_INFO_FEATURE}) in order to determine whether
the agent is also a kernel agent. If this is the case, the kernel agent
is stored in the buffer and the iteration ends:
\begin{lstlisting}
hsa_agent_t kernel_agent;
hsa_iterate_agents(get_kernel_agent, &kernel_agent);
\end{lstlisting}
where the application-provided callback \textit{get_kernel_agent} is:
\lstinputfunlisting[0]{get_kernel_agent}
Section~\ref{sec:agentinfo} lists the set of available agent and
system-wide attributes, and describes the functions to query them.
\subsection{Queues and AQL packets}
When an HSA application needs to launch a kernel in a kernel agent, it does so
by placing an \textit{AQL packet} in a \textit{queue} owned by the kernel
agent. A packet is a memory buffer encoding a single command. There are
different types of packets; the one used for dispatching a kernel is named
\emph{kernel dispatch} packet.
The binary structure of the different packet types is defined in the HSA
Architecture Specification~\cite{sar} standard. For example, all the packets
types occupy 64 bytes of storage and share a common header, and the kernel
dispatch packets should specify a handle to the executable code at offset
32. The packet structure is known to the application (kernel dispatch packets
correspond to the \hsaref{hsa_kernel_dispatch_packet_t} type in the HSA API),
but also to the hardware. This is a key HSA feature that enables applications to
launch a packet in a specific agent by simply placing it in one of its
\textit{queues}.
A queue is a runtime-allocated resource that contains a packet buffer and is
associated with a packet processor. The packet processor tracks which packets in
the buffer have already been processed. When it has been informed by the
application that a new packet has been enqueued, the packet processor is able to
process it because the packet format is standard and the packet contents are
self-contained -- they include all the necessary information to run a
command. The \textit{packet processor} is generally a hardware unit that is
aware of the different packet formats.
After introducing the basic concepts related to packets and queues, we can go
back to our example and create a queue in the kernel agent using
\hsaref{hsa_queue_create}. The queue creation can be configured in multiple
ways. In the snippet below the application indicates that the queue should
be able to hold 256 packets.
\begin{lstlisting}
hsa_queue_t *queue;
hsa_queue_create(kernel_agent, 256, HSA_QUEUE_TYPE_SINGLE, NULL, NULL, UINT32_MAX, UINT32_MAX, &queue);
\end{lstlisting}
The next step is to create a packet and push it into the newly created
queue. Packets are not created using an HSA runtime function. Instead, the
application can directly access the packet buffer of any queue and setup a
kernel dispatch by simply filling all the fields mandated by the kernel dispatch
packet format (type \hsaref{hsa_kernel_dispatch_packet_t}). The location of the
packet buffer is available in the \hsaref{hsa_queue_t.base_address} field of any
queue:
\begin{lstlisting}
hsa_kernel_dispatch_packet_t* packet = (hsa_kernel_dispatch_packet_t*) queue->base_address;
// Configure dispatch dimensions: use a total of 256 work-items
packet->grid_size_x = 256;
packet->grid_size_y = 1;
packet->grid_size_z = 1;
// Configuration of the rest of the kernel dispatch packet is omitted for simplicity
\end{lstlisting}
In a real-world scenario, the application needs to exercise more caution when
enqueuing a packet -- there could be another thread writing a packet to the same
memory location. The HSA API exposes several functions that allow the
application to determine which buffer index to use to write a packet, and when to
write it. For more information on queues, refer to Section~\ref{sec:queues}. For
more information on AQL packets, refer to Section~\ref{sec:aql}.
\subsection{Signals and Packet launch}
The kernel dispatch packet is not launched until the application informs the
packet processor that there is new work available. The notification is divided
in two parts:
\begin{enumerate}
\item The contents of the first 32 bits of the packet (which include the
\hsaref{hsa_kernel_dispatch_packet_t.header} and the
\hsaref{hsa_kernel_dispatch_packet_t.setup} fields) must be atomically set
using a release memory ordering. This ensures that previous modifications to
the rest of the packet are globally visible by the time the first 32 bits of
the packet are also visible. The most relevant information passed in the
header is the packet's type (in this case,
\hsaref{HSA_PACKET_TYPE_KERNEL_DISPATCH}). For simplicity we omit the details
on how to setup the header and setup fields (the reader can refer to
Section~\ref{dispatch-packet} for the source code of the helper functions used
in the snippet). One possible implementation of the atomic update in GCC
is:
\lstinputfunlisting[1]{set_header_and_setup}
\item The buffer index where the packet has been written (in the example, zero)
must be stored in the \textit{doorbell signal} of the queue.
\end{enumerate}
A \emph{signal} is a runtime-allocated, opaque object used for communication
between agents in an HSA system. Signals are similar to shared memory
locations containing an integer. Agents can atomically store a new integer
value in a signal, atomically read the current value of the signal, etc. using
HSA runtime functions. Signals are the preferred communication mechanism in an
HSA system because signal operations usually perform better (in terms of power
or speed) than their shared memory counterparts. For more information on
signals, refer to Section~\ref{sec:signals}.
When the runtime creates a queue, it also automatically creates a ``doorbell''
signal that must be used by the application to inform the packet processor of
the index of the packet ready to be consumed. The doorbell signal is contained
in the \hsaref{hsa_queue_t.doorbell_signal} field of the queue. The value of a
signal can be updated using \hsaref{hsa_signal_store_screlease}:
\begin{lstlisting}
hsa_signal_store_screlease(queue->doorbell_signal, 0);
\end{lstlisting}
After the packet processor has been notified, the execution of the kernel may
start asynchronously at any moment. The application could simultaneously write
more packets to launch other kernels in the same queue.
In this introductory example, we omitted some important steps in the dispatch
process. In particular, we did not show how to compile a kernel, indicate which
executable code to run in the kernel dispatch packet, nor how to pass arguments
to the kernel. However, some relevant differences with other runtime systems and
programming models are already evident. Other runtime systems provide software
APIs for setting arguments and launching kernels, while HSA architects these at
the hardware and specification level. An HSA application can use regular memory
operations and a very lightweight set of runtime APIs to launch a kernel or in
general submit a packet.
\chapter{HSA Core Programming Guide} \label{coreapi}
This chapter describes the HSA Core runtime APIs, organized by functional
area. For information on definitions that are not specific to any functionality,
refer to Section~\ref{sec:other}. The API follows the requirements listed in the
HSA Programmer's Reference Manual~\cite{prm} and the HSA Platform System
Architecture Specification~\cite{sar}.
Several operating systems allow functions to be executed when a DLL or a shared
library is loaded (for example, DLL main in Windows and GCC
\emph{constructor/destructor} attributes that allow functions to be executed
prior to main in several operating systems). Whether or not the HSA runtime
functions are allowed to be invoked in such fashion may be
implementation-specific and is outside the scope of this specification.
Any header files distributed by the HSA Foundation for this specification may
contain calling-convention specific prefixes such as __cdecl or __stdcall, which
are outside the scope of the API definition.
Unless otherwise stated, functions can be considered thread-safe.
\section{Initialization and Shut Down}\label{sec:init}
When an application initializes the runtime (\hsaref{hsa_init}) for the first
time in a given process, a runtime instance is created. The instance is
reference counted such that multiple HSA clients within the same
process do not interfere with each other. Invoking the initialization routine
$n$ times within a process does not create $n$ runtime instances, but a unique
runtime object with an associated reference counter of $n$. Shutting down the
runtime (\hsaref{hsa_shut_down}) is equivalent to decreasing its reference
counter. When the reference counter is less than one, the runtime object ceases
to exist, and any reference to it (or to any resources created while it was
active) results in undefined behavior.
After being initialized for the first time, the runtime is in the \textit{configuration
state}. Certain functions are only callable whilst the runtime is in the
configuration state. When a runtime function other than any of the following
functions is called, the runtime is no longer in the configuration state:
\begin{itemize}[itemsep=0pt,topsep=0pt,partopsep=0pt]
\item \hsaref{hsa_init}
\item \hsaref{hsa_system_get_info}
\item \hsaref{hsa_extension_get_name}
\item \hsaref{hsa_system_extension_supported}
\item \hsaref{hsa_system_get_extension_table}
\item \hsaref{hsa_agent_get_info}
\item \hsaref{hsa_iterate_agents}
\item \hsaref{hsa_agent_get_exception_policies}
\item \hsaref{hsa_cache_get_info}
\item \hsaref{hsa_agent_iterate_caches}
\item \hsaref{hsa_agent_extension_supported}
\item \hsaref{hsa_region_get_info}
\item \hsaref{hsa_agent_iterate_regions}
\item \hsaref{hsa_isa_from_name}
\item \hsaref{hsa_agent_iterate_isas}
\item \hsaref{hsa_isa_get_info}
\item \hsaref{hsa_isa_get_info_alt}
\item \hsaref{hsa_isa_get_exception_policies}
\item \hsaref{hsa_isa_get_round_method}
\item \hsaref{hsa_wavefront_get_info}
\item \hsaref{hsa_isa_iterate_wavefronts}
\item \hsaref{hsa_isa_compatible}
\end{itemize}
Extensions can specify functions which do not cause the runtime to exit the
configuration state.
\subsection{API}
\input{api/altlatex/group-initshutdown}
\section{Runtime Notifications}
\label{sec:error}
The runtime can report notifications (errors or events) synchronously or
asynchronously. The runtime uses the return value of functions in the HSA API to
pass synchronous notifications to the application. In this case, the
notification is a status code of type \hsaref{hsa_status_t} that indicates
success or error.
The documentation of each function defines what constitutes a successful
execution. When an HSA function does not execute successfully, the returned
status code might help determining the source of the error. While some
conditions can be generalized to a certain degree (e.g. failure in allocating
resources), others have implementation-specific explanations. For
example, certain operations on signals (explained in Section~\ref{sec:signals})
can fail if the runtime implementation validates the signal object passed by
the application. Because the representation of a signal is specific to the
implementation, the reported error would simply indicate that the signal is
invalid.
The \hsaref{hsa_status_t} enumeration captures the result of any API function
that has been executed, except for accessors and mutators. Success is
represented by \hsaref{HSA_STATUS_SUCCESS}, which has a value of zero. Error
statuses are assigned positive integers and their identifiers start with the
\refenu{HSA_STATUS_ERROR} prefix. The application may use
\hsaref{hsa_status_string} to obtain a string describing a status code.
The runtime passes \textit{asynchronous} notifications in a different fashion.
When the runtime detects an asynchronous event, it invokes an
application-defined callback. For example, queues (described in
Section~\ref{sec:queues}) are a common source of asynchronous events because the
tasks queued by an application are asynchronously consumed by the packet
processor. When the runtime detects an error in a queue, it invokes the callback
associated with that queue and passes it a status code (indicating what
happened) and a pointer to the erroneous queue. An application can associate a
callback with a queue at creation time.
The application must use caution when using blocking functions within their
callback implementation -- a callback that does not return can render the
runtime state to be undefined. The application cannot depend on thread local
storage within the callbacks implementation and may safely kill the thread that
registers the callback. The application is responsible for ensuring that
the callback function is thread-safe. The runtime does not implement any default
callbacks.
\subsection{API}
\input{api/altlatex/group-status}
\section{System and Agent Information}
\label{sec:agentinfo}
The HSA runtime API uses opaque handles of type \hsaref{hsa_agent_t} to
represent agents. The application can traverse the list of agents that are
available in the system using \hsaref{hsa_iterate_agents}, and use
\hsaref{hsa_agent_get_info} to query agent-specific attributes. Examples of
agent attributes include: name, type of backing device (CPU, GPU), and supported
queue types. Implementations of \hsaref{hsa_iterate_agents} are required to at
least report the host (CPU) agent.
If an agent supports kernel dispatch packets, then it is also kernel agent
(supports the AQL packet format and the HSAIL instruction set). The application
can inspect the \hsaref{HSA_AGENT_INFO_FEATURE} attribute in order to determine
if the agent is a kernel agent. kernel agents expose a rich set of
attributes related to kernel dispatches such as wavefront size or maximum number
of work-items in the grid.
The application can use \hsaref{hsa_system_get_info} to query system-wide
attributes. Note that the value of some attributes is not constant. For example,
the current timestamp (\hsaref{HSA_SYSTEM_INFO_TIMESTAMP}) value returned by the
runtime can increase as time progresses. For more information on timestamps,
please refer to~\cite{sar}, Section 2.5.
\subsection{API}
\input{api/altlatex/group-agentinfo}
\section{Signals}\label{sec:signals}
Agents can communicate with each other by using coherent shared (global)
memory or by using signals. Agents can perform operations on signals similar
to operations performed on shared memory locations. For example, an agent
can atomically store an integer value on them, atomically load their current
value, etc. However, signals can only be manipulated using the HSA runtime API
or HSAIL instructions. The advantage of signals over shared memory is that
signal operations usually perform better in terms of power or speed. For
example, a spin loop involving atomic memory operations that waits for a shared
memory location to satisfy a condition can be replaced with an HSA signal wait
operator such as \hsaref{hsa_signal_wait_scacquire}, which is implemented by the
runtime using efficient hardware features.
The runtime API uses opaque signal handlers of type \hsaref{hsa_signal_t} to
represent signals. A signal carries a signed integer value of type
\reftyp{hsa_signal_value_t} that can be accessed or conditionally waited upon
through an API call or HSAIL instruction. The value occupies four or eight bytes
depending on the machine model (small or large, respectively) being used. The
application creates a signal using the function \hsaref{hsa_signal_create}.
Modifying the value of a signal is equivalent to sending the signal. In addition
to the regular update (store) of a signal value, an application can perform
atomic operations such as add, subtract, or compare-and-swap. Each read or write
signal operation specifies which memory order to use. For example, store-release
(\hsaref{hsa_signal_store_screlease} function) is equivalent to storing a value on
the signal with release memory ordering. The combinations of actions and
memory orders available in the API match the corresponding HSAIL
instructions. For more information on memory orders and the HSA memory model,
please refer to the other HSA specifications~\cite{prm, sar}.
The application may wait on a signal, with a condition specifying the terms of
the wait. The wait can be done either in the kernel agent by using an HSAIL
\refhsl{wait} instruction or in the host CPU by using a runtime API
call. Waiting for a signal implies reading the current signal value (which is
returned to the application) using a SC acquire
(\hsaref{hsa_signal_wait_scacquire}) or a relaxed
(\hsaref{hsa_signal_wait_relaxed}) memory order. The signal value returned by
the wait operation is not guaranteed to satisfy the wait condition due to
multiple reasons:
\begin{itemize}[itemsep=1pt,topsep=3pt,partopsep=0pt]
\item A spurious wakeup interrupts the wait.
\item The wait time exceeded the user-specified timeout.
\item The wait time exceeded the system timeout
\hsaref{HSA_SYSTEM_INFO_SIGNAL_MAX_WAIT}.
\item The wait has been interrupted because the signal value satisfies the
specified condition, but the value is modified before the implementation of
the wait operation has the opportunity to read it.
\end{itemize}
\subsection{API}
% Macro-dependent typedefs are not showing up correctly in the XML generated
% by Doxygen, so we need to hardcode them :-(
\subsubsection{hsa_\-signal_\-value_\-t}
\vspace{-3.5mm}\begin{mylongtable}{@{}p{\textwidth}}
\rule{0pt}{3ex}\hypertarget{group__signals_1ga67ca2818879c9990e1b5f1b14ce7ed27}{}
\#ifdef HSA_LARGE_MODEL\\
\hspace{1.7em}typedef int64_\-t \textbf{hsa_\-signal_\-value_\-t}\\
\#else \\
\hspace{1.7em}typedef int32_\-t \textbf{hsa_\-signal_\-value_\-t}\\
\#endif\rule[-2ex]{0pt}{0pt}
\end{mylongtable}
\vspace{-3mm}Signal value. The value occupies 32 bits in small machine mode, and 64 bits in large machine mode.
\\
\input{api/altlatex/group-signals}
\section{Queues} \label{sec:queues}
HSA hardware supports command execution through user mode queues. A user mode
command queue is characterized~\cite{sar} as runtime-allocated, user-level,
accessible virtual memory of a certain size, containing packets (commands)
defined in the Architected Queueing Language (AQL is explained in more detail in
the next section). A queue is associated with a specific agent. An agent may
have several queues attached to it. We will refer to user mode queues as just
queues.
The application submits a packet to the queue of an agent by performing the
following steps:
\begin{enumerate}[itemsep=1pt,topsep=3pt,partopsep=0pt]
\item Create a queue on the agent, using \hsaref{hsa_queue_create}. The queue
should support the desired packet type. When the queue is created, the runtime
allocates memory for the \hsaref{hsa_queue_t} data structure that represents
the visible part of the queue, as well as the AQL packet buffer pointed by the
\hsaref{hsa_queue_t.base_address} field.
\item Reserve a packet ID by incrementing the write index of the queue, which is
a 64-bit unsigned integer that contains the number of packets allocated so
far. The runtime exposes several functions such as
\hsaref{hsa_queue_add_write_index_scacquire} to increment the value of the write
index.
\item Wait until the queue is not full (has space for the packet) before writing
the packet. If the queue is full, the packet ID obtained in the previous step
will be greater or equal than the sum of the current read index plus the queue
size. The read index of a queue is a 64-bit unsigned integer that contains the
number of packets that have been processed and released by the queue's packet
processor (i.e., the identifier of the next packet to be released). The
application can load the read index using
\hsaref{hsa_queue_load_read_index_scacquire} or
\hsaref{hsa_queue_load_read_index_relaxed}.
If the application observes that the read index matches the write index, the
queue can be considered empty. This does not mean that the kernels have
finished execution, just that all packets have been consumed.
\item Populate the packet. This step does not require using any HSA
API. Instead, the application directly writes the contents of the AQL packet
located at \hsaref{hsa_queue_t.base_address} + (AQL packet size) * ((packet
ID) \% \hsaref{hsa_queue_t.size}). Note that \hsaref{hsa_queue_t.base_address}
and \hsaref{hsa_queue_t.size} are fields in the queue structure, while the
size of any AQL packet is 64 bytes. The different packet types are discussed
in the next section.
\item Launch the packet by first setting the type of the packet field on its
header to the corresponding value, and then storing the packet ID in
\hsaref{hsa_queue_t.doorbell_signal} using \hsaref{hsa_signal_store_screlease}
(or any variant that uses a different memory order). The application is
required to ensure that the rest of the packet is globally visible before or
at the same time the type is written.
The doorbell signal of the queue is used to indicate the packet processor that
it has work to do. The value which the doorbell signal must be signaled with
corresponds to the identifier of the packet that is ready to be launched.
However, the packet might be consumed by the packet processor even before the
doorbell signal has been signaled. This is because the packet processor might
be already processing some other packet and observes that there is new work
available, so it processes the new packets. In any case, agents are
required to signal the doorbell for every batch of packets they write.
\item (Optional) Wait for the packet to be complete by waiting on its completion
signal, if any.
\item (Optional) Submit more packets by repeating steps 2-6
\item Destroy the queue using \hsaref{hsa_queue_destroy}.
\end{enumerate}
Queues are semi-opaque objects: there is a visible part, represented by the
\hsaref{hsa_queue_t} structure and the corresponding ring buffer (pointed to by
\hsaref{hsa_queue_t.base_address}), and an invisible part, which contains at
least the read and write indexes. The access rules for the different queue parts
are:
\begin{itemize}[itemsep=1pt,topsep=3pt,partopsep=0pt]
\item The {hsa_queue_t} structure is read-only. If the application overwrites
its contents, the behavior is undefined.
\item The ring buffer can be directly accessed by the application.
\item The read and write indexes of the queue can only be accessed using
dedicated runtime APIs. The available index functions differ on the index of
interest (read or write), action to be performed (addition, compare and swap,
etc.), and memory order applied(relaxed, release, etc.).
\end{itemize}
\subsection{Single vs. Multiple Producers}
An application may limit the job submission to a single agent. When this is the
case, the application can create a single producer queue (a queue of type
\hsaref{HSA_QUEUE_TYPE_SINGLE}), which may be more efficient than a regular,