Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MI300 details to docs #446

Draft
wants to merge 22 commits into
base: develop
Choose a base branch
from
Draft

Conversation

peterjunpark
Copy link
Contributor

@peterjunpark peterjunpark commented Oct 9, 2024

This PR updates the documentation with info about the MI300 series

Performance model

L1

UTCL1

L2

VALU

= Add MI300 to list of products with MFMA units (https://advanced-micro-devices-demo--446.com.readthedocs.build/projects/rocprofiler-compute/en/446/conceptual/pipeline-descriptions.html#vector-arithmetic-logic-unit-valu)

AGPRs

Scalar / Instruction cache

@peterjunpark peterjunpark added the documentation Improvements or additions to documentation label Oct 9, 2024
@peterjunpark peterjunpark force-pushed the docs/mi300 branch 2 times, most recently from d490ba3 to f027f4d Compare January 22, 2025 19:30
@peterjunpark peterjunpark changed the base branch from amd-staging to develop January 23, 2025 19:03
Copy link
Contributor

@skyreflectedinmirrors skyreflectedinmirrors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good start!

system supports a maximum of two instances. In contrast, the CDNA3-based
:ref:`MI300 <mixxx-note>` accelerator features 16 channels per XCD, each with a
capacity of 256KB and also utilizing 256B address interleaving, allowing for a
total of up to *eight* instances. Incoming requests are mapped to specific L2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total of up to eight instances (one per XCD)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to this:
...
The L2 cache consists of several distinct channels. The CDNA3-based :ref:MI300 <mixxx-note>
accelerator consists of 16 channels each with a capacity of 256KB and utilizing
256B address interleaving. These channels can operate largely independently and
the system supports up to 8 instances (one per XCD). In constrast, the
:ref:MI200 <mixxx-note> and earlier CDNA accelerators have 32 L2 cache
channels each using 256B address interleaving, but only supports a maximum of 2
instances. ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I... think for MI200, at best it would be "up to two instances, one per GCD". And MI100 has 16 channels :P

I forget how we've generally discussed MI200's GCD stuff in these docs, but I think typically we just talk about them like they're entirely separate GPUs.

I would probably do:

The L2 cache consists of several distinct channels. The CDNA3-based :ref:MI300
accelerator consists of 16 channels each with a capacity of 256KB and utilizing
256B address interleaving. These channels can operate largely independently and
the system supports up to 8 total L2 cache instances (one per XCD). In constrast, the
:ref:MI200 CDNA accelerators have 32 L2 cache
channels each using 256B address interleaving, and MI100 CDNA accelerators / GCN GPUs have only 16 L2 Cache channels. ...

cc: @feizheng10 any thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated c7bc7bd

docs/conceptual/l2-cache.rst Outdated Show resolved Hide resolved
docs/conceptual/l2-cache.rst Outdated Show resolved Hide resolved
.. list-table::
:header-rows: 1

* - Feature
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table seems weird with just one entry right now, but I'm sure we had ideas on how to fill it :P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, we probably want to (eventually) take a second pass and go through to find places where we distinguish values based on the architecture, like the waveslots discussion below (or AGPRs), and add them here.

That can probably wait till this is ~ finalized though

docs/conceptual/vector-l1-cache.rst Outdated Show resolved Hide resolved
docs/tutorial/includes/infinity-fabric-transactions.rst Outdated Show resolved Hide resolved
@vedithal-amd vedithal-amd force-pushed the develop branch 11 times, most recently from d1528cc to 95b600e Compare January 31, 2025 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants