caten/codegen
caten/codegen
Hackable Kernel Generator
The caten/codegen
is a system designed to JIT-compile AASM Graphs into other languages, such as C or Metal, to achieve faster computation.
This package has the following two objectives:
-
Device Independence: It imposes no constraints on the execution device, allowing users to extend it with only 200–300 lines of code.
-
Performance: It ensures computations run fairly fast on each target device.
Caten's code generation is divided into two stages based on the intended purpose:
-
caten/codegen Generates unoptimized code (e.g., Scheduler, Lowerer, op fusion, etc.).
-
caten/polyhedral (Optional) Performs advanced kernel loop transformations and optimizations on the scheduled code, such as Auto Scheduler.
This document focuses on the first stage, caten/codegen
, and provides related information and some examples.
How to enable JIT
To enable JIT, you need to set the JIT
contextvar before running the function caten
. The following code snippet demonstrates how to enable JIT on repl:
;; Option1: Set globally
(setf (ctx:getenv :JIT) 1)
;; Option2: Set locally
(ctx:with-contextvar (:JIT 1)
;; ...
)
caten
to JIT-compile the AASM Graphs, e.g.:
How to debug JIT
You can set the JIT_DEBUG
contextvar to debug the JIT compilation process. The debugging level ranges from 0 to 4, as shown below:
JIT_DEBUG | Effect
----------|-------
0 | None
1 | Nothing for now
2 | Displays blueprint and scop
3 | Displays schedule-graph
4 | Displays memory-planner and the compiled code
How to learn JIT
You can start by reading ./caten/docs/getting-started.lisp
to understand the basic concepts of caten/codegen
through a running example.
Additionally, the following sections provide documentation and explanations for each JIT mechanism, along with simple executable examples.
Shape Inference
The JIT compilation starts by inferring the shape of each points in the graph. caten/codegen/shape-inference
is responsible for this task.
[function] run-type-infer
Run the shape inference to the given AVM, returningType-Reporter
[struct] Inferred-Type
A structure Inferred-Type
contains the shape/stride/view information of the node at each points. Also, it containts the iteration space information that is used to render the kernel.
If the shape inference is successfully done and properly deployed to the target graph by the rewriting-rule
, the function (read-type-relay node)
will return the Inferred-Type
structure.
- relay-reads returns the list of the buffer corresponding to the node-reads. (If node-reads a number, nil is set)
- relay-writes returns the list of the buffer corresponding to the node-writes.
- relay-read-iters returns the list of the iteration space corresponding to the node-reads.
- relay-write-iters returns the list of the iteration space corresponding to the node-writes.
[struct] Iteration-Space
Iteration-Space is a structure that contains the shape/stride/view information of the each buffer. It is used to render the kernel. iteration-space-shape to get the shape, iteration-space-strides to get the stride, iteration-space-views to get the offsets and increments, and iteration-space-procedure to get how the iteraton-space is collapsed or permuted. Each elements for shape/stride are Expr.
iteration-space-expr-aref
to render the aref.
Rewriting Rules
TODO: apply-rewriting-rules doc
Optimization for gemm
TODO: WMMA Rewriting Docs
Scheduler
The caten/codegen/scheduler
is responsible for assigining the execution order of the computation graph in the aasm.
Occasionally they merges or rewrites a view in order to schedule multiple nodes into the same schedule if possible.
The function graph-schedule
is an entry point, and takes a shape-inferred aasm graph as input, performs scheduling, and returns a schedule graph (called Schedule-Graph) whose each node is composed of Schedule-Item.
One Schedule-Item corresponds to one kernel in GPU, graph-schedule
must ensure that each item is merged only within the bounds that can be lowered into a single kernel.## Schedule-Item
:SCHEDULE-ITEM
Schedule-Item is an intermidate object to represent a one kernel in GPU.
f = cache_name or name
write_ids = f(*[storage_id_dst], *[dynamic_shape], *[inputs])
^ can be modified by the memory-planner
It has a unique name
, and cache-name
. If cache-name
was assigned, the compiler will fail to compile this schedule-item and reuse the kernel named cache-name
instead.
In order to lowering the computation graph as the foreign language, items
must be consisted of JITAble operations (except for special irs and :allocate). If it qualifies, jitable
is set to T.
Otherwise, the scheduled items are relocated to the compiled avm directly. Specifially, if the item was :ALLOCATE, :allocated-p is set to T.
- blueprint[list] is a lowered schedule-item
- polyhedral[list] is a Polyhedral IR obtained by lowering blueprint
- auto-schedule-p[list] is set to T if it is worth to run auto-scheduler. (If there is a symbolic incremental, loop is not an affine and cannot run isl scheduler)
- items[list] are the scheduled items
- items-to-cache[list] are the copies of items but having the unique read/write. It is used to determine the equivalence of the two schedule-items.
- rank[fixnum] is the highest rank of the iteration space.
- storage-id-src[list] is the list of the storage-id of the source buffer (optimized by running memory-planner)
- storage-id-dst[list] is the list of the storage-id of the destination buffer (optimized by running memory-planner)
- namespace[list] a list of symbols appeared in the items. It is used to map a new blueprint from the cached blueprint.
When optimizing schedule-item in-place, the 0th read is consumed.
[function] graph-schedule
Creates a schedule-graph(FastGraph) from the givengraph
.
Example (Scheduler + Shape Inference + Rewriting Rules)
This code snippet demonstrates how to create a schedule-graph from AASM Graph. AASM Graph is obtained by running caten with JIT=0.
CATEN-USER> (ctx:with-contextvar (:JIT 0) (pprint-graph (avm-graph (caten (!relu (!matmul (make-tensor `(3 3)) (make-tensor `(3 3))))))))
Result
============================================================================================================================================
=== [P: 0] =================================================================================================================================
[P=0, ID=0]:
:PAUSE/BACKWARD {0}
└ :MAX {N1}
└ :VIEW {N2}
│ ├ load(0.0)
│ └ Allocate[:float32] NIL
├ :VIEW {N3}
├ :MOVE {N4}
├ Allocate[:float32] (3 3 1)
└ :VIEW {N5}
├ :ADD {N6}
├ :VIEW {N7}
│ ├ load(0.0)
│ └ Allocate[:float32] (3 3 1)
└ :MUL {N8}
└ :VIEW {N9}
│ ├ Allocate[:float32] (3 3)
├ :MOVE {N10}
├ Allocate[:float32] (3 3 3)
└ :VIEW {N11}
├ Allocate[:float32] (3 3)
============================================================================================================================================
(let* ((graph (ctx:with-contextvar (:JIT 0) (avm-graph (caten (!relu (!matmul (make-tensor `(3 3)) (make-tensor `(3 3))))))))
(vm (make-avm graph :example nil (graph-outputs graph) nil)))
(run-type-infer vm) ;; Running the shape inference
(apply-rewriting-rules vm) ;; Running the rewriting rules
;; graph-schedule to finally run the scheduler
(pprint-graph (graph-schedule (avm-graph vm))))
Result
============================================================================================================================================
=== [P: 0] =================================================================================================================================
[P=0, ID=0]:
:PAUSE/BACKWARD {N1}
└ [KERNEL] FUSED_RELU_SUMNODE_MATMUL18987
├ Allocate[:float32] (3 3 3)
├ Allocate[:float32] (3 3)
├ Allocate[:float32] (3 3)
├ Allocate[:float32] (3 3 1)
├ Allocate[:float32] (3 3 1)
└ Allocate[:float32] NIL
============================================================================================================================================
Lowerer (Blueprint)
The package caten/codegen/blueprint
is responsible for lowering the schedule-item into a blueprint. A blueprint is an IR that represents a computation graph with explicit loop bounds.
The lower-schedule-item
method infers loop boundaries based on Schedule-item
and performs lowering into a format that includes :FOR/:ENDFOR nodes.### [function] LOWER-SCHEDULE-ITEM
Lowers the Schedule-Item into blueprint.
-
node is a schedule-item to lower which belongs to scheduled-graph.
-
base-graph is the graph you passed to the graph-schedule.
-
scheduled-graph is the scheduled graph obtained by graph-schedule.
Example (Scheduler + Lowerer)
This code snippet demonstrates how to lower the schedule-graph into a blueprint created in the previous section.
(let* ((graph (ctx:with-contextvar (:JIT 0) (avm-graph (caten (!relu (!matmul (make-tensor `(3 3)) (make-tensor `(3 3))))))))
(vm (make-avm graph :example nil (graph-outputs graph) nil)))
(run-type-infer vm) ;; Running the shape inference
(apply-rewriting-rules vm) ;; Running the rewriting rules
;; graph-schedule to finally run the scheduler
(let ((schedule-graph (graph-schedule (avm-graph vm))))
(caten/codegen/expr-cache::with-expr-cache () ;; need this to forcibly keep the loop affine.
(let ((bp (lower-schedule-item (nth 1 (graph-nodes schedule-graph)) graph schedule-graph)))
(print-blueprint bp nil)))))
Renderer
TODO (Docs for Renderer/rendre-expr/extensible polyhedral compiler/etc)
Here is a list of nodes used to render the kernel.
Render IRs
:DEFINE-GLOBAL
The node :DEFINE-GLOBAL declares a global variable in the kernel. (it corresponds to the argument of the kernel.)
When optimizing define-global in-place, the 0th read is consumed.
:AREF
:AREF corresponds to the following code:
- storage-id[symbol] An index to the reference pointer optimized by the memory-planner.
- buffer[Buffer] The buffer to be accessed.
- space[Iteration-Space] The iteration space
:AREF
belongs to.
When optimizing aref in-place, the 0th read is consumed.
:ENDIF
When optimizing endif in-place, the 0th read is consumed.
:IF
When optimizing if in-place, the 0th read is consumed.:ENDFOR
When optimizing endfor in-place, the 0th read is consumed.:FOR
- scope[keyword] If
:global
, the loop is parallelizable. One of :global, :local.
When optimizing for in-place, the 0th read is consumed.
EXPR
TODO
Memory Planner
TODO: Docs (MIP Solver)
Scop Analyzer
TODO: Docs for scop