Caten Documentation

  • Home
  • Quickstart
  • Development
  • API Reference
    • caten/air
    • caten/aasm
    • caten/codegen
    • caten/api
      • Overview
      • Tensor
      • Func
      • Module
      • Model
      • Initializers
      • ShapeTracker
      • Facet API
      • StateDict
    • caten/nn
      • Activation
      • Convolution
      • Criterion
      • Embedding
      • Linear
      • Normalization
      • Padding
      • Pooling
      • Encoding
      • Optimizers
  • Ready to use packages
    • Overview
    • caten/apps.gpt2
  • External Packages
    • caten/gguf
    • caten/oonx
    • caten/llm
In this article
  • caten/codegen
    • Hackable Kernel Generator
    • How to enable JIT
    • How to debug JIT
    • How to learn JIT
  • Shape Inference
    • [function] run-type-infer
    • [struct] Inferred-Type
    • [struct] Iteration-Space
  • Rewriting Rules
    • Optimization for gemm
  • Scheduler
    • :SCHEDULE-ITEM
    • [function] graph-schedule
    • Example (Scheduler + Shape Inference + Rewriting Rules)
  • Lowerer (Blueprint)
    • Example (Scheduler + Lowerer)
  • Renderer
    • Render IRs
      • :DEFINE-GLOBAL
      • :AREF
      • :ENDIF
      • :IF
      • :ENDFOR
      • :FOR
  • EXPR
  • Memory Planner
  • Scop Analyzer

caten/codegen

  1. Caten Documentation
  2. API Reference
  3. caten/codegen
|
  • Share via

  •  Edit this article

caten/codegen

Hackable Kernel Generator

The caten/codegen is a system designed to JIT-compile AASM Graphs into other languages, such as C or Metal, to achieve faster computation.

This package has the following two objectives:

  1. Device Independence: It imposes no constraints on the execution device, allowing users to extend it with only 200–300 lines of code.

  2. Performance: It ensures computations run fairly fast on each target device.

Caten's code generation is divided into two stages based on the intended purpose:

  1. caten/codegen Generates unoptimized code (e.g., Scheduler, Lowerer, op fusion, etc.).

  2. caten/polyhedral (Optional) Performs advanced kernel loop transformations and optimizations on the scheduled code, such as Auto Scheduler.

This document focuses on the first stage, caten/codegen, and provides related information and some examples.

How to enable JIT

To enable JIT, you need to feed the :BACKEND contextvar with the desired backend. (backends must be defined by caten/codegen/backend:define-backend and :JIT must be set to 1)

The following code snippet demonstrates how to enable JIT on repl:

;; Option1: Set globally
(setf (ctx:getenv :BACKEND) "CLANG")
;; Option2: Set locally
(ctx:with-contextvar (:BACKEND "CLANG")
  ;; ...
)
After setting the contextvar, you can run the function caten to JIT-compile the AASM Graphs, e.g.:
(caten (!rand `(3 3)))

How to debug JIT

You can set the JIT_DEBUG contextvar to debug the JIT compilation process. The debugging level ranges from 0 to 4, as shown below:

JIT_DEBUG | Effect
----------|-------
0         | None
1         | Nothing for now
2         | Displays blueprint and scop
3         | Displays schedule-graph
4         | Displays memory-planner and the compiled code

How to learn JIT

You can start by reading ./caten/docs/getting-started.lisp to understand the basic concepts of caten/codegen through a running example.

Additionally, the following sections provide documentation and explanations for each JIT mechanism, along with simple executable examples.

Shape Inference

The JIT compilation starts by inferring the shape of each points in the graph. caten/codegen/shape-inference is responsible for this task.

[function] run-type-infer

(run-type-infer runtime)
Runs the shape inference to the given GraphRuntime, returning Type-Reporter.

[struct] Inferred-Type

A structure Inferred-Type contains the shape/stride/view information of the node at each points. Also, it containts the iteration space information that is used to render the kernel.

If the shape inference is successfully done and properly deployed to the target graph by the rewriting-rule, the function (read-type-relay node) will return the Inferred-Type structure.

  • relay-reads returns the list of the buffer corresponding to the node-reads. (If node-reads a number, nil is set)
  • relay-writes returns the list of the buffer corresponding to the node-writes.
  • relay-read-iters returns the list of the iteration space corresponding to the node-reads.
  • relay-write-iters returns the list of the iteration space corresponding to the node-writes.

[struct] Iteration-Space

Iteration-Space is a structure that contains the shape/stride/view information of the each buffer. It is used to render the kernel. iteration-space-shape to get the shape, iteration-space-strides to get the stride, iteration-space-views to get the offsets and increments, and iteration-space-procedure to get how the iteraton-space is collapsed or permuted. Each elements for shape/stride are Expr.

iteration-space-expr-aref to render the aref.

(iteration-space-expr-aref iteration-space buffer gids)
gids corresponds for the loop idx in the kernel.

Rewriting Rules

TODO: apply-rewriting-rules doc

Optimization for gemm

TODO: WMMA Rewriting Docs

Scheduler

The caten/codegen/scheduler is responsible for assigining the execution order of the computation graph in the aasm. Occasionally they merges or rewrites a view in order to schedule multiple nodes into the same schedule if possible.

The function graph-schedule is an entry point, and takes a shape-inferred aasm graph as input, performs scheduling, and returns a schedule graph (called Schedule-Graph) whose each node is composed of Schedule-Item.

One Schedule-Item corresponds to one kernel in GPU, graph-schedule must ensure that each item is merged only within the bounds that can be lowered into a single kernel.## Schedule-Item

:SCHEDULE-ITEM

Schedule-Item is an intermidate object to represent a one kernel in GPU.

f = cache_name or name
write_ids = f(*[storage_id_dst], *[dynamic_shape], *[inputs])
                      ^ can be modified by the memory-planner

It has a unique name, and cache-name. If cache-name was assigned, the compiler will fail to compile this schedule-item and reuse the kernel named cache-name instead.

In order to lowering the computation graph as the foreign language, items must be consisted of JITAble operations (except for special irs and :allocate). If it qualifies, jitable is set to T.

Otherwise, the scheduled items are relocated to the compiled GraphRuntime directly. Specifially, if the item was :ALLOCATE, :allocated-p is set to T.

  • blueprint[list] is a lowered schedule-item
  • polyhedral[list] is a Polyhedral IR obtained by lowering blueprint
  • auto-schedule-p[list] is set to T if it is worth to run auto-scheduler. (If there is a symbolic incremental, loop is not an affine and cannot run isl scheduler)
  • items[list] are the scheduled items
  • items-to-cache[list] are the copies of items but having the unique read/write. It is used to determine the equivalence of the two schedule-items.
  • rank[fixnum] is the highest rank of the iteration space.
  • storage-id-src[list] is the list of the storage-id of the source buffer (optimized by running memory-planner)
  • storage-id-dst[list] is the list of the storage-id of the destination buffer (optimized by running memory-planner)
  • namespace[list] a list of symbols appeared in the items. It is used to map a new blueprint from the cached blueprint.

When optimizing schedule-item in-place, the 0th read is consumed.

[function] graph-schedule

(graph-schedule graph)
Creates a schedule-graph(FastGraph) from the given graph.

Example (Scheduler + Shape Inference + Rewriting Rules)

This code snippet demonstrates how to create a schedule-graph from AASM Graph. AASM Graph is obtained by running caten with JIT=0.

lisp
CATEN-USER> (ctx:with-contextvar (:BACKEND "LISP") (pprint-graph (runtime-graph (caten (!relu (!matmul (make-tensor `(3 3)) (make-tensor `(3 3))))))))
Result
Result
============================================================================================================================================
=== [P: 0] =================================================================================================================================
[P=0, ID=0]:
   :PAUSE/BACKWARD {0}
   └ :MAX {N1}
     └ :VIEW {N2}
     │ ├ load(0.0)
     │   └ Allocate[:float32] NIL
     ├ :VIEW {N3}
       ├ :MOVE {N4}
         ├ Allocate[:float32] (3 3 1)
         └ :VIEW {N5}
           ├ :ADD {N6}
             ├ :VIEW {N7}
             │ ├ load(0.0)
             │   └ Allocate[:float32] (3 3 1)
             └ :MUL {N8}
               └ :VIEW {N9}
               │ ├ Allocate[:float32] (3 3)
               ├ :MOVE {N10}
                 ├ Allocate[:float32] (3 3 3)
                 └ :VIEW {N11}
                   ├ Allocate[:float32] (3 3)
============================================================================================================================================
Example
(let* ((graph (ctx:with-contextvar (:BACKEND "LISP") (runtime-graph (caten (!relu (!matmul (make-tensor `(3 3)) (make-tensor `(3 3))))))))
       (vm (make-runtime graph :fw-outputs (graph-outputs graph))))
  (run-type-infer vm) ;; Running the shape inference
  (apply-rewriting-rules vm) ;; Running the rewriting rules
  ;; graph-schedule to finally run the scheduler
  (pprint-graph (graph-schedule (runtime-graph vm))))
Result
Result
============================================================================================================================================
=== [P: 0] =================================================================================================================================
[P=0, ID=0]:
   :PAUSE/BACKWARD {N1}
   └ [KERNEL] FUSED_RELU_SUMNODE_MATMUL18889
     ├ Allocate[:float32] (3 3 3)
     ├ Allocate[:float32] (3 3)
     ├ Allocate[:float32] (3 3)
     ├ Allocate[:float32] (3 3 1)
     ├ Allocate[:float32] (3 3 1)
     └ Allocate[:float32] NIL
============================================================================================================================================

Lowerer (Blueprint)

The package caten/codegen/blueprint is responsible for lowering the schedule-item into a blueprint. A blueprint is an IR that represents a computation graph with explicit loop bounds. The lower-schedule-item method infers loop boundaries based on Schedule-item and performs lowering into a format that includes :FOR/:ENDFOR nodes.### [function] LOWER-SCHEDULE-ITEM

(lower-schedule-item node base-graph scheduled-graph)

Lowers the Schedule-Item into blueprint.

  • node is a schedule-item to lower which belongs to scheduled-graph.

  • base-graph is the graph you passed to the graph-schedule.

  • scheduled-graph is the scheduled graph obtained by graph-schedule.

Example (Scheduler + Lowerer)

This code snippet demonstrates how to lower the schedule-graph into a blueprint created in the previous section.

Example
(let* ((graph (ctx:with-contextvar (:BACKEND "LISP") (runtime-graph (caten (!relu (!matmul (make-tensor `(3 3)) (make-tensor `(3 3))))))))
       (vm (make-runtime graph :fw-outputs (graph-outputs graph))))
  (run-type-infer vm) ;; Running the shape inference
  (apply-rewriting-rules vm) ;; Running the rewriting rules
  ;; graph-schedule to finally run the scheduler
  (let ((schedule-graph (graph-schedule (runtime-graph vm))))
    (caten/codegen/expr-cache::with-expr-cache () ;; need this to forcibly keep the loop affine.
      (let ((bp (lower-schedule-item (nth 5 (graph-nodes schedule-graph)) graph schedule-graph)))
        (print-blueprint bp nil)))))
Result
Result
{
  for (int _gid0=0;(_gid0<3);_gid0+=1) {
    for (int _gid1=0;(_gid1<3);_gid1+=1) {
      val_2 = 0.0;
      for (int _gid2=0;(_gid2<3);_gid2+=1) {
        val_2 = (val_2+(val_4[((3*_gid0)+_gid2)]*val_6[(_gid1+(3*_gid2))])); // :reduction=t
      } // _gid2
      val_12[((3*_gid0)+_gid1)] = max(val_2, 0.0);
    } // _gid1
  } // _gid0
}

Renderer

TODO (Docs for Renderer/rendre-expr/extensible polyhedral compiler/etc)

Here is a list of nodes used to render the kernel.

Render IRs

:DEFINE-GLOBAL

The node :DEFINE-GLOBAL declares a global variable in the kernel. (it corresponds to the argument of the kernel.)

When optimizing define-global in-place, the 0th read is consumed.

:AREF

:AREF corresponds to the following code:

ID[*space]

  • storage-id[symbol] An index to the reference pointer optimized by the memory-planner.
  • buffer[AbstractBuffer] The buffer to be accessed.
  • space[Iteration-Space] The iteration space :AREF belongs to.

When optimizing aref in-place, the 0th read is consumed.

:ENDIF

} // endif

When optimizing endif in-place, the 0th read is consumed.

:IF

if(condition)
When optimizing if in-place, the 0th read is consumed.

:ENDFOR

} // idx
When optimizing endfor in-place, the 0th read is consumed.

:FOR

for(int idx=upfrom, below, by)
  • scope[keyword] If :global, the loop is parallelizable. One of :global, :local.

When optimizing for in-place, the 0th read is consumed.

EXPR

TODO

Memory Planner

TODO: Docs (MIP Solver)

Scop Analyzer

TODO: Docs for scop

Search
Enter a keyword to search.