From 1be19a732880bf55074d7538b40b585272d589cb Mon Sep 17 00:00:00 2001 From: eric Date: Wed, 20 Aug 2025 06:04:04 +0000 Subject: [PATCH] deploy: ba596e75db9a7b65da50fb40c8e00f7859a7e39b --- 404.html | 2 +- about/index.html | 2 +- categories/index.html | 2 +- index.html | 2 +- index.xml | 4 ++-- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- posts/how-rvq-teaches-llms-to-see-and-hear/index.html | 2 +- posts/index.html | 8 ++++---- posts/index.xml | 4 ++-- .../index.html | 2 +- posts/page/2/index.html | 8 ++++++++ posts/quantization-in-llms/index.html | 10 ++++++++++ .../index.html | 2 +- posts/supabase-deep-dive/index.html | 2 +- .../index.html | 2 +- posts/useful/index.html | 2 +- sitemap.xml | 2 +- tags/index.html | 2 +- 21 files changed, 42 insertions(+), 24 deletions(-) create mode 100644 posts/page/2/index.html create mode 100644 posts/quantization-in-llms/index.html diff --git a/404.html b/404.html index 4d2723b..43db4d1 100644 --- a/404.html +++ b/404.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[ed94cec] \ No newline at end of file +[ba596e7] \ No newline at end of file diff --git a/about/index.html b/about/index.html index bbc9485..5d5e13a 100644 --- a/about/index.html +++ b/about/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[ed94cec] \ No newline at end of file +[ba596e7] \ No newline at end of file diff --git a/categories/index.html b/categories/index.html index 73a9297..39026ab 100644 --- a/categories/index.html +++ b/categories/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[ed94cec] \ No newline at end of file +[ba596e7] \ No newline at end of file diff --git a/index.html b/index.html index 632d4a3..b39061a 100644 --- a/index.html +++ b/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[ed94cec] \ No newline at end of file +[ba596e7] \ No newline at end of file diff --git a/index.xml b/index.xml index 265a19a..111d1bf 100644 --- a/index.xml +++ b/index.xml @@ -1,4 +1,4 @@ -Eric X. Liu's Personal Page/Recent content on Eric X. Liu's Personal PageHugoenWed, 20 Aug 2025 04:48:53 +0000A Technical Deep Dive into the Transformer's Core Mechanics/posts/a-technical-deep-dive-into-the-transformer-s-core-mechanics/Tue, 19 Aug 2025 00:00:00 +0000/posts/a-technical-deep-dive-into-the-transformer-s-core-mechanics/<p>The Transformer architecture is the bedrock of modern Large Language Models (LLMs). While its high-level success is widely known, a deeper understanding requires dissecting its core components. This article provides a detailed, technical breakdown of the fundamental concepts within a Transformer block, from the notion of &ldquo;channels&rdquo; to the intricate workings of the attention mechanism and its relationship with other advanced architectures like Mixture of Experts.</p> +Eric X. Liu's Personal Page/Recent content on Eric X. Liu's Personal PageHugoenWed, 20 Aug 2025 06:02:35 +0000A Technical Deep Dive into the Transformer's Core Mechanics/posts/a-technical-deep-dive-into-the-transformer-s-core-mechanics/Tue, 19 Aug 2025 00:00:00 +0000/posts/a-technical-deep-dive-into-the-transformer-s-core-mechanics/<p>The Transformer architecture is the bedrock of modern Large Language Models (LLMs). While its high-level success is widely known, a deeper understanding requires dissecting its core components. This article provides a detailed, technical breakdown of the fundamental concepts within a Transformer block, from the notion of &ldquo;channels&rdquo; to the intricate workings of the attention mechanism and its relationship with other advanced architectures like Mixture of Experts.</p> <h3 id="1-the-channel-a-foundational-view-of-d_model"> 1. The &ldquo;Channel&rdquo;: A Foundational View of <code>d_model</code> <a class="heading-link" href="#1-the-channel-a-foundational-view-of-d_model"> @@ -6,7 +6,7 @@ <span class="sr-only">Link to heading</span> </a> </h3> -<p>In deep learning, a &ldquo;channel&rdquo; can be thought of as a feature dimension. While this term is common in Convolutional Neural Networks for images (e.g., Red, Green, Blue channels), in LLMs, the analogous concept is the model&rsquo;s primary embedding dimension, commonly referred to as <code>d_model</code>.</p>A Comprehensive Guide to Breville Barista Pro Maintenance/posts/a-comprehensive-guide-to-breville-barista-pro-maintenance/Sat, 16 Aug 2025 00:00:00 +0000/posts/a-comprehensive-guide-to-breville-barista-pro-maintenance/<p>Proper maintenance is critical for the longevity and performance of a Breville Barista Pro espresso machine. Consistent cleaning not only ensures the machine functions correctly but also directly impacts the quality of the espresso produced. This guide provides a detailed, technical breakdown of the essential maintenance routines, from automated cycles to daily upkeep.</p> +<p>In deep learning, a &ldquo;channel&rdquo; can be thought of as a feature dimension. While this term is common in Convolutional Neural Networks for images (e.g., Red, Green, Blue channels), in LLMs, the analogous concept is the model&rsquo;s primary embedding dimension, commonly referred to as <code>d_model</code>.</p>Quantization in LLMs/posts/quantization-in-llms/Tue, 19 Aug 2025 00:00:00 +0000/posts/quantization-in-llms/<p>The burgeoning scale of Large Language Models (LLMs) has necessitated a paradigm shift in their deployment, moving beyond full-precision floating-point arithmetic towards lower-precision representations. Quantization, the process of mapping a wide range of continuous values to a smaller, discrete set, has emerged as a critical technique to reduce model size, accelerate inference, and lower energy consumption. This article provides a technical overview of quantization theories, their application in modern LLMs, and highlights the ongoing innovations in this domain.</p>A Comprehensive Guide to Breville Barista Pro Maintenance/posts/a-comprehensive-guide-to-breville-barista-pro-maintenance/Sat, 16 Aug 2025 00:00:00 +0000/posts/a-comprehensive-guide-to-breville-barista-pro-maintenance/<p>Proper maintenance is critical for the longevity and performance of a Breville Barista Pro espresso machine. Consistent cleaning not only ensures the machine functions correctly but also directly impacts the quality of the espresso produced. This guide provides a detailed, technical breakdown of the essential maintenance routines, from automated cycles to daily upkeep.</p> <h4 id="understanding-the-two-primary-maintenance-cycles"> <strong>Understanding the Two Primary Maintenance Cycles</strong> <a class="heading-link" href="#understanding-the-two-primary-maintenance-cycles"> diff --git a/posts/a-comprehensive-guide-to-breville-barista-pro-maintenance/index.html b/posts/a-comprehensive-guide-to-breville-barista-pro-maintenance/index.html index 9e4cf25..39671da 100644 --- a/posts/a-comprehensive-guide-to-breville-barista-pro-maintenance/index.html +++ b/posts/a-comprehensive-guide-to-breville-barista-pro-maintenance/index.html @@ -25,4 +25,4 @@ Understanding the Two Primary Maintenance Cycles Link to heading The Breville Ba 2016 - 2025 Eric X. Liu -[ed94cec] \ No newline at end of file +[ba596e7] \ No newline at end of file diff --git a/posts/a-deep-dive-into-ppo-for-language-models/index.html b/posts/a-deep-dive-into-ppo-for-language-models/index.html index a30e9dc..e31a38e 100644 --- a/posts/a-deep-dive-into-ppo-for-language-models/index.html +++ b/posts/a-deep-dive-into-ppo-for-language-models/index.html @@ -23,4 +23,4 @@ where δ_t = r_t + γV(s_{t+1}) - V(s_t)

\ No newline at end of file +[ba596e7] \ No newline at end of file diff --git a/posts/index.xml b/posts/index.xml index 4a81095..19e4d70 100644 --- a/posts/index.xml +++ b/posts/index.xml @@ -1,4 +1,4 @@ -Posts on Eric X. Liu's Personal Page/posts/Recent content in Posts on Eric X. Liu's Personal PageHugoenWed, 20 Aug 2025 04:48:53 +0000A Technical Deep Dive into the Transformer's Core Mechanics/posts/a-technical-deep-dive-into-the-transformer-s-core-mechanics/Tue, 19 Aug 2025 00:00:00 +0000/posts/a-technical-deep-dive-into-the-transformer-s-core-mechanics/<p>The Transformer architecture is the bedrock of modern Large Language Models (LLMs). While its high-level success is widely known, a deeper understanding requires dissecting its core components. This article provides a detailed, technical breakdown of the fundamental concepts within a Transformer block, from the notion of &ldquo;channels&rdquo; to the intricate workings of the attention mechanism and its relationship with other advanced architectures like Mixture of Experts.</p> +Posts on Eric X. Liu's Personal Page/posts/Recent content in Posts on Eric X. Liu's Personal PageHugoenWed, 20 Aug 2025 06:02:35 +0000A Technical Deep Dive into the Transformer's Core Mechanics/posts/a-technical-deep-dive-into-the-transformer-s-core-mechanics/Tue, 19 Aug 2025 00:00:00 +0000/posts/a-technical-deep-dive-into-the-transformer-s-core-mechanics/<p>The Transformer architecture is the bedrock of modern Large Language Models (LLMs). While its high-level success is widely known, a deeper understanding requires dissecting its core components. This article provides a detailed, technical breakdown of the fundamental concepts within a Transformer block, from the notion of &ldquo;channels&rdquo; to the intricate workings of the attention mechanism and its relationship with other advanced architectures like Mixture of Experts.</p> <h3 id="1-the-channel-a-foundational-view-of-d_model"> 1. The &ldquo;Channel&rdquo;: A Foundational View of <code>d_model</code> <a class="heading-link" href="#1-the-channel-a-foundational-view-of-d_model"> @@ -6,7 +6,7 @@ <span class="sr-only">Link to heading</span> </a> </h3> -<p>In deep learning, a &ldquo;channel&rdquo; can be thought of as a feature dimension. While this term is common in Convolutional Neural Networks for images (e.g., Red, Green, Blue channels), in LLMs, the analogous concept is the model&rsquo;s primary embedding dimension, commonly referred to as <code>d_model</code>.</p>A Comprehensive Guide to Breville Barista Pro Maintenance/posts/a-comprehensive-guide-to-breville-barista-pro-maintenance/Sat, 16 Aug 2025 00:00:00 +0000/posts/a-comprehensive-guide-to-breville-barista-pro-maintenance/<p>Proper maintenance is critical for the longevity and performance of a Breville Barista Pro espresso machine. Consistent cleaning not only ensures the machine functions correctly but also directly impacts the quality of the espresso produced. This guide provides a detailed, technical breakdown of the essential maintenance routines, from automated cycles to daily upkeep.</p> +<p>In deep learning, a &ldquo;channel&rdquo; can be thought of as a feature dimension. While this term is common in Convolutional Neural Networks for images (e.g., Red, Green, Blue channels), in LLMs, the analogous concept is the model&rsquo;s primary embedding dimension, commonly referred to as <code>d_model</code>.</p>Quantization in LLMs/posts/quantization-in-llms/Tue, 19 Aug 2025 00:00:00 +0000/posts/quantization-in-llms/<p>The burgeoning scale of Large Language Models (LLMs) has necessitated a paradigm shift in their deployment, moving beyond full-precision floating-point arithmetic towards lower-precision representations. Quantization, the process of mapping a wide range of continuous values to a smaller, discrete set, has emerged as a critical technique to reduce model size, accelerate inference, and lower energy consumption. This article provides a technical overview of quantization theories, their application in modern LLMs, and highlights the ongoing innovations in this domain.</p>A Comprehensive Guide to Breville Barista Pro Maintenance/posts/a-comprehensive-guide-to-breville-barista-pro-maintenance/Sat, 16 Aug 2025 00:00:00 +0000/posts/a-comprehensive-guide-to-breville-barista-pro-maintenance/<p>Proper maintenance is critical for the longevity and performance of a Breville Barista Pro espresso machine. Consistent cleaning not only ensures the machine functions correctly but also directly impacts the quality of the espresso produced. This guide provides a detailed, technical breakdown of the essential maintenance routines, from automated cycles to daily upkeep.</p> <h4 id="understanding-the-two-primary-maintenance-cycles"> <strong>Understanding the Two Primary Maintenance Cycles</strong> <a class="heading-link" href="#understanding-the-two-primary-maintenance-cycles"> diff --git a/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html b/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html index 34c8bb3..b4e5a7e 100644 --- a/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html +++ b/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/index.html @@ -44,4 +44,4 @@ The Top-K routing mechanism, as illustrated in the provided ima 2016 - 2025 Eric X. Liu -[ed94cec] \ No newline at end of file +[ba596e7] \ No newline at end of file diff --git a/posts/page/2/index.html b/posts/page/2/index.html new file mode 100644 index 0000000..73c9206 --- /dev/null +++ b/posts/page/2/index.html @@ -0,0 +1,8 @@ +Posts · Eric X. Liu's Personal Page

Posts

\ No newline at end of file diff --git a/posts/quantization-in-llms/index.html b/posts/quantization-in-llms/index.html new file mode 100644 index 0000000..6dd8bdb --- /dev/null +++ b/posts/quantization-in-llms/index.html @@ -0,0 +1,10 @@ +Quantization in LLMs · Eric X. Liu's Personal Page

Quantization in LLMs

The burgeoning scale of Large Language Models (LLMs) has necessitated a paradigm shift in their deployment, moving beyond full-precision floating-point arithmetic towards lower-precision representations. Quantization, the process of mapping a wide range of continuous values to a smaller, discrete set, has emerged as a critical technique to reduce model size, accelerate inference, and lower energy consumption. This article provides a technical overview of quantization theories, their application in modern LLMs, and highlights the ongoing innovations in this domain.

The Fundamentals of Quantization

At its core, quantization seeks to represent model weights and activations using fewer bits. Three primary approaches form the theoretical foundation:

  1. K-Means-based Quantization (Non-uniform): This method clusters floating-point weights into a predefined number of groups. Each weight is then replaced by the centroid of its assigned cluster. While effective for storage compression by storing a small “codebook” of centroids and integer indices, its direct computational benefits during inference are limited unless specialized hardware for lookup tables is employed.

  2. Linear (Affine) Quantization: The most prevalent form, linear quantization maps a floating-point range to a fixed integer range using a simple linear transformation: r = S * (q - Z). Here, r is the real value, q is the quantized integer, S is the scale factor, and Z is the zero-point (offset). This approach directly enables integer arithmetic, which is significantly faster and more energy-efficient on modern hardware.

  3. Binary and Ternary Quantization (Extreme Low-Bit): These push quantization to its limits by constraining weights and/or activations to only two (e.g., +1, -1) or three (e.g., +1, 0, -1) values. While offering maximal compression and enabling bitwise operations instead of multiplications, they often incur substantial accuracy degradation for complex LLMs. For instance, BinaryConnect enabled training deep neural networks with binary weights, showing near state-of-the-art results on image classification tasks. XNOR-Net further extended this by binarizing both weights and inputs, achieving significant speedups and memory savings. Ternary Weight Networks (TWNs) and Trained Ternary Quantization (TTQ) improve upon binary methods by introducing a zero value or learnable scaling factors, respectively, mitigating some accuracy loss.

Quantization Strategies: Bridging Accuracy and Efficiency

The practical application of quantization involves distinct strategies:

  1. Post-Training Quantization (PTQ): This approach applies quantization to an already trained, full-precision model without any further training or fine-tuning.

    • Quantization Granularity: The precision of quantization can vary across a model.
      • Per-Tensor Quantization applies a single scale and zero-point to an entire tensor.
      • Per-Channel Quantization assigns unique scale and zero-point parameters to each output channel of a layer, crucial for handling diverse value distributions.
      • Group Quantization provides an intermediate granularity, where scales and zero-points are applied to smaller groups of weights within a channel or layer. This balances fine-grained control with hardware efficiency.
    • Dynamic Range Clipping (Calibration): A critical aspect of PTQ is determining the optimal range (r_min, r_max) for quantization, especially for activations, which often exhibit outliers. Methods include:
      • Min-Max: Simply using the observed minimum and maximum values.
      • Exponential Moving Averages (EMA): Tracking ranges using a smoothed average during a calibration run.
      • Kullback-Leibler (KL) Divergence Minimization: Selecting clipping thresholds that minimize the information loss between the original and quantized distributions.
      • Mean Square Error (MSE) Minimization: Optimizing scale and zero-point parameters to minimize the reconstruction error. Adaptive rounding techniques, such as AdaRound, further refine this by optimizing rounding decisions for individual weights.
  2. Quantization-Aware Training (QAT): This method integrates the quantization process directly into the training or fine-tuning loop. By simulating the effects of low-precision arithmetic during training, the model learns to be robust to quantization noise. The Straight-Through Estimator (STE) is commonly used to approximate gradients for the non-differentiable quantization operations, enabling backpropagation. QAT generally yields higher accuracy than PTQ, particularly for aggressive low-bit quantization.

Emerging Techniques for Modern LLMs

The scale and complexity of LLMs necessitate advanced quantization strategies:

  1. One-Shot Post-Training Quantization (e.g., GPTQ, AWQ): These techniques aim to achieve near-QAT accuracy with PTQ’s convenience, requiring only a small, unlabelled calibration dataset and no full retraining. GPTQ quantizes weights layer-by-layer by minimizing output MSE, leveraging Hessian-aware information. AWQ identifies and scales “important” weights based on activation magnitudes before quantization. These methods have been instrumental in enabling 4-bit LLM inference on consumer-grade hardware.

  2. Sparsity-Quantization Hybrid (e.g., SpQR): These approaches combine model pruning (removing redundant connections) with quantization to achieve even greater compression. SpQR prunes weights and then quantizes the remaining non-zero weights, often with special handling for critical outlier weights.

  3. Quantization for Efficient Fine-tuning (e.g., QLoRA): QLoRA quantizes the base LLM weights (e.g., to 4-bit) and freezes them, then fine-tunes only small, low-rank adapter modules in full precision. This drastically reduces the memory requirements for fine-tuning large models on limited hardware.

  4. Hardware-Optimized Quantization Formats: Beyond bit-width, specialized floating-point formats and efficient kernels are being developed. MXFP4 (Microscaling FP4), NVIDIA’s FP8 (E4M3/E5M2), and GGUF’s K-quants are examples of block-wise floating-point formats and hierarchical quantization schemes optimized for high performance on modern accelerators like NVIDIA’s Blackwell GPUs. These formats offer superior dynamic range compared to fixed-point integers at very low bit-widths.

Multi-Level Scaling in Group Quantization: A Deeper Dive

Modern group quantization approaches often employ multi-level scaling to achieve an optimal balance between precision and compression. Consider a generalized formula for reconstructing a real value r from a quantized value q:

r = (q - z) * s_l0 * s_l1 * ...

where z is the zero-point (often 0 for symmetric quantization), and s_l0, s_l1 are scale factors at different hierarchical levels. The “Effective Bit Width” reflects the average number of bits per weight after accounting for both the quantized value and its associated scales.

Let’s dissect a representative table of such schemes:

Quantization ApproachData Type (q)L0 Group SizeL0 Scale Data TypeL1 Group SizeL1 Scale Data TypeEffective Bit Width
Per-Channel QuantINT4Per ChannelFP16--4
VSQINT416UINT4Per ChannelFP164 + 4/16 = 4.25
MX4S1M22E1M016E8M03 + 1/2 + 8/16 = 4
MX6S1M42E1M016E8M05 + 1/2 + 8/16 = 6
MX9S1M72E1M016E8M08 + 1/2 + 8/16 = 9
  • Data Types Explanation:

    • INT4: Standard 4-bit integer.
    • UINT4: 4-bit unsigned integer.
    • FP16: 16-bit floating-point number.
    • S1M2: A custom 3-bit floating-point-like format (1 sign bit, 2 mantissa bits), with its exponent effectively derived from shared scales.
    • S1M4: A custom 5-bit format (1 sign bit, 4 mantissa bits).
    • S1M7: A custom 8-bit format (1 sign bit, 7 mantissa bits).
    • E1M0: A custom 1-bit exponent-only floating-point scale (1 exponent bit, 0 mantissa bits).
    • E8M0: A custom 8-bit exponent-only floating-point scale (8 exponent bits, 0 mantissa bits).
  • Row-by-Row Analysis:

    1. Per-Channel Quant: This represents a baseline. Each individual value (q) is stored as a 4-bit integer. A single 16-bit FP16 scale (s_l0) is applied per channel. Since a channel contains many weights, the overhead of the 16-bit scale is amortized, making the effective bit width approximately 4 bits per weight.
    2. VSQ (Per-Vector Scaled Quantization): This scheme introduces a two-level scaling hierarchy. The core quantized value (q) is a 4-bit integer. A finer-grained 4-bit unsigned integer scale (s_l0 in UINT4) is applied to groups of 16 quantized values. A coarser 16-bit FP16 scale (s_l1) is applied per channel. The effective bit width is calculated as: (4 bits for q) + (4 bits for s_l0 / 16 elements) = 4 + 0.25 = 4.25 bits/weight. The FP16 s_l1 scale overhead per channel is negligible, hence not included in the fraction.
    3. MX4 (Mixed-Precision with Microexponents, 4-bit effective): This is a key example of specialized floating-point quantization. The base quantized value (q) uses a compact 3-bit S1M2 format. A 1-bit E1M0 scale (s_l0) is applied to very small groups of 2 q values. A coarser 8-bit E8M0 scale (s_l1) is applied to groups of 16 q values. The effective bit width is: (3 bits for q) + (1 bit for s_l0 / 2 elements) + (8 bits for s_l1 / 16 elements) = 3 + 0.5 + 0.5 = 4 bits/weight. This allows for a wider dynamic range, typical of floating-point numbers, while maintaining a very low average bit-width.
    4. MX6: Similar to MX4, but uses a 5-bit S1M4 format for q. The effective bit width becomes: 5 + 0.5 + 0.5 = 6 bits/weight, offering higher precision at the cost of slight increase in size.
    5. MX9: Uses an 8-bit S1M7 format for q. The effective bit width is: 8 + 0.5 + 0.5 = 9 bits/weight, providing near-INT8 precision while retaining the floating-point-like dynamic range benefits.

These multi-level, mixed-precision, floating-point quantization schemes represent a significant advancement, enabling LLMs to run efficiently on diverse hardware while maintaining high accuracy, especially for managing the ubiquitous outlier values in LLM activations and weights.

Current Trends and Future Outlook

The field of LLM quantization is characterized by rapid innovation.

  • Linear (Affine) Quantization remains the foundational principle, with most advancements focusing on refining its application.
  • Per-channel and especially Group/Block-wise Quantization are indispensable for LLMs due to their heterogeneous weight distributions.
  • Post-Training Quantization (PTQ), particularly advanced one-shot methods like GPTQ and AWQ, are highly relevant for efficient deployment of LLMs without the extensive resources required for QAT.
  • Quantization-Aware Training (QAT) is the benchmark for achieving peak accuracy at very low bit-widths, particularly when PTQ falls short.
  • Mixed-Precision Quantization is crucial for balancing accuracy and efficiency across the massive, varying layers of LLMs.
  • Hardware-optimized quantization formats (like MXFP4, FP8) represent a significant step towards co-designing models and silicon for maximum performance.

Conversely, methods like pure K-means quantization (where computation requires fetching float centroids) and general-purpose pure binary/ternary quantization are less commonly adopted as primary strategies for high-accuracy LLM inference, primarily due to the greater accuracy challenges and lack of widespread hardware acceleration for these specific paradigms compared to optimized integer or block-floating-point operations. The trajectory indicates a continuous push for lower effective bit-widths, driven by clever scaling strategies, specialized data formats, and a hardware-aware approach to model optimization.


References

Courbariaux, M., Bengio, Y., & David, J. P. (2015). BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations. NeurIPS Proceedings.

Dai, S., Venkatesan, R., Ren, H., Zimmer, B., Dally, W. J., & Khailany, B. (2021). VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference. arXiv preprint arXiv:2102.04503.

Rastegari, M., Ordonez, V., Redmon, J., & Farhadi, A. (2016). XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. European Conference on Computer Vision (ECCV).

Zhu, C., Han, S., Mao, H., & Dally, W. J. (2017). Trained Ternary Quantization. International Conference on Learning Representations (ICLR).

Migacz, S. (2017). 8-bit Inference with TensorRT. NVIDIA GTC Presentation.

Krishnamoorthi, R. (2018). Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper. arXiv preprint arXiv:1806.08342.

Li, F., Liu, B., Wang, X., Zhang, B., & Yan, J. (2016). Ternary Weight Networks. arXiv preprint arXiv:1605.04711.

Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., & Kalenichenko, D. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Nagel, M., van Baalen, T., Blankevoort, T., & Louizos, C. (2019). Data-Free Quantization Through Weight Equalization and Bias Correction. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

Han, S., Mao, H

\ No newline at end of file diff --git a/posts/secure-boot-dkms-and-mok-on-proxmox-debian/index.html b/posts/secure-boot-dkms-and-mok-on-proxmox-debian/index.html index c775455..22090fe 100644 --- a/posts/secure-boot-dkms-and-mok-on-proxmox-debian/index.html +++ b/posts/secure-boot-dkms-and-mok-on-proxmox-debian/index.html @@ -59,4 +59,4 @@ nvidia-smi failed to communicate with the NVIDIA driver modprobe nvidia → “K 2016 - 2025 Eric X. Liu -[ed94cec] \ No newline at end of file +[ba596e7] \ No newline at end of file diff --git a/posts/supabase-deep-dive/index.html b/posts/supabase-deep-dive/index.html index 8b0b3f9..4d3ab8a 100644 --- a/posts/supabase-deep-dive/index.html +++ b/posts/supabase-deep-dive/index.html @@ -90,4 +90,4 @@ Supabase enters this space with a radically different philosophy: transparency. 2016 - 2025 Eric X. Liu -[ed94cec] \ No newline at end of file +[ba596e7] \ No newline at end of file diff --git a/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html b/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html index 389c5a2..f10df58 100644 --- a/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html +++ b/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/index.html @@ -30,4 +30,4 @@ But to truly understand the field, we must look at the pivotal models that explo 2016 - 2025 Eric X. Liu -[ed94cec] \ No newline at end of file +[ba596e7] \ No newline at end of file diff --git a/posts/useful/index.html b/posts/useful/index.html index b13824b..2d4da63 100644 --- a/posts/useful/index.html +++ b/posts/useful/index.html @@ -9,4 +9,4 @@ One-minute read
  • [ed94cec] \ No newline at end of file +[ba596e7] \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index 34072e7..af9ccf6 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -1 +1 @@ -/posts/a-technical-deep-dive-into-the-transformer-s-core-mechanics/2025-08-20T04:48:53+00:00weekly0.5/2025-08-20T04:48:53+00:00weekly0.5/posts/2025-08-20T04:48:53+00:00weekly0.5/posts/a-comprehensive-guide-to-breville-barista-pro-maintenance/2025-08-20T04:48:53+00:00weekly0.5/posts/secure-boot-dkms-and-mok-on-proxmox-debian/2025-08-14T06:50:22+00:00weekly0.5/posts/how-rvq-teaches-llms-to-see-and-hear/2025-08-08T17:36:52+00:00weekly0.5/posts/supabase-deep-dive/2025-08-04T03:59:37+00:00weekly0.5/posts/a-deep-dive-into-ppo-for-language-models/2025-08-16T21:13:18+00:00weekly0.5/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/2025-08-03T06:02:48+00:00weekly0.5/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/2025-08-03T03:41:10+00:00weekly0.5/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/2025-08-03T04:20:20+00:00weekly0.5/posts/useful/2025-08-03T08:37:28-07:00weekly0.5/about/2020-06-16T23:30:17-07:00weekly0.5/categories/weekly0.5/tags/weekly0.5 \ No newline at end of file +/posts/a-technical-deep-dive-into-the-transformer-s-core-mechanics/2025-08-20T04:48:53+00:00weekly0.5/2025-08-20T06:02:35+00:00weekly0.5/posts/2025-08-20T06:02:35+00:00weekly0.5/posts/quantization-in-llms/2025-08-20T06:02:35+00:00weekly0.5/posts/a-comprehensive-guide-to-breville-barista-pro-maintenance/2025-08-20T04:48:53+00:00weekly0.5/posts/secure-boot-dkms-and-mok-on-proxmox-debian/2025-08-14T06:50:22+00:00weekly0.5/posts/how-rvq-teaches-llms-to-see-and-hear/2025-08-08T17:36:52+00:00weekly0.5/posts/supabase-deep-dive/2025-08-04T03:59:37+00:00weekly0.5/posts/a-deep-dive-into-ppo-for-language-models/2025-08-16T21:13:18+00:00weekly0.5/posts/mixture-of-experts-moe-models-challenges-solutions-in-practice/2025-08-03T06:02:48+00:00weekly0.5/posts/t5-the-transformer-that-zigged-when-others-zagged-an-architectural-deep-dive/2025-08-03T03:41:10+00:00weekly0.5/posts/espresso-theory-application-a-guide-for-the-breville-barista-pro/2025-08-03T04:20:20+00:00weekly0.5/posts/useful/2025-08-03T08:37:28-07:00weekly0.5/about/2020-06-16T23:30:17-07:00weekly0.5/categories/weekly0.5/tags/weekly0.5 \ No newline at end of file diff --git a/tags/index.html b/tags/index.html index 0ede996..10de90e 100644 --- a/tags/index.html +++ b/tags/index.html @@ -4,4 +4,4 @@ 2016 - 2025 Eric X. Liu -[ed94cec] \ No newline at end of file +[ba596e7] \ No newline at end of file