• Docs >
  • Quantization Operators
Shortcuts

Quantization Operators

Quantization is a model optimization technique to reduce the size of a large model in order to achieve better storage performance with a small loss in accuracy.

CUDA Operators

at::Tensor _float_to_bfloat16_gpu(const at::Tensor &input)

Converts a tensor of float values into a tensor of Brain Floating Point (bfloat16) values.

Parameters:

input – A tensor of float values

Returns:

A new tensor with values from the input tensor converted to bfloat16.

at::Tensor _bfloat16_to_float_gpu(const at::Tensor &input)

Converts a tensor of Brain Floating Point (bfloat16) values into a tensor of float values.

Parameters:

input – A tensor of bfloat16 values

Returns:

A new tensor with values from the input tensor converted to float.

Tensor _float_to_FP8rowwise_gpu(const Tensor &input, const bool forward)

Converts a tensor of float values into a tensor of fp8 values.

Parameters:
  • input – A tensor of float values. The dtype can be either SparseType::FP32, SparseType::FP16, or SparseType::BF16

  • forward

Throws:

c10::Error – if input.dtype is not one of (SparseType::FP32, SparseType::FP16, or SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to fp8.

at::Tensor _FP8rowwise_to_float_gpu(const at::Tensor &input, bool forward, const int64_t output_dtype)

Converts a tensor of fp8 values into a tensor of float values.

Parameters:
  • input – A tensor of fp8 values

  • forward

  • output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32, SparseType::FP16, or SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to float (with dtype of either SparseType::FP32, SparseType::FP16, or SparseType::BF16).

Tensor _float_to_fused8bitrowwise_gpu(const Tensor &input)

Converts a tensor of float values into a tensor of fused 8-bit rowwise values.

Parameters:

input – A tensor of float values

Returns:

A new tensor with values from the input tensor converted to fused 8-bit rowwise.

Tensor _half_to_fused8bitrowwise_gpu(const Tensor &input)

Converts a tensor of at::Half values into a tensor of fused 8-bit rowwise values.

Parameters:

input – A tensor of at::Half values

Returns:

A new tensor with values from the input tensor converted to fused 8-bit rowwise.

Tensor _single_or_half_precision_to_fused8bitrowwise_gpu(const Tensor &input)

Converts a tensor of at::Single or at::Half values into a tensor of fused 8-bit rowwise values.

Parameters:

input – A tensor of at::Single or at::Half values

Returns:

A new tensor with values from the input tensor converted to fused 8-bit rowwise.

at::Tensor _fused8bitrowwise_to_float_gpu(const at::Tensor &input)

Converts a tensor of fused 8-bit rowwise values into a tensor of float values.

Parameters:

input – A tensor of fused 8-bit rowwise values

Returns:

A new tensor with values from the input tensor converted to float.

at::Tensor _fused8bitrowwise_to_half_gpu(const at::Tensor &input)

Converts a tensor of fused 8-bit rowwise values into a tensor of at::Half values.

Parameters:

input – A tensor of fused 8-bit rowwise values

Returns:

A new tensor with values from the input tensor converted to at::Half.

at::Tensor _fused8bitrowwise_to_single_or_half_precision_gpu(const at::Tensor &input, const int64_t output_dtype, const bool scale_bias_last, const bool quant_padding_float_type)

Converts a tensor of fused 8-bit rowwise values into a tensor of float, at::Half, or at::BFloat16 values.

Parameters:
  • input – A tensor of fused 8-bit rowwise values

  • output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32, SparseType::FP16, or SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to float, at::Half, or at::BFloat16.

at::Tensor _fused8bitrowwise_to_float_mixed_dim_gpu(const at::Tensor &input, const at::Tensor &D_offsets, const int64_t output_dtype)

Converts a tensor of fused 8-bit rowwise values into a tensor of at::kFloat or at::kHalf values.

Parameters:
  • input – A tensor of fused 8-bit rowwise values

  • D_offsets

  • output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32, SparseType::FP16)

Returns:

A new tensor with values from the input tensor converted to at::kFloat or at::kHalf.

Tensor _float_to_fusednbitrowwise_gpu(const Tensor &input, const int64_t bit_rate)

Converts a tensor of float values into a tensor of fused N-bit rowwise values.

Parameters:
  • input – A tensor of float values

  • bit_rate

Returns:

A new tensor with values from the input tensor converted to fused N-bit rowwise.

at::Tensor _half_to_fusednbitrowwise_gpu(const at::Tensor &input, const int64_t bit_rate)

Converts a tensor of at::Half values into a tensor of fused N-bit rowwise values.

Parameters:
  • input – A tensor of at::Half values

  • bit_rate

Returns:

A new tensor with values from the input tensor converted to fused N-bit rowwise.

Tensor _single_or_half_precision_to_fusednbitrowwise_gpu(const Tensor &input, const int64_t bit_rate)

Converts a tensor of float or at::Half values into a tensor of fused N-bit rowwise values.

Parameters:
  • input – A tensor of float or at::Half values

  • bit_rate

Returns:

A new tensor with values from the input tensor converted to fused N-bit rowwise.

at::Tensor _fusednbitrowwise_to_float_gpu(const at::Tensor &input, const int64_t bit_rate)

Converts a tensor of fused N-bit rowwise values into a tensor of float values.

Parameters:
  • input – A tensor of fused N-bit rowwise values

  • bit_rate

Returns:

A new tensor with values from the input tensor converted to float.

at::Tensor _fusednbitrowwise_to_half_gpu(const at::Tensor &input, const int64_t bit_rate)

Converts a tensor of fused N-bit rowwise values into a tensor of at::Half values.

Parameters:
  • input – A tensor of fused N-bit rowwise values

  • bit_rate

Returns:

A new tensor with values from the input tensor converted to at::Half.

at::Tensor _fusednbitrowwise_to_single_or_half_precision_gpu(const at::Tensor &input, const int64_t bit_rate, const int64_t output_dtype)

Converts a tensor of fused N-bit rowwise values into a tensor of float or at::Half or at::Bf16 values.

Parameters:
  • input – A tensor of fused N-bit rowwise values

  • bit_rate

  • output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32 or SparseType::FP16 or SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to float or at::Half or at::Bf16, depending on output_dtype.

at::Tensor _float_to_hfp8_gpu(const at::Tensor &input, const int64_t ebits, const int64_t exponent_bias, const double max_pos)

Converts a tensor of float values into a tensor of Hybrid 8-bit Floating Point (hfp8) values.

Parameters:
  • input – A tensor of float values

  • ebits

  • exponent_bias

  • max_pos

Throws:

c10::Error – if ebits > 0 or exponent_bias > 0.

Returns:

A new tensor with values from the input tensor converted to hfp8.

at::Tensor _hfp8_to_float_gpu(const at::Tensor &input, const int64_t ebits, const int64_t exponent_bias)

Converts a tensor of Hybrid 8-bit Floating Point (hfp8) values into a tensor of float values.

Parameters:
  • input – A tensor of hfp8 values

  • ebits

  • exponent_bias

Throws:

c10::Error – if ebits > 0 or exponent_bias > 0.

Returns:

A new tensor with values from the input tensor converted to float.

at::Tensor _float_to_msfp_gpu(const at::Tensor &input, const int64_t bounding_box_size, const int64_t ebits, const int64_t mbits, const int64_t bias, const double min_pos, const double max_pos)

Converts a tensor of float values into a tensor of Microsoft Floating Point (msfp) values.

Parameters:
  • input – A tensor of float values

  • bounding_box_size

  • ebits

  • mbits

  • bias

  • min_pos

  • max_pos

Returns:

A new tensor with values from the input tensor converted to msfp.

at::Tensor _msfp_to_float_gpu(const at::Tensor &input, const int64_t ebits, const int64_t mbits, const int64_t bias)

Converts a tensor of Microsoft Floating Point (msfp) values into a tensor of float values.

Parameters:
  • input – A tensor of msfp values

  • ebits

  • mbits

  • bias

Returns:

A new tensor with values from the input tensor converted to float.

Tensor _float_to_paddedFP8rowwise_gpu(const Tensor &input, const bool forward, const int64_t row_dim)

Converts a tensor of float values into a tensor of padded fp8 rowwise values.

Parameters:
  • input – A tensor of float values. The dtype can be either SparseType::FP32, SparseType::FP16, or SparseType::BF16

  • forward

  • row_dim

Returns:

A new tensor with values from the input tensor converted to padded fp8 rowwise.

at::Tensor _paddedFP8rowwise_to_float_gpu(const at::Tensor &input, const bool forward, const int64_t row_dim, const int64_t output_last_dim, const int64_t output_dtype)

Converts a tensor of padded fp8 rowwise values into a tensor of float values.

Parameters:
  • input – A tensor of float values. The dtype can be either SparseType::FP32, SparseType::FP16, or SparseType::BF16

  • forward

  • row_dim

  • output_last_dim

  • output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32, SparseType::FP16, SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to float.

CPU Operators

Tensor &_fused8bitrowwise_to_float_cpu_out(Tensor &output, const Tensor &input)
Tensor &_float_to_fused8bitrowwise_cpu_out(Tensor &output, const Tensor &input)
Tensor float_to_fused8bitrowwise_cpu(const Tensor &input)
Tensor half_to_fused8bitrowwise_cpu(const Tensor &input)
Tensor float_or_half_to_fused8bitrowwise_cpu(const Tensor &input)
Tensor fused8bitrowwise_to_float_cpu(const Tensor &input)
Tensor fused8bitrowwise_to_half_cpu(const Tensor &input)
Tensor fused8bitrowwise_to_float_or_half_cpu(const Tensor &input, const int64_t output_dtype, const bool scale_bias_last, const bool quant_padding_float_type)
Tensor float_to_FP8rowwise_cpu(const Tensor &input, bool forward)
Tensor FP8rowwise_to_float_cpu(const Tensor &input, bool forward, const int64_t output_dtype)
Tensor fusednbitrowwise_to_float_cpu(const Tensor &input, const int64_t bit_rate)
Tensor fusednbitrowwise_sbfront_to_float_cpu(const Tensor &input, const int64_t bit_rate)

Dequantize int4/int2 rows with scale and bias stored in the front into float32.

Dequantize int4/int2 rows with scale and bias stored in the front into float32. The input tensor should have torch.quint4x2 or torch.quint2x4 dtype and QuantizedCPU backend. This operator is only recommended for testing purpose because its kernel is reference implementation and not optimized.

Parameters:
  • input – Tensor of int4/int2 rows with scale and bias stored in the front.

  • bit_rate – Bit rate of each element. Should be 4 or 2.

Returns:

Tensor of float32, holding dequantized numbers.

Tensor fusednbitrowwise_to_half_cpu(const Tensor &input, const int64_t bit_rate)
Tensor fusednbitrowwise_to_float_or_half_cpu(const Tensor &input, const int64_t bit_rate, const int64_t output_dtype)
void FloatToFP8Quantized_ref(const float *const input, const size_t nrows, const size_t ncols, uint8_t *const output, const int ebits, const int exponent_bias, const double max_pos)
void FP8QuantizedToFloat_ref(const uint8_t *const input, const size_t nrows, const size_t ncols, float *const output, const int ebits, const int exponent_bias)

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources