module ActiveSupport::Multibyte::Unicode

Constants

NORMALIZATION_FORMS

A list of all available normalization forms. See www.unicode.org/reports/tr15/tr15-29.html for more information about normalization.

UNICODE_VERSION

The Unicode version that is supported by the implementation

Attributes

default_normalization_form[RW]

The default normalization used for operations that require normalization. It can be set to any of the normalizations in NORMALIZATION_FORMS.

ActiveSupport::Multibyte::Unicode.default_normalization_form = :c

Public Instance Methods

compose(codepoints) Show source
# File activesupport/lib/active_support/multibyte/unicode.rb, line 67
def compose(codepoints)
  codepoints.pack("U*").unicode_normalize(:nfc).codepoints
end

Compose decomposed characters to the composed form.

decompose(type, codepoints) Show source
# File activesupport/lib/active_support/multibyte/unicode.rb, line 58
def decompose(type, codepoints)
  if type == :compatibility
    codepoints.pack("U*").unicode_normalize(:nfkd).codepoints
  else
    codepoints.pack("U*").unicode_normalize(:nfd).codepoints
  end
end

Decompose composed characters to the decomposed form.

normalize(string, form = nil) Show source
# File activesupport/lib/active_support/multibyte/unicode.rb, line 118
      def normalize(string, form = nil)
        form ||= @default_normalization_form

        # See https://www.unicode.org/reports/tr15, Table 1
        if alias_form = NORMALIZATION_FORM_ALIASES[form]
          ActiveSupport::Deprecation.warn(<<-MSG.squish)
            ActiveSupport::Multibyte::Unicode#normalize is deprecated and will be
            removed from Rails 6.1. Use String#unicode_normalize(:#{alias_form}) instead.
          MSG

          string.unicode_normalize(alias_form)
        else
          ActiveSupport::Deprecation.warn(<<-MSG.squish)
            ActiveSupport::Multibyte::Unicode#normalize is deprecated and will be
            removed from Rails 6.1. Use String#unicode_normalize instead.
          MSG

          raise ArgumentError, "#{form} is not a valid normalization variant", caller
        end
      end

Returns the KC normalization of the string by default. NFKC is considered the best normalization form for passing strings to databases and validations.

  • string - The string to perform normalization on.

  • form - The form you want to normalize in. Should be one of the following: :c, :kc, :d, or :kd. Default is #default_normalization_form.

pack_graphemes(unpacked) Show source
# File activesupport/lib/active_support/multibyte/unicode.rb, line 48
      def pack_graphemes(unpacked)
        ActiveSupport::Deprecation.warn(<<-MSG.squish)
          ActiveSupport::Multibyte::Unicode#pack_graphemes is deprecated and will be
          removed from Rails 6.1. Use array.flatten.pack("U*") instead.
        MSG

        unpacked.flatten.pack("U*")
      end

Reverse operation of unpack_graphemes.

Unicode.pack_graphemes(Unicode.unpack_graphemes('क्षि')) # => 'क्षि'
tidy_bytes(string, force = false) Show source
# File activesupport/lib/active_support/multibyte/unicode.rb, line 78
def tidy_bytes(string, force = false)
  return string if string.empty?
  return recode_windows1252_chars(string) if force
  string.scrub { |bad| recode_windows1252_chars(bad) }
end

Replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.

Passing true will forcibly tidy all bytes, assuming that the string's encoding is entirely CP1252 or ISO-8859-1.

unpack_graphemes(string) Show source
# File activesupport/lib/active_support/multibyte/unicode.rb, line 36
      def unpack_graphemes(string)
        ActiveSupport::Deprecation.warn(<<-MSG.squish)
          ActiveSupport::Multibyte::Unicode#unpack_graphemes is deprecated and will be
          removed from Rails 6.1. Use string.scan(/\X/).map(&:codepoints) instead.
        MSG

        string.scan(/\X/).map(&:codepoints)
      end

Unpack the string at grapheme boundaries. Returns a list of character lists.

Unicode.unpack_graphemes('क्षि') # => [[2325, 2381], [2359], [2367]]
Unicode.unpack_graphemes('Café') # => [[67], [97], [102], [233]]

© 2004–2019 David Heinemeier Hansson
Licensed under the MIT License.