Skip to content

Always read data/list.txt as UTF-8 to avoid "ArgumentError: invalid byte sequence in US-ASCII" when parsing it #118

@dentarg

Description

@dentarg

If your environment fails to specify UTF-8, Ruby defaults to US-ASCII and when public_suffix try to parse the list data, it fails:

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH) ; PublicSuffix::List.parse(list_data, private_domains: false) ; nil
ArgumentError: invalid byte sequence in US-ASCII
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:89:in `strip!'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:89:in `block (2 levels) in parse'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:88:in `each_line'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:88:in `block in parse'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:128:in `initialize'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:87:in `new'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:87:in `parse'
    from (irb):1
    from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):002:0> Encoding.default_external
=> #<Encoding:US-ASCII>
irb(main):003:0> RUBY_VERSION
=> "2.2.5"
irb(main):004:0>

Passing encoding: Encoding::UTF_8 to File.read makes it work, even if the default encoding isn't UTF-8:

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8) ; PublicSuffix::List.parse(list_data, private_domains: false) ; nil
=> nil
irb(main):002:0> RUBY_VERSION
=> "2.2.5"
irb(main):003:0> Encoding.default_external
=> #<Encoding:US-ASCII>

Related to #94 (maybe the list data has changed since?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions