diff --git a/encoding.bs b/encoding.bs index 6e591d7..6bfcb55 100644 --- a/encoding.bs +++ b/encoding.bs @@ -46,8 +46,8 @@ specification does not provide a mechanism for extending any aspect of encodings encoding in use, or on the way a given encoding is to be implemented. For instance, an attack was reported in 2011 where a Shift_JIS lead byte 0x82 was used to “mask” a 0x22 trail byte in a JSON resource of which an attacker could control some field. The producer did not see the problem -even though this is an illegal byte combination. The consumer decoded it as a single U+FFFD and -therefore changed the overall interpretation as U+0022 is an important delimiter. Decoders of +even though this is an illegal byte combination. The consumer decoded it as a single U+FFFD (�) and +therefore changed the overall interpretation as U+0022 (") is an important delimiter. Decoders of encodings that use multiple bytes for scalar values now require that in case of an illegal byte combination, a scalar value in the range U+0000 to U+007F, inclusive, cannot be “masked”. For the aforementioned sequence the output would be U+FFFD U+0022. (As an unfortunate exception to this, the @@ -179,7 +179,7 @@ from an I/O queue ioQueue, run these steps:
If the last item in ioQueue is end-of-queue, then: +
If the last item in ioQueue is end-of-queue:
If item is end-of-queue, do nothing. @@ -786,9 +786,9 @@ duplicated. decoder and one for its encoder.
To find the pointers and their corresponding code points in an index, -let lines be the result of splitting the resource's contents on U+000A. -Then remove each item in lines that is the empty string or starts with U+0023. -Then the pointers and their corresponding code points are found by splitting each item in lines on U+0009. +let lines be the result of splitting the resource's contents on U+000A LF. +Then remove each item in lines that is the empty string or starts with U+0023 (#). +Then the pointers and their corresponding code points are found by splitting each item in lines on U+0009 TAB. The first subitem is the pointer (as a decimal number) and the second is the corresponding code point (as a hexadecimal number). Other subitems are not relevant. @@ -801,10 +801,10 @@ changed, so has the index. pointer in index, or null if pointer is not in index. -
The index pointer for code point in +
The index pointer for codePoint in index is the first pointer corresponding to -code point in index, or null if -code point is not in index. +codePoint in index, or null if +codePoint is not in index.
There is a non-normative visualization for each index other than @@ -918,10 +918,10 @@ specification, excluding index single-byte, which have their own table: the return value of these steps:
If pointer is greater than 39419 and less than - 189000, or pointer is greater than 1237575, return null. +
If pointer is greater than 39419 and less than 189000, or pointer is + greater than 1237575, then return null. -
If pointer is 7457, return code point U+E7C7. +
If pointer is 7457, then return code point U+E7C7.
Let offset be the last pointer in index gb18030 ranges that is less than @@ -934,23 +934,23 @@ the return value of these steps:
The index gb18030 ranges pointer for code point is +
The index gb18030 ranges pointer for codePoint is the return value of these steps:
If code point is U+E7C7, return pointer 7457. +
If codePoint is U+E7C7, then return pointer 7457.
Let offset be the last code point in index gb18030 ranges that is less - than or equal to code point and let pointer offset be its corresponding + than or equal to codePoint and let pointer offset be its corresponding pointer.
Return a pointer whose value is - pointer offset + code point − offset. + pointer offset + codePoint − offset.
The index Shift_JIS pointer for code point is the return value of these +
The index Shift_JIS pointer for codePoint is the return value of these steps:
The index jis0208 contains duplicate code points so the exclusion of these entries causes later code points to be used. -
Return the index pointer for code point in - index. +
Return the index pointer for codePoint in index.
The index Big5 pointer for code point is the return value of +
The index Big5 pointer for codePoint is the return value of these steps:
Avoid returning Hong Kong Supplementary Character Set extensions literally.
If code point is U+2550, U+255E, U+2561, U+256A, U+5341, or U+5345, - return the last pointer corresponding to code point in +
If codePoint is U+2550, U+255E, U+2561, U+256A, U+5341, or U+5345, + return the last pointer corresponding to codePoint in index.
There are other duplicate code points, but for those the first pointer is to be used. -
Return the index pointer for code point in - index. +
Return the index pointer for codePoint in index.
If decoder's encoding is UTF-8 or UTF-16BE/LE, and decoder's ignore BOM and - BOM seen are false, then: + BOM seen are false:
Append item to output. @@ -1380,7 +1378,7 @@ getter steps are to return this's encoding's
The fatal getter
steps are to return true if this's error mode is
-"fatal", otherwise false.
+"fatal"; otherwise false.
decoder . fatal
- Returns true if error mode is "fatal", otherwise
+
Returns true if error mode is "fatal"; otherwise
false.
decoder . ignoreBOM
@@ -1674,7 +1672,7 @@ method steps are:
If destination's byte length − - written is greater than or equal to the number of bytes in result, then: + written is greater than or equal to the number of bytes in result:
If item is greater than U+FFFF, then increment read by 2. @@ -1856,7 +1854,7 @@ constructor steps are: I/O queue.
If item is end-of-queue, then: +
If item is end-of-queue:
Let outputChunk be the result of running serialize I/O queue with @@ -1900,7 +1898,7 @@ steps: error mode.
If result is finished, then: +
If result is finished:
Let outputChunk be the result of running serialize I/O queue with @@ -2014,7 +2012,7 @@ constructor steps are:
{{DOMString}}, as well as an I/O queue of code units rather than scalar values, are used here so that a surrogate pair that is split between chunks can be reassembled into the appropriate scalar value. The behavior is otherwise identical to {{USVString}}. In particular, - lone surrogates will be replaced with U+FFFD. + lone surrogates will be replaced with U+FFFD (�).
Let output be the I/O queue of bytes « end-of-queue ». @@ -2025,13 +2023,13 @@ constructor steps are:
Let item be the result of reading from input.
If item is end-of-queue, then: +
If item is end-of-queue:
Convert output into a byte sequence.
If output is non-empty, then: +
If output is non-empty:
Let chunk be a {{Uint8Array}} object wrapping an {{ArrayBuffer}} containing @@ -2061,7 +2059,7 @@ constructor steps are:
If encoder's leading surrogate is non-null, then: +
If encoder's leading surrogate is non-null:
Let leadingSurrogate be encoder's @@ -2074,13 +2072,13 @@ constructor steps are:
Restore item to input. -
Return U+FFFD. +
Return U+FFFD (�).
If item is a leading surrogate, then set encoder's leading surrogate to item and return continue. -
If item is a trailing surrogate, then return U+FFFD. +
If item is a trailing surrogate, then return U+FFFD (�).
Return item.
If encoder's leading surrogate is non-null, then: +
If encoder's leading surrogate is non-null:
If byte is end-of-queue, return - finished. +
If byte is end-of-queue, then return finished.
If UTF-8 bytes needed is 0, based on byte: @@ -2199,7 +2196,7 @@ in deployed content. Therefore it is not part of the UTF-8 decoder algori
If byte is not in the range UTF-8 lower boundary to - UTF-8 upper boundary, inclusive, then: + UTF-8 upper boundary, inclusive:
Set UTF-8 code point, @@ -2225,15 +2222,15 @@ in deployed content. Therefore it is not part of the UTF-8 decoder algori
Increase UTF-8 bytes seen by one. -
If UTF-8 bytes seen is not equal to - UTF-8 bytes needed, return continue. +
If UTF-8 bytes seen is not equal to UTF-8 bytes needed, then return + continue. -
Let code point be UTF-8 code point. +
Let codePoint be UTF-8 code point.
Set UTF-8 code point, UTF-8 bytes needed, and UTF-8 bytes seen to 0. -
Return a code point whose value is code point. +
Return a code point whose value is codePoint.
The constraints in the UTF-8 decoder above match @@ -2246,18 +2243,17 @@ achieve the same result are fine, even encouraged).
UTF-8's encoder's handler, given unused and -code point, runs these steps: +codePoint, runs these steps:
If code point is end-of-queue, return - finished. +
If codePoint is end-of-queue, then return finished. -
If code point is an ASCII code point, return - a byte whose value is code point. +
If codePoint is an ASCII code point, then return a byte whose value is + codePoint.
Set count and offset based on the - range code point is in: + range codePoint is in:
Let bytes be a byte sequence whose first byte is - (code point >> (6 × count)) + offset. + (codePoint >> (6 × count)) + offset.
While count is greater than 0:
Set temp to - code point >> (6 × (count − 1)). + codePoint >> (6 × (count − 1)).
Append to bytes 0x80 | (temp & 0x3F). @@ -2347,37 +2343,34 @@ historically this might have been the case for ISO-8859-6 and unused and byte, runs these steps:
If byte is end-of-queue, return - finished. +
If byte is end-of-queue, then return finished. -
If byte is an ASCII byte, return a code point whose value - is byte. +
If byte is an ASCII byte, then return a code point whose value is + byte. -
Let code point be the index code point +
Let codePoint be the index code point for byte − 0x80 in index single-byte. -
If code point is null, return error. +
If codePoint is null, then return error. -
Return a code point whose value is code point. +
Return a code point whose value is codePoint.
Single-byte encodings's encoder's handler, given -unused and code point, runs these steps: +unused and codePoint, runs these steps:
If code point is end-of-queue, return - finished. +
If codePoint is end-of-queue, then return finished. -
If code point is an ASCII code point, return - a byte whose value is code point. +
If codePoint is an ASCII code point, then return a byte whose value is + codePoint. -
Let pointer be the index pointer for - code point in index single-byte. +
Let pointer be the index pointer for codePoint in + index single-byte. -
If pointer is null, return error with - code point. +
If pointer is null, then return error with codePoint.
Return a byte whose value is pointer + 0x80.
If byte is end-of-queue and - gb18030 first, gb18030 second, and gb18030 third - are 0x00, return finished. +
If byte is end-of-queue and gb18030 first, gb18030 second, + and gb18030 third are 0x00, then return finished. -
If byte is end-of-queue, and - gb18030 first, gb18030 second, or gb18030 third - is not 0x00, set gb18030 first, gb18030 second, and +
If byte is end-of-queue, and gb18030 first, gb18030 second, + or gb18030 third is not 0x00, then set gb18030 first, gb18030 second, and gb18030 third to 0x00, and return error.
If gb18030 third is not 0x00, then: +
If gb18030 third is not 0x00:
If byte is not in the range 0x30 to 0x39, inclusive, then: +
If byte is not in the range 0x30 to 0x39, inclusive:
Restore « gb18030 second, gb18030 third, byte » to @@ -2439,20 +2430,20 @@ consumers of content generated with GBK's encoder.
Return error.
Let code point be the index gb18030 ranges code point for +
Let codePoint be the index gb18030 ranges code point for ((gb18030 first − 0x81) × (10 × 126 × 10)) + ((gb18030 second − 0x30) × (10 × 126)) + ((gb18030 third − 0x81) × 10) + byte − 0x30.
Set gb18030 first, gb18030 second, and gb18030 third to 0x00. -
If code point is null, return error. +
If codePoint is null, then return error. -
Return a code point whose value is code point. +
Return a code point whose value is codePoint.
If gb18030 second is not 0x00, then: +
If gb18030 second is not 0x00:
If gb18030 first is not 0x00, then: +
If gb18030 first is not 0x00:
If byte is in the range 0x30 to 0x39, inclusive, set @@ -2472,31 +2463,31 @@ consumers of content generated with GBK's encoder.
Let lead be gb18030 first, let pointer be null, and set gb18030 first to 0x00. -
Let offset be 0x40 if byte is less than 0x7F, otherwise 0x41. +
Let offset be 0x40 if byte is less than 0x7F; otherwise 0x41.
If byte is in the range 0x40 to 0x7E, inclusive, or 0x80 to 0xFE, inclusive, set pointer to (lead − 0x81) × 190 + (byte − offset). -
Let code point be null if pointer is null, otherwise the +
Let codePoint be null if pointer is null; otherwise the index code point for pointer in index gb18030. -
If code point is non-null, return a code point whose value is - code point. +
If codePoint is non-null, then return a code point whose value is + codePoint. -
If byte is an ASCII byte, restore byte to +
If byte is an ASCII byte, then restore byte to ioQueue.
Return error.
If byte is an ASCII byte, return - a code point whose value is byte. +
If byte is an ASCII byte, then return a code point whose value is + byte. -
If byte is 0x80, return code point U+20AC. +
If byte is 0x80, then return code point U+20AC (€). -
If byte is in the range 0x81 to 0xFE, inclusive, set - gb18030 first to byte and return continue. +
If byte is in the range 0x81 to 0xFE, inclusive, then set gb18030 first to + byte and return continue.
Return error.
gb18030's encoder's handler, given unused and -code point, runs these steps: +codePoint, runs these steps:
If code point is end-of-queue, return - finished. +
If codePoint is end-of-queue, then return finished. -
If code point is an ASCII code point, return - a byte whose value is code point. +
If codePoint is an ASCII code point, then return a byte whose value is + codePoint.
If code point is U+E5E5, return error with code point. +
If codePoint is U+E5E5, then return error with codePoint.
Index gb18030 maps 0xA3 0xA0 to U+3000 rather than U+E5E5 for compatibility with deployed content. Therefore it cannot roundtrip. -
If is GBK is true and code point is - U+20AC, return byte 0x80. +
If is GBK is true and codePoint is U+20AC (€), then return byte 0x80.
If there is a row in the table below whose first column is code point, then return +
If there is a row in the table below whose first column is codePoint, then return the two bytes on the same row listed in the second column:
| Pointer | Code points | Notes
@@ -2691,11 +2682,11 @@ and byte, runs these steps:
Since indexes are limited to single code points this table is used for these pointers. - Let code point be null if pointer is null, otherwise the + Let codePoint be null if pointer is null; otherwise the index code point for pointer in index Big5. - If code point is non-null, return a code point whose value is - code point. + If codePoint is non-null, then return a code point whose value is + codePoint. If byte is an ASCII byte, restore byte to ioQueue. @@ -2703,10 +2694,10 @@ and byte, runs these steps: Return error. - If byte is an ASCII byte, return - a code point whose value is byte. + If byte is an ASCII byte, then return a code point whose value is + byte. - If byte is in the range 0x81 to 0xFE, inclusive, set + If byte is in the range 0x81 to 0xFE, inclusive, then set Big5 lead to byte and return continue. Return error. @@ -2716,20 +2707,17 @@ and byte, runs these steps: Big5 encoderBig5's encoder's handler, given unused and -code point, runs these steps: +codePoint, runs these steps:
If state is non-null, then: + If state is non-null:
x-user-defined encoderx-user-defined's encoder's handler, given unused and -code point, runs these steps: +codePoint, runs these steps:
|
|---|