https://bugzilla.redhat.com/show_bug.cgi?id=1301928
Cedric Buissart cbuissar@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |cbuissar@redhat.com
--- Comment #3 from Cedric Buissart cbuissar@redhat.com --- Below is my current understanding of this issue (which, I believe, is identical to 1304636) :
The issue is when a word starts with normal ASCII chars and jumps to UTF multibytes chars.
The issue is in htmlParseNameComplex. More precisely, in the while{} loop. The following happens :
vars: len: the size of the word, in bytes. this is used to be able to get back to the begining of the word (i.e.: 'ctxt->input->cur - len') c : the current character (can be multibytes) l : the size in bytes of character c
The while loop will find the end of the word. The expectation is the following : 'ctxt->input->cur' points to the end of the word, and len contains the word's length in byte, thus ctxt->input->cur - len points to the beginning of the word.
htmlCurrentChar() is called during the process (via macro CUR_CHAR), and returns the next character (possibly multibytes) along with its size (l is updated). While in htmlCurrentChar(), if the character is multibytes/non-ASCII, this will lead to a change of encoding via the function xmlSwitchToEncodingInt(). During this switch, xmlBufShrink() is called. The purpose of this function is to remove the beginning of an XML buffer, via memmove. But this is done from the current character, not from the begining of the word. Thus, in the process, ctxt->input->cur will point to the begining of the string, and thus 'ctxt->input->cur - len' will point before the beginning of the string.
The number of bytes to be removed is based on the following calculation : [parserInternals.c:1202] processed = input->cur - input->base; In order to keep the current word, htmlParseNameComplex's 'len' value should be removed here too so that the shrinking stops at the begining of the word instead of the current character.
i.e.: my understanding is that xmlBufShrink() should not shrink beyond the beginning of the current word
==============
Breakpoint 2, xmlBufShrink__internal_alias (buf=0x602550, len=len@entry=14) at buf.c:386 386 xmlBufShrink(xmlBufPtr buf, size_t len) { (gdb) bt #0 xmlBufShrink__internal_alias (buf=0x602550, len=len@entry=14) at buf.c:386 #1 0x00007ffff7aa8764 in xmlSwitchInputEncodingInt (ctxt=0x6045b0, input=0x605b30, handler=0x6023e0, len=45) at parserInternals.c:1194 #2 0x00007ffff7aa9b59 in xmlSwitchToEncodingInt (len=<optimized out>, handler=<optimized out>, ctxt=0x6045b0) at parserInternals.c:1272 #3 xmlSwitchEncoding__internal_alias (ctxt=ctxt@entry=0x6045b0, enc=enc@entry=XML_CHAR_ENCODING_8859_1) at parserInternals.c:1100 #4 0x00007ffff7ae7425 in htmlCurrentChar (ctxt=0x6045b0, len=0x7fffffffdc54) at HTMLparser.c:518 #5 0x00007ffff7ae77d5 in htmlParseNameComplex (ctxt=0x6045b0) at HTMLparser.c:2515 #6 htmlParseName (ctxt=ctxt@entry=0x6045b0) at HTMLparser.c:2483 #7 0x00007ffff7aa0a73 in htmlParseDocTypeDecl (ctxt=ctxt@entry=0x6045b0) at HTMLparser.c:3398 #8 0x00007ffff7aed52d in htmlParseTryOrFinish (terminate=<optimized out>, ctxt=<optimized out>) at HTMLparser.c:5440 #9 htmlParseChunk__internal_alias (ctxt=0x6045b0, chunk=<optimized out>, size=<optimized out>, terminate=0) at HTMLparser.c:6070 #10 0x00000000004007f4 in main (argc=1, arg=0x7fffffffdec8) at foo.c:25
Before the shrink : (gdb) p *ctxt->input $9 = {buf = 0x602500, [...], base = 0x6025a0 "<!DOCTYPE html\342\t</</body></html>", cur = 0x6025ae "\342\t</</body></html>", end = 0x6025c1 "", length = 0, line = 1, col = 15, consumed = 0, [...]}
After the shrink: (gdb) p *ctxt->input $15 = {buf = 0x602500, [...], base = 0x6025a0 "\342\t</</body></html>", cur = 0x6025ae "tml>", end = 0x6025c1 "", length = 0, line = 1, col = 15, consumed = 0, [...]}
The final input, after readjustement : (gdb) p *ctxt->input $23 = {buf = 0x602500, [...], base = 0x605d50 "â\t</</body></html>", cur = 0x605d50 "â\t</</body></html>", end = 0x605d64 "", length = 0, line = 1, col = 15, consumed = 0, [...]}