codeql/javascript/ql/src/Performance/PolynomialReDoS.qhelp at sauyon/java-spring-stringutils · ByteDecoder/codeql · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
<!DOCTYPE qhelp PUBLIC
"-//Semmle//qhelp//EN"
"qhelp.dtd">

<qhelp>

	<include src="ReDoSIntroduction.inc.qhelp" />

	<example>
		<p>

			Consider this use of a regular expression, which removes
			all leading and trailing whitespace in a string:

		</p>

		<sample language="javascript">
			text.replace(/^\s+|\s+$/g, ''); // BAD
		</sample>

		<p>

			The sub-expression <code>"\s+$"</code> will match the
			whitespace characters in <code>text</code> from left to right, but it
			can start matching anywhere within a whitespace sequence. This is
			problematic for strings that do <strong>not</strong> end with a whitespace
			character. Such a string will force the regular expression engine to
			process each whitespace sequence once per whitespace character in the
			sequence.

		</p>

		<p>

			This ultimately means that the time cost of trimming a
			string is quadratic in the length of the string. So a string like
			<code>"a b"</code> will take milliseconds to process, but a similar
			string with a million spaces instead of just one will take several
			minutes.

		</p>

		<p>

			Avoid this problem by rewriting the regular expression to
			not contain the ambiguity about when to start matching whitespace
			sequences. For instance, by using a negative look-behind
			(<code>/^\s+|(?&lt;!\s)\s+$/g</code>), or just by using the built-in trim
			method (<code>text.trim()</code>).

		</p>

		<p>

			Note that the sub-expression <code>"^\s+"</code> is
			<strong>not</strong> problematic as the <code>^</code> anchor restricts
			when that sub-expression can start matching, and as the regular
			expression engine matches from left to right.

		</p>

	</example>

	<example>

		<p>

			As a similar, but slightly subtler problem, consider the
			regular expression that matches lines with numbers, possibly written
			using scientific notation:
		</p>

		<sample language="javascript">
			^0\.\d+E?\d+$ // BAD
		</sample>

		<p>

			The problem with this regular expression is in the
			sub-expression <code>\d+E?\d+</code> because the second
			<code>\d+</code> can start matching digits anywhere after the first
			match of the first <code>\d+</code> if there is no <code>E</code> in
			the input string.

		</p>

		<p>

			This is problematic for strings that do <strong>not</strong>
			end with a digit. Such a string will force the regular expression
			engine to process each digit sequence once per digit in the sequence,
			again leading to a quadratic time complexity.

		</p>

		<p>

			To make the processing faster, the regular expression
			should be rewritten such that the two <code>\d+</code> sub-expressions
			do not have overlapping matches: <code>^0\.\d+(E\d+)?$</code>.

		</p>

	</example>

	<include src="ReDoSReferences.inc.qhelp"/>

</qhelp>