The community hosts a neat little project called AntiSamy[1] which lends its name from the well known MySpace worm[2] and which comes in handy when trying to mitigate Cross-site Scripting[3] attacks. Whereby XSS is sometimes hard to mitigate when business is asking for HTML formatting in user supplied inputs. At that point, AntiSamy might become handy since it focuses to strip down user supplied input to a predefined set of allowed formatting (HTML tags and attributes).
The basic steps when working with AntiSamy are
- Define a policy file (XML)
- Sanitize user input according to policy
The Java API code is pretty straight forward. Note, AntiSamy is to some extent also available for .NET
AntiSamy a = new AntiSamy(); CleanResults r = a.scan(userInput, policyPath);
Thus, it all boils down to configure a strict policy. Samples are shipped with the AntiSamy framework. The file I copied snippets from is named antisamy-slashdot.xml[4] . AntiSamy policy files consist of the following major sections:
A) Directives
Directives describe the fundamental behavior of the framework and may also help to prevent XML External Entity Attacks XXE[5] with XML message based services.
<directive name="omitXmlDeclaration" value="true"/> <directive name="omitDoctypeDeclaration" value="true"/> <directive name="maxInputSize" value="5000"/> <directive name="useXHTML" value="true"/> <directive name="formatOutput" value="true"/> <directive name="embedStyleSheets" value="false"/>
Hint: AntiSamy would prevent XXE when configuring omitDoctypeDeclaration ‘true’. However, I do not consider AntiSamy an appropriate variant to filter doctype declarations in a large-scale XML service environments. An application level firewall would probably better fit enterprise grade infrastructure needs. Note, the full list of directives is documented in the AntiSamy developer guide[6] and the source code.
B) Common Regular Expressions
This section lists expressions that describe contents of tags and attributes. It basically serves as a variable declaration.
<regexp name="htmlTitle" value="[\p{L}\p{N}\s-',:[]!./\()&]*"/> <regexp name="onsiteURL" value="([\p{L}\p{N}\/.\?=#&;-~]+|#(\w)+)"/> <regexp name="offsiteURL" value="(\s)((ht|f)tp(s?)://|mailto:)[\p{L}\p{N}]+[~\p{L}\p{N}\p{Zs}-_.@#\$%&;:,\?=/+!()](\s)*"/>
Confused? It is indeed pretty difficult to write properly matching expressions. Take care not to weaken your policy in a way that would allow an adversary to pass malicious inputs. You have been warned.
D) Attribute definitions
These definitions declare potentially allowed HTML attributes and also define what values an attribute might take. Note, the value could also be any of the named regular expressions above. Note, by listing an attribute within this section does not automatically allow that attribute to be used in user input. See tags and global attributes section instead.
<attribute name="align" description="..."> <literal-list> <literal value="center"/> <literal value="left"/> <literal value="right"/> <literal value="justify"/> <literal value="char"/> </literal-list> </attribute>
E) Tag rules
The section specifies HTML tags and explicit actions to be taken by the framework when approaching a tag. A tag definition may also reference attributes declared in the attributes section. Tags that should be allowed in user input must be flagged with action=”validate”. Unspecified tags will be deleted whereby the tag itself is removed and the content between the opening and closing tag will remain. This action can be explicitly specified as ‘filter’. The truncate action will keep the tag but remove all attributes from the tag.
<tag name="script" action="remove"/> <tag name="iframe" action="remove"/> <tag name="style" action="remove"/> ... <tag name="p" action="validate"> <attribute name="align"/> </tag> ... <tag name="br" action="truncate"/>
F) Tags to encode
The section lists tags that will not be removed by default but its contents are being HTML encoded.
<tags-to-encode> <tag>g</tag> <tag>grin</tag> </tags-to-encode>
G) Global attributes
Lists attributes that are globally valid for all tags without explicit declaration within the tags section.
<global-tag-attributes> <attribute name="title"/> <attribute name="lang"/> </global-tag-attributes>
Conclusion
Getting a strict policy is not an easy task. However, the developers guide[6] and the project sample files give a quick start at the framework and also give advice and provide examples of how large platforms approach HTML formatting of user input.
Got more appetite on application security? Join us for the upcoming web application security trainings (held in Jona in German language).
- August 20th and 21st, Web Application Security Basic
- August 22nd and 23rd, Web Application Security Advanced
References
[1] OWASP AntiSamy https://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project
[2] Samy is my hero http://en.wikipedia.org/wiki/Samy_(computer_worm)
[3] Cross-site Scripting (and XSS Shell) http://www.csnc.ch/misc/files/publications/compass_event08_xssshell_krm_v1.0.pdf
[4] antisamy-slashdot.xml example http://owaspantisamy.googlecode.com/files/antisamy-slashdot-1.4.4.xml
[5] XML External Entity Attacks http://www.csnc.ch/misc/files/publications/2010_w-jax_xml_theory_and_attacks_XXE.pdf
[6] AntiSamy Developer Guide http://owaspantisamy.googlecode.com/files/Developer%20Guide.pdf