The community hosts a neat little project called AntiSamy[1] which lends its name from the well known MySpace worm[2] and which comes in handy when trying to mitigate Cross-site Scripting[3] attacks. Whereby XSS is sometimes hard to mitigate when business is asking for HTML formatting in user supplied inputs. At that point, AntiSamy might become handy since it focuses to strip down user supplied input to a predefined set of allowed formatting (HTML tags and attributes).

The basic steps when working with AntiSamy are

  • Define a policy file (XML)
  • Sanitize user input according to policy 

The Java API code is pretty straight forward. Note, AntiSamy is to some extent also available for .NET

    AntiSamy a = new AntiSamy();
    CleanResults r = a.scan(userInput, policyPath);
    

    Thus, it all boils down to configure a strict policy. Samples are shipped with the AntiSamy framework. The file I copied snippets from is named antisamy-slashdot.xml[4] . AntiSamy policy files consist of the following major sections:

     

    A) Directives

    Directives describe the fundamental behavior of the framework and may also help to prevent XML External Entity Attacks XXE[5] with XML message based services. 

    <directive name="omitXmlDeclaration" value="true"/>
    <directive name="omitDoctypeDeclaration" value="true"/>
    <directive name="maxInputSize" value="5000"/>
    <directive name="useXHTML" value="true"/>
    <directive name="formatOutput" value="true"/>
    <directive name="embedStyleSheets" value="false"/>
    

    Hint: AntiSamy would prevent XXE when configuring omitDoctypeDeclaration ‘true’. However, I do not consider AntiSamy an appropriate variant to filter doctype declarations in a large-scale XML service environments. An application level firewall would probably better fit enterprise grade infrastructure needs. Note, the full list of directives is documented in the AntiSamy developer guide[6] and the source code.

     

    B) Common Regular Expressions

    This section lists expressions that describe contents of tags and attributes. It basically serves as a variable declaration.

    <regexp name="htmlTitle" value="[\p{L}\p{N}\s-',:[]!./\()&amp;]*"/>
    <regexp name="onsiteURL" value="([\p{L}\p{N}\/.\?=#&amp;;-~]+|#(\w)+)"/>
    <regexp name="offsiteURL" value="(\s)((ht|f)tp(s?)://|mailto:)[\p{L}\p{N}]+[~\p{L}\p{N}\p{Zs}-_.@#\$%&amp;;:,\?=/+!()](\s)*"/>
    

    Confused? It is indeed pretty difficult to write properly matching expressions. Take care not to weaken your policy in a way that would allow an adversary to pass malicious inputs. You have been warned.

     

    D) Attribute definitions

    These definitions declare potentially allowed HTML attributes and also define what values an attribute might take. Note, the value could also be any of the named regular expressions above. Note, by listing an attribute within this section does not automatically allow that attribute to be used in user input. See tags and global attributes section instead.

    <attribute name="align" description="...">
    	<literal-list>
    		<literal value="center"/>
    		<literal value="left"/>
    		<literal value="right"/>
    		<literal value="justify"/>
    		<literal value="char"/>
    	</literal-list>
    </attribute>
    

     

    E) Tag rules

    The section specifies HTML tags and explicit actions to be taken by the framework when approaching a tag. A tag definition may also reference attributes declared in the attributes section. Tags that should be allowed in user input must be flagged with action=”validate”. Unspecified tags will be deleted whereby the tag itself is removed and the content between the opening and closing tag will remain. This action can be explicitly specified as ‘filter’. The truncate action will keep the tag but remove all attributes from the tag.

    <tag name="script" action="remove"/>
    <tag name="iframe" action="remove"/>
    <tag name="style" action="remove"/>
    ...
    <tag name="p" action="validate">
    	<attribute name="align"/>
    </tag>
    ...
    <tag name="br" action="truncate"/>

     

    F) Tags to encode

    The section lists tags that will not be removed by default but its contents are being HTML encoded.

    <tags-to-encode>
    	<tag>g</tag>
    	<tag>grin</tag>
    </tags-to-encode> 

     

    G) Global attributes

    Lists attributes that are globally valid for all tags without explicit declaration within the tags section.

    <global-tag-attributes>
    	<attribute name="title"/>
    	<attribute name="lang"/>
    </global-tag-attributes>

     

    Conclusion

    Getting a strict policy is not an easy task. However, the developers guide[6] and the project sample files give a quick start at the framework and also give advice and provide examples of how large platforms approach HTML formatting of user input.

    Got more appetite on application security? Join us for the upcoming web application security trainings (held in Jona in German language).

     

    References

    [1] OWASP AntiSamy https://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project
    [2] Samy is my hero http://en.wikipedia.org/wiki/Samy_(computer_worm)
    [3] Cross-site Scripting (and XSS Shell) http://www.csnc.ch/misc/files/publications/compass_event08_xssshell_krm_v1.0.pdf
    [4] antisamy-slashdot.xml example http://owaspantisamy.googlecode.com/files/antisamy-slashdot-1.4.4.xml
    [5] XML External Entity Attacks http://www.csnc.ch/misc/files/publications/2010_w-jax_xml_theory_and_attacks_XXE.pdf
    [6] AntiSamy Developer Guide http://owaspantisamy.googlecode.com/files/Developer%20Guide.pdf