Class HTMLScanner

  • All Implemented Interfaces:
    HTMLComponent, XMLComponent, XMLDocumentSource, XMLLocator, org.xml.sax.ext.Locator2, org.xml.sax.Locator

    public class HTMLScanner
    extends java.lang.Object
    implements XMLDocumentSource, XMLLocator, HTMLComponent
    A simple HTML scanner. This scanner makes no attempt to balance tags or fix other problems in the source document — it just scans what it can and generates XNI document "events", ignoring errors of all kinds.

    This component recognizes the following features:

    • http://cyberneko.org/html/features/augmentations
    • http://cyberneko.org/html/features/report-errors
    • http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
    • http://cyberneko.org/html/features/scanner/script/strip-comment-delims
    • http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
    • http://cyberneko.org/html/features/scanner/style/strip-comment-delims
    • http://cyberneko.org/html/features/scanner/ignore-specified-charset
    • http://cyberneko.org/html/features/scanner/cdata-sections
    • http://cyberneko.org/html/features/scanner/cdata-early-closing
    • http://cyberneko.org/html/features/override-doctype
    • http://cyberneko.org/html/features/insert-doctype
    • http://cyberneko.org/html/features/parse-noscript-content
    • http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe
    • http://cyberneko.org/html/features/scanner/allow-selfclosing-tags
    • http://cyberneko.org/html/features/scanner/normalize-attrs
    • http://cyberneko.org/html/features/scanner/plain-attr-values

    This component recognizes the following properties:

    • http://cyberneko.org/html/properties/names/elems
    • http://cyberneko.org/html/properties/names/attrs
    • http://cyberneko.org/html/properties/default-encoding
    • http://cyberneko.org/html/properties/error-reporter
    • http://cyberneko.org/html/properties/encoding-translator
    • http://cyberneko.org/html/properties/doctype/pubid
    • http://cyberneko.org/html/properties/doctype/sysid
    See Also:
    HTMLElements
    • Field Detail

      • HTML_4_01_STRICT_PUBID

        public static final java.lang.String HTML_4_01_STRICT_PUBID
        HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").
        See Also:
        Constant Field Values
      • HTML_4_01_STRICT_SYSID

        public static final java.lang.String HTML_4_01_STRICT_SYSID
        HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").
        See Also:
        Constant Field Values
      • HTML_4_01_TRANSITIONAL_PUBID

        public static final java.lang.String HTML_4_01_TRANSITIONAL_PUBID
        HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").
        See Also:
        Constant Field Values
      • HTML_4_01_TRANSITIONAL_SYSID

        public static final java.lang.String HTML_4_01_TRANSITIONAL_SYSID
        HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").
        See Also:
        Constant Field Values
      • HTML_4_01_FRAMESET_PUBID

        public static final java.lang.String HTML_4_01_FRAMESET_PUBID
        HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").
        See Also:
        Constant Field Values
      • HTML_4_01_FRAMESET_SYSID

        public static final java.lang.String HTML_4_01_FRAMESET_SYSID
        HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").
        See Also:
        Constant Field Values
      • AUGMENTATIONS

        public static final java.lang.String AUGMENTATIONS
        Include infoset augmentations.
        See Also:
        Constant Field Values
      • REPORT_ERRORS

        public static final java.lang.String REPORT_ERRORS
        Report errors.
        See Also:
        Constant Field Values
      • SCRIPT_STRIP_COMMENT_DELIMS

        public static final java.lang.String SCRIPT_STRIP_COMMENT_DELIMS
        Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.
        See Also:
        Constant Field Values
      • SCRIPT_STRIP_CDATA_DELIMS

        public static final java.lang.String SCRIPT_STRIP_CDATA_DELIMS
        Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.
        See Also:
        Constant Field Values
      • STYLE_STRIP_COMMENT_DELIMS

        public static final java.lang.String STYLE_STRIP_COMMENT_DELIMS
        Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.
        See Also:
        Constant Field Values
      • STYLE_STRIP_CDATA_DELIMS

        public static final java.lang.String STYLE_STRIP_CDATA_DELIMS
        Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.
        See Also:
        Constant Field Values
      • IGNORE_SPECIFIED_CHARSET

        public static final java.lang.String IGNORE_SPECIFIED_CHARSET
        Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset='…'> tag or in the <?xml … encoding='…'> processing instruction.
        See Also:
        Constant Field Values
      • CDATA_SECTIONS

        public static final java.lang.String CDATA_SECTIONS
        Whether CDATA sections (<![CDATA[...]]>) are treated as proper XML CDATA sections or as HTML comments.
        See Also:
        Constant Field Values
      • CDATA_EARLY_CLOSING

        public static final java.lang.String CDATA_EARLY_CLOSING
        '>' closes the cdata section (see html spec).
        See Also:
        Constant Field Values
      • OVERRIDE_DOCTYPE

        public static final java.lang.String OVERRIDE_DOCTYPE
        Override doctype declaration public and system identifiers.
        See Also:
        Constant Field Values
      • INSERT_DOCTYPE

        public static final java.lang.String INSERT_DOCTYPE
        Insert document type declaration.
        See Also:
        Constant Field Values
      • PARSE_NOSCRIPT_CONTENT

        public static final java.lang.String PARSE_NOSCRIPT_CONTENT
        Parse <noscript>...</noscript> content.
        See Also:
        Constant Field Values
      • ALLOW_SELFCLOSING_IFRAME

        public static final java.lang.String ALLOW_SELFCLOSING_IFRAME
        Allows self closing <iframe/> tag.
        See Also:
        Constant Field Values
      • ALLOW_SELFCLOSING_SCRIPT

        public static final java.lang.String ALLOW_SELFCLOSING_SCRIPT
        Allows self closing <script/> tag.
        See Also:
        Constant Field Values
      • ALLOW_SELFCLOSING_TAGS

        public static final java.lang.String ALLOW_SELFCLOSING_TAGS
        Allows self closing tags e.g. <div/> (XHTML)
        See Also:
        Constant Field Values
      • NORMALIZE_ATTRIBUTES

        public static final java.lang.String NORMALIZE_ATTRIBUTES
        Normalize attribute values.
        See Also:
        Constant Field Values
      • PLAIN_ATTRIBUTE_VALUES

        public static final java.lang.String PLAIN_ATTRIBUTE_VALUES
        Store the plain attribute values also.
        See Also:
        Constant Field Values
      • RECOGNIZED_FEATURES

        private static final java.lang.String[] RECOGNIZED_FEATURES
        Recognized features.
      • RECOGNIZED_FEATURES_DEFAULTS

        private static final java.lang.Boolean[] RECOGNIZED_FEATURES_DEFAULTS
        Recognized features defaults.
      • NAMES_ELEMS

        public static final java.lang.String NAMES_ELEMS
        Modify HTML element names: { "upper", "lower", "default" }.
        See Also:
        Constant Field Values
      • NAMES_ATTRS

        public static final java.lang.String NAMES_ATTRS
        Modify HTML attribute names: { "upper", "lower", "default" }.
        See Also:
        Constant Field Values
      • DEFAULT_ENCODING

        public static final java.lang.String DEFAULT_ENCODING
        Default encoding.
        See Also:
        Constant Field Values
      • ERROR_REPORTER

        public static final java.lang.String ERROR_REPORTER
        Error reporter.
        See Also:
        Constant Field Values
      • ENCODING_TRANSLATOR

        public static final java.lang.String ENCODING_TRANSLATOR
        Encoding translator.
        See Also:
        Constant Field Values
      • DOCTYPE_PUBID

        public static final java.lang.String DOCTYPE_PUBID
        Doctype declaration public identifier.
        See Also:
        Constant Field Values
      • DOCTYPE_SYSID

        public static final java.lang.String DOCTYPE_SYSID
        Doctype declaration system identifier.
        See Also:
        Constant Field Values
      • READER_BUFFER_SIZE

        public static final java.lang.String READER_BUFFER_SIZE
        Reader buffer size.
        See Also:
        Constant Field Values
      • RECOGNIZED_PROPERTIES

        private static final java.lang.String[] RECOGNIZED_PROPERTIES
        Recognized properties.
      • RECOGNIZED_PROPERTIES_DEFAULTS

        private static final java.lang.Object[] RECOGNIZED_PROPERTIES_DEFAULTS
        Recognized properties defaults.
      • STATE_CONTENT

        protected static final short STATE_CONTENT
        State: content.
        See Also:
        Constant Field Values
      • STATE_MARKUP_BRACKET

        protected static final short STATE_MARKUP_BRACKET
        State: markup bracket.
        See Also:
        Constant Field Values
      • STATE_START_DOCUMENT

        protected static final short STATE_START_DOCUMENT
        State: start document.
        See Also:
        Constant Field Values
      • STATE_END_DOCUMENT

        protected static final short STATE_END_DOCUMENT
        State: end document.
        See Also:
        Constant Field Values
      • NAMES_NO_CHANGE

        protected static final short NAMES_NO_CHANGE
        Don't modify HTML names.
        See Also:
        Constant Field Values
      • NAMES_UPPERCASE

        protected static final short NAMES_UPPERCASE
        Uppercase HTML names.
        See Also:
        Constant Field Values
      • NAMES_LOWERCASE

        protected static final short NAMES_LOWERCASE
        Lowercase HTML names.
        See Also:
        Constant Field Values
      • SCAN_TRUE

        private static final int SCAN_TRUE
        Scan return code: operation completed normally.
        See Also:
        Constant Field Values
      • SCAN_EOF

        private static final int SCAN_EOF
        Scan return code: end of entity reached (EOF).
        See Also:
        Constant Field Values
      • SCAN_FALSE

        private static final int SCAN_FALSE
        Scan return code: continue scanning (state transition).
        See Also:
        Constant Field Values
      • DEBUG_SCANNER

        private static final boolean DEBUG_SCANNER
        Set to true to debug changes in the scanner.
        See Also:
        Constant Field Values
      • DEBUG_SCANNER_STATE

        private static final boolean DEBUG_SCANNER_STATE
        Set to true to debug changes in the scanner state.
        See Also:
        Constant Field Values
      • DEBUG_BUFFER

        private static final boolean DEBUG_BUFFER
        Set to true to debug the buffer.
        See Also:
        Constant Field Values
      • DEBUG_CHARSET

        private static final boolean DEBUG_CHARSET
        Set to true to debug character encoding handling.
        See Also:
        Constant Field Values
      • DEBUG_CALLBACKS

        protected static final boolean DEBUG_CALLBACKS
        Set to true to debug callbacks.
        See Also:
        Constant Field Values
      • fAugmentations_

        private boolean fAugmentations_
        Augmentations.
      • fReportErrors_

        boolean fReportErrors_
        Report errors.
      • fScriptStripCDATADelims_

        boolean fScriptStripCDATADelims_
        Strip CDATA delimiters from SCRIPT tags.
      • fScriptStripCommentDelims_

        boolean fScriptStripCommentDelims_
        Strip comment delimiters from SCRIPT tags.
      • fStyleStripCDATADelims_

        boolean fStyleStripCDATADelims_
        Strip CDATA delimiters from STYLE tags.
      • fStyleStripCommentDelims_

        boolean fStyleStripCommentDelims_
        Strip comment delimiters from STYLE tags.
      • fIgnoreSpecifiedCharset_

        boolean fIgnoreSpecifiedCharset_
        Ignore specified character set.
      • fCDATASections_

        boolean fCDATASections_
        CDATA sections.
      • fCDATAEarlyClosing_

        boolean fCDATAEarlyClosing_
        CDATA early closing.
      • fOverrideDoctype_

        private boolean fOverrideDoctype_
        Override doctype declaration public and system identifiers.
      • fInsertDoctype_

        boolean fInsertDoctype_
        Insert document type declaration.
      • fNormalizeAttributes_

        boolean fNormalizeAttributes_
        Normalize attribute values.
      • fPlainAttributeValues_

        boolean fPlainAttributeValues_
        Store the plain attribute values also.
      • fParseNoScriptContent_

        boolean fParseNoScriptContent_
        Parse noscript content.
      • fAllowSelfclosingIframe_

        boolean fAllowSelfclosingIframe_
        Allows self closing iframe tags.
      • fAllowSelfclosingScript_

        boolean fAllowSelfclosingScript_
        Allows self closing script tags.
      • fAllowSelfclosingTags_

        boolean fAllowSelfclosingTags_
        Allows self closing tags.
      • fNamesElems

        protected short fNamesElems
        Modify HTML element names.
      • fNamesAttrs

        protected short fNamesAttrs
        Modify HTML attribute names.
      • fDefaultIANAEncoding

        protected java.lang.String fDefaultIANAEncoding
        Default encoding.
      • fDoctypePubid

        protected java.lang.String fDoctypePubid
        Doctype declaration public identifier.
      • fDoctypeSysid

        protected java.lang.String fDoctypeSysid
        Doctype declaration system identifier.
      • fReaderBufferSize

        private int fReaderBufferSize
      • fBeginLineNumber

        protected int fBeginLineNumber
        Beginning line number.
      • fBeginColumnNumber

        protected int fBeginColumnNumber
        Beginning column number.
      • fBeginCharacterOffset

        protected int fBeginCharacterOffset
        Beginning character offset in the file.
      • fScannerState

        protected short fScannerState
        The current scanner state.
      • fIANAEncoding

        protected java.lang.String fIANAEncoding
        Auto-detected IANA encoding.
      • fJavaEncoding

        protected java.lang.String fJavaEncoding
        Auto-detected Java encoding.
      • fElementCount

        protected int fElementCount
        Element count.
      • fElementDepth

        protected int fElementDepth
        Element depth.
      • fFragmentSpecialScannerTag_

        private java.lang.String fFragmentSpecialScannerTag_
        if not empty this is the name of the tag, that requires the special scanner.
      • fSpecialScanner

        protected final HTMLScanner.SpecialScanner fSpecialScanner
        Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references. For example: <SCRIPT> and <COMMENT>.
      • fStringBuffer

        protected final XMLString fStringBuffer
        String buffer.
      • fStringBufferEntityRef

        final XMLString fStringBufferEntityRef
        String buffer used when resolving entity refs.
      • fStringBufferPlainAttribValue

        final XMLString fStringBufferPlainAttribValue
      • fScanUntilEndTag

        final XMLString fScanUntilEndTag
      • fScanLiteral

        private final XMLString fScanLiteral
      • fSingleBoolean

        final boolean[] fSingleBoolean
        Reusable single-element boolean array used as an out-parameter.

        Performance optimization: This array is reused across method calls to avoid allocating a new single-element array on every invocation of methods like scanStartElement and scanAttribute. The array is reset before each use to ensure correctness.

        Thread safety: Safe because scanner instances are single-threaded.

      • fLocationItem

        private final LocationItem fLocationItem
        Our location item, to be reused because Augmentations says so, so let's save on memory
    • Constructor Detail

      • HTMLScanner

        HTMLScanner​(HTMLConfiguration htmlConfiguration)
        Creates a new HTMLScanner with the given configuration
        Parameters:
        htmlConfiguration - the configuration to use
    • Method Detail

      • pushInputSource

        public void pushInputSource​(XMLInputSource inputSource)
        Pushes an input source onto the current entity stack. This enables the scanner to transparently scan new content (e.g. the output written by an embedded script). At the end of the current entity, the scanner returns where it left off at the time this entity source was pushed.

        Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.

        Parameters:
        inputSource - The new input source to start scanning.
        See Also:
        evaluateInputSource(XMLInputSource)
      • getReader

        private java.io.Reader getReader​(XMLInputSource inputSource)
      • evaluateInputSource

        public void evaluateInputSource​(XMLInputSource inputSource)
        Immediately evaluates an input source and add the new content (e.g. the output written by an embedded script).
        Parameters:
        inputSource - The new input source to start evaluating.
        See Also:
        pushInputSource(XMLInputSource)
      • cleanup

        public void cleanup​(boolean closeall)
        Cleans up used resources. For example, if scanning is terminated early, then this method ensures all remaining open streams are closed.
        Parameters:
        closeall - Close all streams, including the original. This is used in cases when the application has opened the original document stream and should be responsible for closing it.
      • getEncoding

        public java.lang.String getEncoding()
        Returns the encoding.
        Specified by:
        getEncoding in interface org.xml.sax.ext.Locator2
      • getPublicId

        public java.lang.String getPublicId()
        Returns the public identifier.
        Specified by:
        getPublicId in interface org.xml.sax.Locator
      • getBaseSystemId

        public java.lang.String getBaseSystemId()
        Returns the base system identifier.
        Specified by:
        getBaseSystemId in interface XMLLocator
        Returns:
        the base system identifier.
      • getLiteralSystemId

        public java.lang.String getLiteralSystemId()
        Returns the literal system identifier.
        Specified by:
        getLiteralSystemId in interface XMLLocator
        Returns:
        the literal system identifier.
      • getSystemId

        public java.lang.String getSystemId()
        Returns the expanded system identifier.
        Specified by:
        getSystemId in interface org.xml.sax.Locator
      • getLineNumber

        public int getLineNumber()
        Returns the current line number.
        Specified by:
        getLineNumber in interface org.xml.sax.Locator
      • getColumnNumber

        public int getColumnNumber()
        Returns the current column number.
        Specified by:
        getColumnNumber in interface org.xml.sax.Locator
      • getXMLVersion

        public java.lang.String getXMLVersion()
        Returns the XML version.
        Specified by:
        getXMLVersion in interface org.xml.sax.ext.Locator2
      • getCharacterOffset

        public int getCharacterOffset()
        Returns the character offset.
        Specified by:
        getCharacterOffset in interface XMLLocator
        Returns:
        the character offset, or -1 if no character offset is available.
      • getFeatureDefault

        public java.lang.Boolean getFeatureDefault​(java.lang.String featureId)
        Returns the default state for a feature.
        Specified by:
        getFeatureDefault in interface HTMLComponent
        Specified by:
        getFeatureDefault in interface XMLComponent
        Parameters:
        featureId - The feature identifier.
        Returns:
        the default state for a feature, or null if this component does not want to report a default value for this feature.
      • getPropertyDefault

        public java.lang.Object getPropertyDefault​(java.lang.String propertyId)
        Returns the default state for a property.
        Specified by:
        getPropertyDefault in interface HTMLComponent
        Specified by:
        getPropertyDefault in interface XMLComponent
        Parameters:
        propertyId - The property identifier.
        Returns:
        the default state for a property, or null if this component does not want to report a default value for this property
      • getRecognizedFeatures

        public java.lang.String[] getRecognizedFeatures()
        Returns recognized features.
        Specified by:
        getRecognizedFeatures in interface XMLComponent
        Returns:
        an array of feature identifiers that are recognized by this component. This method may return null if no features are recognized by this component.
      • getRecognizedProperties

        public java.lang.String[] getRecognizedProperties()
        Returns recognized properties.
        Specified by:
        getRecognizedProperties in interface XMLComponent
        Returns:
        an array of property identifiers that are recognized by this component. This method may return null if no properties are recognized by this component.
      • setFeature

        public void setFeature​(java.lang.String featureId,
                               boolean state)
        Sets a feature.
        Specified by:
        setFeature in interface XMLComponent
        Parameters:
        featureId - The feature identifier.
        state - The state of the feature.
      • setProperty

        public void setProperty​(java.lang.String propertyId,
                                java.lang.Object value)
                         throws XMLConfigurationException
        Sets a property.
        Specified by:
        setProperty in interface XMLComponent
        Parameters:
        propertyId - The property identifier.
        value - The value of the property.
        Throws:
        XMLConfigurationException - Thrown for configuration error. In general, components should only throw this exception if it is really a critical error.
      • setInputSource

        public void setInputSource​(XMLInputSource source)
                            throws java.io.IOException
        Sets the input source.
        Parameters:
        source - The input source.
        Throws:
        java.io.IOException - Thrown on i/o error.
      • scanDocument

        public boolean scanDocument​(boolean complete)
                             throws XNIException,
                                    java.io.IOException
        Scans a document.
        Parameters:
        complete - True if the scanner should scan the document completely, pushing all events to the registered document handler. A value of false indicates that the scanner should only scan the next portion of the document and return. A scanner instance is permitted to completely scan a document if it does not support this "pull" scanning model.
        Returns:
        True if there is more to scan, false otherwise.
        Throws:
        java.io.IOException - Thrown on i/o error.
        XNIException - on error.
      • getValue

        protected static java.lang.String getValue​(XMLAttributes attrs,
                                                   java.lang.String aname)
      • systemId

        public static java.lang.String systemId​(java.lang.String systemId,
                                                java.lang.String baseSystemId)
        Expands a system id and returns the system id as a URI, if it can be expanded. A return value of null means that the identifier is already expanded. An exception thrown indicates a failure to expand the id.
        Parameters:
        systemId - The systemId to be expanded.
        baseSystemId - baseSystemId
        Returns:
        Returns the URI string representing the expanded system identifier. A null value indicates that the given system identifier is already expanded.
      • fixURI

        protected static java.lang.String fixURI​(java.lang.String str)
        Fixes a platform dependent filename to standard URI form.
        Parameters:
        str - The string to fix.
        Returns:
        Returns the fixed URI string.
      • getNamesValue

        protected static short getNamesValue​(java.lang.String value)
      • setScannerState

        protected void setScannerState​(short state)
      • scanDoctype

        protected int scanDoctype()
                           throws java.io.IOException
        Throws:
        java.io.IOException
      • scanLiteral

        protected int scanLiteral()
                           throws java.io.IOException
        Throws:
        java.io.IOException
      • scanName

        protected java.lang.String scanName​(boolean strict,
                                            short mode)
                                     throws java.io.IOException
        Throws:
        java.io.IOException
      • scanTagName

        protected java.lang.String scanTagName()
                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • scanEntityRef

        protected int scanEntityRef​(XMLString str,
                                    XMLString plainValue,
                                    boolean content)
                             throws java.io.IOException
        Throws:
        java.io.IOException
      • returnEntityRefString

        private int returnEntityRefString​(XMLString str,
                                          boolean content)
      • synthesizedAugs

        protected final Augmentations synthesizedAugs()
      • isEncodingCompatible

        static boolean isEncodingCompatible​(java.lang.String encoding1,
                                            java.lang.String encoding2)
        To detect if 2 encoding are compatible, both must be able to read the meta tag specifying the new encoding. This means that the byte representation of some minimal html markup must be the same in both encodings
      • canRoundtrip

        private static boolean canRoundtrip​(java.lang.String encodeCharset,
                                            java.lang.String decodeCharset)
                                     throws java.io.UnsupportedEncodingException
        Throws:
        java.io.UnsupportedEncodingException