Package org.htmlunit.cyberneko
Class HTMLScanner
- java.lang.Object
-
- org.htmlunit.cyberneko.HTMLScanner
-
- All Implemented Interfaces:
HTMLComponent,XMLComponent,XMLDocumentSource,XMLLocator,org.xml.sax.ext.Locator2,org.xml.sax.Locator
public class HTMLScanner extends java.lang.Object implements XMLDocumentSource, XMLLocator, HTMLComponent
A simple HTML scanner. This scanner makes no attempt to balance tags or fix other problems in the source document — it just scans what it can and generates XNI document "events", ignoring errors of all kinds.This component recognizes the following features:
- http://cyberneko.org/html/features/augmentations
- http://cyberneko.org/html/features/report-errors
- http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
- http://cyberneko.org/html/features/scanner/script/strip-comment-delims
- http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
- http://cyberneko.org/html/features/scanner/style/strip-comment-delims
- http://cyberneko.org/html/features/scanner/ignore-specified-charset
- http://cyberneko.org/html/features/scanner/cdata-sections
- http://cyberneko.org/html/features/scanner/cdata-early-closing
- http://cyberneko.org/html/features/override-doctype
- http://cyberneko.org/html/features/insert-doctype
- http://cyberneko.org/html/features/parse-noscript-content
- http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe
- http://cyberneko.org/html/features/scanner/allow-selfclosing-tags
- http://cyberneko.org/html/features/scanner/normalize-attrs
- http://cyberneko.org/html/features/scanner/plain-attr-values
This component recognizes the following properties:
- http://cyberneko.org/html/properties/names/elems
- http://cyberneko.org/html/properties/names/attrs
- http://cyberneko.org/html/properties/default-encoding
- http://cyberneko.org/html/properties/error-reporter
- http://cyberneko.org/html/properties/encoding-translator
- http://cyberneko.org/html/properties/doctype/pubid
- http://cyberneko.org/html/properties/doctype/sysid
- See Also:
HTMLElements
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description classHTMLScanner.ContentScannerThe primary HTML document scanner.(package private) static classHTMLScanner.CurrentEntityCurrent entity.classHTMLScanner.PlainTextScannerSpecial scanner used forPLAINTEXT.static interfaceHTMLScanner.ScannerBasic scanner interface.private static classHTMLScanner.ScanScriptStateclassHTMLScanner.ScriptScannerSpecial scanner used forscript's.classHTMLScanner.SpecialScannerSpecial scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references.
-
Field Summary
Fields Modifier and Type Field Description static java.lang.StringALLOW_SELFCLOSING_IFRAMEAllows self closing <iframe/> tag.static java.lang.StringALLOW_SELFCLOSING_SCRIPTAllows self closing <script/> tag.static java.lang.StringALLOW_SELFCLOSING_TAGSAllows self closing tags e.g.static java.lang.StringAUGMENTATIONSInclude infoset augmentations.static java.lang.StringCDATA_EARLY_CLOSING'>' closes the cdata section (see html spec).static java.lang.StringCDATA_SECTIONSWhether CDATA sections (<![CDATA[...]]>) are treated as proper XML CDATA sections or as HTML comments.private static booleanDEBUG_BUFFERSet to true to debug the buffer.protected static booleanDEBUG_CALLBACKSSet to true to debug callbacks.private static booleanDEBUG_CHARSETSet to true to debug character encoding handling.private static booleanDEBUG_SCANNERSet to true to debug changes in the scanner.private static booleanDEBUG_SCANNER_STATESet to true to debug changes in the scanner state.static java.lang.StringDEFAULT_ENCODINGDefault encoding.static java.lang.StringDOCTYPE_PUBIDDoctype declaration public identifier.static java.lang.StringDOCTYPE_SYSIDDoctype declaration system identifier.static java.lang.StringENCODING_TRANSLATOREncoding translator.static java.lang.StringERROR_REPORTERError reporter.(package private) booleanfAllowSelfclosingIframe_Allows self closing iframe tags.(package private) booleanfAllowSelfclosingScript_Allows self closing script tags.(package private) booleanfAllowSelfclosingTags_Allows self closing tags.private booleanfAugmentations_Augmentations.protected intfBeginCharacterOffsetBeginning character offset in the file.protected intfBeginColumnNumberBeginning column number.protected intfBeginLineNumberBeginning line number.protected PlaybackInputStreamfByteStreamThe playback byte stream.(package private) booleanfCDATAEarlyClosing_CDATA early closing.(package private) booleanfCDATASections_CDATA sections.protected HTMLScanner.ScannerfContentScannerContent scanner.protected HTMLScanner.CurrentEntityfCurrentEntityCurrent entity.protected MiniStack<HTMLScanner.CurrentEntity>fCurrentEntityStackThe current entity stack.protected java.lang.StringfDefaultIANAEncodingDefault encoding.protected java.lang.StringfDoctypePubidDoctype declaration public identifier.protected java.lang.StringfDoctypeSysidDoctype declaration system identifier.protected XMLDocumentHandlerfDocumentHandlerThe document handler.protected intfElementCountElement count.protected intfElementDepthElement depth.protected EncodingTranslatorfEncodingTranslatorError reporter.protected HTMLErrorReporterfErrorReporterError reporter.private java.lang.StringfFragmentSpecialScannerTag_if not empty this is the name of the tag, that requires the special scanner.protected java.lang.StringfIANAEncodingAuto-detected IANA encoding.(package private) booleanfIgnoreSpecifiedCharset_Ignore specified character set.(package private) booleanfInsertDoctype_Insert document type declaration.protected java.lang.StringfJavaEncodingAuto-detected Java encoding.private LocationItemfLocationItemOur location item, to be reused becauseAugmentationssays so, so let's save on memoryprotected shortfNamesAttrsModify HTML attribute names.protected shortfNamesElemsModify HTML element names.(package private) booleanfNormalizeAttributes_Normalize attribute values.private booleanfOverrideDoctype_Override doctype declaration public and system identifiers.(package private) booleanfParseNoScriptContent_Parse noscript content.(package private) booleanfPlainAttributeValues_Store the plain attribute values also.private intfReaderBufferSize(package private) booleanfReportErrors_Report errors.(package private) XMLStringfScanCommentprivate XMLStringfScanLiteralprotected HTMLScanner.ScannerfScannerThe current scanner.protected shortfScannerStateThe current scanner state.(package private) XMLStringfScanUntilEndTagprotected HTMLScanner.ScriptScannerfScriptScannerSpecial scanner used script tags.(package private) booleanfScriptStripCDATADelims_Strip CDATA delimiters from SCRIPT tags.(package private) booleanfScriptStripCommentDelims_Strip comment delimiters from SCRIPT tags.(package private) boolean[]fSingleBooleanReusable single-element boolean array used as an out-parameter.protected HTMLScanner.SpecialScannerfSpecialScannerSpecial scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references.protected XMLStringfStringBufferString buffer.(package private) XMLStringfStringBufferEntityRefString buffer used when resolving entity refs.(package private) XMLStringfStringBufferPlainAttribValue(package private) booleanfStyleStripCDATADelims_Strip CDATA delimiters from STYLE tags.(package private) booleanfStyleStripCommentDelims_Strip comment delimiters from STYLE tags.static java.lang.StringHTML_4_01_FRAMESET_PUBIDHTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").static java.lang.StringHTML_4_01_FRAMESET_SYSIDHTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").static java.lang.StringHTML_4_01_STRICT_PUBIDHTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").static java.lang.StringHTML_4_01_STRICT_SYSIDHTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").static java.lang.StringHTML_4_01_TRANSITIONAL_PUBIDHTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").static java.lang.StringHTML_4_01_TRANSITIONAL_SYSIDHTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").(package private) HTMLConfigurationhtmlConfiguration_static java.lang.StringIGNORE_SPECIFIED_CHARSETIgnore specified charset found in the <meta equiv='Content-Type' content='text/html;charset='…'> tag or in the <?xml … encoding='…'> processing instruction.static java.lang.StringINSERT_DOCTYPEInsert document type declaration.static java.lang.StringNAMES_ATTRSModify HTML attribute names: { "upper", "lower", "default" }.static java.lang.StringNAMES_ELEMSModify HTML element names: { "upper", "lower", "default" }.protected static shortNAMES_LOWERCASELowercase HTML names.protected static shortNAMES_NO_CHANGEDon't modify HTML names.protected static shortNAMES_UPPERCASEUppercase HTML names.static java.lang.StringNORMALIZE_ATTRIBUTESNormalize attribute values.static java.lang.StringOVERRIDE_DOCTYPEOverride doctype declaration public and system identifiers.static java.lang.StringPARSE_NOSCRIPT_CONTENTParse <noscript>...</noscript> content.static java.lang.StringPLAIN_ATTRIBUTE_VALUESStore the plain attribute values also.static java.lang.StringREADER_BUFFER_SIZEReader buffer size.private static java.lang.String[]RECOGNIZED_FEATURESRecognized features.private static java.lang.Boolean[]RECOGNIZED_FEATURES_DEFAULTSRecognized features defaults.private static java.lang.String[]RECOGNIZED_PROPERTIESRecognized properties.private static java.lang.Object[]RECOGNIZED_PROPERTIES_DEFAULTSRecognized properties defaults.static java.lang.StringREPORT_ERRORSReport errors.private static intSCAN_EOFScan return code: end of entity reached (EOF).private static intSCAN_FALSEScan return code: continue scanning (state transition).private static intSCAN_TRUEScan return code: operation completed normally.static java.lang.StringSCRIPT_STRIP_CDATA_DELIMSStrip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.static java.lang.StringSCRIPT_STRIP_COMMENT_DELIMSStrip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.protected static shortSTATE_CONTENTState: content.protected static shortSTATE_END_DOCUMENTState: end document.protected static shortSTATE_MARKUP_BRACKETState: markup bracket.protected static shortSTATE_START_DOCUMENTState: start document.static java.lang.StringSTYLE_STRIP_CDATA_DELIMSStrip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.static java.lang.StringSTYLE_STRIP_COMMENT_DELIMSStrip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.
-
Constructor Summary
Constructors Constructor Description HTMLScanner(HTMLConfiguration htmlConfiguration)Creates a new HTMLScanner with the given configuration
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private static booleancanRoundtrip(java.lang.String encodeCharset, java.lang.String decodeCharset)voidcleanup(boolean closeall)Cleans up used resources.voidevaluateInputSource(XMLInputSource inputSource)Immediately evaluates an input source and add the new content (e.g.protected static java.lang.StringfixURI(java.lang.String str)Fixes a platform dependent filename to standard URI form.java.lang.StringgetBaseSystemId()Returns the base system identifier.intgetCharacterOffset()Returns the character offset.intgetColumnNumber()Returns the current column number.XMLDocumentHandlergetDocumentHandler()java.lang.StringgetEncoding()Returns the encoding.java.lang.BooleangetFeatureDefault(java.lang.String featureId)Returns the default state for a feature.intgetLineNumber()Returns the current line number.java.lang.StringgetLiteralSystemId()Returns the literal system identifier.protected static shortgetNamesValue(java.lang.String value)java.lang.ObjectgetPropertyDefault(java.lang.String propertyId)Returns the default state for a property.java.lang.StringgetPublicId()Returns the public identifier.private java.io.ReadergetReader(XMLInputSource inputSource)java.lang.String[]getRecognizedFeatures()Returns recognized features.java.lang.String[]getRecognizedProperties()Returns recognized properties.java.lang.StringgetSystemId()Returns the expanded system identifier.protected static java.lang.StringgetValue(XMLAttributes attrs, java.lang.String aname)java.lang.StringgetXMLVersion()Returns the XML version.(package private) static booleanisEncodingCompatible(java.lang.String encoding1, java.lang.String encoding2)To detect if 2 encoding are compatible, both must be able to read the meta tag specifying the new encoding.protected AugmentationslocationAugs(HTMLScanner.CurrentEntity currentEntity)voidpushInputSource(XMLInputSource inputSource)Pushes an input source onto the current entity stack.voidreset(XMLComponentManager manager)Resets the component.private intreturnEntityRefString(XMLString str, boolean content)protected intscanDoctype()booleanscanDocument(boolean complete)Scans a document.protected intscanEntityRef(XMLString str, XMLString plainValue, boolean content)protected intscanLiteral()protected java.lang.StringscanName(boolean strict, short mode)protected java.lang.StringscanTagName()voidsetDocumentHandler(XMLDocumentHandler handler)Sets the document handler.voidsetFeature(java.lang.String featureId, boolean state)Sets a feature.voidsetInputSource(XMLInputSource source)Sets the input source.voidsetProperty(java.lang.String propertyId, java.lang.Object value)Sets a property.protected voidsetScanner(HTMLScanner.Scanner scanner)protected voidsetScannerState(short state)protected AugmentationssynthesizedAugs()static java.lang.StringsystemId(java.lang.String systemId, java.lang.String baseSystemId)Expands a system id and returns the system id as a URI, if it can be expanded.
-
-
-
Field Detail
-
HTML_4_01_STRICT_PUBID
public static final java.lang.String HTML_4_01_STRICT_PUBID
HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").- See Also:
- Constant Field Values
-
HTML_4_01_STRICT_SYSID
public static final java.lang.String HTML_4_01_STRICT_SYSID
HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").- See Also:
- Constant Field Values
-
HTML_4_01_TRANSITIONAL_PUBID
public static final java.lang.String HTML_4_01_TRANSITIONAL_PUBID
HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").- See Also:
- Constant Field Values
-
HTML_4_01_TRANSITIONAL_SYSID
public static final java.lang.String HTML_4_01_TRANSITIONAL_SYSID
HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").- See Also:
- Constant Field Values
-
HTML_4_01_FRAMESET_PUBID
public static final java.lang.String HTML_4_01_FRAMESET_PUBID
HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").- See Also:
- Constant Field Values
-
HTML_4_01_FRAMESET_SYSID
public static final java.lang.String HTML_4_01_FRAMESET_SYSID
HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").- See Also:
- Constant Field Values
-
AUGMENTATIONS
public static final java.lang.String AUGMENTATIONS
Include infoset augmentations.- See Also:
- Constant Field Values
-
REPORT_ERRORS
public static final java.lang.String REPORT_ERRORS
Report errors.- See Also:
- Constant Field Values
-
SCRIPT_STRIP_COMMENT_DELIMS
public static final java.lang.String SCRIPT_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.- See Also:
- Constant Field Values
-
SCRIPT_STRIP_CDATA_DELIMS
public static final java.lang.String SCRIPT_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.- See Also:
- Constant Field Values
-
STYLE_STRIP_COMMENT_DELIMS
public static final java.lang.String STYLE_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.- See Also:
- Constant Field Values
-
STYLE_STRIP_CDATA_DELIMS
public static final java.lang.String STYLE_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.- See Also:
- Constant Field Values
-
IGNORE_SPECIFIED_CHARSET
public static final java.lang.String IGNORE_SPECIFIED_CHARSET
Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset='…'> tag or in the <?xml … encoding='…'> processing instruction.- See Also:
- Constant Field Values
-
CDATA_SECTIONS
public static final java.lang.String CDATA_SECTIONS
Whether CDATA sections (<![CDATA[...]]>) are treated as proper XML CDATA sections or as HTML comments.- See Also:
- Constant Field Values
-
CDATA_EARLY_CLOSING
public static final java.lang.String CDATA_EARLY_CLOSING
'>' closes the cdata section (see html spec).- See Also:
- Constant Field Values
-
OVERRIDE_DOCTYPE
public static final java.lang.String OVERRIDE_DOCTYPE
Override doctype declaration public and system identifiers.- See Also:
- Constant Field Values
-
INSERT_DOCTYPE
public static final java.lang.String INSERT_DOCTYPE
Insert document type declaration.- See Also:
- Constant Field Values
-
PARSE_NOSCRIPT_CONTENT
public static final java.lang.String PARSE_NOSCRIPT_CONTENT
Parse <noscript>...</noscript> content.- See Also:
- Constant Field Values
-
ALLOW_SELFCLOSING_IFRAME
public static final java.lang.String ALLOW_SELFCLOSING_IFRAME
Allows self closing <iframe/> tag.- See Also:
- Constant Field Values
-
ALLOW_SELFCLOSING_SCRIPT
public static final java.lang.String ALLOW_SELFCLOSING_SCRIPT
Allows self closing <script/> tag.- See Also:
- Constant Field Values
-
ALLOW_SELFCLOSING_TAGS
public static final java.lang.String ALLOW_SELFCLOSING_TAGS
Allows self closing tags e.g. <div/> (XHTML)- See Also:
- Constant Field Values
-
NORMALIZE_ATTRIBUTES
public static final java.lang.String NORMALIZE_ATTRIBUTES
Normalize attribute values.- See Also:
- Constant Field Values
-
PLAIN_ATTRIBUTE_VALUES
public static final java.lang.String PLAIN_ATTRIBUTE_VALUES
Store the plain attribute values also.- See Also:
- Constant Field Values
-
RECOGNIZED_FEATURES
private static final java.lang.String[] RECOGNIZED_FEATURES
Recognized features.
-
RECOGNIZED_FEATURES_DEFAULTS
private static final java.lang.Boolean[] RECOGNIZED_FEATURES_DEFAULTS
Recognized features defaults.
-
NAMES_ELEMS
public static final java.lang.String NAMES_ELEMS
Modify HTML element names: { "upper", "lower", "default" }.- See Also:
- Constant Field Values
-
NAMES_ATTRS
public static final java.lang.String NAMES_ATTRS
Modify HTML attribute names: { "upper", "lower", "default" }.- See Also:
- Constant Field Values
-
DEFAULT_ENCODING
public static final java.lang.String DEFAULT_ENCODING
Default encoding.- See Also:
- Constant Field Values
-
ERROR_REPORTER
public static final java.lang.String ERROR_REPORTER
Error reporter.- See Also:
- Constant Field Values
-
ENCODING_TRANSLATOR
public static final java.lang.String ENCODING_TRANSLATOR
Encoding translator.- See Also:
- Constant Field Values
-
DOCTYPE_PUBID
public static final java.lang.String DOCTYPE_PUBID
Doctype declaration public identifier.- See Also:
- Constant Field Values
-
DOCTYPE_SYSID
public static final java.lang.String DOCTYPE_SYSID
Doctype declaration system identifier.- See Also:
- Constant Field Values
-
READER_BUFFER_SIZE
public static final java.lang.String READER_BUFFER_SIZE
Reader buffer size.- See Also:
- Constant Field Values
-
RECOGNIZED_PROPERTIES
private static final java.lang.String[] RECOGNIZED_PROPERTIES
Recognized properties.
-
RECOGNIZED_PROPERTIES_DEFAULTS
private static final java.lang.Object[] RECOGNIZED_PROPERTIES_DEFAULTS
Recognized properties defaults.
-
STATE_CONTENT
protected static final short STATE_CONTENT
State: content.- See Also:
- Constant Field Values
-
STATE_MARKUP_BRACKET
protected static final short STATE_MARKUP_BRACKET
State: markup bracket.- See Also:
- Constant Field Values
-
STATE_START_DOCUMENT
protected static final short STATE_START_DOCUMENT
State: start document.- See Also:
- Constant Field Values
-
STATE_END_DOCUMENT
protected static final short STATE_END_DOCUMENT
State: end document.- See Also:
- Constant Field Values
-
NAMES_NO_CHANGE
protected static final short NAMES_NO_CHANGE
Don't modify HTML names.- See Also:
- Constant Field Values
-
NAMES_UPPERCASE
protected static final short NAMES_UPPERCASE
Uppercase HTML names.- See Also:
- Constant Field Values
-
NAMES_LOWERCASE
protected static final short NAMES_LOWERCASE
Lowercase HTML names.- See Also:
- Constant Field Values
-
SCAN_TRUE
private static final int SCAN_TRUE
Scan return code: operation completed normally.- See Also:
- Constant Field Values
-
SCAN_EOF
private static final int SCAN_EOF
Scan return code: end of entity reached (EOF).- See Also:
- Constant Field Values
-
SCAN_FALSE
private static final int SCAN_FALSE
Scan return code: continue scanning (state transition).- See Also:
- Constant Field Values
-
DEBUG_SCANNER
private static final boolean DEBUG_SCANNER
Set to true to debug changes in the scanner.- See Also:
- Constant Field Values
-
DEBUG_SCANNER_STATE
private static final boolean DEBUG_SCANNER_STATE
Set to true to debug changes in the scanner state.- See Also:
- Constant Field Values
-
DEBUG_BUFFER
private static final boolean DEBUG_BUFFER
Set to true to debug the buffer.- See Also:
- Constant Field Values
-
DEBUG_CHARSET
private static final boolean DEBUG_CHARSET
Set to true to debug character encoding handling.- See Also:
- Constant Field Values
-
DEBUG_CALLBACKS
protected static final boolean DEBUG_CALLBACKS
Set to true to debug callbacks.- See Also:
- Constant Field Values
-
fAugmentations_
private boolean fAugmentations_
Augmentations.
-
fReportErrors_
boolean fReportErrors_
Report errors.
-
fScriptStripCDATADelims_
boolean fScriptStripCDATADelims_
Strip CDATA delimiters from SCRIPT tags.
-
fScriptStripCommentDelims_
boolean fScriptStripCommentDelims_
Strip comment delimiters from SCRIPT tags.
-
fStyleStripCDATADelims_
boolean fStyleStripCDATADelims_
Strip CDATA delimiters from STYLE tags.
-
fStyleStripCommentDelims_
boolean fStyleStripCommentDelims_
Strip comment delimiters from STYLE tags.
-
fIgnoreSpecifiedCharset_
boolean fIgnoreSpecifiedCharset_
Ignore specified character set.
-
fCDATASections_
boolean fCDATASections_
CDATA sections.
-
fCDATAEarlyClosing_
boolean fCDATAEarlyClosing_
CDATA early closing.
-
fOverrideDoctype_
private boolean fOverrideDoctype_
Override doctype declaration public and system identifiers.
-
fInsertDoctype_
boolean fInsertDoctype_
Insert document type declaration.
-
fNormalizeAttributes_
boolean fNormalizeAttributes_
Normalize attribute values.
-
fPlainAttributeValues_
boolean fPlainAttributeValues_
Store the plain attribute values also.
-
fParseNoScriptContent_
boolean fParseNoScriptContent_
Parse noscript content.
-
fAllowSelfclosingIframe_
boolean fAllowSelfclosingIframe_
Allows self closing iframe tags.
-
fAllowSelfclosingScript_
boolean fAllowSelfclosingScript_
Allows self closing script tags.
-
fAllowSelfclosingTags_
boolean fAllowSelfclosingTags_
Allows self closing tags.
-
fNamesElems
protected short fNamesElems
Modify HTML element names.
-
fNamesAttrs
protected short fNamesAttrs
Modify HTML attribute names.
-
fDefaultIANAEncoding
protected java.lang.String fDefaultIANAEncoding
Default encoding.
-
fErrorReporter
protected HTMLErrorReporter fErrorReporter
Error reporter.
-
fEncodingTranslator
protected EncodingTranslator fEncodingTranslator
Error reporter.
-
fDoctypePubid
protected java.lang.String fDoctypePubid
Doctype declaration public identifier.
-
fDoctypeSysid
protected java.lang.String fDoctypeSysid
Doctype declaration system identifier.
-
fReaderBufferSize
private int fReaderBufferSize
-
fBeginLineNumber
protected int fBeginLineNumber
Beginning line number.
-
fBeginColumnNumber
protected int fBeginColumnNumber
Beginning column number.
-
fBeginCharacterOffset
protected int fBeginCharacterOffset
Beginning character offset in the file.
-
fByteStream
protected PlaybackInputStream fByteStream
The playback byte stream.
-
fCurrentEntity
protected HTMLScanner.CurrentEntity fCurrentEntity
Current entity.
-
fCurrentEntityStack
protected final MiniStack<HTMLScanner.CurrentEntity> fCurrentEntityStack
The current entity stack.
-
fScanner
protected HTMLScanner.Scanner fScanner
The current scanner.
-
fScannerState
protected short fScannerState
The current scanner state.
-
fDocumentHandler
protected XMLDocumentHandler fDocumentHandler
The document handler.
-
fIANAEncoding
protected java.lang.String fIANAEncoding
Auto-detected IANA encoding.
-
fJavaEncoding
protected java.lang.String fJavaEncoding
Auto-detected Java encoding.
-
fElementCount
protected int fElementCount
Element count.
-
fElementDepth
protected int fElementDepth
Element depth.
-
fFragmentSpecialScannerTag_
private java.lang.String fFragmentSpecialScannerTag_
if not empty this is the name of the tag, that requires the special scanner.
-
fContentScanner
protected HTMLScanner.Scanner fContentScanner
Content scanner.
-
fSpecialScanner
protected final HTMLScanner.SpecialScanner fSpecialScanner
Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references. For example: <SCRIPT> and <COMMENT>.
-
fScriptScanner
protected final HTMLScanner.ScriptScanner fScriptScanner
Special scanner used script tags.
-
fStringBuffer
protected final XMLString fStringBuffer
String buffer.
-
fStringBufferEntityRef
final XMLString fStringBufferEntityRef
String buffer used when resolving entity refs.
-
fStringBufferPlainAttribValue
final XMLString fStringBufferPlainAttribValue
-
fScanUntilEndTag
final XMLString fScanUntilEndTag
-
fScanComment
final XMLString fScanComment
-
fScanLiteral
private final XMLString fScanLiteral
-
fSingleBoolean
final boolean[] fSingleBoolean
Reusable single-element boolean array used as an out-parameter.Performance optimization: This array is reused across method calls to avoid allocating a new single-element array on every invocation of methods like
scanStartElementandscanAttribute. The array is reset before each use to ensure correctness.Thread safety: Safe because scanner instances are single-threaded.
-
htmlConfiguration_
final HTMLConfiguration htmlConfiguration_
-
fLocationItem
private final LocationItem fLocationItem
Our location item, to be reused becauseAugmentationssays so, so let's save on memory
-
-
Constructor Detail
-
HTMLScanner
HTMLScanner(HTMLConfiguration htmlConfiguration)
Creates a new HTMLScanner with the given configuration- Parameters:
htmlConfiguration- the configuration to use
-
-
Method Detail
-
pushInputSource
public void pushInputSource(XMLInputSource inputSource)
Pushes an input source onto the current entity stack. This enables the scanner to transparently scan new content (e.g. the output written by an embedded script). At the end of the current entity, the scanner returns where it left off at the time this entity source was pushed.Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.
- Parameters:
inputSource- The new input source to start scanning.- See Also:
evaluateInputSource(XMLInputSource)
-
getReader
private java.io.Reader getReader(XMLInputSource inputSource)
-
evaluateInputSource
public void evaluateInputSource(XMLInputSource inputSource)
Immediately evaluates an input source and add the new content (e.g. the output written by an embedded script).- Parameters:
inputSource- The new input source to start evaluating.- See Also:
pushInputSource(XMLInputSource)
-
cleanup
public void cleanup(boolean closeall)
Cleans up used resources. For example, if scanning is terminated early, then this method ensures all remaining open streams are closed.- Parameters:
closeall- Close all streams, including the original. This is used in cases when the application has opened the original document stream and should be responsible for closing it.
-
getEncoding
public java.lang.String getEncoding()
Returns the encoding.- Specified by:
getEncodingin interfaceorg.xml.sax.ext.Locator2
-
getPublicId
public java.lang.String getPublicId()
Returns the public identifier.- Specified by:
getPublicIdin interfaceorg.xml.sax.Locator
-
getBaseSystemId
public java.lang.String getBaseSystemId()
Returns the base system identifier.- Specified by:
getBaseSystemIdin interfaceXMLLocator- Returns:
- the base system identifier.
-
getLiteralSystemId
public java.lang.String getLiteralSystemId()
Returns the literal system identifier.- Specified by:
getLiteralSystemIdin interfaceXMLLocator- Returns:
- the literal system identifier.
-
getSystemId
public java.lang.String getSystemId()
Returns the expanded system identifier.- Specified by:
getSystemIdin interfaceorg.xml.sax.Locator
-
getLineNumber
public int getLineNumber()
Returns the current line number.- Specified by:
getLineNumberin interfaceorg.xml.sax.Locator
-
getColumnNumber
public int getColumnNumber()
Returns the current column number.- Specified by:
getColumnNumberin interfaceorg.xml.sax.Locator
-
getXMLVersion
public java.lang.String getXMLVersion()
Returns the XML version.- Specified by:
getXMLVersionin interfaceorg.xml.sax.ext.Locator2
-
getCharacterOffset
public int getCharacterOffset()
Returns the character offset.- Specified by:
getCharacterOffsetin interfaceXMLLocator- Returns:
- the character offset, or
-1if no character offset is available.
-
getFeatureDefault
public java.lang.Boolean getFeatureDefault(java.lang.String featureId)
Returns the default state for a feature.- Specified by:
getFeatureDefaultin interfaceHTMLComponent- Specified by:
getFeatureDefaultin interfaceXMLComponent- Parameters:
featureId- The feature identifier.- Returns:
- the default state for a feature, or null if this component does not want to report a default value for this feature.
-
getPropertyDefault
public java.lang.Object getPropertyDefault(java.lang.String propertyId)
Returns the default state for a property.- Specified by:
getPropertyDefaultin interfaceHTMLComponent- Specified by:
getPropertyDefaultin interfaceXMLComponent- Parameters:
propertyId- The property identifier.- Returns:
- the default state for a property, or null if this component does not want to report a default value for this property
-
getRecognizedFeatures
public java.lang.String[] getRecognizedFeatures()
Returns recognized features.- Specified by:
getRecognizedFeaturesin interfaceXMLComponent- Returns:
- an array of feature identifiers that are recognized by this component. This method may return null if no features are recognized by this component.
-
getRecognizedProperties
public java.lang.String[] getRecognizedProperties()
Returns recognized properties.- Specified by:
getRecognizedPropertiesin interfaceXMLComponent- Returns:
- an array of property identifiers that are recognized by this component. This method may return null if no properties are recognized by this component.
-
reset
public void reset(XMLComponentManager manager) throws XMLConfigurationException
Resets the component.- Specified by:
resetin interfaceXMLComponent- Parameters:
manager- The component manager.- Throws:
XMLConfigurationException
-
setFeature
public void setFeature(java.lang.String featureId, boolean state)Sets a feature.- Specified by:
setFeaturein interfaceXMLComponent- Parameters:
featureId- The feature identifier.state- The state of the feature.
-
setProperty
public void setProperty(java.lang.String propertyId, java.lang.Object value) throws XMLConfigurationExceptionSets a property.- Specified by:
setPropertyin interfaceXMLComponent- Parameters:
propertyId- The property identifier.value- The value of the property.- Throws:
XMLConfigurationException- Thrown for configuration error. In general, components should only throw this exception if it is really a critical error.
-
setInputSource
public void setInputSource(XMLInputSource source) throws java.io.IOException
Sets the input source.- Parameters:
source- The input source.- Throws:
java.io.IOException- Thrown on i/o error.
-
scanDocument
public boolean scanDocument(boolean complete) throws XNIException, java.io.IOExceptionScans a document.- Parameters:
complete- True if the scanner should scan the document completely, pushing all events to the registered document handler. A value of false indicates that the scanner should only scan the next portion of the document and return. A scanner instance is permitted to completely scan a document if it does not support this "pull" scanning model.- Returns:
- True if there is more to scan, false otherwise.
- Throws:
java.io.IOException- Thrown on i/o error.XNIException- on error.
-
setDocumentHandler
public void setDocumentHandler(XMLDocumentHandler handler)
Sets the document handler.- Specified by:
setDocumentHandlerin interfaceXMLDocumentSource- Parameters:
handler- the new handler
-
getDocumentHandler
public XMLDocumentHandler getDocumentHandler()
- Specified by:
getDocumentHandlerin interfaceXMLDocumentSource- Returns:
- the document handler
-
getValue
protected static java.lang.String getValue(XMLAttributes attrs, java.lang.String aname)
-
systemId
public static java.lang.String systemId(java.lang.String systemId, java.lang.String baseSystemId)Expands a system id and returns the system id as a URI, if it can be expanded. A return value of null means that the identifier is already expanded. An exception thrown indicates a failure to expand the id.- Parameters:
systemId- The systemId to be expanded.baseSystemId- baseSystemId- Returns:
- Returns the URI string representing the expanded system identifier. A null value indicates that the given system identifier is already expanded.
-
fixURI
protected static java.lang.String fixURI(java.lang.String str)
Fixes a platform dependent filename to standard URI form.- Parameters:
str- The string to fix.- Returns:
- Returns the fixed URI string.
-
getNamesValue
protected static short getNamesValue(java.lang.String value)
-
setScanner
protected void setScanner(HTMLScanner.Scanner scanner)
-
setScannerState
protected void setScannerState(short state)
-
scanDoctype
protected int scanDoctype() throws java.io.IOException- Throws:
java.io.IOException
-
scanLiteral
protected int scanLiteral() throws java.io.IOException- Throws:
java.io.IOException
-
scanName
protected java.lang.String scanName(boolean strict, short mode) throws java.io.IOException- Throws:
java.io.IOException
-
scanTagName
protected java.lang.String scanTagName() throws java.io.IOException- Throws:
java.io.IOException
-
scanEntityRef
protected int scanEntityRef(XMLString str, XMLString plainValue, boolean content) throws java.io.IOException
- Throws:
java.io.IOException
-
returnEntityRefString
private int returnEntityRefString(XMLString str, boolean content)
-
locationAugs
protected final Augmentations locationAugs(HTMLScanner.CurrentEntity currentEntity)
-
synthesizedAugs
protected final Augmentations synthesizedAugs()
-
isEncodingCompatible
static boolean isEncodingCompatible(java.lang.String encoding1, java.lang.String encoding2)To detect if 2 encoding are compatible, both must be able to read the meta tag specifying the new encoding. This means that the byte representation of some minimal html markup must be the same in both encodings
-
canRoundtrip
private static boolean canRoundtrip(java.lang.String encodeCharset, java.lang.String decodeCharset) throws java.io.UnsupportedEncodingException- Throws:
java.io.UnsupportedEncodingException
-
-