Class JSoupParser

  • All Implemented Interfaces:
    Serializable, org.apache.tika.parser.Parser

    public class JSoupParser
    extends org.apache.tika.parser.AbstractEncodingDetectorParser
    HTML parser. Uses JSoup to turn the input document to HTML SAX events, and post-processes the events to produce XHTML and metadata expected by Tika clients.
    See Also:
    Serialized Form
    • Field Detail

      • DEFAULT_CHARSET

        public static final Charset DEFAULT_CHARSET
    • Constructor Detail

      • JSoupParser

        public JSoupParser()
      • JSoupParser

        public JSoupParser​(org.apache.tika.detect.EncodingDetector encodingDetector)
    • Method Detail

      • getSupportedTypes

        public Set<org.apache.tika.mime.MediaType> getSupportedTypes​(org.apache.tika.parser.ParseContext context)
      • isExtractScripts

        public boolean isExtractScripts()
      • setExtractScripts

        @Field
        public void setExtractScripts​(boolean extractScripts)
        Whether or not to extract contents in script entities. Default is false
        Parameters:
        extractScripts -
      • getEncodingDetector

        protected org.apache.tika.detect.EncodingDetector getEncodingDetector​(org.apache.tika.parser.ParseContext parseContext)
        Look for an EncodingDetetor in the ParseContext. If it hasn't been passed in, use the original EncodingDetector from initialization.
        Overrides:
        getEncodingDetector in class org.apache.tika.parser.AbstractEncodingDetectorParser
        Parameters:
        parseContext -
        Returns: