Class TikaDocumentReader

java.lang.Object
org.springframework.ai.reader.tika.TikaDocumentReader
All Implemented Interfaces:
Supplier<List<org.springframework.ai.document.Document>>, org.springframework.ai.document.DocumentReader

public class TikaDocumentReader extends Object implements org.springframework.ai.document.DocumentReader
A document reader that leverages Apache Tika to extract text from a variety of document formats, such as PDF, DOC/DOCX, PPT/PPTX, and HTML. For a comprehensive list of supported formats, refer to: https://tika.apache.org/3.1.0/formats.html. This reader directly provides the extracted text without any additional formatting. All extracted texts are encapsulated within a Document instance. If you require more specialized handling for PDFs, consider using the PagePdfDocumentReader or ParagraphPdfDocumentReader.
Author:
Christian Tzolov
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final String
    Metadata key representing the source of the document.
  • Constructor Summary

    Constructors
    Constructor
    Description
    Constructor initializing the reader with a given resource URL.
    TikaDocumentReader(String resourceUrl, org.springframework.ai.reader.ExtractedTextFormatter textFormatter)
    Constructor initializing the reader with a given resource URL and a text formatter.
    TikaDocumentReader(org.springframework.core.io.Resource resource)
    Constructor initializing the reader with a resource.
    TikaDocumentReader(org.springframework.core.io.Resource resource, org.springframework.ai.reader.ExtractedTextFormatter textFormatter)
    Constructor initializing the reader with a resource and a text formatter.
    TikaDocumentReader(org.springframework.core.io.Resource resource, ContentHandler contentHandler, org.springframework.ai.reader.ExtractedTextFormatter textFormatter)
    Constructor initializing the reader with a resource, content handler, and a text formatter.
  • Method Summary

    Modifier and Type
    Method
    Description
    List<org.springframework.ai.document.Document>
    get()
    Extracts and returns the list of documents from the resource.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

    Methods inherited from interface org.springframework.ai.document.DocumentReader

    read
  • Field Details

    • METADATA_SOURCE

      public static final String METADATA_SOURCE
      Metadata key representing the source of the document.
      See Also:
  • Constructor Details

    • TikaDocumentReader

      public TikaDocumentReader(String resourceUrl)
      Constructor initializing the reader with a given resource URL.
      Parameters:
      resourceUrl - URL to the resource
    • TikaDocumentReader

      public TikaDocumentReader(String resourceUrl, org.springframework.ai.reader.ExtractedTextFormatter textFormatter)
      Constructor initializing the reader with a given resource URL and a text formatter.
      Parameters:
      resourceUrl - URL to the resource
      textFormatter - Formatter for the extracted text
    • TikaDocumentReader

      public TikaDocumentReader(org.springframework.core.io.Resource resource)
      Constructor initializing the reader with a resource.
      Parameters:
      resource - Resource pointing to the document
    • TikaDocumentReader

      public TikaDocumentReader(org.springframework.core.io.Resource resource, org.springframework.ai.reader.ExtractedTextFormatter textFormatter)
      Constructor initializing the reader with a resource and a text formatter. This constructor will create a BodyContentHandler that allows for reading large PDFs (constrained only by memory)
      Parameters:
      resource - Resource pointing to the document
      textFormatter - Formatter for the extracted text
    • TikaDocumentReader

      public TikaDocumentReader(org.springframework.core.io.Resource resource, ContentHandler contentHandler, org.springframework.ai.reader.ExtractedTextFormatter textFormatter)
      Constructor initializing the reader with a resource, content handler, and a text formatter.
      Parameters:
      resource - Resource pointing to the document
      contentHandler - Handler to manage content extraction
      textFormatter - Formatter for the extracted text
  • Method Details

    • get

      public List<org.springframework.ai.document.Document> get()
      Extracts and returns the list of documents from the resource.
      Specified by:
      get in interface Supplier<List<org.springframework.ai.document.Document>>
      Returns:
      List of extracted Document