EPub
EPUB is an e-book file format that uses the โ.epubโ file extension. The term is short for electronic publication and is sometimes styled ePub.
EPUBis supported by many e-readers, and compatible software is available for most smartphones, tablets, and computers.
This covers how to load .epub documents into the Document format that
we can use downstream. Youโll need to install the
pandoc package for this loader
to work.
#!pip install pandoc
from langchain.document_loaders import UnstructuredEPubLoader
loader = UnstructuredEPubLoader("winter-sports.epub")
data = loader.load()
Retain Elementsโ
Under the hood, Unstructured creates different โelementsโ for different
chunks of text. By default we combine those together, but you can easily
keep that separation by specifying mode="elements".
loader = UnstructuredEPubLoader("winter-sports.epub", mode="elements")
data = loader.load()
data[0]
Document(page_content='The Project Gutenberg eBook of Winter Sports in\nSwitzerland, by E. F. Benson', lookup_str='', metadata={'source': 'winter-sports.epub', 'page_number': 1, 'category': 'Title'}, lookup_index=0)