BuildFast Bot
Ask to

BuildFast Bot

BuildFast Bot

Hey! Wanna know about Generative AI Crash Course?

BuildFastwithAI
satvik@buildfastwithai.com

Koramangala, Bengaluru, 560034

Support

  • Consulting
  • GenAI Course
  • BuildFast Studio

Company

  • Resources
  • Events

Legal

  • Privacy
  • Terms
  • Refund

Our Products

Educhain

Educhain

AI-powered education platform for teachers

BuildFast Studio

BuildFast Studio

The Indian version of CharacterAI but even more varieties.

LinkedInInstagramTwitterGitHub

© 2025 Intellify Edventures Private Limited All rights reserved.

Unstructured: The Best Tool for Text Preprocessing

March 6, 2025
5 min read
Published
Unstructured: The Best Tool for Text Preprocessing
Unstructured: The Best Tool for Text Preprocessing - BuildFast with AI

Do you want to be a bystander in the world of tomorrow, or its creator?

Act now—Gen AI Launch Pad 2025 is your gateway to innovation.

Introduction

The rise of Large Language Models (LLMs) has created a need for efficient text preprocessing tools that can handle diverse document formats. Unstructured is an open-source library designed to extract, clean, and structure text from various file types, making it ideal for LLM applications. In this blog, we will explore its capabilities, demonstrate its usage with practical code examples, and show how it integrates with LangChain and ChromaDB for enhanced text processing and vector database ingestion.

Why Use Unstructured?

Key Features:

  • Multi-format Support: Works with PDFs, Word documents, HTML, and more. 📄
  • Text Extraction: Extracts text while maintaining document structure. 📝
  • Data Cleaning: Prepares text for better LLM performance. 🧹
  • Element Chunking: Splits text into meaningful segments. 🧩
  • Seamless Integration: Works with LangChain and other LLM tools. 🤝

Installation

To begin using Unstructured, install the required dependencies:

pip install unstructured[pdf] langchain_community chromadb tiktoken

This installs Unstructured along with essential libraries for document processing and vector database support.

Setting Up API Keys

If you're using OpenAI models, set up your API key in your environment variables:

import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

Extracting Text from a PDF

Extracting text from PDFs is a common need for research papers, reports, and scanned documents. Unstructured makes this process seamless.

Code:

from unstructured.partition.auto import partition
import requests

pdf_url = "https://arxiv.org/pdf/2310.06825.pdf"  # Example PDF URL
response = requests.get(pdf_url)

with open("example.pdf", "wb") as f:
    f.write(response.content)

# Partition the PDF
elements = partition(filename="example.pdf")

# Print extracted text
for element in elements:
    print(element.text)

Explanation:

  • Downloads a PDF from a URL.
  • Uses partition to extract text while preserving structure.
  • Iterates over extracted elements and prints the text.

Expected Output:

Extracted text from the PDF, preserving paragraphs, headers, and formatting.

Real-World Application:

Use this method for processing research papers, business reports, and scanned contracts for LLM-based summarization or analysis.

Extracting Text from a Local .txt File

For plain text files, Unstructured provides an efficient way to partition and process text.

Code:

from unstructured.partition.text import partition_text

# Create a sample text file
with open("dummy_text.txt", "w") as f:
    f.write("This is a sample text file.\n")
    f.write("It contains multiple lines of text.\n")
    f.write("Unstructured can process this easily.")

# Extract text
elements = partition_text(filename="dummy_text.txt")

for element in elements:
    print(element.text)

Expected Output:

This is a sample text file.
It contains multiple lines of text.
Unstructured can process this easily.

Application:

This method is useful for preprocessing logs, articles, or any text file before feeding it into an LLM.

Extracting Text from a Website

Extracting content from web pages can be crucial for news aggregation, data collection, or competitive analysis.

Code:

from unstructured.partition.html import partition_html
import requests

url = "https://www.unstructured.io/"
response = requests.get(url)
html_content = response.text

# Partition HTML
elements = partition_html(text=html_content)

for element in elements:
    print(element.text)

Expected Output:

Extracted text from the web page, including article content and structured elements.

Use Case:

Use this approach to scrape articles, blog posts, or documentation for LLM-powered summarization or analysis.

Vector Database Ingestion with ChromaDB

Unstructured also helps in creating vector-based document retrieval systems. Here’s how to use it with ChromaDB and LangChain.

Gathering Links from CNN Lite

from unstructured.partition.html import partition_html

cnn_lite_url = "https://lite.cnn.com/"
elements = partition_html(url=cnn_lite_url)
links = []

for element in elements:
    if element.metadata.link_urls:
        relative_link = element.metadata.link_urls[0][1:]
        if relative_link.startswith("2025"):
            links.append(f"{cnn_lite_url}{relative_link}")

Ingesting Articles

from langchain.document_loaders import UnstructuredURLLoader

loaders = UnstructuredURLLoader(urls=links, show_progress_bar=True)
docs = loaders.load()

Storing Documents in ChromaDB

from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings)
query_docs = vectorstore.similarity_search("Update on the coup in Niger.", k=1)

Summarizing Retrieved Documents

from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

llm = ChatOpenAI(temperature=0, model_name="gpt-4o")
chain = load_summarize_chain(llm, chain_type="stuff")
chain.run(query_docs)

Expected Output:

A concise summary of the most relevant article matching the query.

Real-World Application:

  • Use case: Automating news summarization.
  • Benefit: Reduces manual effort in tracking trending topics.

Conclusion

Unstructured is a powerful tool for preprocessing text from diverse sources, making it an invaluable asset for LLM applications. Whether extracting text from PDFs, processing web content, or integrating with vector databases, Unstructured streamlines workflows for AI-powered applications.

Next Steps

  • Try Unstructured with your own dataset.
  • Explore LangChain and ChromaDB for more advanced NLP applications.
  • Check out Unstructured’s official documentation for further customization.

Resources

  • Unstructured GitHub
  • LangChain Documentation
  • ChromaDB GitHub
  • Unstructured Experiment Notebook

---------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI Implementation. Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

---------------------------

Resources and Community

Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, our resources will help you understand and implement Generative AI in your projects.

  • Website: www.buildfastwithai.com
  • LinkedIn: linkedin.com/company/build-fast-with-ai/
  • Instagram: instagram.com/buildfastwithai/
  • Twitter: x.com/satvikps
  • Telegram: t.me/BuildFastWithAI
buildfastwithai
GenAI Bootcamp
Daily GenAI Quiz
BuildFast Studio
Resources
buildfastwithai