When importing rich text data into a Data Warehouse (DWH), it’s often necessary to remove unwanted HTML tags while preserving essential formatting. This guide outlines a method using Python’s BeautifulSoup library to clean and structure the data efficiently.
Why Strip HTML Tags?
Rich text from sources like web applications, CMS platforms, and APIs often includes extraneous formatting, embedded media, and metadata that are not needed for structured storage. Stripping tags ensures:
- Improved readability of stored text.
- Consistent formatting across datasets.
- Reduced storage overhead by eliminating unnecessary HTML elements.
- Better query performance when searching or processing text data.
Defining a Whitelist of Allowed Tags
In many cases, some formatting should be preserved to maintain meaning. The following HTML tags are commonly retained:
Tag | Purpose |
---|---|
<b> , <strong> | Bold text for emphasis |
<i> , <em> | Italicised text for emphasis |
<u> | Underlined text |
<p> | Paragraphs for readability |
<br> | Line breaks |
<ul> , <ol> , <li> | Lists for structuring content |
<a> (with href ) | Hyperlinks |
<h1> , <h2> , <h3> | Headings for structured content |
Any other tags, such as <div>
, <span>
, <iframe>
, and <script>
, should be removed or unwrapped.
Using Python to Strip Unwanted Tags
The following Python script processes HTML content, removing all tags except those on the whitelist:
from bs4 import BeautifulSoup
def clean_rich_text(html_content):
"""Strips unwanted HTML tags while preserving a whitelist."""
allowed_tags = {"b", "strong", "i", "em", "u", "br", "p", "ul", "ol", "li", "a", "h1", "h2", "h3"}
soup = BeautifulSoup(html_content, "html.parser")
# Remove disallowed tags while keeping their content
for tag in soup.find_all():
if tag.name not in allowed_tags:
tag.unwrap() # Removes the tag but keeps the text inside it
return str(soup)
# Example Usage
html_input = """
<div><p><b>Important:</b> This is an <i>example</i> of a <a href='https://example.com'>rich text</a> import.</p>
<figure><iframe src="https://youtube.com/embed/example"></iframe></figure>
</div>
"""
cleaned_output = clean_rich_text(html_input)
print(cleaned_output)
How It Works
- Parses HTML using BeautifulSoup.
- Checks each tag against the allowed whitelist.
- Removes disallowed tags while preserving their content.
- Returns a cleaned HTML string that maintains essential formatting.
Integrating This in Data Pipelines
For large-scale processing before ingestion into a DWH:
- Pre-process HTML data before storing it in SQL.
- Apply transformations in an ETL pipeline using Python.
- Ensure text fields are stored in a query-friendly format, such as
NVARCHAR(MAX)
orTEXT
.
Benefits of This Approach
β
Preserves meaningful formatting (bold, italics, lists, links, headings).
β
Prevents security risks by stripping scripts and embedded media.
β
Ensures consistency across datasets by standardising text structure.
β
Optimises searchability by removing excess metadata.
By implementing this method, you can maintain clean, structured text data in your Data Warehouse without losing valuable formatting elements. π