Module: Lich::Util::TextStripper

Defined in:
documented/util/textstripper.rb

Overview

Module for stripping formatting from text This module provides methods to remove HTML, XML, and Markdown formatting from text.

Examples:

Stripping HTML from text

plain_text = Lich::Util::TextStripper.strip_html(html_text)

Defined Under Namespace

Modules: Mode

Constant Summary collapse

MODE_TO_INPUT_FORMAT =
{
  Mode::HTML     => 'html',
  Mode::MARKUP   => 'GFM',
  Mode::MARKDOWN => 'GFM'
}.freeze

Class Method Summary collapse

Class Method Details

.entity_to_char(entity) ⇒ String

Converts an HTML entity to its corresponding character

Examples:

Converting an entity

char = Lich::Util::TextStripper.entity_to_char(:nbsp) # => " "

Parameters:

  • entity (Symbol)

    The HTML entity to convert

Returns:

  • (String)

    The corresponding character



268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
# File 'documented/util/textstripper.rb', line 268

def self.entity_to_char(entity)
  if entity.respond_to?(:char)
    entity.char
  else
    # Fallback for symbol entities
    case entity
    when :nbsp then ' '
    when :lt then '<'
    when :gt then '>'
    when :amp then '&'
    when :quot then '"'
    else entity.to_s
    end
  end
end

.extract_text(element) ⇒ String

Extracts plain text from a Kramdown element

Examples:

Extracting text from Kramdown

text = Lich::Util::TextStripper.extract_text(kramdown_element)

Parameters:

  • element (Kramdown::Element)

    The Kramdown element to extract text from

Returns:

  • (String)

    The extracted plain text



232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
# File 'documented/util/textstripper.rb', line 232

def self.extract_text(element)
  return '' if element.nil?

  case element.type
  when :text
    element.value
  when :entity
    # Convert HTML entities (e.g., &nbsp; -> space)
    entity_to_char(element.value)
  when :smart_quote
    # Convert smart quotes to regular quotes
    smart_quote_to_char(element.value)
  when :codeblock, :codespan
    # Return code content as plain text
    element.value
  when :br
    # Convert line breaks to newlines
    "\n"
  when :blank
    # Blank lines become newlines
    "\n"
  else
    # For all other elements (p, div, span, etc.), recursively process children
    if element.children
      element.children.map { |child| extract_text(child) }.join
    else
      ''
    end
  end
end

.extract_xml_text(element) ⇒ String

Extracts text content from an XML element

Examples:

Extracting XML text

text = Lich::Util::TextStripper.extract_xml_text(xml_element)

Parameters:

  • element (REXML::Element)

    The XML element to extract text from

Returns:

  • (String)

    The extracted text



203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
# File 'documented/util/textstripper.rb', line 203

def self.extract_xml_text(element)
  return '' if element.nil?

  text_parts = []

  # Iterate through all child nodes
  element.each do |node|
    case node
    when REXML::Text
      # Regular text node
      text_parts << node.value
    when REXML::CData
      # CDATA section - extract the content
      text_parts << node.value
    when REXML::Element
      # Nested element - recursively extract text
      text_parts << extract_xml_text(node)
    end
    # Ignore other node types (comments, processing instructions, etc.)
  end

  text_parts.join
end

.log_error(message, exception) ⇒ Object

Logs an error message along with the exception details

Examples:

Logging an error

Lich::Util::TextStripper.log_error("An error occurred", e)

Parameters:

  • message (String)

    The error message to log

  • exception (StandardError)

    The exception that occurred



154
155
156
157
158
# File 'documented/util/textstripper.rb', line 154

def self.log_error(message, exception)
  full_message = "TextStripper: #{message} (#{exception.class}: #{exception.message}). Returning original."
  respond(full_message)
  Lich.log(full_message)
end

.requires_kramdown?(mode) ⇒ Boolean

Checks if Kramdown is required for the given mode

Examples:

Checking Kramdown requirement

Lich::Util::TextStripper.requires_kramdown?(:markup) # => true

Parameters:

  • mode (Symbol)

    The mode to check

Returns:

  • (Boolean)

    True if Kramdown is required, false otherwise



79
80
81
# File 'documented/util/textstripper.rb', line 79

def self.requires_kramdown?(mode)
  MODE_TO_INPUT_FORMAT.key?(mode)
end

.smart_quote_to_char(quote_type) ⇒ String

Converts a smart quote type to its corresponding character

Examples:

Converting a smart quote

char = Lich::Util::TextStripper.smart_quote_to_char(:ldquo) # => "

Parameters:

  • quote_type (Symbol)

    The type of smart quote

Returns:

  • (String)

    The corresponding character



289
290
291
292
293
294
295
# File 'documented/util/textstripper.rb', line 289

def self.smart_quote_to_char(quote_type)
  case quote_type
  when :lsquo, :rsquo then "'"
  when :ldquo, :rdquo then '"'
  else quote_type.to_s
  end
end

.strip(text, mode) ⇒ String

Strips formatting from the given text based on the specified mode

Examples:

Stripping text

plain_text = Lich::Util::TextStripper.strip(html_text, Lich::Util::TextStripper::Mode::HTML)

Parameters:

  • text (String)

    The text to be stripped

  • mode (Symbol)

    The mode to use for stripping

Returns:

  • (String)

    The stripped text

Raises:

  • (ArgumentError)

    If the mode is invalid



90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# File 'documented/util/textstripper.rb', line 90

def self.strip(text, mode)
  return "" if text.nil? || text.empty?

  # Validate mode BEFORE entering the rescue block
  # This allows ArgumentError to propagate to the caller as documented
  validated_mode = validate_mode(mode)

  # Check if kramdown is required and available
  if requires_kramdown?(validated_mode) && !KRAMDOWN_LOADED
    respond("Need to restart Lich5 in order to use this method.")
    return text
  end

  # Route to appropriate parsing method based on mode
  case validated_mode
  when Mode::XML
    strip_xml_with_rexml(text)
  else
    strip_with_kramdown(text, validated_mode)
  end
rescue Kramdown::Error => e
  # Handle Kramdown parsing errors (HTML/MARKUP/MARKDOWN modes)
  log_error("Failed to parse #{validated_mode}", e)
  text
rescue REXML::ParseException => e
  # Handle REXML parsing errors (XML mode)
  log_error("Failed to parse #{validated_mode}", e)
  text
rescue StandardError => e
  # Catch any other unexpected errors during parsing
  log_error("Unexpected error during #{validated_mode} parsing", e)
  text
end

.strip_html(text) ⇒ String

Strips HTML tags from the given text

Examples:

Stripping HTML

plain_text = Lich::Util::TextStripper.strip_html(html_text)

Parameters:

  • text (String)

    The HTML text to be stripped

Returns:

  • (String)

    The stripped text

Raises:

  • (RuntimeError)

    If Kramdown is not loaded



303
304
305
306
307
308
309
310
# File 'documented/util/textstripper.rb', line 303

def self.strip_html(text)
  unless KRAMDOWN_LOADED
    respond("Need to restart Lich5 in order to use this method.")
    return text
  end

  strip_with_kramdown(text, Mode::HTML)
end

.strip_markdown(text) ⇒ String

Strips Markdown formatting from the given text

Examples:

Stripping Markdown

plain_text = Lich::Util::TextStripper.strip_markdown(markdown_text)

Parameters:

  • text (String)

    The Markdown text to be stripped

Returns:

  • (String)

    The stripped text

Raises:

  • (RuntimeError)

    If Kramdown is not loaded



342
343
344
345
346
347
348
349
# File 'documented/util/textstripper.rb', line 342

def self.strip_markdown(text)
  unless KRAMDOWN_LOADED
    respond("Need to restart Lich5 in order to use this method.")
    return text
  end

  strip_with_kramdown(text, Mode::MARKDOWN)
end

.strip_markup(text) ⇒ String

Strips Markdown/markup formatting from the given text

Examples:

Stripping markup

plain_text = Lich::Util::TextStripper.strip_markup(markup_text)

Parameters:

  • text (String)

    The text to be stripped

Returns:

  • (String)

    The stripped text

Raises:

  • (RuntimeError)

    If Kramdown is not loaded



327
328
329
330
331
332
333
334
# File 'documented/util/textstripper.rb', line 327

def self.strip_markup(text)
  unless KRAMDOWN_LOADED
    respond("Need to restart Lich5 in order to use this method.")
    return text
  end

  strip_with_kramdown(text, Mode::MARKUP)
end

.strip_with_kramdown(text, mode) ⇒ String

Strips formatting from text using Kramdown

Examples:

Stripping with Kramdown

plain_text = Lich::Util::TextStripper.strip_with_kramdown(markdown_text, Lich::Util::TextStripper::Mode::MARKDOWN)

Parameters:

  • text (String)

    The text to be stripped

  • mode (Symbol)

    The mode to use for stripping

Returns:

  • (String)

    The stripped text

Raises:

  • (RuntimeError)

    If Kramdown is not loaded



167
168
169
170
171
172
173
174
175
176
177
178
# File 'documented/util/textstripper.rb', line 167

def self.strip_with_kramdown(text, mode)
  unless KRAMDOWN_LOADED
    respond("Need to restart Lich5 in order to use this method.")
    return text
  end

  input_format = MODE_TO_INPUT_FORMAT[mode]
  doc = Kramdown::Document.new(text, input: input_format)

  # Extract plain text from the parsed document by traversing the element tree
  extract_text(doc.root).strip
end

.strip_xml(text) ⇒ String

Strips XML tags from the given text

Examples:

Stripping XML

plain_text = Lich::Util::TextStripper.strip_xml(xml_text)

Parameters:

  • text (String)

    The XML text to be stripped

Returns:

  • (String)

    The stripped text



317
318
319
# File 'documented/util/textstripper.rb', line 317

def self.strip_xml(text)
  strip_xml_with_rexml(text)
end

.strip_xml_with_rexml(text) ⇒ String

Strips XML tags from the given text

Examples:

Stripping XML

plain_text = Lich::Util::TextStripper.strip_xml_with_rexml(xml_text)

Parameters:

  • text (String)

    The XML text to be stripped

Returns:

  • (String)

    The stripped text



185
186
187
188
189
190
191
192
193
194
195
196
# File 'documented/util/textstripper.rb', line 185

def self.strip_xml_with_rexml(text)
  # Try to parse as-is first (in case it's already well-formed XML)
  begin
    doc = REXML::Document.new("<root>#{text}</root>")
  rescue REXML::ParseException
    # If parsing fails due to unescaped characters, wrap in CDATA
    doc = REXML::Document.new("<root><![CDATA[#{text}]]></root>")
  end

  # Extract all text content from the document
  extract_xml_text(doc.root).strip
end

.validate_mode(mode) ⇒ Symbol

Validates the given mode and normalizes it to a symbol

Examples:

Validating a mode

Lich::Util::TextStripper.validate_mode(:html) # => :html

Parameters:

  • mode (Symbol, String)

    The mode to validate

Returns:

  • (Symbol)

    The normalized mode

Raises:

  • (ArgumentError)

    If the mode is invalid



130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
# File 'documented/util/textstripper.rb', line 130

def self.validate_mode(mode)
  # Ensure mode is a Symbol or String
  unless mode.is_a?(Symbol) || mode.is_a?(String)
    raise ArgumentError,
          "Mode must be a Symbol or String, got #{mode.class}"
  end

  # Normalize to symbol
  normalized_mode = mode.to_sym

  # Validate against allowed modes
  unless Mode.valid?(normalized_mode)
    raise ArgumentError,
          "Invalid mode: #{mode}. Use one of: #{Mode.list}"
  end

  normalized_mode
end