Module: Lich::Util::TextStripper

Defined in:
documented/util/textstripper.rb

Overview

Provides methods to strip formatting from text.

This module handles different modes of text stripping, including HTML, XML, and Markdown.

Defined Under Namespace

Modules: Mode

Constant Summary collapse

MODE_TO_INPUT_FORMAT =
{
  Mode::HTML     => 'html',
  Mode::MARKUP   => 'GFM',
  Mode::MARKDOWN => 'GFM'
}.freeze

Class Method Summary collapse

Class Method Details

.entity_to_char(entity) ⇒ String

Converts an HTML entity to its corresponding character.

Parameters:

  • entity (Symbol)

    the entity to convert

Returns:

  • (String)

    the corresponding character



242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
# File 'documented/util/textstripper.rb', line 242

def self.entity_to_char(entity)
  if entity.respond_to?(:char)
    entity.char
  else
    # Fallback for symbol entities
    case entity
    when :nbsp then ' '
    when :lt then '<'
    when :gt then '>'
    when :amp then '&'
    when :quot then '"'
    else entity.to_s
    end
  end
end

.extract_text(element) ⇒ String

Extracts plain text from a Kramdown element.

Parameters:

  • element (Kramdown::Element)

    the element to extract text from

Returns:

  • (String)

    the extracted text



208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
# File 'documented/util/textstripper.rb', line 208

def self.extract_text(element)
  return '' if element.nil?

  case element.type
  when :text
    element.value
  when :entity
    # Convert HTML entities (e.g., &nbsp; -> space)
    entity_to_char(element.value)
  when :smart_quote
    # Convert smart quotes to regular quotes
    smart_quote_to_char(element.value)
  when :codeblock, :codespan
    # Return code content as plain text
    element.value
  when :br
    # Convert line breaks to newlines
    "\n"
  when :blank
    # Blank lines become newlines
    "\n"
  else
    # For all other elements (p, div, span, etc.), recursively process children
    if element.children
      element.children.map { |child| extract_text(child) }.join
    else
      ''
    end
  end
end

.extract_xml_text(element) ⇒ String

Extracts text content from an XML element.

Parameters:

  • element (REXML::Element)

    the XML element to extract text from

Returns:

  • (String)

    the extracted text



181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
# File 'documented/util/textstripper.rb', line 181

def self.extract_xml_text(element)
  return '' if element.nil?

  text_parts = []

  # Iterate through all child nodes
  element.each do |node|
    case node
    when REXML::Text
      # Regular text node
      text_parts << node.value
    when REXML::CData
      # CDATA section - extract the content
      text_parts << node.value
    when REXML::Element
      # Nested element - recursively extract text
      text_parts << extract_xml_text(node)
    end
    # Ignore other node types (comments, processing instructions, etc.)
  end

  text_parts.join
end

.log_error(message, exception) ⇒ void

This method returns an undefined value.

Logs an error message along with the exception details.

Parameters:

  • message (String)

    the error message to log

  • exception (StandardError)

    the exception that occurred



138
139
140
141
142
# File 'documented/util/textstripper.rb', line 138

def self.log_error(message, exception)
  full_message = "TextStripper: #{message} (#{exception.class}: #{exception.message}). Returning original."
  respond(full_message)
  Lich.log(full_message)
end

.requires_kramdown?(mode) ⇒ Boolean

Determines if Kramdown is required for the given mode.

Parameters:

  • mode (Symbol)

    the mode to check

Returns:

  • (Boolean)

    true if Kramdown is required, false otherwise



67
68
69
# File 'documented/util/textstripper.rb', line 67

def self.requires_kramdown?(mode)
  MODE_TO_INPUT_FORMAT.key?(mode)
end

.smart_quote_to_char(quote_type) ⇒ String

Converts a smart quote type to its corresponding character.

Parameters:

  • quote_type (Symbol)

    the type of smart quote

Returns:

  • (String)

    the corresponding character



261
262
263
264
265
266
267
# File 'documented/util/textstripper.rb', line 261

def self.smart_quote_to_char(quote_type)
  case quote_type
  when :lsquo, :rsquo then "'"
  when :ldquo, :rdquo then '"'
  else quote_type.to_s
  end
end

.strip(text, mode) ⇒ String

Strips formatting from the given text based on the specified mode.

Examples:

Strip HTML tags

TextStripper.strip("<p>Hello</p>", TextStripper::Mode::HTML)

Parameters:

  • text (String)

    the text to be stripped

  • mode (Symbol)

    the mode to use for stripping

Returns:

  • (String)

    the stripped text



77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# File 'documented/util/textstripper.rb', line 77

def self.strip(text, mode)
  return "" if text.nil? || text.empty?

  # Validate mode BEFORE entering the rescue block
  # This allows ArgumentError to propagate to the caller as documented
  validated_mode = validate_mode(mode)

  # Check if kramdown is required and available
  if requires_kramdown?(validated_mode) && !KRAMDOWN_LOADED
    respond("Need to restart Lich5 in order to use this method.")
    return text
  end

  # Route to appropriate parsing method based on mode
  case validated_mode
  when Mode::XML
    strip_xml_with_rexml(text)
  else
    strip_with_kramdown(text, validated_mode)
  end
rescue Kramdown::Error => e
  # Handle Kramdown parsing errors (HTML/MARKUP/MARKDOWN modes)
  log_error("Failed to parse #{validated_mode}", e)
  text
rescue REXML::ParseException => e
  # Handle REXML parsing errors (XML mode)
  log_error("Failed to parse #{validated_mode}", e)
  text
rescue StandardError => e
  # Catch any other unexpected errors during parsing
  log_error("Unexpected error during #{validated_mode} parsing", e)
  text
end

.strip_html(text) ⇒ String

Note:

Kramdown must be loaded for this method to work

Strips HTML tags from the given text.

Parameters:

  • text (String)

    the HTML text to be stripped

Returns:

  • (String)

    the stripped text



273
274
275
276
277
278
279
280
# File 'documented/util/textstripper.rb', line 273

def self.strip_html(text)
  unless KRAMDOWN_LOADED
    respond("Need to restart Lich5 in order to use this method.")
    return text
  end

  strip_with_kramdown(text, Mode::HTML)
end

.strip_markdown(text) ⇒ String

Note:

Kramdown must be loaded for this method to work

Strips Markdown formatting from the given text.

Parameters:

  • text (String)

    the Markdown text to be stripped

Returns:

  • (String)

    the stripped text



306
307
308
309
310
311
312
313
# File 'documented/util/textstripper.rb', line 306

def self.strip_markdown(text)
  unless KRAMDOWN_LOADED
    respond("Need to restart Lich5 in order to use this method.")
    return text
  end

  strip_with_kramdown(text, Mode::MARKDOWN)
end

.strip_markup(text) ⇒ String

Note:

Kramdown must be loaded for this method to work

Strips markup formatting from the given text.

Parameters:

  • text (String)

    the markup text to be stripped

Returns:

  • (String)

    the stripped text



293
294
295
296
297
298
299
300
# File 'documented/util/textstripper.rb', line 293

def self.strip_markup(text)
  unless KRAMDOWN_LOADED
    respond("Need to restart Lich5 in order to use this method.")
    return text
  end

  strip_with_kramdown(text, Mode::MARKUP)
end

.strip_with_kramdown(text, mode) ⇒ String

Note:

Kramdown must be loaded for this method to work

Strips formatting from text using Kramdown.

Parameters:

  • text (String)

    the text to be stripped

  • mode (Symbol)

    the mode to use for stripping

Returns:

  • (String)

    the stripped text



149
150
151
152
153
154
155
156
157
158
159
160
# File 'documented/util/textstripper.rb', line 149

def self.strip_with_kramdown(text, mode)
  unless KRAMDOWN_LOADED
    respond("Need to restart Lich5 in order to use this method.")
    return text
  end

  input_format = MODE_TO_INPUT_FORMAT[mode]
  doc = Kramdown::Document.new(text, input: input_format)

  # Extract plain text from the parsed document by traversing the element tree
  extract_text(doc.root).strip
end

.strip_xml(text) ⇒ String

Strips XML tags from the given text.

Parameters:

  • text (String)

    the XML text to be stripped

Returns:

  • (String)

    the stripped text



285
286
287
# File 'documented/util/textstripper.rb', line 285

def self.strip_xml(text)
  strip_xml_with_rexml(text)
end

.strip_xml_with_rexml(text) ⇒ String

Strips XML tags from the given text using REXML.

Parameters:

  • text (String)

    the XML text to be stripped

Returns:

  • (String)

    the stripped text



165
166
167
168
169
170
171
172
173
174
175
176
# File 'documented/util/textstripper.rb', line 165

def self.strip_xml_with_rexml(text)
  # Try to parse as-is first (in case it's already well-formed XML)
  begin
    doc = REXML::Document.new("<root>#{text}</root>")
  rescue REXML::ParseException
    # If parsing fails due to unescaped characters, wrap in CDATA
    doc = REXML::Document.new("<root><![CDATA[#{text}]]></root>")
  end

  # Extract all text content from the document
  extract_xml_text(doc.root).strip
end

.validate_mode(mode) ⇒ Symbol

Validates the given mode and normalizes it to a symbol.

Parameters:

  • mode (Symbol, String)

    the mode to validate

Returns:

  • (Symbol)

    the normalized mode

Raises:

  • ArgumentError if the mode is invalid



115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# File 'documented/util/textstripper.rb', line 115

def self.validate_mode(mode)
  # Ensure mode is a Symbol or String
  unless mode.is_a?(Symbol) || mode.is_a?(String)
    raise ArgumentError,
          "Mode must be a Symbol or String, got #{mode.class}"
  end

  # Normalize to symbol
  normalized_mode = mode.to_sym

  # Validate against allowed modes
  unless Mode.valid?(normalized_mode)
    raise ArgumentError,
          "Invalid mode: #{mode}. Use one of: #{Mode.list}"
  end

  normalized_mode
end