NAME HTML::ParagraphSplit - Change text containing HTML into a formatted HTML fragment SYNOPSIS use HTML::ParagraphSplit qw( split_paragraphs_to_text split_paragraphs ); # Read in from a file handle, output text print split_paragraphs_to_text(\*ARGV); # Convert text to nicely split text print split_paragraphs_to_text(<<END_OF_MARKUP); This is one paragraph. This is a another paragraph. END_OF_MARKUP # Convert to an HTML::Element object instead my $tree = split_paragraphs($html_input); print $tree->as_HTML; # Create your own HTML::Element object and split it my $tree = HTML::TreeBuilder->new; $tree->parse($text); $tree->eof; split_paragraphs($tree); my $html_fragment = $tree->guts->as_HTML; $tree->delete; DESCRIPTION The purpose of this library is to provide methods for converting double line-breaks in text to HTML paragraphs (i.e., wrap in "<P></P>" tags). It can also convert single line breaks into "<BR>" tags. In addition, markup can be mixed in as well and this library will DoTheRightThing(tm). There are a number of additional options that can modify how the paragraph splits are performed. For example, given this input (the initial text was generated by DadaDodo <http://www.jwz.org/dadadodo/dadadodo.cgi>, btw): I see over the <strong>noise</strong> but I don't understand sometimes. Fortunately, we've traded the club you can't skimp on the do because This week! Presented by code Lounge: except, for controlling Knox video cameras Linux well that the reason, the runlevel to run some reason number of coming back next server; sees you Control <a href="blah.html">display</a> a steep and I tagged with specifications of six feet, moving to Code, flyer main room motel balcony, <p>and airflow in which define the ability to run a common. We need to current in a manner <pre>than six months and that already gotten a webcast</pre> is roughly long and bulk: and up the src page: and updates on a: user will probably does this. This would be converted into the following: <p>I see over the <strong>noise</strong> but I don't understand sometimes.</p> <ol><li>One</li><li>Two</li><li>Three</li><ol> <p>Fortunately, we've traded the club you can't skimp on the do because This week! Presented by code Lounge: except, for controlling Knox video cameras Linux well that the reason, the runlevel to run some reason number of coming back next server; sees you Control <a href="blah.html">display</a> a steep and I tagged with specifications of six feet, moving to Code, flyer main room motel balcony,</p> <p>and airflow in which define the ability to run a common. We need to current in a manner</p> <pre>than six months and that already gotten a webcast</pre> <p>is roughly long and bulk: and up the src page: and updates on a: user will probably does this.</p> This allows authors that want to use HTML markup some but don't really want to cope with getting their paragraph tags right, can use this filter to format their work the right way. This library depends upon HTML::TreeBuilder and HTML::Tagset. You may wish to see the documentation for those libraries for additional details. METHODS The primary method of this library is "split_paragraphs()". An additional method, "split_paragraphs_to_text()" is provided to simplify the task of generating output without having to fuss with HTML::TreeBuilder. $element = split_paragraphs($handle, \%options) $element = split_paragraphs($text, \%options) $element = split_paragraphs($element, \%options) This method has three forms, which vary only in the input the receive. If the first argument is a file handle, $handle, then that handle will be read, parsed, and split. If the first argument is a scalar, $text, then that text will parsed and split. If the first argument is a subclass of HTML::Element, $element, then the tree represented by the node will be traversed and split. If you use the third form, your tree will be modified in place and the same tree will be returned. You will want to clone the tree ahead of time if you need to preserve the old tree. All forms take an optional second parameter, "\%options", which is a reference to a hash of options which modify the default behavior. See below for details. The first two forms perform an extra step, but are handled essentially the same after the input is parsed into an HTML::Element using HTML::TreeBuilder. This is done using the defaults, except that "no_space_compacting()" is set to a true value (otherwise, we lose any double returns that were in the original text). If you parse your own trees, you'll probably want to do the same. This method will search down the element tree and find the first node with non-implicit child ndoes and use that as the root of operations. The "split_paragraphs()" method then walks the tree and wraps any undecorated text node in a paragraph. Any double line break discovered will result in multiple paragraphs. Any paragraph content elements (as defined by %is_Possible_Strict_P_Content of HTML::Tagset) will be inserted into the paragraph elements as if they were text. Any block level tags (i.e., not in %is_Possible_Strict_P_Content) cause a paragraph break immediately before and after such elements. Any text found within a block-level node may also be paragraphified. Those blocks of text will not be wrapped in paragraphs unless they contain a double-line break (that way we're not inserting "P"-tags without an explicit need for them). Note also that this will insert "P"-tags conservatively. If more than two line-breaks are present, even if they are mixed with other white space, all of that whitespace will be treated as the same paragraph break. No empty "P"-tags or "P"-tags containing only whitespace will be inserted (mostly). The only exception is when the white space is created by white space entities, such as " ". All of that is the default behavior. That behavior may be modified by the second parameter, which is used to specify options that modify that behavior. Here's the list of options and what they do: p_on_breaks_only => 1 If this option is used, then paragrpahs will not be added to your text unless there is at least on double-line break. This option is used internally to make sure nested elements do not have extra "P"-tags unnecessarily. single_line_breaks_to_br => 1 If this option is given, then single line breaks will also be converted to "BR"-tags. br_only_if_can_tighten => 1 This option modifies the "single_line_breaks_to_br" option by specifying that "BR"-tags are not added within blocks that cannot be tightened (i.e., aren't set in %canTighten of HTML::Tagset). This can be useful for preventing double-line breaks from appearing inside "PRE"-tags or "TEXTAREA"-tags because of added "BR"-tags. use_br_instead_of_p => 1 As an alternative to using "P"-tags at all, this can also just place "BR"-tags everywhere instead. Instead of inserting "P"-tags whenever a double line-break is enountered, two "BR"-tags will be inserted instead. This option is independant of "single_line_breaks_to_br" as single line-breaks are not dealt with unless that option is turned on. Also note that, like "P"-tag insertion, it inserts "BR"-tags conservatively. Multiple consecutive line-breaks (even mixed with whitespace) will be treated just as if they were only two. Thus, given the default stylesheet of your typical browser, the rendered output will appear pretty much the same in most circumstances. add_attrs_to_p => \%attrs This can be used to insert a static set of attributes to each inserted "P"-element. For example: # Give each newly added paragraph the "generated" class. split_paragraphs($tree, { add_attrs_to_p => { class => 'generated' }, }); add_attrs_to_br => \%attrs Same as above, but for the inserted "BR"-tags. filter_added_nodes => \&sub This can be used to run a small subroutine on each added paragraph or line-break tag as it is added. For example: # Give each newly added paragraph a unique ID split_paragraphs($tree, { filter_added_nodes => sub { my ($element) = @_; $element->idf(); }, }); Many, if not all, of the other options can be simulated using this method, by the way. use_instead_of_p => $tag Rather than using "P"-tags to break everything, use a different tag. This example uses "DIV"-tags instead of "P"-tags: split_paragraphs($tree, { use_instead_of_p => 'div', }); $html_text = split_paragraphs_to_text($handle, \%options) $html_text = split_paragraphs_to_text($text, \%options) $html_text = split_paragraphs_to_text($element, \%options) This method performs the exact same operation as the "split_paragraphs()" method, but returns the text as a scalar value. This is helpful if you just want a quick method that takes in text and outputs text and you don't really need the HTML formatted in any particular way and don't need to modify the tree at all. I created this method primarily as a way of outputing the tree to make testing easier. If the output isn't want you like, use "split_paragraphs()" instead and use the output methods available in HTML::Element directly to get the desired output. SEE ALSO HTML::TreeBuilder, HTML::Tagset BUGS AND TODO I don't really have any explicit plans for this module, but if you find a bug or would like an additional feature or have another contribution, send me email at <hanenkamp@cpan.org>. NOTES I tried to name this library HTML::Paragraphify first. After typing that a dozen times and looking at it for a few hours, my eyes felt like they were starting to bleed so I changed it to HTML::ParagraphSplit. I've left a few token references to that in the documentation name for kicks. AUTHOR Andrew Sterling Hanenkamp, <hanenkamp@cpan.org> LICENSE AND COPYRIGHT Copyright 2006 Andrew Sterling Hanenkamp <hanenkamp@cpan.org>. All Rights Reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.