public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed
From: gentuxx <gentuxx@gmail.com>
To: gentoo-user@lists.gentoo.org
Subject: [gentoo-user] Any 'sed' geniuses out there?
Date: Mon, 26 Sep 2005 20:45:40 -0700	[thread overview]
Message-ID: <4338C064.3090207@gmail.com> (raw)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'm writing a sed script that will parse the *broken* output of
man2html. I say broken, because the output isn't W3C compliant (html
OR xhtml). I'd like to be able to modify it so that the final outcome
is XHTML 1.0 compliant. I'm running into a problem where the output
doesn't close the <p>, <dt>, or <dd> tags. XHTML requires that tags
containing text be closed. So the problem I'm having is being able to
take note of the starting tag, grab the subsequent paragraph, then
insert the closing tag. What I've got /sort of/ works, but still not.

Here's a sample that has been parsed, but not with the <p> modifying
elements:

<p>

Regular expression support is provided by the PCRE library package,
which is open source software, written by Philip Hazel, and copyright
by the University of Cambridge, England.  See <a
href="http://www.pcre.org/">http://www.pcre.org/</a> .

<p>

Nmap can optionally link to the OpenSSL cryptography toolkit, which is
available from <a
href="http://www.openssl.org/">http://www.openssl.org/</a> .


Here's the entire sedscr (sans comments):

/^$/{
        N
        /^\n$/d
}
/^Content-type: text\/html/c\
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
s%<\(HTML\|P\|HEAD\|TITLE\|BODY\|STRONG\|EM\|H[123456]\|D[DLT]\|T[TDRH]\)>%\L<\1>%g
s%<\/\(HTML\|P\|A\|HEAD\|TITLE\|BODY\|STRONG\|EM\|H[123456]\|D[DLT]\|T[TDRH]\)>%\L</\1>%g
s%<BR>%<br />%g
s%<HR>%<hr />%g
s%<[Dd][Ll] [Cc][Oo][Mm][Pp][Aa][Cc][Tt]>%<dl compact="compact">%
s%<A HREF\(.*\)>%<a href\1>%g
s%<A NAME\(.*\)>%<a name\1>%g
/^<[IB]>.*$/{
        N
        s%\(<[IB]>\)\(.*\)\(<\/[IB]>\)\n%\L\1\2\L\3%
}
/^<[ib]>.*$/{
        N
        s%\n%%
}
s%<[IB]>%\L&%
s%<\/[IB]>%\L&%
/<body>/,/<\/body>/{
        /<p>/!{
                H
                d
        }
        /<p>/{
                x
                s/$/<\/p>/
                G
        }
}
/^<p>$/,/<\p>$/{
        N
        /^\n<p>$/d
}


Here's the funkiness after parsing with the last part
(/<body>/,/<\/body>/{) enabled:

<p>

<p>

Regular expression support is provided by the PCRE library package,
which is open source software, written by Philip Hazel, and copyright
by the University of Cambridge, England.  See <a
href="http://www.pcre.org/">http://www.pcre.org/</a> .</p>

<p>

<p>

Nmap can optionally link to the OpenSSL cryptography toolkit, which is
available from <a
href="http://www.openssl.org/">http://www.openssl.org/</a> .</p>



(Just in case you were wondering, this IS from the nmap man page. ;-)
Thanks.

- --
gentux
echo "hfouvyAdpy/ofu" | perl -pe 's/(.)/chr(ord($1)-1)/ge'

gentux's gpg fingerprint ==> 34CE 2E97 40C7 EF6E EC40  9795 2D81 924A
6996 0993
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDOMBkLYGSSmmWCZMRAnnrAJwKNqr+/OgBdDD8X8PXX6rpKUfaxQCfU9PW
Bs2oA/76RYFbbc7DWEpfTM8=
=gcc/
-----END PGP SIGNATURE-----

-- 
gentoo-user@gentoo.org mailing list



             reply	other threads:[~2005-09-27  3:51 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-09-27  3:45 gentuxx [this message]
2005-09-27  3:51 ` [gentoo-user] Any 'sed' geniuses out there? Dave Nebinger
2005-09-27  3:57   ` gentuxx
2005-09-27  4:22   ` gentuxx
2005-09-27  4:27     ` Dave Nebinger
2005-09-27  4:40       ` gentuxx
2005-09-27  8:24         ` Mariusz Pękala
2005-09-30 19:44         ` Billy Holmes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4338C064.3090207@gmail.com \
    --to=gentuxx@gmail.com \
    --cc=gentoo-user@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox