* [gentoo-user] Any 'sed' geniuses out there?
@ 2005-09-27 3:45 gentuxx
2005-09-27 3:51 ` Dave Nebinger
0 siblings, 1 reply; 8+ messages in thread
From: gentuxx @ 2005-09-27 3:45 UTC (permalink / raw
To: gentoo-user
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I'm writing a sed script that will parse the *broken* output of
man2html. I say broken, because the output isn't W3C compliant (html
OR xhtml). I'd like to be able to modify it so that the final outcome
is XHTML 1.0 compliant. I'm running into a problem where the output
doesn't close the <p>, <dt>, or <dd> tags. XHTML requires that tags
containing text be closed. So the problem I'm having is being able to
take note of the starting tag, grab the subsequent paragraph, then
insert the closing tag. What I've got /sort of/ works, but still not.
Here's a sample that has been parsed, but not with the <p> modifying
elements:
<p>
Regular expression support is provided by the PCRE library package,
which is open source software, written by Philip Hazel, and copyright
by the University of Cambridge, England. See <a
href="http://www.pcre.org/">http://www.pcre.org/</a> .
<p>
Nmap can optionally link to the OpenSSL cryptography toolkit, which is
available from <a
href="http://www.openssl.org/">http://www.openssl.org/</a> .
Here's the entire sedscr (sans comments):
/^$/{
N
/^\n$/d
}
/^Content-type: text\/html/c\
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
s%<\(HTML\|P\|HEAD\|TITLE\|BODY\|STRONG\|EM\|H[123456]\|D[DLT]\|T[TDRH]\)>%\L<\1>%g
s%<\/\(HTML\|P\|A\|HEAD\|TITLE\|BODY\|STRONG\|EM\|H[123456]\|D[DLT]\|T[TDRH]\)>%\L</\1>%g
s%<BR>%<br />%g
s%<HR>%<hr />%g
s%<[Dd][Ll] [Cc][Oo][Mm][Pp][Aa][Cc][Tt]>%<dl compact="compact">%
s%<A HREF\(.*\)>%<a href\1>%g
s%<A NAME\(.*\)>%<a name\1>%g
/^<[IB]>.*$/{
N
s%\(<[IB]>\)\(.*\)\(<\/[IB]>\)\n%\L\1\2\L\3%
}
/^<[ib]>.*$/{
N
s%\n%%
}
s%<[IB]>%\L&%
s%<\/[IB]>%\L&%
/<body>/,/<\/body>/{
/<p>/!{
H
d
}
/<p>/{
x
s/$/<\/p>/
G
}
}
/^<p>$/,/<\p>$/{
N
/^\n<p>$/d
}
Here's the funkiness after parsing with the last part
(/<body>/,/<\/body>/{) enabled:
<p>
<p>
Regular expression support is provided by the PCRE library package,
which is open source software, written by Philip Hazel, and copyright
by the University of Cambridge, England. See <a
href="http://www.pcre.org/">http://www.pcre.org/</a> .</p>
<p>
<p>
Nmap can optionally link to the OpenSSL cryptography toolkit, which is
available from <a
href="http://www.openssl.org/">http://www.openssl.org/</a> .</p>
(Just in case you were wondering, this IS from the nmap man page. ;-)
Thanks.
- --
gentux
echo "hfouvyAdpy/ofu" | perl -pe 's/(.)/chr(ord($1)-1)/ge'
gentux's gpg fingerprint ==> 34CE 2E97 40C7 EF6E EC40 9795 2D81 924A
6996 0993
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDOMBkLYGSSmmWCZMRAnnrAJwKNqr+/OgBdDD8X8PXX6rpKUfaxQCfU9PW
Bs2oA/76RYFbbc7DWEpfTM8=
=gcc/
-----END PGP SIGNATURE-----
--
gentoo-user@gentoo.org mailing list
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [gentoo-user] Any 'sed' geniuses out there?
2005-09-27 3:45 [gentoo-user] Any 'sed' geniuses out there? gentuxx
@ 2005-09-27 3:51 ` Dave Nebinger
2005-09-27 3:57 ` gentuxx
2005-09-27 4:22 ` gentuxx
0 siblings, 2 replies; 8+ messages in thread
From: Dave Nebinger @ 2005-09-27 3:51 UTC (permalink / raw
To: gentoo-user
> I'm writing a sed script that will parse the *broken* output of
> man2html. I say broken, because the output isn't W3C compliant (html
> OR xhtml). I'd like to be able to modify it so that the final outcome
> is XHTML 1.0 compliant. I'm running into a problem where the output
> doesn't close the <p>, <dt>, or <dd> tags.
Won't html tidy do this kind of thing for you?
It would seem to be easier to reuse an existing tested tool rather than
trying to roll your own...
--
gentoo-user@gentoo.org mailing list
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [gentoo-user] Any 'sed' geniuses out there?
2005-09-27 3:51 ` Dave Nebinger
@ 2005-09-27 3:57 ` gentuxx
2005-09-27 4:22 ` gentuxx
1 sibling, 0 replies; 8+ messages in thread
From: gentuxx @ 2005-09-27 3:57 UTC (permalink / raw
To: gentoo-user
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Dave Nebinger wrote:
>> I'm writing a sed script that will parse the *broken* output of
>> man2html. I say broken, because the output isn't W3C compliant (html
>> OR xhtml). I'd like to be able to modify it so that the final outcome
>> is XHTML 1.0 compliant. I'm running into a problem where the output
>> doesn't close the <p>, <dt>, or <dd> tags.
>
>
> Won't html tidy do this kind of thing for you?
>
> It would seem to be easier to reuse an existing tested tool rather
> than trying to roll your own...
>
>
Possibly. Didn't know about that. I'll look into it.
- --
gentux
echo "hfouvyAdpy/ofu" | perl -pe 's/(.)/chr(ord($1)-1)/ge'
gentux's gpg fingerprint ==> 34CE 2E97 40C7 EF6E EC40 9795 2D81 924A
6996 0993
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDOMMTLYGSSmmWCZMRAokvAJoDPchPx83taV9a70hSODam/1SBMwCdGDtU
JGHSO7g47BfuV3JNSGjsK7A=
=41BE
-----END PGP SIGNATURE-----
--
gentoo-user@gentoo.org mailing list
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [gentoo-user] Any 'sed' geniuses out there?
2005-09-27 3:51 ` Dave Nebinger
2005-09-27 3:57 ` gentuxx
@ 2005-09-27 4:22 ` gentuxx
2005-09-27 4:27 ` Dave Nebinger
1 sibling, 1 reply; 8+ messages in thread
From: gentuxx @ 2005-09-27 4:22 UTC (permalink / raw
To: gentoo-user
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Dave Nebinger wrote:
>> I'm writing a sed script that will parse the *broken* output of
>> man2html. I say broken, because the output isn't W3C compliant (html
>> OR xhtml). I'd like to be able to modify it so that the final outcome
>> is XHTML 1.0 compliant. I'm running into a problem where the output
>> doesn't close the <p>, <dt>, or <dd> tags.
>
>
> Won't html tidy do this kind of thing for you?
>
> It would seem to be easier to reuse an existing tested tool rather
> than trying to roll your own...
>
>
Well, while I enjoy a good challenge (especially sed, awk, or perl),
htmltidy does the trick quite nicely. It doesn't indent the way that
I do, but my first priority was making the output W3C compliant, and
htmltidy's output is that.
Thanks Dave!
- --
gentux
echo "hfouvyAdpy/ofu" | perl -pe 's/(.)/chr(ord($1)-1)/ge'
gentux's gpg fingerprint ==> 34CE 2E97 40C7 EF6E EC40 9795 2D81 924A
6996 0993
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDOMkhLYGSSmmWCZMRAoqSAJ4nP7Vyb/x3/EjvC9O7MGWLpcWzygCgxQrV
kFsnMCN9cYemTJEh0ub+JSI=
=pUJg
-----END PGP SIGNATURE-----
--
gentoo-user@gentoo.org mailing list
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [gentoo-user] Any 'sed' geniuses out there?
2005-09-27 4:22 ` gentuxx
@ 2005-09-27 4:27 ` Dave Nebinger
2005-09-27 4:40 ` gentuxx
0 siblings, 1 reply; 8+ messages in thread
From: Dave Nebinger @ 2005-09-27 4:27 UTC (permalink / raw
To: gentoo-user
> Well, while I enjoy a good challenge (especially sed, awk, or perl),
> htmltidy does the trick quite nicely. It doesn't indent the way that
> I do, but my first priority was making the output W3C compliant, and
> htmltidy's output is that.
I haven't used it in awhile, but there may be some command line options to
assist with the indentation...
--
gentoo-user@gentoo.org mailing list
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [gentoo-user] Any 'sed' geniuses out there?
2005-09-27 4:27 ` Dave Nebinger
@ 2005-09-27 4:40 ` gentuxx
2005-09-27 8:24 ` Mariusz Pękala
2005-09-30 19:44 ` Billy Holmes
0 siblings, 2 replies; 8+ messages in thread
From: gentuxx @ 2005-09-27 4:40 UTC (permalink / raw
To: gentoo-user
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Dave Nebinger wrote:
>> Well, while I enjoy a good challenge (especially sed, awk, or perl),
>> htmltidy does the trick quite nicely. It doesn't indent the way that
>> I do, but my first priority was making the output W3C compliant, and
>> htmltidy's output is that.
>
>
> I haven't used it in awhile, but there may be some command line
> options to assist with the indentation...
>
There is the '-i' option, but the indentation is minimal. I tend to
be pretty anal about indentation. But like I said, I'm not really
concerned about that. My main concern was validation, and htmltidy
gives me that.
Thanks again.
- --
gentux
echo "hfouvyAdpy/ofu" | perl -pe 's/(.)/chr(ord($1)-1)/ge'
gentux's gpg fingerprint ==> 34CE 2E97 40C7 EF6E EC40 9795 2D81 924A
6996 0993
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDOM1KLYGSSmmWCZMRAs3qAJ9hoq5MwibAnIEqfnJr/75lnlQlPgCgj6Wm
hhFCE/G1Leo7ZUjExnM6OW8=
=N9UW
-----END PGP SIGNATURE-----
--
gentoo-user@gentoo.org mailing list
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [gentoo-user] Any 'sed' geniuses out there?
2005-09-27 4:40 ` gentuxx
@ 2005-09-27 8:24 ` Mariusz Pękala
2005-09-30 19:44 ` Billy Holmes
1 sibling, 0 replies; 8+ messages in thread
From: Mariusz Pękala @ 2005-09-27 8:24 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 434 bytes --]
On 2005-09-26 21:40:43 -0700 (Mon, Sep), gentuxx wrote:
> There is the '-i' option, but the indentation is minimal. I tend to
> be pretty anal about indentation. But like I said, I'm not really
> concerned about that. My main concern was validation, and htmltidy
> gives me that.
You may use sed for indentation tweaking. :-)
--
No virus found in this outgoing message.
Checked by 'grep -i virus $MESSAGE'
Trust me.
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [gentoo-user] Any 'sed' geniuses out there?
2005-09-27 4:40 ` gentuxx
2005-09-27 8:24 ` Mariusz Pękala
@ 2005-09-30 19:44 ` Billy Holmes
1 sibling, 0 replies; 8+ messages in thread
From: Billy Holmes @ 2005-09-30 19:44 UTC (permalink / raw
To: gentoo-user
gentuxx wrote:
> concerned about that. My main concern was validation, and htmltidy
> gives me that.
open it in vim, and do the autoindent function (select all, hit "=")
--
gentoo-user@gentoo.org mailing list
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2005-09-30 19:49 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-27 3:45 [gentoo-user] Any 'sed' geniuses out there? gentuxx
2005-09-27 3:51 ` Dave Nebinger
2005-09-27 3:57 ` gentuxx
2005-09-27 4:22 ` gentuxx
2005-09-27 4:27 ` Dave Nebinger
2005-09-27 4:40 ` gentuxx
2005-09-27 8:24 ` Mariusz Pękala
2005-09-30 19:44 ` Billy Holmes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox