Source for file rtfclass.php
Documentation is available at rtfclass.php
* Filename: includes/rtfclass.php
* Function: RTF parsing class
* Last Modified: $Date: 2007-11-27 14:02:54 +0100 (Tue, 27 Nov 2007) $
Rich Text Format - Parsing Class
================================
<mfischer@josefine.ben.tuwien.ac.at>
http://josefine.ben.tuwien.ac.at/~mfischer/
Latest versions of this class can always be found at
http://josefine.ben.tuwien.ac.at/~mfischer/developing/php/rtf/rtfclass.phps
Testing suite is available at
http://josefine.ben.tuwien.ac.at/~mfischer/developing/php/rtf/
http://msdn.microsoft.com/library/default.asp?URL=/library/specs/rtfspec.htm
Unknown or unspupported control symbols are silently ignored
Group stacking is still not supported :(
group stack logic implemented; however not really used yet
Example on how to use this class:
=================================
$r = new rtf( stripslashes( $rtf));
if( count( $r->err) == 0) // no errors detected
Sat Nov 25 09:52:12 CET 2000 mfischer
First version which has useable but only well-formed xml output; rtf
data structure is only logically rebuild, no real parsing yet
Mon Nov 27 16:17:18 CET 2000 mfischer
Wrote handler for \plain control word (thanks to Peter Kursawe for this
Tue Nov 28 02:22:16 CET 2000 mfischer
Implemented alignment (left, center, right) with HTML <DIV .. tags
Also implemented translation for < and > character when outputting html or xml
Mon Oct 25 14:15:03 CET 2004 smanciles
Implemented parsing of special characteres for spanish and catalan (รบร...)
This class and all work done here is dedicated to Tatjana.
/* was just a brainlag suggestion of my inner link; don't know if I'll use it */
var $rtf; // rtf core stream
var $len; // length in characters of the stream (get performace due avoiding calling strlen everytime)
var $err =
array(); // array of error message, no entities on no error
// the only variable which should be accessed from the outside
var $out; // output data stream (depends on which $wantXXXXX is set to true
var $outstyles; // htmlified styles (generated after parsing if wantHTML
var $styles; // if wantHTML, stylesheet definitions are put in here
// internal parser variables --------------------------------
// control word variables
var $cword; // holds the current (or last) control word, depending on $cw
var $cw; // are we currently parsing a control word ?
var $cfirst; // could this be the first character ? so watch out for control symbols
var $flags =
array(); // parser flags
var $queue; // every character which is no sepcial char, not belongs to a control word/symbol; is generally considered being 'plain'
var $stack =
array(); // group stack
/* keywords which don't follw the specification (used by Word '97 - 2000) */
"tdfrmtxtBottom(-?[0-9]+)?",
"tdfrmtxtLeft(-?[0-9]+)?",
"tdfrmtxtRight(-?[0-9]+)?",
"tdfrmtxtTop(-?[0-9]+)?",
"trftsWidthA(-?[0-9]+)?",
"trftsWidthB(-?[0-9]+)?",
"spectspecifygen(-?[0-9]+)?"
"179" =>
"Arabic Traditional",
"238" =>
"Eastern European",
/* note: the only conversion table used */
Takes as argument the raw RTF stream
(Note under certain circumstances the stream has to be stripslash'ed before handling over)
Initialises some class-global variables
echo
"<hr>\n<b>RTF</b><br>\n<code>\n";
echo
"--->" .
$this->rtf .
"<---<br>\n";
echo
"</code>\n<br>\n<hr>\n";
Default values according to the specs
$this->flags =
array("fontsize" =>
24,
// font table definition start
$this->flags["fonttbl"] =
true; // signal fonttable control records they are allowed to behave as expected
if ($this->flags["fonttbl"]) { // if its set, the fonttable definition is written to; else its read from
$this->flags["fonttbl_current_write"] =
$parameter;
$this->flags["fonttbl_current_read"] =
$parameter;
// this is for preparing flushQueue; it then moves the Queue to $this->fonttable .. instead to formatted output
$this->flags["fonttbl_want_fcharset"] =
$parameter;
// sets the current fontsize; is used by stylesheets (which are therefore generated on the fly
$this->flags["fontsize"] =
$parameter;
$this->flags["alignment"] =
"center";
$this->flags["alignment"] =
"right";
// reset paragraph settings ( only alignment)
$this->flags["alignment"] =
"";
// define new paragraph (for now, thats a simple break in html)
$this->flags["beginparagraph"] =
true;
// haven'y yet figured out WHY I need a (string)-cast here ... hm
if ((string)
$parameter ==
"0") {
$this->flags["bold"] =
false;
$this->flags["bold"] =
true;
if ((string)
$parameter ==
"0") {
$this->flags["underlined"] =
false;
$this->flags["underlined"] =
true;
if ((string)
$parameter ==
"0") {
$this->flags["italic"] =
false;
$this->flags["italic"] =
true;
if ((string)
$parameter ==
"0") {
$this->flags["strikethru"] =
false;
$this->flags["strikethru"] =
true;
// reset all font modifiers and fontsize to 12
$this->flags["bold"] =
false;
$this->flags["italic"] =
false;
$this->flags["underlined"] =
false;
$this->flags["strikethru"] =
false;
$this->flags["fontsize"] =
12;
$this->flags["subscription"] =
false;
$this->flags["superscription"] =
false;
// sub and superscription
if ((string)
$parameter ==
"0") {
$this->flags["subscription"] =
false;
$this->flags["subscription"] =
true;
if ((string)
$parameter ==
"0") {
$this->flags["superscription"] =
false;
$this->flags["superscription"] =
true;
Dispatch the control word to the output stream
if (ereg("^([A-Za-z]+)(-?[0-9]*) ?$", $this->cword, $match)) {
$this->out .=
"<control word=\"" .
$match[1] .
"\"";
$this->out .=
" param=\"" .
$match[2] .
"\"";
If output stream supports comments, dispatch it
$this->out .=
"<!-- " .
$comment .
" -->";
Dispatch start/end of logical rtf groups
(not every output type needs it; merely debugging purpose)
/* push onto the stack */
$this->last_flags =
$this->flags;
$this->flags["fonttbl_current_write"] =
""; // on group close, no more font definition will be written to this id
// this is not really the right way to do it !
// of course a '}' not necessarily donates a fonttable end; a fonttable
// group at least *can* contain sub-groups
// therefore an stacked approach is heavily needed
$this->flags["fonttbl"] =
false; // no matter what you do, if a group closes, its fonttbl definition is closed too
$this->out .=
"</group>";
if ($this->flags[$rtf] ==
true) {
if ($command ==
"start") {
$this->out .=
"<" .
$html .
">";
$this->out .=
"</" .
$html .
">";
if (ereg("^[0-9]+$", $this->flags["fonttbl_want_fcharset"])) {
$this->fonttable[$this->flags["fonttbl_want_fcharset"]]["charset"] =
$this->queue;
$this->flags["fonttbl_want_fcharset"] =
"";
Everything which passes this is (or, at leat, *should*) be only outputted plaintext
Thats why we can safely add the css-stylesheet when using wantHTML
$this->out .=
"<plain>" .
$this->queue .
"</plain>";
// only output html if a valid (for now, just numeric;) fonttable is given
if (ereg("^[0-9]+$", $this->flags["fonttbl_current_read"])) {
if ($this->flags["beginparagraph"] ==
true) {
$this->flags["beginparagraph"] =
false;
$this->out .=
"<div align=\"";
switch ($this->flags["alignment"]) {
/* define new style for that span */
$this->styles["f" .
$this->flags["fonttbl_current_read"] .
"s" .
$this->flags["fontsize"]] =
"font-family:" .
$this->fonttable[$this->flags["fonttbl_current_read"]]["charset"] .
" font-size:" .
$this->flags["fontsize"] .
";";
$this->out .=
"<span class=\"f" .
$this->flags["fonttbl_current_read"] .
"s" .
$this->flags["fontsize"] .
"\">";
/* check if the span content has a modifier */
handle special charactes like \'ef
$this->out .=
"<special value=\"" .
$special .
"\"/>";
$this->out .=
"<special value=\"" .
$special .
"\"/>";
case "c1":
$this->out .=
"Á";
case "e1":
$this->out .=
"á";
case "c0":
$this->out .=
"À";
case "e0":
$this->out .=
"à";
case "c9":
$this->out .=
"É";
case "e9":
$this->out .=
"é";
case "c8":
$this->out .=
"È";
case "e8":
$this->out .=
"è";
case "cd":
$this->out .=
"Í";
case "ed":
$this->out .=
"í";
case "cc":
$this->out .=
"Ì";
case "ec":
$this->out .=
"ì";
case "d3":
$this->out .=
"Ó";
case "f3":
$this->out .=
"ó";
case "d2":
$this->out .=
"Ò";
case "f2":
$this->out .=
"ò";
case "da":
$this->out .=
"Ú";
case "fa":
$this->out .=
"ú";
case "d9":
$this->out .=
"Ù";
case "f9":
$this->out .=
"ù";
case "80":
$this->out .=
"€";
case "d1":
$this->out .=
"Ñ";
case "f1":
$this->out .=
"ñ";
case "c7":
$this->out .=
"Ç";
case "e7":
$this->out .=
"ç";
case "dc":
$this->out .=
"Ü";
case "fc":
$this->out .=
"ü";
case "bf":
$this->out .=
"¿";
case "a1":
$this->out .=
"¡";
case "b7":
$this->out .=
"·";
case "a9":
$this->out .=
"©";
case "ae":
$this->out .=
"®";
case "ba":
$this->out .=
"º";
case "aa":
$this->out .=
"ª";
case "b2":
$this->out .=
"²";
case "b3":
$this->out .=
"³";
$this->out .=
"<errors>";
while (list
($num, $value) =
each($this->err)) {
$this->out .=
"<message>" .
$value .
"</message>";
$this->out .=
"</errors>";
$this->outstyles =
"<style type=\"text/css\"><!--\n";
while (list
($stylename, $styleattrib) =
each($this->styles)) {
$this->outstyles .=
"." .
$stylename .
" { " .
$styleattrib .
" }\n";
How this parser (is supposed) to work:
======================================
This parse simple starts at the beginning of the rtf core stream,
catches every controlling character {,} and \, automatically builds
control words and control symbols during his livetime, trashes
every other character into the plain text queue
$this->cw =
false; // flag if control word is currently parsed
$this->cfirst =
false; // first control character ?
$this->cword =
""; // last or current control word ( depends on $this->cw
$this->queue =
""; // plain text data found during parsing
while ($i <
$this->len) {
switch ($this->rtf[$i]) {
if ($this->cfirst) { // catches '\\'
if ((ord($this->rtf[$i]) ==
10) ||
(ord($this->rtf[$i]) ==
13)) break; // eat line breaks
if ($this->cw) { // active control word ?
Watch the RE: there's an optional space at the end which IS part of
the control word (but actually its ignored by flushControl)
if (ereg("^[a-zA-Z0-9-]?$", $this->rtf[$i])) { // continue parsing
Control word could be a 'control symbol', like \~ or \* etc.
if ($this->rtf[$i] ==
'\'') { // expect to get some special chars
if (ereg("^[{}\*]$", $this->rtf[$i])) {
if ($this->rtf[$i] ==
' ') { // space delimtes control words, so just discard it and flush the controlword
The current character is a delimeter, but is NOT
part of the control word so we hop one step back
in the stream and process it again
// < and > need translation before putting into queue when XML or HTML is wanted
switch ($this->rtf[$i]) {
echo
"<hr>\n<b>RTF Out</b><br>\n<code>\n";
echo
"--->" .
$this->out .
"<---<br>\n";
echo
"</code>\n<br>\n<hr>\n";
Documentation generated on Thu, 22 Jan 2009 09:17:48 +0100 by phpDocumentor 1.4.2