xpra icon
Bug tracker and wiki

This bug tracker and wiki are being discontinued
please use https://github.com/Xpra-org/xpra instead.


Ticket #2967: TracWiki2MediaWiki.pl

File TracWiki2MediaWiki.pl, 36.6 KB (added by Antoine Martin, 8 months ago)

perl script found here: https://www.mediawiki.org/w/index.php?oldid=2103135#Code

Line 
1#!/usr/bin/perl -w
2#
3# tracwiki2mediawiki.pl - converts TracWiki format to MediaWiki format.
4# Version 1.0, 14 November 2007
5#
6# Copyright (C) 2007 Uzi Cohen
7#
8#This tool is merely a  DRAFT version,a mini utility that can be used to convert Trac Wiki pages to MediaWiki format.
9#The basis for this tool is php2mediawiki by Isaac Wilcox
10#Copyright (C) 2005 Isaac Wilcox
11#http://www.iwilcox.me.uk/2005/07/php2mediawiki/
12#php2mediawiki provided a convinient basis for this convertor and the monifications added to it
13#were introduced to support the conversion of the TracWiki format.
14#
15#
16# This program is free software; you can redistribute it and/or modify
17# it under the terms of the GNU General Public License as published by
18# the Free Software Foundation; either version 2 of the License, or
19# (at your option) any later version.
20#
21# This program is distributed in the hope that it will be useful,
22# but WITHOUT ANY WARRANTY; without even the implied warranty of
23# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
24# GNU General Public License for more details.
25#
26# You should have received a copy of the GNU General Public License
27# along with this program; if not, write to the Free Software
28# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
29#
30#
31use strict;
32use warnings;
33use English;
34use Carp;
35use Data::Dumper;
36use URI::Escape;
37use POSIX qw(strftime);
38
39# Make warnings into fatal, backtracing errors.
40BEGIN { $SIG{__WARN__} = \&confess; }
41
42sub usage() {
43  my $me = $0;
44  $me =~ s|.*/||;
45  print STDERR <<EOF
46$me: converts a file from Trac syntax to Mediawiki Syntax
47
48Usage:
49  perl TracWiki2Mediawiki.pl <tracwiki_file> [mediawiki_file]
50
51Where the tracwiki_file.txt is a text file containing the tracwiki page.
52If the output file name is not provided, it defaults to the input file
53name with the suffix ".after".
54EOF
55}
56
57#Some of the nice features originally provided by php2mediawiki, are not supported by this version,
58#and it can now be used to convert single pages only.
59# How to use the tool?
60#-------------------------
61# syntax:
62# perl TracWiki2Mediawiki.pl <tracwiki_file.txt> [mediawiki_file]
63#
64# Where the tracwiki_file.txt is a text file containing the tracwiki page.
65# If the output file name is not provided, it defaults to the input file
66# name with the suffix ".after".
67#
68# In order to get the TracWiki page content, goto the TracWiki page and click the edit button, copy and paste its content to the tracwiki_file.txt file, and run the convertion utility.
69#
70#Known issues
71# - In tracwiki syntax, {{{ text }}} causes the text inside to be excluded from the wiki parsing, current version of the conversion tool does not support this exclude when it occurs over multiple lines...
72# - Tables' first line must start at the beginning of the line.
73# - A link to a file in the wiki, cannot be distinguished from a page link, thus all links would be treated as page links (rather than [[media:<filename>]])
74# - CamalCase is not recognize as a link when converted to MediaWiki
75#
76
77#The following are the comments as they appeared in  the original php2mediawiki
78# Despite the long list of known issues below, the script does get most
79# things right most of the time, so try it out before letting this list
80# persuade you otherwise.  I've tried to list all the bugs regardless of
81# size, for completeness.
82#
83# Known issues:
84#  - Old-style PHPWiki syntax is unsupported unless it happens to match the
85#    new-style markup.
86#
87#  - Only the OldStyleTable plugin is supported.  No other plugins work.
88#    (I've patched my MediaWiki to include workalikes for 'backlinks' and
89#    'unfoldsubpages', because my wiki uses them heavily; contact me if
90#    you're interested.)
91#
92#  - PHPWiki page titles in BumpyCaps don't get spaces inserted (PHPWiki
93#    would normally do this itself when rendering the page as HTML).  Not
94#    sure I want to do this, either - some pages are meant to have spaces
95#    inserted, and some aren't, and there's no way to tell.  Plus, without
96#    extra work, all the internal Wiki links will break, and I can't be
97#    bothered to do the extra work just yet.
98#
99#  - Characters in titles that PHPWiki allows but MediaWiki prohibits simply
100#    cause the script to die, because my PHPWiki has "C++" at worst, which
101#    is special-cased here.
102#
103#  - Clever links (InterWikiMaps?) like "Google:foo" aren't translated.
104#    You can manually create templates that do the same thing - e.g., to
105#    simulate an InterWikiMap of:
106#      Google:foo => http://www.google.com/search?q=foo
107#    you could create Template:Google, containing:
108#      [http://www.google.com/search?q={{{1}}} Google:{{{1}}}]
109#    and turn all Google:Foo into {{Google|Foo}} in this script.
110#    (You might also want to fiddle with MAX_INCLUDE_REPEAT in Parser.php,
111#    but I leave that to you.)
112#    This is the solution I use, but currently you have to create the
113#    templates (e.g. Template:Google) manually.  Templates could be created
114#    automatically by this script if someone took the time to write the
115#    code.
116#    There's code here to change "Google:foo" into "{{Google|Foo}}", but it
117#    depends on $is_zaks_own_wiki, and the list of supported mappings is
118#    tiny (though it should be easy to add more).
119#
120#  - '_' in the middle of a word incorrectly gets italicized - effectively:
121#    FOO_BAR_BAZ => FOO<I>BAR</I>BAZ.  This bites me in a surprising number
122#    of places, but the regex fix is a little thorny.
123#
124#  - Tables:
125#    - table cells starting with a lowercase letter 'o' are interfered with by
126#      the bulleted list code, and break the table.
127#    - tables whose first lines are indented don't get converted at all
128#    - tables with an empty first cell go wrong
129#    - old style tables don't honour < > ^ (looks easy to fix though)
130#
131#  - BumpyCaps with suppressing tildes (e.g. ~DontLinkThis) retain the
132#    tildes; same for homedirs in URLs (e.g. http://host/~~user/).
133#
134#  - Anchors and links to anchors are broken.  MediaWiki mini-TOCs pretty
135#    much make this irrelevent if you only use anchors for simulating TOCs
136#    in your PHPWiki (as I did).
137#
138#  - The "=...=" (fixed-width font) PHPWiki markup conversion sometimes gets
139#    overexcited, and turns "foo=bar,baz=quux" into
140#    "foo<tt>bar,baz</tt>quux".
141#
142#  - Probably a fair few other things.  My PHPWiki really doesn't tax the
143#    markup.
144#
145#
146# On with the show.  Prerequisites:
147#   1. Hopefully this goes without saying, but...you must have a working,
148#      online PHPWiki and a working, online MediaWiki.
149#
150#   2. You must have accounts with the MySQL (or MySQL*s*) that host the PHP
151#      and Media Wikis, and have relevant privileges.
152#
153#   3. You must have *NIX knowledge, and not be a fool.
154#      I assume you're using *NIX, because I know nothing of Perl on Win32.
155#      Patches for Win32 portability issues are of course welcome.
156#
157#   4. You must not expect the script to do /all/ the work for you.
158#
159
160# Turns on a few extra conversions that probably only Zak's wiki uses.
161my $is_zaks_own_wiki = 0;
162
163  my $infile_name = "";
164  my $outfile_name = "";
165  if ($#ARGV < 0) {
166    print "$0: Error: please specify a file name to convert\n";
167    usage;
168    exit 1;
169  } elsif ($#ARGV > 1) {
170    print "$0: Error: Too many argument.";
171    usage;
172    exit 1;
173  } elsif ($#ARGV == 1) {
174    $infile_name = $ARGV[0];
175    $outfile_name = $ARGV[1];
176  } else { # $ARGV == 0
177    $infile_name = $ARGV[0];
178    $outfile_name = "$infile_name.after";
179  }
180   
181  my $page = get_page($infile_name);
182 
183  print "\n\n\n -> Converting page: $page->{title}\n";
184  convert_markup($page);
185
186  open(NEW_PAGE, ">$outfile_name");
187  print NEW_PAGE "$page->{title}\n";
188  print NEW_PAGE $page->{content};
189  print NEW_PAGE "\n\n\n\n{{TracNotice|{{PAGENAME}}}}";
190
191  close(NEW_PAGE);
192 
193exit(0);
194
195# ID => "replacement text" mapping
196my %deferred_substs;
197
198# $pagehash get_page();
199#
200# Retrieve and return from a file the page's title and latest content in a
201# hash. 
202{
203sub get_page {
204  my ($file_name) = @_;
205  my $pgcontent = "";
206  my $pgtitle   = "";
207  my $record    = "";
208  my $page;
209 
210  open (ORIG_PAGE, "<".$file_name) || die "couldn't open input file! ($file_name)";
211 
212  $pgtitle = '';
213  #if ( defined($pgtitle = <ORIG_PAGE>) ) {
214    while ( defined($record = <ORIG_PAGE>) ) {
215      $pgcontent .= $record;
216    }
217    $page = { title => $pgtitle, content => $pgcontent };
218  #}
219
220  close(ORIG_PAGE);
221 
222  return $page;
223}
224}
225
226# void convert_markup($page);
227#
228# Convert $page (content, and maybe title) from PHPWiki to MediaWiki format as
229# best we can.  Modifies $page in place.
230sub convert_markup {
231  my ($page) = @_;
232  my $old_title = $page->{title};
233 
234  # http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(technical_restrictions)
235  # says that # + < > [ ] | { } are forbidden in MediaWiki page titles.  Only
236  # pages I have here are variants on "C++", luckily for me, so I special-case
237  # those and allow them.  Not so lucky for you...everything else causes script death.
238  if ($page->{title} =~ /\+\+$/o) {
239    $page->{title} =~ s/\+\+/_Plus_Plus/o;
240  } elsif ($page->{title} =~ /\+\+/) {
241    $page->{title} =~ s/\+\+/_Plus_Plus_/o;
242  } elsif ($page->{title} =~ /^\//o
243           || $page->{title} =~ /[]\[\{\}\<\>\#\+\|]/o) {
244    # Starting with '/' is also unsupported.
245    die("Can't yet handle page titles with funky chars in");
246  }
247  if ($page->{title} =~ / /o) {
248    # MediaWiki doesn't expect to see embedded spaces in page titles in the
249    # DB; the PHP layer will always turn a request for e.g. "Bumpy Caps"
250    # into "Bumpy_Caps" before asking the DB.  So retitle such pages here.
251    $page->{title} =~ s/ /_/go;
252  }
253  if ($page->{title} =~ /^[a-z]/o) {
254    # Pages starting with small letters seem to break without this.
255    # Again I think the PHP looks for IBook, not iBook.
256    $page->{title} =~ s/^([a-z])/uc($1)/eo;
257  }
258
259  # Break the page up into lines.  This avoids having to special-case for \n in
260  # the middle of things, and is apparently how PHPWiki looks at things.
261  # Few PHPWiki constructs span >1 line.
262  my @lines = split(/\n/, $page->{content});
263
264  # Sort out verbatim sections early so that following substitutions just
265  # see a placeholder and leave it alone.
266  block_cvt_verbatim(\@lines);
267
268  foreach my $line (@lines) {
269    $line = cvt_linebreaks($line);
270    $line = cvt_remove_exclamation_mark($line);
271    $line = cvt_fixedwidth1($line);
272    $line = cvt_para_open($line);
273    $line = cvt_para_close($line);
274    if ($is_zaks_own_wiki) {
275      $line = cvt_unfolds($line);
276      $line = cvt_backlinks($line);
277    }
278    # $line = cvt_plugins($line);
279
280    # Apply in most specific --> least specific order to avoid applying an
281    # overly generic conversion prematurely.  So, named external links
282    # first, because they're more specific (http always in them).  Etc.
283   
284    #$line = cvt_wikiwords($line);
285   
286    $line = cvt_explicit_internal_links($line);
287    $line = cvt_explicit_external_links($line);
288
289    # InterWikiMaps (see note at top of file).  Has to precede WikiWords,
290    # otherwise LocalFile:foo and Map:BumpyTarget get [[]]ified first.
291    $line = cvt_interwikimaps($line);
292
293
294
295    # It helps if this precedes cvt_lists, because '-' is a valid bulleted list
296    # item marker.
297    $line = cvt_horizontal_rules($line);
298
299    # It helps if cvt_lists precedes bold, because '*' is a bit overloaded.
300    $line = cvt_lists($line);   
301    $line = cvt_underline($line);
302    $line = cvt_strikethrough($line);
303    $line = cvt_superscript($line);
304    $line = cvt_subscript($line);
305    $line = cvt_toc($line);
306   
307   #$line = cvt_bold_italics($line);
308    #$line = cvt_bold($line);
309    # old_bold should precede italics
310    #$line = cvt_italics($line);  # same format - leave as is
311
312    $line = cvt_fixedwidth2($line);
313    # Headings must follow fixedwidth, because it produces '='
314    $line = cvt_headings_trac($line); 
315  }
316
317  for (my $i = 0; $i < $#lines; $i++) {
318    foreach my $block_cvt_sub (\( &block_cvt_new_style_tables, )) {
319      my ($num_lines_to_remove, @new_content) = $block_cvt_sub->(\@lines, $i);
320      if (defined($num_lines_to_remove)) {
321        splice(@lines, $i, $num_lines_to_remove, @new_content);
322        # Skip inserted lines, and subtract one because the for loop is about
323        # to $i++ again.
324        $i += $#new_content;
325        # Arbitrary decision to only run one matching block converter on any
326        # given section.
327        last;
328      }
329    }
330  }
331
332  $page->{content} = join("\n", @lines);
333  $page->{content} = apply_deferred_substs($page->{content});
334  %deferred_substs = ();
335
336  if ($is_zaks_own_wiki) {
337    # HACK: sort links to pages called CategoryFoo, and sort any InterWikiMap
338    # "Category:Foo" links...turn them all to [[Category:Foo]] and add a :
339    # version so that the text still looks the same as it did.
340    # By this point, Category links might look like any of:
341    #  [[CategoryTools]]                    => [[Category:Tools]][[:Category:Tools]]
342    #  [[Category:Tools]]                   => [[Category:Tools]][[:Category:Tools]]
343    #  [[Category:WikiWord]] => same again
344    #  Category:Tools                       => [[Category:Tools]]
345    #  Category:[[WikiWord]] => [[Category:WikiWord]]
346    # FIXME: use apply_deferred_substs again to avoid all this assertion mess
347    foreach my $pat (qr/\[\[\s*Category([^]:]+)\]\]/,
348                     qr/\[\[\s*Category:([^]]+)\]\]/,
349                   qr/(?:Category:(\w+))/,
350                   qr/(?:Category:\[\[(\w+)\]\])/) {
351      while ($page->{content} =~ $pat) {
352        my $id = defer_subst("[[Category:$1]][[:Category:$1]]");
353        $page->{content} =~ s/$pat/{{{{$id}}}}/;
354      }
355    }
356    $page->{content} = apply_deferred_substs($page->{content});
357
358    # Another Zak-ism (might be useful to others too, but I'm sure there's a
359    # neater way anyway).
360    # Insert a template at content start to let users know that conversion is
361    # temporary output.  Need original title for this.
362    #
363    # Need to turn spaces into %20 etc, otherwise e.g.:
364    #   "http://foo/wiki/index.php/Old Page With Spaces"
365    # won't link properly in MediaWiki.
366    $old_title = uri_escape($old_title, " ");
367    $page->{content} = "{{Conversion|$old_title}}<BR>\n" . $page->{content};
368  }
369}
370
371sub block_cvt_new_style_tables {
372  my ($lines, $i) = @_;
373
374  # FIXME We don't recognise tables if the first line is indented, because the
375  # converter breaks on them at the moment.
376  #(see the bottom for rules for recognising tables)
377  if ($lines->[$i] =~ /\s*\|\|.*/) {
378    return convert_table($lines, $i);
379  } else {
380    return undef;
381  }
382}
383
384# Turn <verbatim> into a MediaWiki indented-by-one section.  This block
385# conversion is a little different to the rest in that it protects the
386# wrapped content from any other conversions, line or block.  Thus, the
387# arguments and return value are not like the other block_cvt_*(); it
388# operates on @lines in place, and doesn't return anything.
389#
390# PHPWiki seems to want <verbatim> to start a line, and include trailing
391# content on the same line.  It seems to want </verbatim> on a line on its
392# own.
393#
394sub block_cvt_verbatim {
395  my ($lines) = @_;
396
397  for (my $i = 0; $i < $#{$lines}; $i++) {
398    if ($lines->[$i] !~ /^<verbatim>(.*)(\s*)$/) {
399      next;
400    }
401    my $cur_line = $i;
402    my @new_content;
403    # Gather content trailing after the opening tag, if any.
404    if (defined($1)) {
405      push(@new_content, " $1");
406    }
407    $cur_line++; # skip opening tag line
408    while (defined($lines->[$cur_line]) && $lines->[$cur_line] !~ m|^</verbatim>$|) {
409      push(@new_content, " " . $lines->[$cur_line]);
410      $cur_line++;
411    }
412    # If we found an opening tag but there are no more closing tags, we're done.
413    if (!defined($lines->[$cur_line])) {
414      last;
415    }
416    my $id = defer_subst(join("\n", @new_content) . "\n");
417    splice(@$lines, $i, $cur_line - $i + 1, ("{{{{$id}}}}"));
418  }
419}
420
421# $id defer_subst($replacement_text);
422#
423# Add a deferred substitution to the list.  When you want to make sure that a
424# substitution will not be susceptible to further changes, get an ID from this
425# function and insert '{{{{id}}}}' instead of the replacement text.  Later,
426# converter will find all '{{{{id}}}}'s and replace them with whatever you
427# saved.  Useful to prevent e.g. BumpyCaps inside a URL being marked up.
428{
429my $next_deferred_subst_id;
430sub defer_subst {
431  my ($replacement_text) = @_;
432  if (!defined($next_deferred_subst_id)) {
433    $next_deferred_subst_id = 0;
434  }
435 
436  $deferred_substs{$next_deferred_subst_id} = $replacement_text;
437  return $next_deferred_subst_id++;
438}
439}
440
441# void apply_deferred_substs($content);
442#
443# Apply all deferred substitutions to the given content (in place).
444sub apply_deferred_substs {
445  my ($content) = @_;
446
447  $content =~ s/\{\{\{\{(\d+)\}\}\}\}/$deferred_substs{$1}/gx;
448  return $content;
449}
450
451
452###############################################################################
453# Markup conversion
454###############################################################################
455
456# Horizontal rules.
457#
458# This just protects HRs from further messing.
459# Regex stolen from Block_hr (possibly changing leading whitespace semantics).
460sub cvt_horizontal_rules {
461  my ($line) = @_;
462
463  if ($line =~ /^-{4,}\s*$/) {
464    my $id = defer_subst($line);
465    $line = "{{{{$id}}}}";
466  }
467  return $line;
468}
469
470# Headings
471#  !!!text => ==text==     (section)
472#  !!text  => ===text===   (subsection)
473#  !text   => ====text==== (subsubsection)
474#  See transform.php:wtm_headings().  Regex stolen from there.
475sub cvt_headings_jsp {
476  my ($line) = @_;
477
478  if ($line =~ /^(!{1,3})[^!]/) {
479    my $markup = '=' x (5 - length($1));
480    $line =~ s/
481      # remove the !s from start of line
482      ^!{1,3}
483      # and any leading whitespace on heading
484      \s*
485      # and capture the heading itself
486      (.*)
487      #Change it with
488      /${markup}$1${markup}/x;
489  }
490  return $line;
491}
492
493# Headings
494#  =text= => ==text==     (section)
495#  ==text==  => ===text===   (subsection)
496#  ===text===   => ====text==== (subsubsection)
497#  See transform.php:wtm_headings().  Regex stolen from there.
498sub cvt_headings_trac {
499  my ($line) = @_;
500  if ($line =~ /^(={1,3})[^=]+\1/) {
501    my $markup = $1."=";
502    $line =~ s/
503      # remove the =s from start of line
504      ^(={1,3})
505      # and any leading whitespace on heading
506      #\s*
507      # and capture the heading itself
508      ([^=]+)\1(.*)
509      #Change it with
510      /${markup}$2${markup}$3/x;
511  }
512  return $line;
513}
514
515# Unsupported single-line plugin calls.
516#
517# See transform.php:wtm_plugin().
518#
519# FIXME: handle multi-line calls (see Block_plugin).
520# FIXME: do something with the plugins we can emulate.
521sub cvt_plugins {
522  my ($line) = @_;
523
524  if ($line =~ /^<\?plugin\s.*\?>\s*$/) {
525    # Hmm.  For now let's just save this chunk of stuff so it doesn't get
526    # fiddled by other conversions.
527    my $id = defer_subst($line);
528    $line = "{{{{$id}}}}";
529  }
530  return $line;
531}
532
533# This one will only work if you've hacked your MediaWiki like Zak did.
534# Contact me for the hack if you're interested.
535sub cvt_backlinks {
536  my ($line) = @_;
537 
538  # Only pay attention if there's a page=Category...
539  if ($line =~ /^<\?plugin\s+BackLinks\s.*(?:page\s*=\s*Category([^\s]+)).*\?>\s*$/) {
540    my $id = defer_subst("{{CATEGORYCONTENTS|$1}}");
541    $line = "{{{{$id}}}}";
542  }
543  return $line;
544}
545
546# This one will only work if you've hacked your MediaWiki like Zak did.
547# Contact me for the hack if you're interested.
548sub cvt_unfolds {
549  my ($line) = @_;
550 
551  # Only pay attention if there's a section=...
552  if ($line =~ /^<\?plugin\s+UnfoldSubpages\s.*section\s*=\s*"?([^\s]+).*\?>\s*$/) {
553    my $id = defer_subst("{{UNFOLD|$1}}");
554    $line = "{{{{$id}}}}";
555  }
556  return $line;
557}
558
559# InterWikiMaps
560#  i.e. [[Star(foo)]] --> {{Star|foo}}
561# Very limited support - see notes at top of file.
562sub cvt_interwikimaps {
563  my ($line) = @_;
564  my $supported_maps = join('|', qw/
565    Google Star P360Doc
566  /);
567  my $pat = qr/
568    # starts with [[
569    \[\[
570    # a recognised InterWikiMap mapping
571    # $1 = mapping name
572    ($supported_maps)
573    # a '('
574    \(
575    # a parameter; I assume parameters can't contain parantases,
576    # but maybe I'm wrong
577    # $2 = parameter
578    ([^)]+)
579    # a ')'
580    \)   
581   # ends with ]]
582    \]\]
583  /x;
584  my $template_name = '';
585  my $param_list = '';
586  while ($line =~ m/$pat/ox) {
587    # Don't wanna let template invocations get converted any more.
588    $template_name = $1;
589    $param_list = $2;
590    $param_list =~ s/,/\|/;
591    my $id = defer_subst("{{$template_name|$param_list}}");
592    $line =~ s/$pat/{{{{$id}}}}/;
593  }
594  return $line;
595}
596
597# WikiWords
598#
599# See $WikiNameRegexp in PHPWiki (occurs several times, not sure which if any
600# is "the" regex).  Regex is hacked a /little/ here to document and capture.
601sub cvt_wikiwords {
602  my ($line) = @_;
603  my $WikiWordRegex = qr/
604    # must follow a non-alphanumeric char, or be first thing on the line
605    # Also don't match if there's a preceding '~', because PHPWiki
606    # suppresses markup in that case.
607    (?<![~[:alnum:]])
608    # match at least "FiFi"
609    ((?:[[:upper:]][[:lower:]]+){2,})
610    # and not followed by an alnum (hmm, think this is dirty...prev bit is
611    # greedy, so think they meant [0-9A-Z] there)
612    (?![[:alnum:]])/x;
613   
614  while ($line =~ m/$WikiWordRegex/ox) {
615    # Don't wanna let WikiWords get converted any more...I don't think...
616    # FIXME this defer might be unnecessary.
617    my $id = defer_subst("[[$1]]");
618    $line =~ s/$WikiWordRegex/{{{{$id}}}}/;
619  }
620  return $line;
621}
622
623# Explicit named  or  not named external links
624#  [http://foo/ coolsite] => [http://foo/| coolsite]
625#  [http://foo/] => [http://foo/]
626sub cvt_explicit_external_links {
627  my ($line) = @_;
628  my $pat = qr/
629    # Starts with a [
630    \[   
631    # then an http link (FIXME could be ftp, etc) containing non-link-ending
632    # chars ($1 = http...)
633    ((?:http|ftp|mailto)[^] ]+)
634    # maybe some whitespace
635    \s*
636    # "coolsite" consists of non-link-ending, non-renaming chars
637    # ($2 = coolsite)
638    ([^]]*)
639    # terminated by a ]
640    \]
641  /x;
642
643  while ($line =~ m/$pat/o) {
644    my $id = 0;
645    if ($2 eq "") {
646      # Explicit anonymous external links
647      #  [http://foo/] => http://foo/
648      # These are modified to make images appear inline ([] in MW doesn't do this),
649      # to make MW render them as the link address (these links appear as "[1]->"
650      # otherwise) and to keep them safe from further messing by other conversions.
651      # May need to treat images specially instead?
652      #$id = defer_subst("[$1]");
653
654      $id = defer_subst("$1");
655    }
656    else {
657      $id = defer_subst("[$1 $2]");
658    }
659    $line =~ s/$pat/{{{{$id}}}}/x;
660  }
661  return $line;
662}
663
664# Explicit named  or  not named internal links
665#  [wiki:foo coolpage] => [[foo| coolpage]]
666#  [wiki:foo] => [[foo]]
667sub cvt_explicit_internal_links {
668  my ($line) = @_;
669  my $pat = qr/
670    # Starts with a [
671    \[wiki:   
672    # then an internal link containing non-link-ending
673    # chars ($1 = internal link)
674    ([^] ]+)
675    # maybe some whitespace
676    \s*
677    # "coolpage" consists of non-link-ending, non-renaming chars
678    # ($2 = coolpage)
679    ([^]]*)
680    # terminated by a ]
681    \]
682  /x;
683
684  while ($line =~ m/$pat/o) {
685    my $id = 0;
686    if ($2 eq "") {
687      $id = defer_subst("[[$1]]");
688    }
689    else {
690      $id = defer_subst("[[$1| $2]]");
691    }
692    $line =~ s/$pat/{{{{$id}}}}/x;
693  }
694  return $line;
695}
696
697# Bulleted lists
698#
699# A little more relaxed than Block_list, just because I hate that regex ;)
700sub cvt_lists {
701  my ($line) = @_;
702  # Get a cup of tea.
703  my $pat = qr/
704    # Bullets must be first non-whitespace on line.
705    # Capture the indentation so we can determine level later.
706     ^(\ *)
707    # About to encounter some list item marker, so start capturing so we can
708    # tell later whether it was <OL> or <UL>.
709     (
710    # Any one of our 5 choices of bulleted list marker.
711    # Now, the first 4 we'll ignore the other uses of, and always just see as
712    # bullets.
713      (1\.)|(a\.)|(i\.)
714    # But '*' is annoying if you have a standalone line of bold text, which I
715    # do, so the hairy negative lookahead from Block_list is copied here.
716    # This basically considers * a bullet unless it looks like these:
717    #   *foo, bar baz*
718    #   *foo bar*
719    #   *foobar*
720    # cos that's probably meant to be bold.  But consider e.g. these to be
721    # bullets:
722    #   *foo bar*baz
723    #   * foo bar*
724    #   *foo bar *baz
725      | \*
726    # Not followed by:
727       (?!
728    ### One char of something (i.e. not space)
729       \S
730    ### Zero or more chars that aren't *s
731       [^*]*
732    ### The last of which must be a something (not space)
733       (?<=\S)
734    ### Then another *
735       \*
736    ### Followed by either space, or end of the line, i.e. not immediately
737    ### followed by text.
738       (?!\S)
739    # End of "Not followed by"
740      )
741    # End of choice of bullets (captured).
742    )
743    # Possibly some spaces
744    \ *
745    # And followed by some content.
746    (?=\S)
747  /x;
748
749  if ($line =~ /$pat/o) {
750    # Work out nesting level. Two spaces make a new level.
751    my $nest_level = (length($1) / 2) + 1;
752    # If bullet used was UL, use '*', else use '#'.
753    my $bullet;
754    if ($2 eq '1.') {
755      $bullet = '#';
756    } 
757    elsif ($2 eq 'a.') {
758      $bullet = '#';
759    }
760    elsif ($2 eq 'i.') {
761      $bullet = '#';
762    }
763    else {
764      $bullet = '*';
765    }
766    my $wm_list_markup = $bullet x $nest_level;
767    my $id = defer_subst("$wm_list_markup");
768    $line =~ s/$pat/{{{{$id}}}}/;
769  }
770  return $line;
771}
772
773# Bold Italics
774#  _*foo*_ => '''''foo'''''
775sub cvt_bold_italics {
776  my ($line) = @_;
777  my $pat = qr/
778    _\* ([^*]+) \*_
779  /x;
780
781  while ($line =~ /$pat/o) {
782    $line =~ s/$pat/'''''$1'''''/g;
783  }
784  return $line;
785}
786
787# underline.
788#  __foo__ => <u>foo</u>
789sub cvt_underline {
790  my ($line) = @_;
791  my $pat = qr/
792    __ (.+?) __
793  /x;
794
795  while ($line =~ /$pat/o) {
796    $line =~ s/$pat/<u>$1<\/u>/g;
797  }
798  return $line;
799}
800
801
802# strikethrough.
803#  ~~foo~~ => <a>foo</a>
804sub cvt_strikethrough {
805  my ($line) = @_;
806  my $pat = qr/
807    ~~ (.+?) ~~
808  /x;
809
810  while ($line =~ /$pat/o) {
811    $line =~ s/$pat/<s>$1<\/s>/g;
812  }
813  return $line;
814}
815
816# subscript.
817#  ,,foo,, => <sub>foo</sub>
818sub cvt_subscript {
819  my ($line) = @_;
820  my $pat = qr/
821    ,, (.+?) ,,
822  /x;
823
824  while ($line =~ /$pat/o) {
825    $line =~ s/$pat/<sub>$1<\/sub>/g;
826  }
827  return $line;
828}
829
830# toc.
831#  [[TOC(8_1/TOF/GMQA3R5)]] => __TOC__
832sub cvt_toc {
833  my ($line) = @_;
834  my $pat = qr/
835    \[\[TOC\( ([^)]+) \)\]\]
836  /x;
837
838  while ($line =~ /$pat/o) {
839    $line =~ s/$pat/__TOC__/g;
840  }
841  return $line;
842}
843
844# superscript.
845#  ^foo^ => <sup>foo</sup>
846sub cvt_superscript {
847  my ($line) = @_;
848  my $pat = qr/
849    \^ ([^^]+) \^
850  /x;
851
852  while ($line =~ /$pat/o) {
853    $line =~ s/$pat/<sup>$1<\/sup>/g;
854  }
855  return $line;
856}
857# Bold
858#  *foo* => '''foo'''
859sub cvt_bold {
860  my ($line) = @_;
861  my $pat = qr/
862    \* ([^*]+) \*
863  /x;
864
865  while ($line =~ /$pat/o) {
866    $line =~ s/$pat/'''$1'''/g;
867  }
868  return $line;
869}
870
871# Italics
872#  _foo_ => ''foo''
873sub cvt_italics {
874  my ($line) = @_;
875  my $pat = qr/
876    _ ([^_]+) _
877  /x;
878
879  while ($line =~ /$pat/o) {
880    $line =~ s/$pat/''$1''/g;
881  }
882  return $line;
883}
884
885# Fixed width
886#  !foo  => foo
887sub cvt_remove_exclamation_mark {
888  my ($line) = @_;
889  my $pat = qr/
890     \!( \w+ )
891  /x;
892
893  while ($line =~ /$pat/o) {
894    $line =~ s/$pat/$1/g;
895  }
896  return $line;
897}
898
899# Fixed width
900#  {{{foo}}} => <tt>foo</tt>
901sub cvt_fixedwidth1 {
902  my ($line) = @_;
903  my $pre_str = '\{\{\{';
904  my $post_str = '\}\}\}';
905  my $pat = qr/
906#  used to be:  ${pre_str}([^${post_str}]+)${post_str}   , changing it for:
907     ${pre_str}(.+?)${post_str}
908  /x;
909
910  while ($line =~ /$pat/o) {
911    $line =~ s/$pat/<tt>$1<\/tt>/g;
912  }
913  return $line;
914}
915
916# Fixed width
917#  `foo` => <tt>foo</tt>
918sub cvt_fixedwidth2 {
919  my ($line) = @_;
920  my $pat = qr/
921    \`([^`]+)\`
922  /x;
923
924  while ($line =~ /$pat/o) {
925    $line =~ s/$pat/<tt>$1<\/tt>/g;
926  }
927  return $line;
928}
929# Line breaks
930#  [[BR]] => <br>
931sub cvt_linebreaks {
932  my ($line) = @_;
933  if ($line =~ /\[\[BR\]\]/i) {
934    $line =~ s/\[\[BR\]\]/<br>/iog;
935  }
936  return $line;
937}
938
939# paragraph
940#  {{{ => <pre>
941sub cvt_para_open {
942  my ($line) = @_;
943
944  if ($line =~ /\{\{\{/) {
945    $line =~ s/\{\{\{/<pre>/og;
946  }
947  return $line;
948}
949
950# paragraph
951#  {{{ => <pre>
952sub cvt_para_close {
953  my ($line) = @_;
954
955  if ($line =~ /}}}/) {
956    $line =~ s/}}}/<\/pre>/og;
957  }
958  return $line;
959}
960
961# HTML markup that's unsupported by MediaWiki:
962# abbr acronym dfn kbd samp
963#  FIXME
964
965
966################## TABLES #####################################################
967# PHPWiki syntax is a little hairy.  This parser requires 3 passes over the
968# lines (!) to convert the syntax.  We could probably refactor this into less
969# passes with a little effort, but I CBA.
970
971# Top level; take in a reference to an array of lines and the index of a line
972# on which a table starts, and return the number of lines used in the
973# conversion and a replacement string containing the new content.
974sub convert_table {
975  my ($lines, $first_table_line) = @_;
976 
977  # Pass 1.  Get all the lines that make up the table.
978  my @table_lines = collect_table($lines, $first_table_line);
979  my $num_table_lines = @table_lines;
980
981  # Pass 2.  Parse table into an IR.  This helps us handle rowspans and
982  # colspans.
983  # We might die if the table uses uneven indents.  Actually, we might die
984  # anyway, so punt other errors.
985  my $IR;
986  eval {
987    $IR = parse_table(@table_lines);
988  };
989  # If so, return the lines as they came, with markers.
990  if ($@) {
991      return ($num_table_lines, "''php2mediawiki failed to parse this table''", "''The table is included here unconverted:''", (map { s/^/ /; $_ } @table_lines), "''(End of failed conversion content)''");
992  }
993
994  # Pass 3.  Render the IR as MediaWiki table.
995  return ($num_table_lines, render_table($IR));
996}
997
998# Pass 1:
999# Collect up all the lines that make up a table, ready to parse the table.
1000# Return a list of lines.
1001sub collect_table {
1002  my ($lines, $first_table_line) = @_;
1003  my @table_lines;
1004
1005  # FIXME see notes at bottom
1006  # We don't handle indented first lines!
1007  #
1008  # We can be sure that at least the next line is a table line, so grab at
1009  # least the first.
1010  $lines->[$first_table_line] =~ s/\s+$//;
1011  $lines->[$first_table_line] =~ s/^\s+//;
1012  push(@table_lines, $lines->[$first_table_line]);
1013  my $cur_line = $first_table_line + 1;
1014  while (exists($lines->[$cur_line])
1015         && $lines->[$cur_line] =~ /^\|\|.*/) {
1016    # Trim any trailing whitespace while we have the line in our clutches.
1017    $lines->[$cur_line] =~ s/\s+$//;
1018    push(@table_lines, $lines->[$cur_line]);
1019    $cur_line++;
1020  }
1021  # Trim trailing blank lines, as they're redundant and not really part of the
1022  # table.
1023  while ($table_lines[-1] =~ /^\s*$/) {
1024    pop(@table_lines);
1025  }
1026  return @table_lines;
1027}
1028
1029# Pass 2:
1030# Parse a table into an IR.
1031# IR format is that each cell is either a cell hash ref like:
1032#  {
1033#    content => "foo",
1034#    rowspan => 5
1035#  }
1036# or a reference to a cell, which means the referring cell is actually spanned
1037# by the cell it refers to.
1038# The very last element of the returned IR will be the length of the longest
1039# row in the table, so that a renderer can sort out colspan.
1040sub parse_table {
1041  my (@table_lines) = @_;
1042  my $IR;
1043  my $max_row_length = 0;
1044
1045  # Keep parsing rows while there are more input lines.
1046  while (@table_lines) {
1047    # Create an empty row in the IR.
1048    push(@$IR, []);
1049
1050    # Get current line and one line of lookahead, and get indents for both.
1051    my $cur_line = shift(@table_lines);
1052
1053    my $content;
1054    my $pat = qr/
1055              # look for a string that starts with ||
1056              \|\|
1057              # capture the content untill the next || or the end of the line (non-greedy)
1058              (.*?)
1059              # use look ahead to check if we have either || or end-of-line
1060              (?=(?:\|\| | $))
1061              /x;
1062    # Keep parsing cells until a new row is detected.
1063    while ($cur_line =~ /$pat/) {
1064      $content = $1;
1065      if ($cur_line !~ /^\|\|\s*$/  ) {
1066        $content =~ s/\|/\//; #change | char to / char.
1067         
1068        push(@{$IR->[-1]}, { content => $content, rowspan => 1 });
1069      }
1070      $cur_line =~ s/$pat//;
1071   }
1072   # Done with row.
1073   $max_row_length = (@{$IR->[-1]} > $max_row_length)
1074                     ? scalar(@{$IR->[-1]}) : $max_row_length;
1075  }
1076  push(@$IR, $max_row_length);
1077  return $IR;
1078}
1079
1080# Hide all parsing of leading spaces; turn them all into indent levels, making
1081# sure that any indent level used is either the biggest yet or has been used
1082# before.
1083#
1084# This uses a Perl idiom to maintain a data structure which is private to a
1085# function and which which maintains state across calls.
1086# See the Camel, page 223.
1087#
1088# @cell_x is basically a hash.  For each defined element, the index corresponds
1089# to an indentation levels and the value corresponds to the X coordinate that
1090# cells with that indents should be assumed to be in.  This requires that
1091# indentation levels are used consistently throughout the table, which is not a
1092# constraint in PHPWiki (even if such tables do mess with your head).  It's not
1093# implemented as a hash because you'd have to sort() and max() a hash to find
1094# the largest indent so far :)
1095{
1096my @cell_x;
1097
1098sub get_cell_x {
1099  my ($line) = @_;
1100
1101  my $indent = 0;
1102  if ($line =~ /^(\s+)/) {
1103    $indent = length($1);
1104  }
1105  if (defined($cell_x[$indent])) {
1106    return $cell_x[$indent];
1107  } elsif ($indent > $#cell_x) {
1108    $cell_x[$indent] = (defined($cell_x[-1]) ? $cell_x[-1] + 1 : 0);
1109    return $cell_x[$indent];
1110  } else {
1111    die("PHPWiki table uses uneven indents");
1112  }
1113}
1114
1115sub reset_cell_x {
1116  @cell_x = ();
1117}
1118}
1119
1120sub update_rowspan {
1121  # $num_spanned_cols is the number of columns we need to update the rowspan
1122  # for in some cell.  As rowspans start from the left and must always follow
1123  # other rowspans, we know that the $num_spanned_cols cells are contiguous
1124  # starting from the LHS.
1125  my ($IR, $num_spanned_cols) = @_;
1126
1127  foreach my $i (0 .. $num_spanned_cols-1) {
1128    if (ref($IR->[-2][$i]) eq 'HASH') {
1129      # If this is the first spanned cell, put a reference in to the spanning cell
1130      # so that any following spanned cells can easily find and update the
1131      # rowspan count.
1132      push(@{$IR->[-1]}, \$IR->[-2][$i]);
1133    } elsif (ref($IR->[-2][$i]) eq 'REF') {
1134      # If this is the second or subsequent spanned cell, then just
1135      # propagate the reference into this cell.
1136      push(@{$IR->[-1]}, $IR->[-2][$i]);
1137    }
1138    # And either way, update the rowspan by using the reference.  Which is
1139    # now a reference to a reference to a hash.  Had enough yet? :)
1140    ${$IR->[-1][$i]}->{rowspan}++;
1141  }
1142}
1143
1144sub render_table {
1145  my ($IR) = @_;
1146  my $mw_table = "";
1147
1148  # Grab the max row length early to not confuse loops later.
1149  my $max_row_length = pop(@$IR);
1150 
1151  $mw_table .= "{| border=1 class=\"simple\"\n";
1152  my $cell_start = "!";
1153
1154  foreach my $y (0.. $#$IR) {
1155    foreach my $x (0.. $#{$IR->[$y]}) {
1156      my $cell = $IR->[$y][$x];
1157      next if ref($cell) ne 'HASH';
1158
1159      my $cell_options = "";
1160
1161      # If we're the last cell on the row, and we're not at column
1162      # $max_row_length, then this cell spans the remaining columns.
1163      if ($x == $#{$IR->[$y]}) {
1164        if ($x != $max_row_length-1) {
1165          $cell_options = "colspan=" . ($max_row_length - $x) . "|";
1166        }
1167      } 
1168
1169      if ($cell->{rowspan} > 1) {
1170        $cell_options = "rowspan=$cell->{rowspan}|";
1171      }
1172      $mw_table .= $cell_start . $cell_options . $cell->{content} . "\n";
1173    }
1174    if ($y < $#$IR) {
1175      $mw_table .= "|- \n";
1176      $cell_start = "| ";
1177    }
1178  }
1179  $mw_table .= "|}\n";
1180
1181  return $mw_table;
1182}
1183
1184__END__
1185
1186PHPWiki table markup parsing gotchas/notes/observations/assumptions.
1187
1188A table starts with a --non-indented-- line ending in a pipe, followed by a ++
1189more++-indented line Course, the indented line can be as simple as this:
1190
1191a |
1192 |  <-- interpreted as a cell containing a pipe
1193end
1194
1195A table continues until:
1196  +++ - a less-indented line +++
1197
1198  - A non-indented line that doesn't end with a pipe.  The unindented line
1199    "end" is used here to show when an example ends, and PHPWiki considers
1200    these to end the table.  Blank lines do not end tables:
1201
1202a |
1203 b |
1204
1205 c |
1206end
1207
1208A row continues until:
1209 - indent level drops (start a new cell on next row)
1210 - the table ends
1211
1212A cell continues until:
1213 - |\n<more indent> is encountered (start a new cell on same row)
1214   - collect this line's content, and exit with more_cells = 1
1215 - indent level drops (regardless of pipe)
1216   - immediately return with more_cells = 0, collecting nothing (do not pass Go)
1217 - a new cell with a follower (therefore ending in a pipe, and followed by a
1218   more indented line) is encountered
1219 - the table ends
1220
1221The widest row in the table sets the number of columns.  Any row with less than
1222this amount will get rendered with the last cell in the row spanning the
1223remaining room.  This is the only way to cause a colspan.
1224
1225Rowspan can be done across any subset of rows that is top-level or follows
1226another rowspan, with the caveat noted below.  Rowspan is caused by a drop in
1227indent level to less than zero.
1228
1229If you don't end a cell's content line with a pipe, then all content up to the
1230next cell is placed in one cell, and that cell is forced to be the last cell on
1231the row.  This means in turn that rowspan can only be done with a table at
1232least 3 columns wide, because:
1233
1234a |
1235 b
1236 c
1237end
1238
1239is parsed as [ a | bc ].
1240
1241All cells in a row except the last are bolded.