An entry in the 2003 (0th) International Obfuscated Haskell Code Competition
----------------------------------------------------------------------------

Program:     'remorse'
Author:      Malcolm.Wallace@cs.york.ac.uk
Copyright:   Public Domain


TO BUILD
--------

To build the program, just use 'make'.  The Makefile assumes ghc,
but nhc98 works equally as well.

Note, this file contains progessively more information about this
submission, starting with the simple command-line USAGE.  You can
also get that information by running the program with no arguments.

Then follow some SIMPLE HINTS, some MORE HINTS, and finally the full
description marked with SPOILER, further down.  Read only as far as
you want to...



USAGE
-----

    Usage: remorse (+/-) file.hs

The output goes to stdout, so it doesn't overwrite your original file.
'remorse' is an invertible function: use the + or - argument to decide
the direction.



SIMPLE HINTS
------------


'remorse' transforms Haskell code into Haskell code.  It is so-named
because, if you wish you hadn't used it, you can go back and undo
your action.  I am certainly beginning to regret writing it.  :-)

You might like to try running 'remorse +' on some arbitrary Haskell'98
modules you have lying around, perhaps even some of the other entries
in the IOHC!  The results should still be valid Haskell modules
(with one caveat (see below)).

To undo the changes, use 'remorse -', and you should get the original
module back.



MORE HINTS
----------


'remorse' is an automatic Haskell code obfuscator and de-obfuscator.
With the command-line argument +, it generates an obfuscated version
of the program, and with the argument -, it undoes the obfuscation.

Obviously, the primary test case was 'remorse' itself, and of course
the version I submitted has already had 'remorse +' applied to itself.
To view the (relatively) unobfuscated version, try 'remorse -' on the
source code.

Now, even when you auto-de-obfuscate 'remorse', it will still not be
easy to read I'm afraid.  I had to use all the obvious techniques to
squash the source code to fit into the 5k limit.  (Actually, it was
the 'remorse'ified version that had to fit into 5k - the 'remorse'less
version is only 2.7k.)  I really wish the limit had been 5k LoC!

    * omission of type signatures
    * removal of all non-essential whitespace between identifiers+patterns
    * use of explicit {} and ; to avoid layout whitespace
    * grossly shortened identifier names
    * local re-binding of Prelude names to shorten them
    * import renaming to shorten the module part of a qualified name

Nevertheless, I felt it prudent to retain a /few/ comments in the
source, just so I can remember what each function does.



SPOILER ALERT!
--------------


The technique used by 'remorse' to obfuscate programs is to rename
all the identifiers using a morse-code spelling.  Any alphanumeric
variable name can be replaced in Haskell by a symbolic operator name
surrounded by parentheses.  This includes the argument positions in
function and pattern bindings.  :-)  Each valid alphanumeric character
in an identifier is replaced by its morse code equivalent, in dots
and dashes, with a separating bar between characters.  Thus, 'a' = .-
and 'b' = -... so the identifier ab = (.-|-...)

The full transliteration was found on the web at
    http://www.babbage.demon.co.uk/morseabc.html

The maximal munch rule of Haskell'98 ensures that there is no problem
using operator names that look a little bit like the start of a
comment, e.g --..  The only tricky part is to ensure that ambiguous
parses cannot happen.  Luckily, there are only five ambiguous cases:
each a single-letter identifier:
    e = .    = overlaps with composition
    i = ..   = syntax for numeric ranges
    m = --   = end-of-line comment
    o = ---  = end-of-line comment
    t = -    = overlaps with subtraction
For these few exceptions, we simply append a bar to the morse spelling.

Morse code covers only one case of the alphabetic characters, which we
chose for convenience to be lower-case.  Thus, every upper-case character
needs a special tag, for which we chose a trailing + character.  Valid
Haskell varids also permit underscore _ and prime ' which have no
morse equivalent, so we chose ! and = respectively (allowing some nice
confusion with function/pattern bindings.)

Now, another special case.  An ordinary varid can be turned into an
infix varop by enclosing it in backquotes, e.g. `elem`, but we have
already translated the varid into a varop surrounded by parentheses,
so in this case we must rather un-bracket the varop and ensure that
the backquotes are dropped.

One cannot spell keywords or conids (type, constructor, or module
name) in morse code - a conop is not permitted in place of a conid at
all syntactic locations.  What about qualified names?  The choices
here are to translate /all/ modules including imported ones, and
thus the qualified name should be spelled in morse; or to translate
modules stand-alone in which case the qualified names should remain as
they are.  I chose the latter course of action, since translating the
entire Prelude and Libraries would have far exceeded the competition's
5k limit.  :-)

This leads us to the previously-mentioned caveat on the validity of
morse-spelled Haskell programs generated by 'remorse'.  Any imported
function name that is /not/ qualified, will be spelled in morse,
and of course there will be no definition with that spelling.  Thus,
to compile and run a 'remorse'ful program, you must either 'remorse'
the whole import dependency graph including the Prelude and Libraries,
or qualify each import.  As illustrated in 'remorse' itself, you can
get away with qualifying each name only once if you use a local binding
e.g. x = Prelude.lex  becomes (-..-) = Prelude.lex.

Finally, although I tried to cover all the common lexical features
of Haskell, obviously the 5k limit meant I couldn't do an exhaustive
analysis.  The undo-ing of the morse-code spelling is rather brute
force, and is the most likely place for bugs to lurk.

