Kuda Tech: Ruby Challenge: Run Length Encoding By Michael Morin Ruby Expert

Run-length encoding is a primitive method of data compression popular in early computing. It works by compressing long strings of the same character to a number followed by that character. For example, the string "wooooooooooah" could be compressed as "w10oh".
For the purposes of this article we'll assume that the strings to be encoded don't contain any numbers. A "real" run-length encoder would use a character to mark a run but for the sake of this article don't use any strings with numbers in them.

Encoding

The first step in encoding is to detect runs of the same character. This can be done with a loop, but Ruby has a great regular expression engine that can be used to do this. We only need to use one trick: a backreference.
To match any character in Ruby, we use the . character. For example, "test".gsub(/./, '!') will replace every character with an exclamation point.

The . regexp character matches any character in the input stream. What we need to do now is continue matching based on that first character.

Your first instinct might be to use something like /.+/, one or more of any character. However, the + sign will keep matching any character, it won't care if the following characters are different. What we need to do is place the . regexp character in a capture group (otherwise known as parentheses). So what we now have is /(.)/.
Capture groups are very interesting things. In regular expressions, capture groups are usually used to match specific groups of characters to be consumed via the MatchData object. However, capture groups can be referred to in the regexp string itself.

For example, the regular expression /(ba)\1/ will match the string "baba", the \1 is a reference to the first capture group. Capture groups are numbered by the order of their opening parentheses, and since there is only one opening parenthesis here it's capture group 1.

Now, back to matching runs of characters.

We'll put the . regular expression character into a capture group and refer to it with a backreference. So we have /(.)\1+/. In other words, we have "any character followed by 1 or more of that character" where before we erroneously had "any character followed by 1 or more other characters." This will be the core of the encoding method.

Now we just need a gsub call. A gsub call will replace all instances of a regular expression in a string with something else. In this case, we want to replace all runs of the same character with the number of times that character appears followed by that character. It's rather straightforward, we'll be using the block variant of gsub to have a dynamic output.


def rle_encode(str)
  str.gsub(/(.)\1+/) {|run| "#{run.length}#{$1}" }
end

Like any code that uses regular expressions it can look a bit like line noise at first. There are some techniques to combat that, however this one is so short the reader will just have to suffer through it. Regular expressions are easily the least readable part of any Ruby program, but they're also the most powerful tool for text processing that you can find. They're always a bit unreadable, but they're very, very useful.
One thing that was used here was the $1 variable. Don't be afraid of this Perlism. This is a "global" variable only in name. This $1 variable refers to the first capture group in the most recently evaluated regular expression. It's also thread-local and method-local, so you don't have to worry about race conditions or accidentally overwriting it with a regular expression buried in a method call. We could have just gotten the first character of the run variable, but we really already extracted that first character in the regular expression and it's there waiting for you in the capture group variable.

Decoding

Now that we can run-length encode a string we should work on decoding it. This process is almost the same as far as the code goes. We want to match on any cluster of digits at least 1 character in length, as well as the following character. For ease of access we'll capture both of those in separate capture groups. So the regular expression we want here is /(\d+)(.)/. Now, putting this together following the same gsub pattern of before we get this.


def rle_decode(str)
  str.gsub(/(\d+)(.)/) { $2 * $1.to_i }
end

And that's it. You can see why this was appealing to computing pioneers. When your mainframe, minicomputer or microcomputer has only a few kilobytes of RAM you generally don't have the space for a complex compression algorithm. And if you're transferring data over a 300 baud modem or storing on paper or magnetic tape then every single byte matters. Also, since programs also take memory and time it may not even be worth it to load and run a complex compression algorithm in the first place, save memory and time and use a primitive method like run-length encoding. Honestly you'll probably never see run-length encoding in a modern setting, with ready access to so many great compression algorithms it's only rarely used.

Kuda Tech

Monday, 9 November 2015

Ruby Challenge: Run Length Encoding By Michael Morin Ruby Expert

Encoding

Decoding

No comments:

Post a Comment