Seeing SequencesFrom melodies to text to DNA, many interesting types of information come in the form of a sequence of symbols. Although it's natural to try to analyze these sequences graphically, they present a unique challenge for data visualization. A good visualization method should provide a broad overview of the "shape" of the data--yet with sequences, the only structure ab initio is the completely uniform micro-level connection of each symbol to its immediate neighbors. This page describes a technique, the matching diagram, designed to deduce and display macro-level structure in long sequences. (If you want to skip this description, go directly to these pictures of music.) Matching diagrams are based on the fact that sequences often represent a hierarchy of ideas. Melodies, for instance, are usually based on combinations of smaller repeated musical passages; text has repeated words and phrases. A natural way to find macro-level structure, therefore, is to use these repeated units as signposts.
Unfortunately the human eye is bad at spotting repetition.
(Try to find the longest
repeated subsequence in
To create the diagram, we start by representing the sequence as a line. If two subsequences "match" according to a given criterion, connect the corresponding intervals on the line with an arc. For instance, here is the matching diagram for the numerical sequence in the previous paragraph, where "match" means an exact repetition.
The arc makes it obvious where the repeated subsequence is. What if a sequence is repeated more than once? In that case I connect each successive repetition with an arc, as in the next picture.
An interesting ambiguity arises when a pattern is
repeated many times in immediate succession.
Take the sequence
Most interesting sequences will combine several
different repeated patterns. To make a diagram for a complex
sequence, we overlay all the appropriate arcs, but with a
degree of translucency so no match is
completely obscured. For instance the sequence
This diagram shows how a pattern of matches can provide a bird's-eye view of the sequence's structure. For instance, although the symbols don't form an exact palindrome, it's visually obvious that at the macro level the sequence is symmetric. A long sequence made from a small set of symbols generally has a huge number of repeated sequences, creating an uninformative jumble as in this example:
One way of reducing this complexity is to display only repeated subsequences that are longer than a given limit. For instance, below is the same sequence but this time I've filtered out all repetitions of less than 10 symbols. The result is a simple diagram that highlights a single repeated region of 15 symbols.
Applied to real-life data, matching diagrams produce some intriguing pictures. I'm most interested in their applications to visualizing music but they also can reveal structure in many kinds of encoded data. Of course, this is not the first attempt to visualize sequences; I've written a quick summary of some other methods. Finally, you can read some implementation details if you wish. |