30 June 2014 by Remco Bouckaert
Multi-line strings
Multi-line strings have always been a bit cumbersome in Java. Either you have to add strings in quotes separated by a + sign, like so:
String s = "This is a multi-linen" + "string that takes a lot of quotesn"+ "and escape characters"; String [] parts = s.split("n");
or use a StringBuilder and append every string like so:
String s = new StringBuilder().append("This is a multi-linen"). .append("string that takes a lot of quotesn"). .append("and escape characters").toString(); String [] parts = s.split("n");
which is even more verbose. Perl supports multi-line strings and the above simplifies to
perl % $s = 'This is a multi-line string without very much need for extra escape characters'; perl % @parts = split('n', $s);
There is a proposal by Stephen Colebourne to incorporate multi-line strings into Java, but ala, so far that never made it (we are at Java 8 now). Fortunately, hidden in the parser, and absent from the documentation, there is this little gem which tells we can have multi-line strings in BEASTShell. The escape sequence is “””, so the Perl fragment in BEASTShell would be
bsh % s = """This is a multi-line string without very much need for extra escape characters"""; bsh % parts = s.split('n');
which is just one characters more than the Perl script! If you want quotes inside the multi-line string, no escape sequence is necessary, just insert them in the string:
bsh % s = """Darwin's Finches Homo "Sapiens" Double quote="""""; bsh % parts = s.split('n'); // parts is now [Darwin's Finches, Homo "Sapiens", Double quote=""] bsh % s = """Darwin's Finches Homo "Sapiens" Single quote=""""; bsh % parts = s.split('n'); // parts is now [Darwin's Finches, Homo "Sapiens", Single quote="]
(which surely upsets the code colouring algorithm:-)) For triple quotes you still need to add two strings
bsh % s = """Darwin's Finches Triple quote=""""" + """; bsh % parts = s.split('n'); // parts is now [Darwin's Finches, Triple quote="""]
but, he, would needs that?
The ugliness of regular expressions
Another eye-sore is Java syntax is regular expression handling. Look at this fragment, almost completely from the Java documentation:
import java.util.regex.Pattern; import java.util.regex.Matcher; public class MatcherDemo { private static final String REGEX = "\bdog\b"; private static final String INPUT = "dog dog dog doggie dogg"; public static void main(String[] args) { Pattern p = Pattern.compile(REGEX); // get a matcher object Matcher m = p.matcher(INPUT); int count = 0; while(m.find()) { count++; System.out.println("match " + count + " = " + INPUT.substring(m.start(), m.end())); } } }
(The original code in the documentation prints the m.start() and m.end(), but why would one be interested in the location instead of the matching string?) The equivalent in BEASTShell would be a bit shorter:
import java.util.regex.Pattern; import java.util.regex.Matcher; REGEX = "\bdog\b"; INPUT = "dog dog dog doggie dogg"; p = Pattern.compile(REGEX); // get a matcher object m = p.matcher(INPUT); count = 0; while(m.find()) { count++; print("match " + count + " = " + INPUT.substring(m.start(), m.end())); }
But in Perl this can be done even shorter:
$REGEX = "\b(dog)\b"; $INPUT = "dog dog dog doggie dogg"; $count = 0; while ($INPUT =~ /$REGEX/g) { $count++; print("match $count = $1n"); }
Gedit counts these fragments as follows
#words | #characters | #characters | |
without whitspace | |||
Java | 68 | 400 | 548 |
BeanShell | 46 | 262 | 313 |
Perl | 22 | 120 | 151 |
In summary, regular expressions in Java and — to a lesser extent — BEASTShell are very verbose and ugly.
Improved regular expressions
BEASTShell has a regexp command that returns a list of matches, which can be used like so:
$REGEX = "\b(dog)\b"; $INPUT = "dog dog dog doggie dogg"; for(s : regexp(INPUT,REGEX)) { print("match " + (++count) + " = " + s); }
And now we can extend the table of Gedit counts:
#words | #characters | #characters | |
without whitspace | |||
Java | 68 | 400 | 548 |
BeanShell | 46 | 262 | 313 |
Perl | 22 | 120 | 151 |
BEASTShell | 19 | 109 | 141 |
This is of course a bit cheating, because we can hide any complex function inside a command. But I think this is justified here since regular expression matching is common enough and verbose enough in Java/BeanShell that a few extra commands are very helpful. Also, it shows how an effectively defined command can help streamline your scripts.
Filter *BEAST analysis
Let’s put this to work to select a set of species taxa in an existing *BEAST analysis. First, define the original species and their sequences.
// the original species and their sequences s="""Cyanocorax_mystacalis = {GU144828 GU144829 GU144831 GU144830} C.validus = {JQ023974} Urocissa_erythrorhyncha = {JQ864482} C.capensis = {JQ023977} Cyanocorax_yucatanicus = {DQ912613 GU144848} Perisoreus_canadensis = {JQ655939 JQ656012 JQ656006 JQ655975} """; // one species per entry, so we can process s line by line data = s.split("n"); // the new set of species filter ="""Urocissa_erythrorhyncha Perisoreus_canadensis Cyanocorax_yucatanicus""";
Then, we define a command to print out a new taxonset, using regexp to get info out of the string:
newtaxon(s, filter) { x = regexp(s, "(.*)=.*{(.*)}.*"); //if (!filter.matches("(?s).*\b"+Matcher.quoteReplacement(x[1].trim())+"\b.*")) { if (!filter.matches("(?s).*\b"+x[1].trim()+"\b.*")) { return; } print(""); for (id : x[2].trim().split("\s+")) { print(" "); } for (s : data) {newtaxon(s, filter);}"); } print("
Assuming the original sequences are in a file called source.xml, we can grab the sequences from the file using:
newsequence(s, filter) { x = regexp(s, "(.*)=.*{(.*)}.*"); if (!filter.matches("(?s).*\b"+x[1].trim()+"\b.*")) { return; } for (id : x[2].trim().split("\s+")) { exec("grep "+ id + " source.xml | grep sequence"); } } for (s : data) {newsequence(s, filter);}