Tag Archives: String processing

BEASTShell String Tricks — this site is deprecated: goto https://www.beast2.org/ for up-to-date information

30 June 2014 by Remco Bouckaert

Multi-line strings

Multi-line strings have always been a bit cumbersome in Java. Either you have to add strings in quotes separated by a + sign, like so:

	String s = "This is a multi-linen" +
		"string that takes a lot of quotesn"+
		"and escape characters";
	String [] parts = s.split("n");

or use a StringBuilder and append every string like so:

	String s = new StringBuilder().append("This is a multi-linen").
		.append("string that takes a lot of quotesn").
		.append("and escape characters").toString();
	String [] parts = s.split("n");

which is even more verbose. Perl supports multi-line strings and the above simplifies to

perl % $s = 'This is a multi-line
string without very much need
for extra escape characters';
perl % @parts = split('n', $s);

There is a proposal by Stephen Colebourne to incorporate multi-line strings into Java, but ala, so far that never made it (we are at Java 8 now). Fortunately, hidden in the parser, and absent from the documentation, there is this little gem which tells we can have multi-line strings in BEASTShell. The escape sequence is “””, so the Perl fragment in BEASTShell would be

bsh %  s = """This is a multi-line
string without very much need
for extra escape characters""";
bsh %  parts = s.split('n');

which is just one characters more than the Perl script! If you want quotes inside the multi-line string, no escape sequence is necessary, just insert them in the string:

bsh %  s = """Darwin's Finches
Homo "Sapiens"
Double quote=""""";
bsh %  parts = s.split('n');
// parts is now [Darwin's Finches, Homo "Sapiens", Double quote=""]

bsh %  s = """Darwin's Finches
Homo "Sapiens"
Single quote="""";
bsh %  parts = s.split('n');
// parts is now [Darwin's Finches, Homo "Sapiens", Single quote="]

(which surely upsets the code colouring algorithm:-)) For triple quotes you still need to add two strings

bsh % s = """Darwin's Finches
Triple quote=""""" + """;
bsh %  parts = s.split('n');
// parts is now [Darwin's Finches, Triple quote="""]

but, he, would needs that?

The ugliness of regular expressions

Another eye-sore is Java syntax is regular expression handling. Look at this fragment, almost completely from the Java documentation:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class MatcherDemo {

    private static final String REGEX = "\bdog\b";
    private static final String INPUT = "dog dog dog doggie dogg";

    public static void main(String[] args) {
       Pattern p = Pattern.compile(REGEX);
       //  get a matcher object
       Matcher m = p.matcher(INPUT);
       int count = 0;
       while(m.find()) {
           count++;
           System.out.println("match " + count + " = " + INPUT.substring(m.start(), m.end()));
      }
   }
}

(The original code in the documentation prints the m.start() and m.end(), but why would one be interested in the location instead of the matching string?) The equivalent in BEASTShell would be a bit shorter:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

REGEX = "\bdog\b";
INPUT = "dog dog dog doggie dogg";

p = Pattern.compile(REGEX);
//  get a matcher object
m = p.matcher(INPUT);
count = 0;
while(m.find()) {
   count++;
   print("match " + count + " = " + INPUT.substring(m.start(), m.end()));
}

But in Perl this can be done even shorter:

$REGEX =  "\b(dog)\b";
$INPUT =  "dog dog dog doggie dogg";

$count = 0;
while ($INPUT =~ /$REGEX/g) {
	$count++;
    print("match $count = $1n");
}

Gedit counts these fragments as follows

#words #characters #characters
without whitspace
Java 68 400 548
BeanShell 46 262 313
Perl 22 120 151

In summary, regular expressions in Java and — to a lesser extent — BEASTShell are very verbose and ugly.

Improved regular expressions

BEASTShell has a regexp command that returns a list of matches, which can be used like so:

$REGEX =  "\b(dog)\b";
$INPUT =  "dog dog dog doggie dogg";

for(s : regexp(INPUT,REGEX)) {   
	print("match " + (++count) + " = " + s);
}

And now we can extend the table of Gedit counts:

#words #characters #characters
without whitspace
Java 68 400 548
BeanShell 46 262 313
Perl 22 120 151
BEASTShell 19 109 141

This is of course a bit cheating, because we can hide any complex function inside a command. But I think this is justified here since regular expression matching is common enough and verbose enough in Java/BeanShell that a few extra commands are very helpful. Also, it shows how an effectively defined command can help streamline your scripts.

Filter *BEAST analysis

Let’s put this to work to select a set of species taxa in an existing *BEAST analysis. First, define the original species and their sequences.

// the original species and their sequences
s="""Cyanocorax_mystacalis = {GU144828 GU144829 GU144831 GU144830}
C.validus = {JQ023974} 
Urocissa_erythrorhyncha = {JQ864482}
C.capensis = {JQ023977}
Cyanocorax_yucatanicus = {DQ912613 GU144848}
Perisoreus_canadensis = {JQ655939 JQ656012 JQ656006 JQ655975} """;

// one species per entry, so we can process s line by line
data = s.split("n");

// the new set of species 
filter ="""Urocissa_erythrorhyncha
Perisoreus_canadensis
Cyanocorax_yucatanicus""";

Then, we define a command to print out a new taxonset, using regexp to get info out of the string:

newtaxon(s, filter) {
	x = regexp(s, "(.*)=.*{(.*)}.*");
	//if (!filter.matches("(?s).*\b"+Matcher.quoteReplacement(x[1].trim())+"\b.*")) {
	if (!filter.matches("(?s).*\b"+x[1].trim()+"\b.*")) {
		return;
	}
	print("");
	for (id : x[2].trim().split("\s+")) {
		print("    ");
	}
	print("");
}

for (s : data) {newtaxon(s, filter);}

Assuming the original sequences are in a file called source.xml, we can grab the sequences from the file using:

newsequence(s, filter) {
	x = regexp(s, "(.*)=.*{(.*)}.*");
	if (!filter.matches("(?s).*\b"+x[1].trim()+"\b.*")) {
		return;
	}
	for (id : x[2].trim().split("\s+")) {
		exec("grep "+ id + " source.xml | grep sequence");
	}
}

for (s : data) {newsequence(s, filter);}