Website Logo. Upload to /source/logo.png ; disable in /source/_includes/logo.html

Zuzur’s wobleg

technology, Internet, and a bit of this and that (with some dyslexia inside)

Some Groovy Magic

| Comments

What i like the most about groovy ?

  • tightly connected to the JVM, so you can use any existing java class in your project
  • dynamic, meta-programming

Why ? let me show you a few examples of why Groovy doesn’t suck…

For my current project, i need to use the Weka toolkit to perform some text-based analysis, and while this toolkit is geared toward their own Swing user interface (yuk !), they provide an API. But even if it’s not that tightly coupled with their tool, many classes are clearly built for being called by a program with a main() method and users happy to type switches on the command line (I am part of these ones…). To say the least, it’s not very programmer friendly…

Proxying instances

The Weka toolkit provides a simple class called NGramTokenizer, whose function is just to find multigrams in a String.

1
2
3
4
5
NGramTokenizer tokenizer = new NGramTokenizer()
tokenizer.setNGramMinSize(ngramMinSize)
tokenizer.setNGramMaxSize(ngramMaxSize)
tokenizer.setDelimiters("(\\p{Blank}+|\\p{Punct})")
tokenizer.tokenize(words)

And then you can iterate over each multigram found in the text you’re working on. It provides 2 methods for this: hasMoreElements() and nextElement() (classical, 10 years-old JDK 1.0 way to iterate over a Collection in java)

Now suppose you want to extract a dictionnary from this text (number of different words with their counts), the naive approach would make you do that

1
2
3
4
5
6
7
8
9
10
11
  Map dict = [:]

  while(tokenizer.hasMoreElements()) {
    String token = tokenizer.nextElement()
    count = dict[token]
    if (count != null) {
      dict[token] = dict[token] + 1
    } else {
      dict[token] = 1
    }
  }

Nice… but not thrilling :)

Now, the Groovy Development Kit provides a nice method added to every instances of Collection, named countBy(), which just does what it says… exactly what’s in the piece of code above. It may be more efficient, using a faster tree-based algorithm ? That’s something i still need to check…

I can’t use that countBy() method on my NGramTokenizer instance because it’s not a Collection instance and doesn’t inherit from it. So in the Java world, i would have to stick to that… different story in groovy:

1
2
3
4
  def tokenizerProxy = [next: tokenizer.nextElement(),
              hasNext: tokenizer.hasMoreElements()]

  Map dict = tokenizerProxy.countBy { it }

That’s all. I define a Map instance which contains the 2 closures (next() and hasNext()) that countBy() relies on to generate the dictionnary, and use it as a proxy that mimics a Collection instance.

All this happens at the instance level, i don’t have to change anything at the class level and change the NGramTokenizer in any way…

There are no evident benefit, but the code is shorter and more readable - your mileage may vary, of course… But, more importantly, it relies on well-tested and proven code (countBy()), instead of my half-assed, trivial while loop…

short-circuiting methods in unit tests

Sometimes, your code relies on services such as HTTP server, database server, RabbitMQ, etc, that aren’t available on the test server. You really want to write a unit test, but have no time to tinker with the infrastructure on the build/test server (you have a non-negligeable change of pissing-off your Operations team in the process, too ;-)). So what are your solutions ?

Imagine a system that collect data from URLs, and build a dictionnary out of it… You have developped a service class that relies on a HTTP server being present to retrieve the content from an URL, and returns a Map out of the dictionnary…

The service would look like that

1
2
3
4
5
6
7
8
9
10
11
12
  String retrieveContent(URL u) throws MalformedURLException,IOException {
    // connect and collect only the text from the document
  }

  Map buildDictionnary(URL u) {
    String content
    try {
      content = retrieveContent(u)
    } catch...

    // there goes the code from the previous example
  }

A test for this might look like

1
2
3
4
5
6
7
8
   void testAcmeDictionnary() {}
     def service = new URLService()

     Map dict = service.buildDictionnary(new URL("http://www.acme.com"))

     assert dict.keySet().size() == /// expected number of keywords
     assert dict['ACME'] == // expected number of instances of the word 'ACME'
   }

There are many problems here:

  • you may not be able to connect to http://www.acme.com
  • they might not be too happy with your tests hitting their servers regularly
  • their public page might change and your assertions would make your test fail, making you having to change your test every time they edit their home page, …

So what if you would do that instead:

1
2
3
4
5
6
7
8
9
10
11
void testAcmeDictionnary() {}
  def service = new URLService()

  // make retrieveContent() always succeed with a fixed content
  service.metaClass.retrieveContent = { URL u -> return "<HTML><BODY>....</BODY></HTML>"}

  Map dict = service.buildDictionnary(new URL("http://www.acme.com"))

  assert dict.size == /// expected number of keywords
  assert dict['ACME'] == // expected number of instances of the work 'ACME'
}

And there you go. Your test doesn’t rely on an external service you have no control on anymore…

Nice, eh ! ;-)

Comments