ANNOUNCE: brillig 0.3 - not quite the Brill tagger
aleks.dimitrov at googlemail.com
Wed Sep 7 19:55:07 BST 2011
On Wed, Sep 07, 2011 at 02:37:14PM -0400, wren ng thornton wrote:
> On 9/7/11 12:40 PM, Rogan Creswick wrote:
> >Is anyone else interested in supporting the Apache UIMA CAS format(s)?
> >I'm not a *huge* fan of the gritty system design details in UIMA (it
> >seems absurdly difficult to actually use an analysis engine / pear in
> >an application) but at least the file format for annotations is
> >somewhat standardized.
> Do you have a reference to the specifications? If it's sufficiently
> standardized I could put something together.
> I already have parsers for a number of common tagging formats (or
> parser formats treated as mere tagging formats):
> * "Brown format", i.e. the format people usually mean when they talk
> about the Brown corpus, rather than the actual format used for
> originally distributing the Brown corpus
> * CoNLL-X shared task format
> * NeGra Export Format for Annotated Corpora, version 3
> * TnT
> and the beginnings of a framework for being able to swap them around
> without a care. Once I get a break from teaching long enough to post
> my tagger to Hackage, this'll be in there too.
UIMA is *much* more than just a framework for unifying tagsets or tagger
It isn't even specifically focused on text or NLP applications. Theoretically,
you could annotate videos or sound files with it. It just provides a common
infrastructure for annotating *data* of any kind.
The basic components are:
- The CAS, which contains the data and the annotations
- The Annotation Engines, which read the CAS, then put annotations in it. They
have access to the complete CAS, which means they have access to the previous
- The annotation type system: defined at compile-time, annotators can either
consume raw data, or other annotators' annotations, as defined by the type
system. The type system also defines how the annotations look like, and what
they can contain. Technically they're plain old Java objects, which means
they're quite limited (no multiple inheritance, etc.)
A typical process would be: put plain text into the CAS. Run a tokenizer over
it, which will populate the CAS with token annotations. Run a sentence boundary
detector to add sentence annotations. Write a PoS-AE (analysis engine) that
looks at all the token annotations within each sentence annotation and adds
PoS-tag information to the Token objects. Etc.
You could read through the official documentation to get an idea of what UIMA is
all about http://uima.apache.org/documentation.html
There was a publication somewhere about UIMA, written by Tilo Götz and Oliver
Suhre … Ah, here it is:
This should get you started. I don't think there's really a "specification" for
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 490 bytes
Desc: Digital signature
More information about the NLP