ANNOUNCE: brillig 0.3 - not quite the Brill tagger
wren ng thornton
wren at freegeek.org
Wed Sep 7 19:37:14 BST 2011
On 9/7/11 12:40 PM, Rogan Creswick wrote:
> Is anyone else interested in supporting the Apache UIMA CAS format(s)?
> I'm not a *huge* fan of the gritty system design details in UIMA (it
> seems absurdly difficult to actually use an analysis engine / pear in
> an application) but at least the file format for annotations is
> somewhat standardized.
Do you have a reference to the specifications? If it's sufficiently
standardized I could put something together.
I already have parsers for a number of common tagging formats (or parser
formats treated as mere tagging formats):
* "Brown format", i.e. the format people usually mean when they talk
about the Brown corpus, rather than the actual format used for
originally distributing the Brown corpus
* CoNLL-X shared task format
* NeGra Export Format for Annotated Corpora, version 3
and the beginnings of a framework for being able to swap them around
without a care. Once I get a break from teaching long enough to post my
tagger to Hackage, this'll be in there too.
However, some formats like those called "Penn Treebank format" aren't
actually standardized sufficiently to permit an actual implementation;
everybody's Penn POS annotations are different. The actual treebank
format is fine, it's just the POS formats which are intractable.
More information about the NLP