ANNOUNCE: brillig 0.3 - not quite the Brill tagger
creswick at gmail.com
Wed Sep 7 21:37:59 BST 2011
On Wed, Sep 7, 2011 at 11:37 AM, wren ng thornton <wren at freegeek.org> wrote:
> On 9/7/11 12:40 PM, Rogan Creswick wrote:
>> Is anyone else interested in supporting the Apache UIMA CAS format(s)?
>> I'm not a *huge* fan of the gritty system design details in UIMA (it
>> seems absurdly difficult to actually use an analysis engine / pear in
>> an application) but at least the file format for annotations is
>> somewhat standardized.
> Do you have a reference to the specifications? If it's sufficiently
> standardized I could put something together.
"UIMA" is an OASIS standard, so there's plenty to read ;)
Here's a link to the start of the "Full UIMA Specification" -- the
details about the CAS representations are probably the most pertinent
for this discussion.
I believe there are currently two standard (textual) CAS
representations: XCAS and XMI CAS - the former is relatively simple,
but appears to be getting phased out in favor of the the XMI-based
> I already have parsers for a number of common tagging formats (or parser
> formats treated as mere tagging formats):
> * "Brown format", i.e. the format people usually mean when they talk about
> the Brown corpus, rather than the actual format used for originally
> distributing the Brown corpus
> * CoNLL-X shared task format
> * NeGra Export Format for Annotated Corpora, version 3
> * TnT
> and the beginnings of a framework for being able to swap them around without
> a care. Once I get a break from teaching long enough to post my tagger to
> Hackage, this'll be in there too.
> However, some formats like those called "Penn Treebank format" aren't
> actually standardized sufficiently to permit an actual implementation;
> everybody's Penn POS annotations are different. The actual treebank format
> is fine, it's just the POS formats which are intractable.
> Live well,
> NLP mailing list
> NLP at projects.haskell.org
More information about the NLP