Japanese Sentence Annotations
I'm a bit late reporting back on the outcome of the big
annotation exercise. It is (almost) all over. All the sentences
have been annotated by two people, and the agreed lexemes
extracted/flagged. About 220 sentences are down for some
adjudication, as an annotator proposed one or more lexemes
which I felt should go to a third party.
I began with 2,000 sentences: 1,000 chosen because they
contained 100 known lexemes (10 of each) selected simply
using grep, and 1,000 chosen "wild". The annotation process
identified 1,537 instances of 755 unique lexemes. Most of the
starting 100 were there, but some went missing or were diminished
because they were embedded in longer terms (e.g. 音響機 was always
part of 音響機器.) The distribution is very asymptotic, with the
original 100 having 4-10 occurrences and a long tail of 624
single lexemes.
I'm now using the 2,000 to carry out some 10-way
cross-classifications to test various machine-learning models
for detecting potential lexemes in text.
Thank a lot to everyone who participated. I shouldn't mention
individuals, but I want especially to thank Rene and muchan who
between tem looked a large proportion of the sentences.
Cheers
Jim
Fnews-brouse 1.9(20180406) -- by Mizuno, MWE <mwe@ccsf.jp>
GnuPG Key ID = ECC8A735
GnuPG Key fingerprint = 9BE6 B9E9 55A5 A499 CD51 946E 9BDC 7870 ECC8 A735