Japanese Sentence Annotations

From(投稿者):	JimBreen <jimbreen@gmail.com>
Newsgroups(投稿グループ):	sci.lang.japan,fj.life.in-japan
Subject(見出し):	Japanese Sentence Annotations
Date(投稿日時):	Tue, 29 Nov 2011 19:55:01 -0800 (PST)
Organization(所属):	http://groups.google.com
Message-ID(記事識別符号):	(G) <51e252df-5a6a-4875-94fc-94b3a2a37d08@i6g2000vbe.googlegroups.com>

From(投稿者):

JimBreen <jimbreen@gmail.com>

Newsgroups(投稿グループ):

sci.lang.japan,fj.life.in-japan

Subject(見出し):

Japanese Sentence Annotations

Date(投稿日時):

Tue, 29 Nov 2011 19:55:01 -0800 (PST)

Organization(所属):

http://groups.google.com

Message-ID(記事識別符号):

(G) <51e252df-5a6a-4875-94fc-94b3a2a37d08@i6g2000vbe.googlegroups.com>

記事全体へのコマンド

I'm a bit late reporting back on the outcome of the big
 annotation exercise. It is (almost) all over. All the sentences
 have been annotated by two people, and the agreed lexemes
 extracted/flagged. About 220 sentences are down for some
 adjudication, as an annotator proposed one or more lexemes
 which I felt should go to a third party.

 I began with 2,000 sentences: 1,000 chosen because they
 contained 100 known lexemes (10 of each) selected simply
 using grep, and 1,000 chosen "wild". The annotation process
 identified 1,537 instances of 755 unique lexemes. Most of the
 starting 100 were there, but some went missing or were diminished
 because they were embedded in longer terms (e.g. 音響機 was always
 part of 音響機器.) The distribution is very asymptotic, with the
 original 100 having 4-10 occurrences and a long tail of 624
 single lexemes.

 I'm now using the 2,000 to carry out some 10-way
 cross-classifications to test various machine-learning models
 for detecting potential lexemes in text.

 Thank a lot to everyone who participated. I shouldn't mention
 individuals, but I want especially to thank Rene and muchan who
 between tem looked a large proportion of the sentences.

 Cheers

Jim

Fnews-brouse 1.9(20180406) -- by Mizuno, MWE <mwe@ccsf.jp>
GnuPG Key ID = ECC8A735
GnuPG Key fingerprint = 9BE6 B9E9 55A5 A499 CD51 946E 9BDC 7870 ECC8 A735