I'm a bit late reporting back on the outcome of the big
 annotation exercise. It is (almost) all over. All the sentences
 have been annotated by two people, and the agreed lexemes
 extracted/flagged. About 220 sentences are down for some
 adjudication, as an annotator proposed one or more lexemes
 which I felt should go to a third party.

 I began with 2,000 sentences: 1,000 chosen because they
 contained 100 known lexemes (10 of each) selected simply
 using grep, and 1,000 chosen "wild". The annotation process
 identified 1,537 instances of 755 unique lexemes. Most of the
 starting 100 were there, but some went missing or were diminished
 because they were embedded in longer terms (e.g. 音響機 was always
 part of 音響機器.) The distribution is very asymptotic, with the
 original 100 having 4-10 occurrences and a long tail of 624
 single lexemes.

 I'm now using the 2,000 to carry out some 10-way
 cross-classifications to test various machine-learning models
 for detecting potential lexemes in text.

 Thank a lot to everyone who participated. I shouldn't mention
 individuals, but I want especially to thank Rene and muchan who
 between tem looked a large proportion of the sentences.

 Cheers

Jim