Re: an on-line Japanese Text Character Frequency Analyzer tool
[I've cross-posted this to sci.lang.japan, as the original posting went
there too, and has received some discussion.]
James Eckman <fugu@prodigy.net> dixit:
>alextret@yahoo.com wrote:
>> We created an on-line Japanese Text Character Frequency Analyzer tool.
>> It uses a web service provided by Google.
>>
>> Take a look and let us know if it is useful in your particular
>> situation (we can read feedback in English, Japanese and Chineese).
>While not directly useful to me, it is interesting. Do you have any
>charts showing the most frequent kanji on the web? It would be
>interesting to see how much the official education order varies from the
>web use order.
As you possibly know, there is a frequency-of-use ranking in the
KANJIDIC file. This is (now) based on a word ranking derived from
a newspaper article corpus. A comparison based on WWW pages would
be very interesting.
It would be a few minutes work to knock up a Google API script that
pulled out the page counts for each of the JIS X 0208 kanji, and at
1000 per day I could have the counts and ranking in a week. I haven't
done this for a couple of reasons:
(a) the counts returned by the Google API are not the same as those
returned by popping a kanji into http://www.google.com.au/advanced_search
and setting "Return pages written in" to Japanese. Worse, the relative
counts vary for different kanji.
(b) a simple page count may be misleading. It is possible to postulate
that kanji X occurs in a lot of pages, but only once per page, whereas
kanji A occurs in fewer pages, but many times in each. You could
make the simplistic assumption that page-count = frequency-of-use, but to
do it properly you would really have to pull in a good selection of pages
(randomly-selected; not just high-ranking) and examine them in their
entirety to work out if the per-page dispersion of A and X differs.
A nice little project. If only I could find my roundtoit.
--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Computer Science & Software Engineering,
Monash University, VIC 3800, Australia
ジム・ブリーン@モナシュ大学
Fnews-brouse 1.9(20180406) -- by Mizuno, MWE <mwe@ccsf.jp>
GnuPG Key ID = ECC8A735
GnuPG Key fingerprint = 9BE6 B9E9 55A5 A499 CD51 946E 9BDC 7870 ECC8 A735