yeah I recognized the issues with your corpus as I was running the numbers, and I considered deliberately picking only stuff with your most consistent voice... but then I was like "lua you can't just cherry-pick the pieces you're using it's not sufficiently random that is not science" :P
incidentally & interestingly, identifying an author that's deliberately trying to disguise their writing style is a problem that's known to be pretty damn difficult (and here's a semi-related blog entry just because I find it interesting :P )
and yeah, I should've mentioned false positives for accuracy, herp
(another slight consideration/flaw I noticed in the original program's training data that I failed to mention before: they had like 50 different authors, which is an awful lot of bins. say everything is scored from 1 to 100 and you have a separate bin for each number, but suppose even a consistent author tends to write in the 30-35 range—they're going to get wildly inconsistent results even though their scores tend to be clustering around the same value.)
no subject
Date: 2013-01-15 07:04 pm (UTC)incidentally & interestingly, identifying an author that's deliberately trying to disguise their writing style is a problem that's known to be pretty damn difficult (and here's a semi-related blog entry just because I find it interesting :P )
and yeah, I should've mentioned false positives for accuracy, herp
(another slight consideration/flaw I noticed in the original program's training data that I failed to mention before: they had like 50 different authors, which is an awful lot of bins. say everything is scored from 1 to 100 and you have a separate bin for each number, but suppose even a consistent author tends to write in the 30-35 range—they're going to get wildly inconsistent results even though their scores tend to be clustering around the same value.)