Date: 2013-01-15 07:04 pm (UTC)
queenlua: (Default)
From: [personal profile] queenlua
yeah I recognized the issues with your corpus as I was running the numbers, and I considered deliberately picking only stuff with your most consistent voice... but then I was like "lua you can't just cherry-pick the pieces you're using it's not sufficiently random that is not science" :P

incidentally & interestingly, identifying an author that's deliberately trying to disguise their writing style is a problem that's known to be pretty damn difficult (and here's a semi-related blog entry just because I find it interesting :P )

and yeah, I should've mentioned false positives for accuracy, herp

(another slight consideration/flaw I noticed in the original program's training data that I failed to mention before: they had like 50 different authors, which is an awful lot of bins. say everything is scored from 1 to 100 and you have a separate bin for each number, but suppose even a consistent author tends to write in the 30-35 range—they're going to get wildly inconsistent results even though their scores tend to be clustering around the same value.)
This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

Expand Cut Tags

No cut tags