I Write Like
Jan. 15th, 2013 06:29 amAccording to I Write Like...
...but then I discovered the source code for IWL is available online (eee) so I decided to poke at its innards for a bit and see what's what
( Lua sets up a local instance and installs shit: the liveblog! (terribly boring do not read) )
Once I had a local instance running, I decided to do some experiments for teh lulz (and perhaps tangentially teh science).
I cleaned out the authors included with the IWL download and used some fanfic authors instead: arbitrarily I chose myself,
amielleon, and
mark_asphodel (hello, unwitting volunteers! :D;;; ). I used the three latest fics by these three authors for training data, then took a few of the other works by each author to see how accurately IWL could guess the true author of a work:
( Data! )
...okay wow, based on that data, IWL seems to suck. Badly. As in, a-random-number-generator-could-do-a-better-job-for-anyone-not-named-Mark1.
Time to look at the code and see what the methodology at play is...
( Footnote )
- "Wings Dancing in the Darkness" reads like Margaret Atwood
- "Every Little Thing" reads like Chuck Palahniuk
- "Delicately, Madly" reads like Charles Dickens
- "White Like Bone" reads like Anne Rice
- "Pyre" reads like Raymond Chandler
- "Dog in the Vineyard" reads like Dan Brown (...yuck)
- "Crush" reads like Chuck Palahniuk
- annnnd Remnants of Restoration reads like Kurt Vonnegut
...but then I discovered the source code for IWL is available online (eee) so I decided to poke at its innards for a bit and see what's what
( Lua sets up a local instance and installs shit: the liveblog! (terribly boring do not read) )
Once I had a local instance running, I decided to do some experiments for teh lulz (and perhaps tangentially teh science).
I cleaned out the authors included with the IWL download and used some fanfic authors instead: arbitrarily I chose myself,
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
( Data! )
...okay wow, based on that data, IWL seems to suck. Badly. As in, a-random-number-generator-could-do-a-better-job-for-anyone-not-named-Mark1.
Time to look at the code and see what the methodology at play is...
- Analysis seems to be based on both "tokens" and "readability"
- The readability metric is just the Flesch Reading Ease score, which has been discussed here before as being a somewhat problematic and inconsistent metric
- Tokens is more unclear to me on this quick skim, but what I'm pretty sure is going on is: they're basically making a giant table of "words appearing in the text plus their frequencies," and based on that, they calculate a "rating" based on how the relative probability of those words is distributed (i.e. if A and B both use the words "obnoxious" and "teetotaler" a lot, the algorithm will notice that and assume A and B are more similar)
( Footnote )