I Write Like
Jan. 15th, 2013 06:29 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
According to I Write Like...
...but then I discovered the source code for IWL is available online (eee) so I decided to poke at its innards for a bit and see what's what
Once I had a local instance running, I decided to do some experiments for teh lulz (and perhaps tangentially teh science).
I cleaned out the authors included with the IWL download and used some fanfic authors instead: arbitrarily I chose myself,
amielleon, and
mark_asphodel (hello, unwitting volunteers! :D;;; ). I used the three latest fics by these three authors for training data, then took a few of the other works by each author to see how accurately IWL could guess the true author of a work:
...okay wow, based on that data, IWL seems to suck. Badly. As in, a-random-number-generator-could-do-a-better-job-for-anyone-not-named-Mark1.
Time to look at the code and see what the methodology at play is...
1 It is probably worth noting that the fics used for Ammie's training set might've skewed her results; "Benefits" and "In the City" are perhaps not the most representative samples from her corpus. Whups.
- "Wings Dancing in the Darkness" reads like Margaret Atwood
- "Every Little Thing" reads like Chuck Palahniuk
- "Delicately, Madly" reads like Charles Dickens
- "White Like Bone" reads like Anne Rice
- "Pyre" reads like Raymond Chandler
- "Dog in the Vineyard" reads like Dan Brown (...yuck)
- "Crush" reads like Chuck Palahniuk
- annnnd Remnants of Restoration reads like Kurt Vonnegut
...but then I discovered the source code for IWL is available online (eee) so I decided to poke at its innards for a bit and see what's what
- in a sort of cute move they decided to write this in some hipster language i've barely heard of
- ...okay what kind of programming language does not have a simple "make install" command and instead gives me some bullshit GUI and forces me to manually set my path geez
- ...oh fuck i overwrote my path, welcome to n00b mistake of the night, oh fuck ls and vim are not working did i just break bash
- crisis averted (but that was the most terrifying handful of minutes in my life)
- uh okay evidently hitting "Analyze" on my local instance gets me a page that says "not found" that seems sort of useless
- mm i love the feeling of adding my first expletive to the code (404 errors are much more attractive as "the fuck")
- uh okay there's a bug somewhere in dispatch-rules what's that about
- bluuuh this is hard to fix without an actual debugger but the racket documentation's pretty vague about how i might use such a thing via the command line
- oh interesting, evidently there's a compatibility issue between Racket 5.3.1 (which I was trying to use) and Racket 5.1, which was causing my instance to Not Work TM. there's a known compatibility issue between 5.0 and 5.1 but nothing online about this issue; I'll file a bug report and maybe look into it in the morning
Once I had a local instance running, I decided to do some experiments for teh lulz (and perhaps tangentially teh science).
I cleaned out the authors included with the IWL download and used some fanfic authors instead: arbitrarily I chose myself,
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
- Lua's stuff: IWL incorrectly thinks that Mark wrote "White Like Bone," "Dog in the Vineyard," and chapter 1 of Remnants of Restoration. It correctly thinks I wrote "Pyre" and "Crush." (Accuracy: 2/5)
- Ammie's stuff: IWL incorrectly thinks I wrote "lucius listens to the rain," "Ghost Stories," and "a visitor at any hour." It thinks Mark wrote "Coin in Palm" and "New World." It correctly thinks Ammie wrote "In Questioning Ghosts." (Accuracy: 1/6)
- Mark's stuff: IWL correctly thinks that Mark wrote "Gold for Salt," "Blackout," "The Losing End," and "In Transition." It thinks I wrote "Without Vocation." (Accuracy: 4/5)
...okay wow, based on that data, IWL seems to suck. Badly. As in, a-random-number-generator-could-do-a-better-job-for-anyone-not-named-Mark1.
Time to look at the code and see what the methodology at play is...
- Analysis seems to be based on both "tokens" and "readability"
- The readability metric is just the Flesch Reading Ease score, which has been discussed here before as being a somewhat problematic and inconsistent metric
- Tokens is more unclear to me on this quick skim, but what I'm pretty sure is going on is: they're basically making a giant table of "words appearing in the text plus their frequencies," and based on that, they calculate a "rating" based on how the relative probability of those words is distributed (i.e. if A and B both use the words "obnoxious" and "teetotaler" a lot, the algorithm will notice that and assume A and B are more similar)
1 It is probably worth noting that the fics used for Ammie's training set might've skewed her results; "Benefits" and "In the City" are perhaps not the most representative samples from her corpus. Whups.
no subject
Date: 2013-01-15 06:05 pm (UTC)But I was always curious as to how it worked. The "male/female" test that goes around every once in a while is at least upfront about it. (My writing usually comes up "masculine" -- and the "feminine" words are typically relationship-focused rather than environmental. Did not like that.)
But yeah. It's cool that you were able to rig that up like that! And the results are interesting, if not especially meaningful.
no subject
Date: 2013-01-15 07:08 pm (UTC)also, that male/female test one made me super-happy because when I got curious about how it worked, not only was there a pretty clear methodology, but the dude posted his master's thesis which was related to the topic and then I spent the afternoon trapped in academic CS papers /dork