November 22, 2006

Hasn't it all been said before?

No. Everything is actually amazingly new:
[D]on't people accidentally repeat each other's sentences all the time? It seems to me that this should not be unusual. Yet try plugging that last sentence word by word into Google Book Search, and watch what happens.
It: Rejected—too many hits to count
It seems: 11,160,000 matches
It seems to: 3,050,000
It seems to me: 1,580,000
It seems to me that: 844,000
It seems to me that this: 29,700
It seems to me that this should: 237
It seems to me that this should not: 20
It seems to me that this should not be: 9
It seems to me that this should not be unusual: 0
It seems to me that this should not be unusual is itself ... unusual.
(From an article on catching plagiarists.)


Jim Hu said...

This is a useful illustration to keep in mind the next time a student claims coincidence in the similarity between what they turn in and a published text.

Dave said...

The argument that a similar string of words necessarily proves plagiarism is a statistically naive argument.

There are relatively few commonly used words in English; it is highly unlikely that one can come up with a phrase or sentence that has not been used before, even, possibly, for the same subject matter. There are, for example, only so many ways that one can discuss Hamlet's ambivalence. The more common the subject matter the more likely it is that what is written by a student (or professor) has been written in much the same way previously.

What is a more telling indication of plagiarism is if whole paragraphs are substantially similar to another text.

Revenant said...

it is highly unlikely that one can come up with a phrase or sentence that has not been used before, even, possibly, for the same subject matter

Actually it is highly *likely* that any given sentence you speak has never been used before, unless the sentence is short and about a common subject. It just seems like the same sentences get reused a lot because our brains are amazingly efficient at distilling sentences down to their core meanings, which *do* get reused regularly.

There are, for example, only so many ways that one can discuss Hamlet's ambivalence

That is technically true, but only in the sense that there are only so many stars in the known universe. :)

AJ Lynch said...

I googled "Ann has too much time on her hands" and drum roll please.... there was no match.

I think I do too. Happy Thanksgiving everyone!

Zeb Quinn said...

The most artful plagiarists steal the idea but alter the words.

Ann Althouse said...

AJ: I just Googled "Actually, I'm terribly busy"... and got 1 match.

Pogo said...

Being a college student must now be immeasurably more difficult.

Freeman Hunt said...

Your search - "There are relatively few commonly used words in English" - did not match any documents.

Your search - "it is highly unlikely that one can come up with a phrase or sentence" - did not match any documents.

Your search - "that has not been used before, even, possibly, for the same subject matter" - did not match any documents.

Bissage said...

And yet "My Hovercraft is Full of Eels" gets 73,100 results.

Go figure.

Ann Althouse said...

"Has anyone ever said this before" gets only one result, from a blogger quoting something really weird -- "If you're going to pimp out your aquatic life...getting jiggy with Lexis-Nexis" -- and asking whether anyone's ever said that before. That's so weird!

"That's so weird" has, however, been written 57,300 times. So language isn't that weird. (But "language isn't that weird" gets a 0!)

Kirk Parker said...

What we need now is for some killjoy to come along and point out that the entire English corpus isn't on the internet yet.

Anonymous said...

I was mildly amused and somewhat pleased to learn that I'm a Googlewhack.

kettle said...

If you're really interested in this stuff, you might take a look at:
The Mathematical Theory of Communication

It's by Claude Shannon; and it has an interesting statistical description of English.

You might also look up 'Latent Semantic Analsysis' or 'Probabilistic Latent Semantic Analysis' these techniques are often employed to determine how 'close' a new document (student paper) is to other supposedly similar documents (a corpus of term papers or published journal articles . It's pretty awesome how good these comparatively simple methods are at determining style and content relationships - based on nothing more than word cooccurence matrices and some nifty linear algebra and statistics theory. In this case the phrase doesn't actually have to be identical - it just has to be similar!

Anonymous said...

Maybe it has not been said exactly the way you have said it before, but it may well have been said in very similar terms.

Take a look at the difference between what Harvard plagiarist Kaavya Viswanathan wrote in her book and what Megan McCafferty (the woman whom she plagiarized) wrote in her book. (See, e.g., here, for example.)

No question about the plagiarism, but I doubt you could catch it by Googling; you would have to have read the McCafferty book and recognized the sentence structure in Viswanathan's piece.

Internet Ronin said...

Yes. (625,000,000+)

Sean said...

I often experience this same phenomenon in reverse, in drafting legal documents. It is a principle of drafting that if you are saying the same thing twice, you should use the same words, yet I am often embarrassed to discover when I review my own work that I have used slightly different phrases within a page or two of each other. So even in a situation where (i) originality in expression is not sought, (ii) the subject matter (secured lending) is rather limited and (iii) I am trying to, in effect, plagiarize myself, it is actually quite difficult to repeat the same phrase word for word. How extremely unlikely it must be for different people at different times to come up with the exact same phrase.

P. Froward said...

The "commonly used words in English" reasoning seems "intuitively" "obvious", but it's wildly wrong: Assume 1000 commonly used words (there are more). 1000 distinct one-word sentences can be constructed from that vocabulary. For each of those 1000 sentences, create 1000 variations by adding each of the 1000 words. That's 1,000,000 two-word sentences. Three words gets you 1,000,000,000, or 1000 to the third power. The lesson is that exponents get very big, very fast.

Not all sequences of valid English words are valid English sentences, but what you lose for that reason is peanuts, relatively speaking.

Exponents aren't "intuitive". Your brain mistakenly expects that adding another word will yield a linear increase in permutations, so you think a four word sentence using a 1,000 word vocabulary isn't much, but in fact you've got something roughly in the neighborhood of a trillion permutations there.

There's an old story about a wise man who asks a king to pay him for something by putting a single coin on the first square of a chessboard, then doubling the sum on each succeeding square. For the student: How many umpty-squillion coins does that come to? Lots!