February 9, 2023

"ChatGPT Is a Blurry JPEG of the Web/OpenAI’s chatbot offers paraphrases, whereas Google offers quotes. Which do we prefer?"

That's the most effective headline I've read in a long time.

The article, in The New Yorker, is by Ted Chiang. Excerpt: 

The fact that Xerox photocopiers use a lossy compression format instead of a lossless one isn’t, in itself, a problem. The problem is that the photocopiers were degrading the image in a subtle way, in which the compression artifacts weren’t immediately recognizable. If the photocopier simply produced blurry printouts, everyone would know that they weren’t accurate reproductions of the originals. What led to problems was the fact that the photocopier was producing numbers that were readable but incorrect; it made the copies seem accurate when they weren’t....

I think that this incident with the Xerox photocopier is worth bearing in mind today, as we consider OpenAI’s ChatGPT and other similar programs, which A.I. researchers call large-language models. The resemblance between a photocopier and a large-language model might not be immediately apparent—but consider the following scenario. Imagine that you’re about to lose your access to the Internet forever. In preparation, you plan to create a compressed copy of all the text on the Web, so that you can store it on a private server. Unfortunately, your private server has only one per cent of the space needed; you can’t use a lossless compression algorithm if you want everything to fit. Instead, you write a lossy algorithm that identifies statistical regularities in the text and stores them in a specialized file format. Because you have virtually unlimited computational power to throw at this task, your algorithm can identify extraordinarily nuanced statistical regularities, and this allows you to achieve the desired compression ratio of a hundred to one.....

9 comments:

Gerda Sprinchorn said...

Imagine that you’re about to lose your access to the Internet forever. In preparation, you plan to create a compressed copy of all the text on the Web, so that you can store it on a private server. Unfortunately, your private server has only one per cent of the space needed; you can’t use a lossless compression algorithm if you want everything to fit.

That's easy.

Store the collected works of William Shakespeare in a lossless format ... 100 times over.

n.n said...

A basket of knowledge. A cache of correlations. Several degrees of freedom in an adaptive neural network or simpler algorithm. A viable simulation of a human life from its conception.

tim in vermont said...

This is true. ChatGPT confuses stuff. If you ask it for a list of the characters in a novel, it might give you a list of characters in another novel by the same writer.

I got into an argument with it, and it kept repeating "other scholars maintain that..." and it stuck to its position. I asked it to name the scholars it was citing, it answered that naming the scholars was "outside the scope of this conversation."

If Google gives actual quotes and cites sources, well, I for one welcome our new AI overlords.

mtp said...

Every think piece makes the implicit assumption that this technology, which I cannot remember at all from 3 years ago, is in more or less it's final form.

"The Wright Brother so-called 'airplane' can fly about 150 ft, whereas a horse-drawn carriage can take you to Oregon. Which do we prefer?"

Our public intellectuals are buffoons.

tim in vermont said...

Chat's vision is a little foggy. It says that you are "widely respected for your insightful analysis," but also says that you are "conservative." IDK, maybe you are like Kevin Klein's character in that movie where he plays an instructor at a boarding school who doesn't know he's gay.

Fred Drinkwater said...

Worthwhile article. But I was amused by the author's comparison of the utility of ChatGPT for elementary arithmetic vs writing college economics essays. The basic problem is not the paraphrasing, but the fact that it's easy to tell if the arithmetic is wrong, and difficult to impossible to tell if the essay is wrong.

I also object to his apparent acceptance of paraphrase in general. There are many fields where exact quotes, or exact matching of text from one area to another, are not just nice, but absolutely essential. The engineering world, for example. Law, probably - paraphrase can easily alter or miss meaning.

Wikipedia editing is a well known current problem. Do we really want to trust what is essentially an automated version of that?

Last: the anecdote about the copier changing numbers on the copies struck me as implausible, until I read the JBIG2 spec. What a terrible idea! People can get killed by that crap.

tim in vermont said...

I wonder if it is like the "multidimensional cube" I used to work with. You would hand it a very large dataset, and define a class of questions that users might ask, and it wold pre-calculate all of the possible answers to those pre-limited questions in a disk footprint far smaller, and faster to query than the dataset itself. Of course the rest of the data relationships were lost in the process, but in a practical sense, the almost instantaneous ability to query on questions you had deemed important far outweighed that loss. You could, if you were wanted, decorate the answers with pointers to the source data. Maybe that's what Google is doing and what ChatGPT does not?

Suppose you had a language processor that could store symbolic representations of meaning of everything that went into it, and then stored those "meanings" into a "multidimensional cube," and you were not constrained by computing resources, so you didn't have to limit the questions, your dataset would still be smaller than the whole internet, maybe not with "decorations" like my hypothetical design that Google is using... But you would have compression, still, even if you chose lossless, you still gain both space savings and access time.

Sorry, just thinking out loud, welcome to my nightmare.



Penguins loose said...

If you are not reading Ted Chiang’s fiction, you are missing some the best writing available. His story, Story of Your Life is one of the best science fiction stories ever written. Do yourself a favor and read it.

Norpois said...

The issue is not how the RESULT is PACKAGED -- it's whether the result you get is curated/censored according to whose algorithm for "accuracy" and "safety". The quotes can be manipulated/censored the same way the Chatbot response is. No difference either way.