Skip to content

Evaluating GPT-4 through literally one summarization, but deeply and systematically

May 30, 2023 at 03:22 PM

It bothers me that there are no great evaluations for large language models on the high level tasks we actually find them most helpful for — chief amongst them summarization.

Perplexity is in essence just the loss function and measurement on downstream objective tasks like question-answering are helpful but too narrow domain. There’s also other ways to do fact-based question answering.

Sam Altman said in an interview that his killer use case for GPT is summarization. I find it doubly helpful on YouTube transcripts, where the audio quality can sometimes be poor.

The focus here is a very qualitative but very deep evaluation and working with one very specific example — a YouTube video from CS25 at Stanford about AI alignment by Jan Leike. I was very interested in the topic, but the audio quality was poor.

I chose to have the model summarize hierarchical summaries and read through them in a systematic way — from the highest level, all the way to actually watching the video so I could verify what came before. I compared what new learnings I had from each level of summary. I generally trusted the model’s output at each level and the next few levels would usually reveal any false beliefs I developed.

This is a long post, probably only worth your time if you’re deeply interested in using LLMs for summarization. It’s really just my notes as I went along. But it’s interesting that through this one simple example, we’ll actually touch nearly every core concept in LLM applications today — prompt engineering, reward models

The guiding question is How much of the original understanding can I develop through the summaries and how much time does it save?

Table of contents

Open Table of contents

TL;DR (this won’t be new, the details are where the fun is)

Overall, these models do open up an ability to read nonfiction hierarchically, which good table of contents already did. Mostly, they add a lot of value in saving you time from reading uninteresting content.

From this point on the post is structured as a notebook/log.

Sample of the video transcript

This is what we have to work with

sample_transcript = yt.get_transcript(sample_video_id)
sample_transcript_raw_text = '\n'.join([o['text'] for o in sample_transcript])
print(sample_transcript_raw_text)
it's my pleasure to welcome Jan from
openai I'm he leads the alignment team
there and he was previously a researcher
at deepmind as well what's a PhD in
reinforcement learning theory has been
thinking about the alignment problem for
over 10 years and today he'll be giving
a very interesting topic so hope you
guys enjoy
yeah thanks a lot for the intro and
thanks a lot for having me
um
I'm very excited to talk about this
stuff
um I'm also super happy to keep it
interactive if you have questions and
then you find please interrupt me and
um yeah I wanted to start out with
a few very basic observations on kind of
what I think is going on
and
so the first one is
team EI is joining the game
so IMEI has a lot of different playe
...

gist link

Split it to fit the token window

GPT-4 has an 8k token window right now (32k coming soon, Anthropic has a 100k model). I’m setting the TOKEN_LIMIT_FOR_SUMMARY to 4k in order to allow the summary to be as long as the content if it wants to be.

len(transcript_splitted), [numtokens(t) for t in transcript_splitted]
(4, [3997, 4000, 3973, 173])

I like about this split that the last two chunk sizes are very different, and there’s some knowledge that’s split across the two. We’ll come back to this at the end.

The initial summarization prompt

def summarize(t: str) -> str:
	return ask("as an expert assistant, summarize the following text from a youtube audio transcript into bullet points: " + t)

Summarizing just one section

Let’s focus on just the first “section” (not a author determined chapter) — transcript_splitted[0]. Then, I’ll progressively have the model summarize the first half of that text and highlight new insights in bold as I read. I’ll mark doubts in italic.

Summary of all 4k tokens of section 1

Original text: 3997 tokens Summary: 245 tokens
Summary is 6.13% of original text
  • Welcome Jan from OpenAI leading the alignment team, previously a researcher at DeepMind.
  • Jan discusses the alignment problem for over 10 years.
  • Observations: AI is joining the game, with stronger players over time.
  • In the long run, AI will have advantages over humans.
  • Team Humans should recruit players from AI and write rules to ensure team human doesn’t lose.
  • Jan focuses on the first objective (alignment) in his talk - building AI systems that follow human intent.
  • Two categories of intent: overt (instructions) and covert (unspoken, general preferences).
  • The main technique used involves a reward model optimized by AI.
  • InstructGPT was found to be preferred by humans over the larger GPT-3 model.
  • Challenges include catastrophic forgetting and distributional shift.
  • Future steps include making AI more robust, enabling it to ask clarifying questions, and tackling scaling.

Doubts I have coming out of this

Overall, this was a good summary — several insights and only 3 clarifications coming out of this.

Clarification vs. Summary: Asking the model to not remove information

As an alternate prompt, I get the model to “clarify” the transcript, asking it to not remove ANY information.

def clarify(t: str) -> str:
	return ask("repeat the following from a youtube audio transcript as a bullet point list. do not remove information. " + t)
Original text: 3997 tokens Clarification: 305 tokens
Clarification is 7.63% of original text

Summary (Clarification version)

  • Jan from OpenAI leads the alignment team
  • Working on alignment problem for 10+ years
  • Observations:
    • AI is joining the game more and more with stronger players
    • Expect AI to have incredibly strong players in the future
    • Team Human has an advantage by picking which AI players join and when
  • Main objectives of Team Human:
    • Recruit players from AI to play on the human team (alignment)
    • Write the rules of the game so that Team Human doesn’t lose
  • Main talk focus: first objective (alignment)
  • Two categories of intent for alignment:
    • Task intent (following human instructions)
    • Other intents (not just following instructions, but also not making up stuff, doing harmful things, etc.)
    • Basic technique for alignment:
      • 1st step: Train a reward model from comparison
      • 2nd step: Optimize the AI model to the reward model using reinforcement learning
    • Results: InstructGPT preferred over larger GPT-3 model
    • Fine-tuning with RL is cheap compared to pre-training
    • Limitations: Not all problems fixed, but useful progress
    • ChatGPT as an upgraded version of InstructGPT for alignment
      • Dialogue as the universal interface
      • Implement human-in-the-loop evaluations
    • Principle: Evaluation is easier than generation
      • Easier to evaluate model’s output than generate it as a human
      • Improved model performance through human feedback

Observations

Turns out I’ve discovered a better summarization prompt than the original! Let’s use this going forward as the “summary”

2k tokens

I just take the first half of the first section and summarize it to see if I get any new insights.

Original text: 2004 tokens Summary: 209 tokens Summary is 10.43% of original text
  • Introduction to Jan from OpenAI, who leads the alignment team and has a PhD in reinforcement learning theory
  • AI is joining the workforce with increasingly stronger players emerging over time
  • In the long run, AI has advantages over humans, but humans get to choose which AI players join and when
  • Two main objectives for humans: recruit aligned AI players and write rules of the game so humans don’t lose
  • Aim to build AI systems that follow human intents, both explicit and implicit
  • Alignment techniques: 1) Train reward model by collecting comparison data 2) Optimize the AI model for the reward model
  • Labels for reward models depend on human preferences, with potential inconsistencies
  • GPT-3 models show improvement with increased size, but an instruct model is preferred even when it is 100 times smaller
  • Demonstrating systems that outperform models without alignment techniques and providing open-source software for alignment are essential for adoption - Work on various methods, techniques, and research to improve alignment and prepare for future AI advancements

Observations

Double checking the transcript

1k tokens

Original text: 1030 tokens Summary: 331 tokens
Summary is 32.14% of original text

Expectations

Summary / Clarification:

  • Pleasure to welcome Jan from OpenAI
  • Jan leads the alignment team at OpenAI
  • Previously a researcher at DeepMind
  • Has a PhD in reinforcement learning theory
  • Thinking about the alignment problem for over 10 years
  • Presentation on interesting topic
  • Excited to talk about the subject
  • Open to interactive discussions
  • Basic observations on what’s going on
  • Team EI joining the game
  • Different players join one by one
  • Varying skill levels
  • Most players are narrow in their focus
  • Stronger players joining over time
  • Expect incredibly strong players in the future
  • Future players will think better, faster, and cheaper than humans
  • Example: chat Epp, better than humans at knowing facts, speaks more languages, able to write 50 words per second, 100 times cheaper than human minimum wage
  • Chat Epp has important limitations but indicates potential future players
  • Long run: team EI will have all the advantages over team humans
  • However, human advantage is picking which players join from EI
  • Important to lean into this advantage for shaping AI’s development
  • Objective 1: Recruit players from team EI to play on human teams (alignment)
  • Objective 2: Write rules of the game so team human doesn’t lose
  • Focus on alignment: AI systems that follow human intent
  • First part of talk: past work, making current models as aligned as possible
  • Second part of talk: future work, what needs to be done next for alignment
  • Alignment means AI systems that follow human preferences and do what we want them to do

Observations

512 tokens

Original text: 515 tokens Summary: 150 tokens
Summary is 29.13% of original text

Summary / Clarification:

  • Pleasure to welcome Jan from OpenAI, leads the alignment team, former DeepMind researcher, PhD in reinforcement learning theory
  • Discussing AI alignment problem and keeping the talk interactive
  • Basic observations:
    • Team AI is joining the game with players joining one by one, varying in their capabilities
    • Stronger players observed joining over time
    • Expectation of AI having stronger players than humans in the future, thinking faster and cheaper
    • Example: ChatGPT, knows more facts, speaks more languages, writes 50 words per second, 100x cheaper than human minimum wage - Still some limitations with current AI capabilities - Long-run advantage of AI over humans, but with an important caveat

Observations:

256 tokens

Original text: 253 tokens
Summary: 99 tokens Summary is 39.13% of original text
  • Welcome Jan from OpenAI
    • Leads alignment team at OpenAI
    • Former researcher at DeepMind
    • Holds PhD in reinforcement learning theory
    • Has thought about alignment problem for over 10 years
  • Keep talk interactive, questions welcome
  • Basic observations:
    • AI teams joining the game
    • Different players join one by one
    • Players vary in skill level
    • Many current players not very strong
    • Focused on narrow setups

Observations

128 tokens

Original text: 121 tokens
Summary: 67 tokens Summary is 55.37% of original text
  • Welcome Jan from OpenAI
  • Leads the alignment team at OpenAI
  • Previously a researcher at DeepMind
  • Has a PhD in reinforcement learning theory
  • Has been thinking about the alignment problem for over 10 years
  • Giving an interesting topic today
  • Keep the discussion interactive
  • Encourages questions and comments

Observations

Summary from traversing one tree branch of summaries

Content Division FactorInput TokensOutput TokensOutput LinesNew Insights (hand labeled except row 1)New Insights x Division Factor
13997305252525
220042091036
410303312714
8515150900
16253991400
3212167800

Observations

Trying to maximize output length

I’ve mentioned a few times above that I wanted to compare these to the first 4k tokens of the transcript. But becaues the transcript is formatted in near illegible way, I tried to get the model to do it’s best to be more verbose.

def clarify_verbose(t: str) -> str:
	return ask("repeat the following from a youtube audio transcript as a bullet point list. do not remove information. be very verbose. " + t)
Original text: 3997 tokens Summary: 464 tokens Summary is 11.61% of original text Number of lines in summary: 42

For context, original clarification prompt resulted in 305 tokens, 7.63% of original text
  • Guest speaker: Jan from OpenAI, leader of alignment team
  • Background: Former DeepMind researcher, Ph.D. in reinforcement learning theory
  • Topic: Alignment problem and work done at OpenAI

Basic Observations: 1. Team AI is joining the game with progressively stronger players. 2. Future AI players expected to be much better, faster, and cheaper than humans. 3. One advantage for Team Human: they get to pick which AI players join and when.

Main Objectives for Team Human:

  1. Recruit AI players to help humans (alignment problem)
  2. Write rules of the game to not lose against AI

Talk Outline:

  1. Past work on making existing models as aligned as possible
  2. Future work needed to better align AI

Alignment:

  • Build AI systems that follow human intent and preferences
  • Two main categories: explicit instructions and implicit unwritten expectations

Technique for AI alignment (Inverse Reward Design):

  1. Train reward model from human preferences by comparing different AI responses
  2. Optimize AI model with reinforcement learning

Surprising findings:

  1. InstructGPT models trained with alignment perform much better than non-aligned GPT-3 models at a fraction of parameter size.
  2. Fine-tuning cost for alignment is significantly less than pre-training cost.
  3. InstructGPT alignment technique is much more data-efficient than imitation learning (imitating human demonstration)

Limitations and future work:

  1. Addressing issues like AI model hallucinations
  2. Making AI model less sensitive to prompting
  3. Improving alignment with even stronger AI models
  4. Exploiting the principle that evaluation is easier than generation to further train and refine AI model performance

Key points from Q&A:

  • Comparison between aligned models and pre-training involves ranking and rating based on human preferences
  • Alignment can involve blending pre-training data into the fine-tuning step
  • The InstructGPT model can be distinguished from larger GPT models by human judges if given tailored prompts
  • ChatGPT is an upgrade over InstructGPT, designed to make models more aligned and user-friendly, but still has multiple limitations
  • The principle of “evaluation is easier than generation” can be useful for training AI models, potentially making them more efficient at tasks than humans themselves

Observations

WAIT, a lot of this is made up

This is just the the first 4k tokens (about 20 minutes of an hour long video). Did they actually get to the Q&A? Turns out no, the section of the transcript cuts off here:

first chapter cutoff

The model is trying take a section of the transcript and structure it as if it’s the full video. This is potentially very dangerous.

Let’s restructure the prompt

def clarify_verbose_caveated(t: str) -> str:
	return ask("repeat the following from a youtube audio transcript as a bullet point list. do not remove information. be very verbose. the transcript could be the full transcript or just a section of the video." + t)

and re-run it

  • Speaker: Jan from OpenAI, leads alignment team, previously at DeepMind, PhD in reinforcement learning theory
  • Topic: Alignment problem
  • Observations:
    • Team EI joining the game, with stronger and stronger players over time
    • Team AI has many advantages over Team Humans (speed, cost, etc.)
    • Team Humans has an advantage in choosing which players join and when
  • Main objectives for humans:
    1. Recruit players from Pmei to play on Team Humans (alignment)
    2. Write the rules of the game so that Team Human doesn’t lose
  • Alignment: build AI systems that follow human intent
    • Two main categories of intent: faithful, other intents (e.g., don’t make up stuff, ask follow-up questions)
  • Technique for alignment: Iterate reward modeling
    1. Train reward model from human preferences
    2. Optimize for the reward model with RL
  • How well does it work?
    • InstructGPT: even the smallest version is preferred over the largest GPT-3 model - Training costs: fine-tuning on the largest GPT-3 version is about 2% of pre-training cost
  • ChatGPT vs InstructGPT
    • ChatGPT: dialogue as universal interface, better at refusing harmful tasks, but still has limitations (hallucination, sensitivity to prompting)
    • InstructGPT: Upgrade to make models more aligned and useful to humans, but still room for improvement
  • Principle: evaluation is easier than generation
    • Easier to spot good behavior than doing the task yourself
    • Useful for providing comparative feedback to the AI system to improve its performance

Observations

Let’s read the transcript, uh oh

I actually just gave up trying to read the full transcript because it was so unreadable. Makes me appreciate the corrections the model did, and make me wonder if it would have done even better. Here’s a few sample links and a gist link.

...
um I'm also super happy to keep it
interactive if you have questions and
then you find please interrupt me and
um yeah I wanted to start out with
a few very basic observations on kind of
what I think is going on
and
so the first one is
team EI is joining the game
so IMEI has a lot of different players
they own All Join at the same time but
rather they Join one by one
and not all of the their players are
like very a lot in how good they are and
right now a lot of the players that have
joined so far
I'm really that's mine and usually you
can do it on your very narrow setup
...

Nearly all my doubts so far have been about quantitative claims, e.g. the one about the 100x model size. I just don’t trust the transcript with this given how poor it is.

They uploaded a better transcript

from youtube_transcript_api import YouTubeTranscriptApi

transcript_list = YouTubeTranscriptApi.list_transcripts(sample_video_id)

list(transcript_list)[0].is_generated

False

Here’s a link to the raw content of the good transcript. And let’s process it (just the first section) with our final prompt after some cleanip

def clarify_verbose_caveated(t: str) -> str:
	return ask("repeat the following from a youtube audio transcript as a bullet point list. do not remove information. be very verbose. the transcript could be the full transcript or just a section of the video." + t)

text_hand = '\n'.join([o['text'] for o in hand_transcript])

hand_transcript_splitted = split_transcript(

	text_hand,

	oai_yt.TOKEN_LIMIT_FOR_SUMMARY,

	numtokens)

len(hand_transcript_splitted), [numtokens(t) for t in hand_transcript_splitted]

(4, [3995, 4000, 4000, 1086])
text_hand_part1 = hand_transcript_splitted[0]

sample_verbose_hand = clarify_verbose_caveated(text_hand_part1)

print_stats(text_hand_part1, sample_verbose_hand)
Original text: 3995 tokens Summary: 266 tokens Summary is 6.66% of original text Number of lines in summary: 15
  • Jan from OpenAI led the alignment team and has a PhD in reinforcement learning theory and was previously at DeepMind
  • He has been thinking about the AI alignment problem for over 10 years
  • Team AI is forming as AI systems improve, with stronger “players” joining over time
  • AI players are becoming more powerful and better than humans in some aspects, such as knowing more facts and speaking more languages
  • However, Team Human has the advantage of picking which AI players join and when
  • Two main objectives for Team Human:
    1. Recruit AI players to play on Team Human (alignment)
    2. Write the rules of the game so that Team Human doesn’t lose
  • AI alignment is about building systems that follow human intent and preferences
  • Two main categories of intent:
    1. Explicit intent: following instructions and serving as an assistant
    2. Implicit intent: not making up things, not doing harmful things, asking follow-up questions, etc.
  • Reinforcement learning from human feedback is a key technique for building aligned AI systems
  • By fine-tuning AI models, they can be made more aligned and useful on a wide range of tasks
  • One issue is that AI models can still hallucinate, even with fine-tuning and reinforcement learning from human feedback

Reading the full transcript

The raw transcript was still hard to read given the formatting. Because my previous efforts to use the model as a pure language formatter had failed, and I was only concerned with the first 20 minutes of the video, I just chose to watch the video and hope I could make sense of the audio.

Plus, I’m tired and I’ve read so many summaries that if something new pops up in the audio, I’ll definitely notice

New things learnt from actually watching part 1

My takeaway from watching the first 12 minutes is that there’s several subtle things that someone who really cares about the topic will find the video worth watching for. However, you aren’t really missing and of the very high level points.

In that way, I really like the hierarchical reading the summaries enable.

Merging sections

for text_hand_part in hand_transcript_splitted:
	sample_verbose_hand = clarify_verbose_caveated(text_hand_part)
	print_stats(text_hand_part, sample_verbose_hand)

Link to the gist. I did skim this very briefly and realized there was a key mistake

Why is the summary of the first part different this time?

if you notice the the part 1 output from the second run, there’s insights in there that are true and weren’t in the first run. Because of the default temperature setting, it outputs a meaningfully different summary for the first part that we’ve already done.

Does lower temperature mean more truthful answers?

Hard to say an outsider given we don’t really know what effect the RL ends up having on the model. We don’t even know if they use the standard formulation of temperature in their implementation. Sampling is unfortunately a total black box.

A key factual error

  • “Training a reward model from comparisons is the first step, followed by optimizing the reward model with reinforcement learning.”

The reward model is not the one that’s RLed. It just outputs an approximation of what a human will think of a completion. If you believed this from the summary, you’d have the wrong understanding of this key part of the process and maybe lose a lot of time (more than what you saved).

What happens at the edges of the split

The last 2 splits are at an interesting point

hand_transcript_splitted[-1]
"to train the system to be more--\nor to get..."

last chapter cutoff It’s interesting because it breaks right in the middle of him explaining something, and a fairly advanced implication.

The last line of summary n-1 is

  • Addressing safety concerns and potential biases in user input by training AI on diverse and representative sets of human preferences

and the first line of summary N is

  • Train inner optimizers to deal with distributional shift and construct outer alignment signals

While sounding intelligent, this line makes no sense and I have to watch -2 minutes into the video from the split point to understand what he’s even talking about.

This is a really good example that can only really be solved by convolving the split point or with a larger context window.

Conclusions

Somewhat obvious sounding but if you really care I think the specific examples worked through will give you more insights of your own.

Principles followed

Future work