Evaluating GPT-4 through literally one summarization, but deeply and systematically

It bothers me that there are no great evaluations for large language models on the high level tasks we actually find them most helpful for — chief amongst them summarization.

Perplexity is in essence just the loss function and measurement on downstream objective tasks like question-answering are helpful but too narrow domain. There’s also other ways to do fact-based question answering.

Sam Altman said in an interview that his killer use case for GPT is summarization. I find it doubly helpful on YouTube transcripts, where the audio quality can sometimes be poor.

The focus here is a very qualitative but very deep evaluation and working with one very specific example — a YouTube video from CS25 at Stanford about AI alignment by Jan Leike. I was very interested in the topic, but the audio quality was poor.

I chose to have the model summarize hierarchical summaries and read through them in a systematic way — from the highest level, all the way to actually watching the video so I could verify what came before. I compared what new learnings I had from each level of summary. I generally trusted the model’s output at each level and the next few levels would usually reveal any false beliefs I developed.

This is a long post, probably only worth your time if you’re deeply interested in using LLMs for summarization. It’s really just my notes as I went along. But it’s interesting that through this one simple example, we’ll actually touch nearly every core concept in LLM applications today — prompt engineering, reward models

The guiding question is How much of the original understanding can I develop through the summaries and how much time does it save?

Open Table of contents

TL;DR (this won’t be new, the details are where the fun is)
- - From this point on the post is structured as a notebook/log.
Sample of the video transcript
Split it to fit the token window
The initial summarization prompt
Summarizing just one section
Summary from traversing one tree branch of summaries
Trying to maximize output length
- WAIT, a lot of this is made up
Let’s read the transcript, uh oh
They uploaded a better transcript
Reading the full transcript
- New things learnt from actually watching part 1
Merging sections
Conclusions
Principles followed
Future work

TL;DR (this won’t be new, the details are where the fun is)

An GPT-4 summarization (of this video) will
- let you read hierarchically at different summary levels
- save you a lot of time
- be mostly right
- cover most key points
but also
- you have to read (very) critically
- and there can be subtle key errors where you might end up confused for more time than you saved
- with any summary, there will be a chance of missing out on the most interesting rabbit-holes.

Overall, these models do open up an ability to read nonfiction hierarchically, which good table of contents already did. Mostly, they add a lot of value in saving you time from reading uninteresting content.

From this point on the post is structured as a notebook/log.

Sample of the video transcript

This is what we have to work with

sample_transcript = yt.get_transcript(sample_video_id)
sample_transcript_raw_text = '\n'.join([o['text'] for o in sample_transcript])
print(sample_transcript_raw_text)

it's my pleasure to welcome Jan from
openai I'm he leads the alignment team
there and he was previously a researcher
at deepmind as well what's a PhD in
reinforcement learning theory has been
thinking about the alignment problem for
over 10 years and today he'll be giving
a very interesting topic so hope you
guys enjoy
yeah thanks a lot for the intro and
thanks a lot for having me
um
I'm very excited to talk about this
stuff
um I'm also super happy to keep it
interactive if you have questions and
then you find please interrupt me and
um yeah I wanted to start out with
a few very basic observations on kind of
what I think is going on
and
so the first one is
team EI is joining the game
so IMEI has a lot of different playe
...

gist link

Split it to fit the token window

GPT-4 has an 8k token window right now (32k coming soon, Anthropic has a 100k model). I’m setting the TOKEN_LIMIT_FOR_SUMMARY to 4k in order to allow the summary to be as long as the content if it wants to be.

len(transcript_splitted), [numtokens(t) for t in transcript_splitted]

(4, [3997, 4000, 3973, 173])

I like about this split that the last two chunk sizes are very different, and there’s some knowledge that’s split across the two. We’ll come back to this at the end.

The initial summarization prompt

def summarize(t: str) -> str:
	return ask("as an expert assistant, summarize the following text from a youtube audio transcript into bullet points: " + t)

Summarizing just one section

Let’s focus on just the first “section” (not a author determined chapter) — transcript_splitted[0]. Then, I’ll progressively have the model summarize the first half of that text and highlight new insights in bold as I read. I’ll mark doubts in italic.

Summary of all 4k tokens of section 1

Original text: 3997 tokens Summary: 245 tokens
Summary is 6.13% of original text

Welcome Jan from OpenAI leading the alignment team, previously a researcher at DeepMind.

Jan discusses the alignment problem for over 10 years.

Observations: AI is joining the game, with stronger players over time.

In the long run, AI will have advantages over humans.

Team Humans should recruit players from AI and write rules to ensure team human doesn’t lose.

Jan focuses on the first objective (alignment) in his talk - building AI systems that follow human intent.

Two categories of intent: overt (instructions) and covert (unspoken, general preferences).

The main technique used involves a reward model optimized by AI.

InstructGPT was found to be preferred by humans over the larger GPT-3 model.

Challenges include catastrophic forgetting and distributional shift.

Future steps include making AI more robust, enabling it to ask clarifying questions, and tackling scaling.

Doubts I have coming out of this

“discusses the alignment problem for over 10 years” is probably a typo
- turns out from the transcript he’s been working on alignment for 10 years, what I inferred anyway.
“Jan focuses on the first objective (alignment) in his talk”. The first of what?
- turns out the two objectives are alignment and governance. Governance is only mentioned on the slide as a term but he does talk about it a little. Model could have captured it.
What does he mean by catastrophic forgetting
- I think this is made up, the transcript makes no mention of it

Overall, this was a good summary — several insights and only 3 clarifications coming out of this.

Clarification vs. Summary: Asking the model to not remove information

As an alternate prompt, I get the model to “clarify” the transcript, asking it to not remove ANY information.

def clarify(t: str) -> str:
	return ask("repeat the following from a youtube audio transcript as a bullet point list. do not remove information. " + t)

Original text: 3997 tokens Clarification: 305 tokens
Clarification is 7.63% of original text

Summary (Clarification version)

Jan from OpenAI leads the alignment team

Working on alignment problem for 10+ years

Observations:

AI is joining the game more and more with stronger players

Expect AI to have incredibly strong players in the future

Team Human has an advantage by picking which AI players join and when

Main objectives of Team Human:

Recruit players from AI to play on the human team (alignment)

Write the rules of the game so that Team Human doesn’t lose

Main talk focus: first objective (alignment)

Two categories of intent for alignment:

Task intent (following human instructions)

Other intents (not just following instructions, but also not making up stuff, doing harmful things, etc.)

Basic technique for alignment:

1st step: Train a reward model from comparison

2nd step: Optimize the AI model to the reward model using reinforcement learning

Results: InstructGPT preferred over larger GPT-3 model

Fine-tuning with RL is cheap compared to pre-training

Limitations: Not all problems fixed, but useful progress

ChatGPT as an upgraded version of InstructGPT for alignment

Dialogue as the universal interface

Implement human-in-the-loop evaluations

Principle: Evaluation is easier than generation

Easier to evaluate model’s output than generate it as a human

Improved model performance through human feedback

Observations

New insights (in bold) are revealed. I actually prefer this clarification as a summary to the prompt where I explicitly asked for a summary
- It’s still short — 7.63% vs 6.13% of original text are comparable
- I like the nested bullet point structure it came up with.
Clarifies”discusses the alignment problem” issue with summary. Turns out (or at least the model tells me to think) that Jan’s been working on alignment for 10 years.
The clarification actually does lose information despite being explicitly asked not to. Specifically the points about distributional shifts and catastrophic forgetting. I actually still don’t understand exactly what those mean, so we’ll get to the full text to understand.
The clarification is only 7.63% of the original text. That makes me suspicious that information is being lost despite instructions not to. We’ll come back to this, but first, let’s investigate further by summarizing smaller tokens.
- Upon reading the transcript and watching the video, it turns out it was a very good compression covering most insights. There’s a section at the end where I watch the video and summarize.
It seems like the model is intrinsically biased towards shorter answers.

Turns out I’ve discovered a better summarization prompt than the original! Let’s use this going forward as the “summary”

2k tokens

I just take the first half of the first section and summarize it to see if I get any new insights.

Original text: 2004 tokens Summary: 209 tokens Summary is 10.43% of original text

Introduction to Jan from OpenAI, who leads the alignment team and has a PhD in reinforcement learning theory

AI is joining the workforce with increasingly stronger players emerging over time

In the long run, AI has advantages over humans, but humans get to choose which AI players join and when

Two main objectives for humans: recruit aligned AI players and write rules of the game so humans don’t lose

Aim to build AI systems that follow human intents, both explicit and implicit

Alignment techniques: 1) Train reward model by collecting comparison data 2) Optimize the AI model for the reward model

Labels for reward models depend on human preferences, with potential inconsistencies

GPT-3 models show improvement with increased size, but an instruct model is preferred even when it is 100 times smaller

Demonstrating systems that outperform models without alignment techniques and providing open-source software for alignment are essential for adoption - Work on various methods, techniques, and research to improve alignment and prepare for future AI advancements

Observations

Adds information that Jan has a PhD in RL theory. I’m ok that the previous summary removed this. It is sort of implied that they would have a PhD or similar level of expertise to have this role.
Loses information that reward model is trained from comparisons
Really interesting that an instruct model is better at 100x smaller scale! Actually, so interesting that I don’t trust it, and want to check the original transcript.
The point about open source software is a little suspicious. I haven’t heard about this before so I want to double check the transcript.
Overall, my review of the text_by_2 summary is that it adds some, but not that much insight, and a couple of very interesting specific callouts, but both make me suspicious and want to double check the transcript

Double checking the transcript

There’s no mention of open source in the actual transcript
The 100 times smaller claim is actually true!

1k tokens

Original text: 1030 tokens Summary: 331 tokens
Summary is 32.14% of original text

Expectations

Output token length jumped up from the 150-200 range to the 300s, I wonder if there’s more information in this set.

Summary / Clarification:

Pleasure to welcome Jan from OpenAI

Jan leads the alignment team at OpenAI

Previously a researcher at DeepMind

Has a PhD in reinforcement learning theory

Thinking about the alignment problem for over 10 years

Presentation on interesting topic

Excited to talk about the subject

Open to interactive discussions

Basic observations on what’s going on

Team EI joining the game

Different players join one by one

Varying skill levels

Most players are narrow in their focus

Stronger players joining over time

Expect incredibly strong players in the future

Future players will think better, faster, and cheaper than humans

Example: chat Epp, better than humans at knowing facts, speaks more languages, able to write 50 words per second, 100 times cheaper than human minimum wage

Chat Epp has important limitations but indicates potential future players

Long run: team EI will have all the advantages over team humans

However, human advantage is picking which players join from EI

Important to lean into this advantage for shaping AI’s development

Objective 1: Recruit players from team EI to play on human teams (alignment)

Objective 2: Write rules of the game so team human doesn’t lose

Focus on alignment: AI systems that follow human intent

First part of talk: past work, making current models as aligned as possible

Second part of talk: future work, what needs to be done next for alignment

Alignment means AI systems that follow human preferences and do what we want them to do

Observations

This is the first level I feel there’s a lot of extraneous information at, that I’m happy the level above summarized away.
This is the first level where the transcript’s spelling errors (AI → EI, ChatGPT → ChatEpp) are not auto corrected (presumably because of the shortening context)

512 tokens

Original text: 515 tokens Summary: 150 tokens
Summary is 29.13% of original text

Token length is back to the 150 range

Summary / Clarification:

Pleasure to welcome Jan from OpenAI, leads the alignment team, former DeepMind researcher, PhD in reinforcement learning theory

Discussing AI alignment problem and keeping the talk interactive

Basic observations:

Team AI is joining the game with players joining one by one, varying in their capabilities

Stronger players observed joining over time

Expectation of AI having stronger players than humans in the future, thinking faster and cheaper

Example: ChatGPT, knows more facts, speaks more languages, writes 50 words per second, 100x cheaper than human minimum wage - Still some limitations with current AI capabilities - Long-run advantage of AI over humans, but with an important caveat

Observations:

There is no new information in this summary
ChatGPT spelling got corrected again

256 tokens

Original text: 253 tokens
Summary: 99 tokens Summary is 39.13% of original text

Welcome Jan from OpenAI

Leads alignment team at OpenAI

Former researcher at DeepMind

Holds PhD in reinforcement learning theory

Has thought about alignment problem for over 10 years

Keep talk interactive, questions welcome

Basic observations:

AI teams joining the game

Different players join one by one

Players vary in skill level

Many current players not very strong

Focused on narrow setups

Observations

Nesting structure is coming back
I’m not sure what “many current players not very strong” means — is he talking about AI of Human players.
No new information

128 tokens

Original text: 121 tokens
Summary: 67 tokens Summary is 55.37% of original text

Welcome Jan from OpenAI

Leads the alignment team at OpenAI

Previously a researcher at DeepMind

Has a PhD in reinforcement learning theory

Has been thinking about the alignment problem for over 10 years

Giving an interesting topic today

Keep the discussion interactive

Encourages questions and comments

Observations

No new information
I’ve now seen so many versions of this that I’m prone to believing it from just repetition even if there’s a systematic error. Take a break.
Stop the halving exercise here because information has been stable for 3 iterations now.

Summary from traversing one tree branch of summaries

Content Division Factor	Input Tokens	Output Tokens	Output Lines	New Insights (hand labeled except row 1)	New Insights x Division Factor
1	3997	305	25	25	25
2	2004	209	10	3	6
4	1030	331	27	1	4
8	515	150	9	0	0
16	253	99	14	0	0
32	121	67	8	0	0

Observations

The first summary is great, has most of the insights. I’d probably just use as many tokens as possible
Output tokens/lines are not monotonic with input size, holding prompt constant. A little surprising.

Trying to maximize output length

I’ve mentioned a few times above that I wanted to compare these to the first 4k tokens of the transcript. But becaues the transcript is formatted in near illegible way, I tried to get the model to do it’s best to be more verbose.

def clarify_verbose(t: str) -> str:
	return ask("repeat the following from a youtube audio transcript as a bullet point list. do not remove information. be very verbose. " + t)

Original text: 3997 tokens Summary: 464 tokens Summary is 11.61% of original text Number of lines in summary: 42

For context, original clarification prompt resulted in 305 tokens, 7.63% of original text

Guest speaker: Jan from OpenAI, leader of alignment team

Background: Former DeepMind researcher, Ph.D. in reinforcement learning theory

Topic: Alignment problem and work done at OpenAI

Basic Observations: 1. Team AI is joining the game with progressively stronger players. 2. Future AI players expected to be much better, faster, and cheaper than humans. 3. One advantage for Team Human: they get to pick which AI players join and when.

Main Objectives for Team Human:

Recruit AI players to help humans (alignment problem)

Write rules of the game to not lose against AI

Talk Outline:

Past work on making existing models as aligned as possible

Future work needed to better align AI

Alignment:

Build AI systems that follow human intent and preferences

Two main categories: explicit instructions and implicit unwritten expectations

Technique for AI alignment (Inverse Reward Design):

Train reward model from human preferences by comparing different AI responses

Optimize AI model with reinforcement learning

Surprising findings:

InstructGPT models trained with alignment perform much better than non-aligned GPT-3 models at a fraction of parameter size.

Fine-tuning cost for alignment is significantly less than pre-training cost.

InstructGPT alignment technique is much more data-efficient than imitation learning (imitating human demonstration)

Limitations and future work:

Addressing issues like AI model hallucinations

Making AI model less sensitive to prompting

Improving alignment with even stronger AI models

Exploiting the principle that evaluation is easier than generation to further train and refine AI model performance

Key points from Q&A:

Comparison between aligned models and pre-training involves ranking and rating based on human preferences

Alignment can involve blending pre-training data into the fine-tuning step

The InstructGPT model can be distinguished from larger GPT models by human judges if given tailored prompts

ChatGPT is an upgrade over InstructGPT, designed to make models more aligned and user-friendly, but still has multiple limitations

The principle of “evaluation is easier than generation” can be useful for training AI models, potentially making them more efficient at tasks than humans themselves

Observations

A few new insights.
“Inverse reward design” is a name I can lookup for further research, which I appreciate definitions of (though again, can’t be sure if it’s real since it comes from an LLM on a shoddy transcript). I’m also suspicious it’s made up because I’ve never heard of it before, but there’s a lot in RL I haven’t heard of before.
It’s well structured — ordered and unordered lists are (mostly) chosen appropriately.
In my opinion it’s worth the extra tokens and I would consider using this as a summary too, especially for technical content.

WAIT, a lot of this is made up

This is just the the first 4k tokens (about 20 minutes of an hour long video). Did they actually get to the Q&A? Turns out no, the section of the transcript cuts off here:

first chapter cutoff

The model is trying take a section of the transcript and structure it as if it’s the full video. This is potentially very dangerous.

Let’s restructure the prompt

def clarify_verbose_caveated(t: str) -> str:
	return ask("repeat the following from a youtube audio transcript as a bullet point list. do not remove information. be very verbose. the transcript could be the full transcript or just a section of the video." + t)

and re-run it

Speaker: Jan from OpenAI, leads alignment team, previously at DeepMind, PhD in reinforcement learning theory

Topic: Alignment problem

Observations:

Team EI joining the game, with stronger and stronger players over time

Team AI has many advantages over Team Humans (speed, cost, etc.)

Team Humans has an advantage in choosing which players join and when

Main objectives for humans:

Recruit players from Pmei to play on Team Humans (alignment)

Write the rules of the game so that Team Human doesn’t lose

Alignment: build AI systems that follow human intent

Two main categories of intent: faithful, other intents (e.g., don’t make up stuff, ask follow-up questions)

Technique for alignment: Iterate reward modeling

Train reward model from human preferences

Optimize for the reward model with RL

How well does it work?

InstructGPT: even the smallest version is preferred over the largest GPT-3 model - Training costs: fine-tuning on the largest GPT-3 version is about 2% of pre-training cost

ChatGPT vs InstructGPT

ChatGPT: dialogue as universal interface, better at refusing harmful tasks, but still has limitations (hallucination, sensitivity to prompting)

InstructGPT: Upgrade to make models more aligned and useful to humans, but still room for improvement

Principle: evaluation is easier than generation

Easier to spot good behavior than doing the task yourself

Useful for providing comparative feedback to the AI system to improve its performance

Observations

Ok, seems like it fixed the made up structure problem.
Noticing that “iterate reward modeling” is likely referring to the same thing it called “inverse reward modeling” earlier.
This ChatGPT vs InstructGPT framing seems wrong to me, and also the other way around. AFAIK ChatGPT is built on top of InstructGPT. This makes InstructGPT sound like an upgrade of ChatGPT, which it isn’t.

Let’s read the transcript, uh oh

I actually just gave up trying to read the full transcript because it was so unreadable. Makes me appreciate the corrections the model did, and make me wonder if it would have done even better. Here’s a few sample links and a gist link.

...
um I'm also super happy to keep it
interactive if you have questions and
then you find please interrupt me and
um yeah I wanted to start out with
a few very basic observations on kind of
what I think is going on
and
so the first one is
team EI is joining the game
so IMEI has a lot of different players
they own All Join at the same time but
rather they Join one by one
and not all of the their players are
like very a lot in how good they are and
right now a lot of the players that have
joined so far
I'm really that's mine and usually you
can do it on your very narrow setup
...

Nearly all my doubts so far have been about quantitative claims, e.g. the one about the 100x model size. I just don’t trust the transcript with this given how poor it is.

They uploaded a better transcript

from youtube_transcript_api import YouTubeTranscriptApi

transcript_list = YouTubeTranscriptApi.list_transcripts(sample_video_id)

list(transcript_list)[0].is_generated

False

Here’s a link to the raw content of the good transcript. And let’s process it (just the first section) with our final prompt after some cleanip

def clarify_verbose_caveated(t: str) -> str:
	return ask("repeat the following from a youtube audio transcript as a bullet point list. do not remove information. be very verbose. the transcript could be the full transcript or just a section of the video." + t)

text_hand = '\n'.join([o['text'] for o in hand_transcript])

hand_transcript_splitted = split_transcript(

	text_hand,

	oai_yt.TOKEN_LIMIT_FOR_SUMMARY,

	numtokens)

len(hand_transcript_splitted), [numtokens(t) for t in hand_transcript_splitted]

(4, [3995, 4000, 4000, 1086])

text_hand_part1 = hand_transcript_splitted[0]

sample_verbose_hand = clarify_verbose_caveated(text_hand_part1)

print_stats(text_hand_part1, sample_verbose_hand)

Original text: 3995 tokens Summary: 266 tokens Summary is 6.66% of original text Number of lines in summary: 15

Jan from OpenAI led the alignment team and has a PhD in reinforcement learning theory and was previously at DeepMind

He has been thinking about the AI alignment problem for over 10 years

Team AI is forming as AI systems improve, with stronger “players” joining over time

AI players are becoming more powerful and better than humans in some aspects, such as knowing more facts and speaking more languages

However, Team Human has the advantage of picking which AI players join and when

Two main objectives for Team Human:

Recruit AI players to play on Team Human (alignment)

Write the rules of the game so that Team Human doesn’t lose

AI alignment is about building systems that follow human intent and preferences

Two main categories of intent:

Explicit intent: following instructions and serving as an assistant

Implicit intent: not making up things, not doing harmful things, asking follow-up questions, etc.

Reinforcement learning from human feedback is a key technique for building aligned AI systems

By fine-tuning AI models, they can be made more aligned and useful on a wide range of tasks

One issue is that AI models can still hallucinate, even with fine-tuning and reinforcement learning from human feedback

Learnt just one new fact from the good transcript, “more languages”, and it’s not that helpful.
Overall, amazed that the model nearly all the useful information out of a terrible transcript and the good one didn’t add much.
Makes me wonder whether they train on “synthetic” data of misspellings explicitly
I’m tired enough at this point of reading that I think I know the content so I may now be marking things as previously seen that I haven’t actually seen.
I think this version is the first explicit mention of “einforcement learning from human feedback” (RLHF) which is a popular term now.

Reading the full transcript

The raw transcript was still hard to read given the formatting. Because my previous efforts to use the model as a pure language formatter had failed, and I was only concerned with the first 20 minutes of the video, I just chose to watch the video and hope I could make sense of the audio.

Plus, I’m tired and I’ve read so many summaries that if something new pops up in the audio, I’ll definitely notice

New things learnt from actually watching part 1

“Governance” is the term he gives to writing rules so team human doesn’t lose. This just isn’t in the audio/transcript.
The visual slide has a few more examples of implicit intent, but similar to what’s covered.
The full InstructGPT diagram is on the slide. Somewhat illustrative but also this is too unlabeled of a diagram to understand what’s really going on. I had to read secondary sources to get this process fully. So not that helpful.
InstructGPT chart of performance is hereis on here. Seeing the exact gap is insightful. “100x smaller instruct model is preferred” point gets proven. He calls it out as wild, which is an important value-add of the video
PPO-ptx is called out as a subtly different variant mixing pretraining data into the fine tuning. For someone who really cares about details, this is an interesting insight.
- An interesting Q&A referencing constitutional AI as a follow up thread.

My takeaway from watching the first 12 minutes is that there’s several subtle things that someone who really cares about the topic will find the video worth watching for. However, you aren’t really missing and of the very high level points.

In that way, I really like the hierarchical reading the summaries enable.

Merging sections

for text_hand_part in hand_transcript_splitted:
	sample_verbose_hand = clarify_verbose_caveated(text_hand_part)
	print_stats(text_hand_part, sample_verbose_hand)

Link to the gist. I did skim this very briefly and realized there was a key mistake

Why is the summary of the first part different this time?

if you notice the the part 1 output from the second run, there’s insights in there that are true and weren’t in the first run. Because of the default temperature setting, it outputs a meaningfully different summary for the first part that we’ve already done.

Does lower temperature mean more truthful answers?

Hard to say an outsider given we don’t really know what effect the RL ends up having on the model. We don’t even know if they use the standard formulation of temperature in their implementation. Sampling is unfortunately a total black box.

A key factual error

“Training a reward model from comparisons is the first step, followed by optimizing the reward model with reinforcement learning.”

The reward model is not the one that’s RLed. It just outputs an approximation of what a human will think of a completion. If you believed this from the summary, you’d have the wrong understanding of this key part of the process and maybe lose a lot of time (more than what you saved).

What happens at the edges of the split

The last 2 splits are at an interesting point

hand_transcript_splitted[-1]
"to train the system to be more--\nor to get..."

last chapter cutoff It’s interesting because it breaks right in the middle of him explaining something, and a fairly advanced implication.

The last line of summary n-1 is

Addressing safety concerns and potential biases in user input by training AI on diverse and representative sets of human preferences

and the first line of summary N is

Train inner optimizers to deal with distributional shift and construct outer alignment signals

While sounding intelligent, this line makes no sense and I have to watch -2 minutes into the video from the split point to understand what he’s even talking about.

This is a really good example that can only really be solved by convolving the split point or with a larger context window.

Conclusions

Somewhat obvious sounding but if you really care I think the specific examples worked through will give you more insights of your own.

You can read hierarchically and save a lot of time with LLMs — understand the high level arc and dive into the parts that interest you the most.
You absolutely still have to read critically, maybe more so than before, though the model is mostly right even in the surprising places.
Even if you read critically for technical material, Subtle false claims like “the reward model is reinforcement learnt” can still leave you confused for much more time than the summary saved you.
With any summary, there will be a chance of missing out on the most interesting rabbit-holes.
- For me, one turned out to be why did PPO-ptx perform better in the 175B model. Even though one of the deeper summaries had it, it took me seeing the graph to get intrigued by that, and then hearing him say “that’s kind of wild” to follow that further.

Principles followed

My main goal is understanding the content as a person. I prefer keeping subjective unquantifiable whenever quantifying will lose meaningful fidelity. I don’t try and simplify the goal to make it numerical if it doesn’t serve my understanding.
Depth instead of breath. I pick one path of analysis and minimize the “what if” questions. A good analysis would require both but I have limited time, like 1/100000th of what the labs have in resources.
No fancy prompt engineering. Of course it could be better, but I’m trying to use it as a normal person would maybe with one or two tricks they learnt about.
Make note of my own cognitive states (e.g. tiredness) and cognitive biases when aware.
Trust first, verify later, and write like a linear notebook. If I only realize later that the model was making stuff up, I don’t pretend otherwise and change the content up top. I tried to be honest about that.
Most of this analysis is done on the bad transcript. They uploaded a good one later, but what motivated using GPT-4 was that the transcript as so bad originally. In production, if you have a good and a bad transcript, you’d analyze the good one. But for me personally a large part of the value of this exercise was making illegible content legible.

Future work

This work does’t touch
- Vector based retreival of what to summarize/process
- Cross checking the model’s world knowledge with the retrieved content’s
- Fancy prompt engineering, self-checking, consensus of outputs methods, etc.
Branches to take within this work
- Using more than 2k tokens out of the allowed 4k for input.
- What if most of the analysis was done on the good transcript?
- What about Claude 100k, GPT-3 or open models
- What if I fine tuned a model for just summarization.
- What else could we try to make it even more verbose.
- Summarization is just a special case of retrieval augmented question answering. We can brannch into other categories of question-answering.