It bothers me that there are no great evaluations for large language models on the high level tasks we actually find them most helpful for — chief amongst them summarization.
Perplexity is in essence just the loss function and measurement on downstream objective tasks like question-answering are helpful but too narrow domain. There’s also other ways to do fact-based question answering.
Sam Altman said in an interview that his killer use case for GPT is summarization. I find it doubly helpful on YouTube transcripts, where the audio quality can sometimes be poor.
The focus here is a very qualitative but very deep evaluation and working with one very specific example — a YouTube video from CS25 at Stanford about AI alignment by Jan Leike. I was very interested in the topic, but the audio quality was poor.
I chose to have the model summarize hierarchical summaries and read through them in a systematic way — from the highest level, all the way to actually watching the video so I could verify what came before. I compared what new learnings I had from each level of summary. I generally trusted the model’s output at each level and the next few levels would usually reveal any false beliefs I developed.
This is a long post, probably only worth your time if you’re deeply interested in using LLMs for summarization. It’s really just my notes as I went along. But it’s interesting that through this one simple example, we’ll actually touch nearly every core concept in LLM applications today — prompt engineering, reward models
The guiding question is How much of the original understanding can I develop through the summaries and how much time does it save?
Table of contents
Open Table of contents
TL;DR (this won’t be new, the details are where the fun is)
- An GPT-4 summarization (of this video) will
- let you read hierarchically at different summary levels
- save you a lot of time
- be mostly right
- cover most key points
- but also
- you have to read (very) critically
- and there can be subtle key errors where you might end up confused for more time than you saved
- with any summary, there will be a chance of missing out on the most interesting rabbit-holes.
Overall, these models do open up an ability to read nonfiction hierarchically, which good table of contents already did. Mostly, they add a lot of value in saving you time from reading uninteresting content.
From this point on the post is structured as a notebook/log.
Sample of the video transcript
This is what we have to work with
sample_transcript = yt.get_transcript(sample_video_id)
sample_transcript_raw_text = '\n'.join([o['text'] for o in sample_transcript])
print(sample_transcript_raw_text)
it's my pleasure to welcome Jan from
openai I'm he leads the alignment team
there and he was previously a researcher
at deepmind as well what's a PhD in
reinforcement learning theory has been
thinking about the alignment problem for
over 10 years and today he'll be giving
a very interesting topic so hope you
guys enjoy
yeah thanks a lot for the intro and
thanks a lot for having me
um
I'm very excited to talk about this
stuff
um I'm also super happy to keep it
interactive if you have questions and
then you find please interrupt me and
um yeah I wanted to start out with
a few very basic observations on kind of
what I think is going on
and
so the first one is
team EI is joining the game
so IMEI has a lot of different playe
...
Split it to fit the token window
GPT-4 has an 8k token window right now (32k coming soon, Anthropic has a 100k model). I’m setting the TOKEN_LIMIT_FOR_SUMMARY to 4k in order to allow the summary to be as long as the content if it wants to be.
len(transcript_splitted), [numtokens(t) for t in transcript_splitted]
(4, [3997, 4000, 3973, 173])
I like about this split that the last two chunk sizes are very different, and there’s some knowledge that’s split across the two. We’ll come back to this at the end.
The initial summarization prompt
def summarize(t: str) -> str:
return ask("as an expert assistant, summarize the following text from a youtube audio transcript into bullet points: " + t)
Summarizing just one section
Let’s focus on just the first “section” (not a author determined chapter) — transcript_splitted[0]
. Then, I’ll progressively have the model summarize the first half of that text and highlight new insights in bold as I read. I’ll mark doubts in italic.
Summary of all 4k tokens of section 1
Original text: 3997 tokens Summary: 245 tokens
Summary is 6.13% of original text
- Welcome Jan from OpenAI leading the alignment team, previously a researcher at DeepMind.
- Jan discusses the alignment problem for over 10 years.
- Observations: AI is joining the game, with stronger players over time.
- In the long run, AI will have advantages over humans.
- Team Humans should recruit players from AI and write rules to ensure team human doesn’t lose.
- Jan focuses on the first objective (alignment) in his talk - building AI systems that follow human intent.
- Two categories of intent: overt (instructions) and covert (unspoken, general preferences).
- The main technique used involves a reward model optimized by AI.
- InstructGPT was found to be preferred by humans over the larger GPT-3 model.
- Challenges include catastrophic forgetting and distributional shift.
- Future steps include making AI more robust, enabling it to ask clarifying questions, and tackling scaling.
Doubts I have coming out of this
- “discusses the alignment problem for over 10 years” is probably a typo
- turns out from the transcript he’s been working on alignment for 10 years, what I inferred anyway.
- “Jan focuses on the first objective (alignment) in his talk”. The first of what?
- turns out the two objectives are alignment and governance. Governance is only mentioned on the slide as a term but he does talk about it a little. Model could have captured it.
- What does he mean by catastrophic forgetting
- I think this is made up, the transcript makes no mention of it
Overall, this was a good summary — several insights and only 3 clarifications coming out of this.
Clarification vs. Summary: Asking the model to not remove information
As an alternate prompt, I get the model to “clarify” the transcript, asking it to not remove ANY information.
def clarify(t: str) -> str:
return ask("repeat the following from a youtube audio transcript as a bullet point list. do not remove information. " + t)
Original text: 3997 tokens Clarification: 305 tokens
Clarification is 7.63% of original text
Summary (Clarification version)
- Jan from OpenAI leads the alignment team
- Working on alignment problem for 10+ years
- Observations:
- AI is joining the game more and more with stronger players
- Expect AI to have incredibly strong players in the future
- Team Human has an advantage by picking which AI players join and when
- Main objectives of Team Human:
- Recruit players from AI to play on the human team (alignment)
- Write the rules of the game so that Team Human doesn’t lose
- Main talk focus: first objective (alignment)
- Two categories of intent for alignment:
- Task intent (following human instructions)
- Other intents (not just following instructions, but also not making up stuff, doing harmful things, etc.)
- Basic technique for alignment:
- 1st step: Train a reward model from comparison
- 2nd step: Optimize the AI model to the reward model using reinforcement learning
- Results: InstructGPT preferred over larger GPT-3 model
- Fine-tuning with RL is cheap compared to pre-training
- Limitations: Not all problems fixed, but useful progress
- ChatGPT as an upgraded version of InstructGPT for alignment
- Dialogue as the universal interface
- Implement human-in-the-loop evaluations
- Principle: Evaluation is easier than generation
- Easier to evaluate model’s output than generate it as a human
- Improved model performance through human feedback
Observations
- New insights (in bold) are revealed. I actually prefer this clarification as a summary to the prompt where I explicitly asked for a summary
- It’s still short — 7.63% vs 6.13% of original text are comparable
- I like the nested bullet point structure it came up with.
- Clarifies”discusses the alignment problem” issue with summary. Turns out (or at least the model tells me to think) that Jan’s been working on alignment for 10 years.
- The clarification actually does lose information despite being explicitly asked not to. Specifically the points about distributional shifts and catastrophic forgetting. I actually still don’t understand exactly what those mean, so we’ll get to the full text to understand.
- The clarification is only 7.63% of the original text. That makes me suspicious that information is being lost despite instructions not to. We’ll come back to this, but first, let’s investigate further by summarizing smaller tokens.
- Upon reading the transcript and watching the video, it turns out it was a very good compression covering most insights. There’s a section at the end where I watch the video and summarize.
- It seems like the model is intrinsically biased towards shorter answers.
Turns out I’ve discovered a better summarization prompt than the original! Let’s use this going forward as the “summary”
2k tokens
I just take the first half of the first section and summarize it to see if I get any new insights.
Original text: 2004 tokens Summary: 209 tokens Summary is 10.43% of original text
- Introduction to Jan from OpenAI, who leads the alignment team and has a PhD in reinforcement learning theory
- AI is joining the workforce with increasingly stronger players emerging over time
- In the long run, AI has advantages over humans, but humans get to choose which AI players join and when
- Two main objectives for humans: recruit aligned AI players and write rules of the game so humans don’t lose
- Aim to build AI systems that follow human intents, both explicit and implicit
- Alignment techniques: 1) Train reward model by collecting comparison data 2) Optimize the AI model for the reward model
- Labels for reward models depend on human preferences, with potential inconsistencies
- GPT-3 models show improvement with increased size, but an instruct model is preferred even when it is 100 times smaller
- Demonstrating systems that outperform models without alignment techniques and providing open-source software for alignment are essential for adoption - Work on various methods, techniques, and research to improve alignment and prepare for future AI advancements
Observations
- Adds information that Jan has a PhD in RL theory. I’m ok that the previous summary removed this. It is sort of implied that they would have a PhD or similar level of expertise to have this role.
- Loses information that reward model is trained from comparisons
- Really interesting that an instruct model is better at 100x smaller scale! Actually, so interesting that I don’t trust it, and want to check the original transcript.
- The point about open source software is a little suspicious. I haven’t heard about this before so I want to double check the transcript.
- Overall, my review of the text_by_2 summary is that it adds some, but not that much insight, and a couple of very interesting specific callouts, but both make me suspicious and want to double check the transcript
Double checking the transcript
- There’s no mention of open source in the actual transcript
- The 100 times smaller claim is actually true!
1k tokens
Original text: 1030 tokens Summary: 331 tokens
Summary is 32.14% of original text
Expectations
- Output token length jumped up from the 150-200 range to the 300s, I wonder if there’s more information in this set.
Summary / Clarification:
- Pleasure to welcome Jan from OpenAI
- Jan leads the alignment team at OpenAI
- Previously a researcher at DeepMind
- Has a PhD in reinforcement learning theory
- Thinking about the alignment problem for over 10 years
- Presentation on interesting topic
- Excited to talk about the subject
- Open to interactive discussions
- Basic observations on what’s going on
- Team EI joining the game
- Different players join one by one
- Varying skill levels
- Most players are narrow in their focus
- Stronger players joining over time
- Expect incredibly strong players in the future
- Future players will think better, faster, and cheaper than humans
- Example: chat Epp, better than humans at knowing facts, speaks more languages, able to write 50 words per second, 100 times cheaper than human minimum wage
- Chat Epp has important limitations but indicates potential future players
- Long run: team EI will have all the advantages over team humans
- However, human advantage is picking which players join from EI
- Important to lean into this advantage for shaping AI’s development
- Objective 1: Recruit players from team EI to play on human teams (alignment)
- Objective 2: Write rules of the game so team human doesn’t lose
- Focus on alignment: AI systems that follow human intent
- First part of talk: past work, making current models as aligned as possible
- Second part of talk: future work, what needs to be done next for alignment
- Alignment means AI systems that follow human preferences and do what we want them to do
Observations
- This is the first level I feel there’s a lot of extraneous information at, that I’m happy the level above summarized away.
- This is the first level where the transcript’s spelling errors (AI → EI, ChatGPT → ChatEpp) are not auto corrected (presumably because of the shortening context)
512 tokens
Original text: 515 tokens Summary: 150 tokens
Summary is 29.13% of original text
- Token length is back to the 150 range
Summary / Clarification:
- Pleasure to welcome Jan from OpenAI, leads the alignment team, former DeepMind researcher, PhD in reinforcement learning theory
- Discussing AI alignment problem and keeping the talk interactive
- Basic observations:
- Team AI is joining the game with players joining one by one, varying in their capabilities
- Stronger players observed joining over time
- Expectation of AI having stronger players than humans in the future, thinking faster and cheaper
- Example: ChatGPT, knows more facts, speaks more languages, writes 50 words per second, 100x cheaper than human minimum wage - Still some limitations with current AI capabilities - Long-run advantage of AI over humans, but with an important caveat
Observations:
- There is no new information in this summary
- ChatGPT spelling got corrected again
256 tokens
Original text: 253 tokens
Summary: 99 tokens Summary is 39.13% of original text
- Welcome Jan from OpenAI
- Leads alignment team at OpenAI
- Former researcher at DeepMind
- Holds PhD in reinforcement learning theory
- Has thought about alignment problem for over 10 years
- Keep talk interactive, questions welcome
- Basic observations:
- AI teams joining the game
- Different players join one by one
- Players vary in skill level
- Many current players not very strong
- Focused on narrow setups
Observations
- Nesting structure is coming back
- I’m not sure what “many current players not very strong” means — is he talking about AI of Human players.
- No new information
128 tokens
Original text: 121 tokens
Summary: 67 tokens Summary is 55.37% of original text
- Welcome Jan from OpenAI
- Leads the alignment team at OpenAI
- Previously a researcher at DeepMind
- Has a PhD in reinforcement learning theory
- Has been thinking about the alignment problem for over 10 years
- Giving an interesting topic today
- Keep the discussion interactive
- Encourages questions and comments
Observations
- No new information
- I’ve now seen so many versions of this that I’m prone to believing it from just repetition even if there’s a systematic error. Take a break.
- Stop the halving exercise here because information has been stable for 3 iterations now.
Summary from traversing one tree branch of summaries
Content Division Factor | Input Tokens | Output Tokens | Output Lines | New Insights (hand labeled except row 1) | New Insights x Division Factor |
---|---|---|---|---|---|
1 | 3997 | 305 | 25 | 25 | 25 |
2 | 2004 | 209 | 10 | 3 | 6 |
4 | 1030 | 331 | 27 | 1 | 4 |
8 | 515 | 150 | 9 | 0 | 0 |
16 | 253 | 99 | 14 | 0 | 0 |
32 | 121 | 67 | 8 | 0 | 0 |
Observations
- The first summary is great, has most of the insights. I’d probably just use as many tokens as possible
- Output tokens/lines are not monotonic with input size, holding prompt constant. A little surprising.
Trying to maximize output length
I’ve mentioned a few times above that I wanted to compare these to the first 4k tokens of the transcript. But becaues the transcript is formatted in near illegible way, I tried to get the model to do it’s best to be more verbose.
def clarify_verbose(t: str) -> str:
return ask("repeat the following from a youtube audio transcript as a bullet point list. do not remove information. be very verbose. " + t)
Original text: 3997 tokens Summary: 464 tokens Summary is 11.61% of original text Number of lines in summary: 42
For context, original clarification prompt resulted in 305 tokens, 7.63% of original text
- Guest speaker: Jan from OpenAI, leader of alignment team
- Background: Former DeepMind researcher, Ph.D. in reinforcement learning theory
- Topic: Alignment problem and work done at OpenAI
Basic Observations: 1. Team AI is joining the game with progressively stronger players. 2. Future AI players expected to be much better, faster, and cheaper than humans. 3. One advantage for Team Human: they get to pick which AI players join and when.
Main Objectives for Team Human:
- Recruit AI players to help humans (alignment problem)
- Write rules of the game to not lose against AI
Talk Outline:
- Past work on making existing models as aligned as possible
- Future work needed to better align AI
Alignment:
- Build AI systems that follow human intent and preferences
- Two main categories: explicit instructions and implicit unwritten expectations
Technique for AI alignment (Inverse Reward Design):
- Train reward model from human preferences by comparing different AI responses
- Optimize AI model with reinforcement learning
Surprising findings:
- InstructGPT models trained with alignment perform much better than non-aligned GPT-3 models at a fraction of parameter size.
- Fine-tuning cost for alignment is significantly less than pre-training cost.
- InstructGPT alignment technique is much more data-efficient than imitation learning (imitating human demonstration)
Limitations and future work:
- Addressing issues like AI model hallucinations
- Making AI model less sensitive to prompting
- Improving alignment with even stronger AI models
- Exploiting the principle that evaluation is easier than generation to further train and refine AI model performance
Key points from Q&A:
- Comparison between aligned models and pre-training involves ranking and rating based on human preferences
- Alignment can involve blending pre-training data into the fine-tuning step
- The InstructGPT model can be distinguished from larger GPT models by human judges if given tailored prompts
- ChatGPT is an upgrade over InstructGPT, designed to make models more aligned and user-friendly, but still has multiple limitations
- The principle of “evaluation is easier than generation” can be useful for training AI models, potentially making them more efficient at tasks than humans themselves
Observations
- A few new insights.
- “Inverse reward design” is a name I can lookup for further research, which I appreciate definitions of (though again, can’t be sure if it’s real since it comes from an LLM on a shoddy transcript). I’m also suspicious it’s made up because I’ve never heard of it before, but there’s a lot in RL I haven’t heard of before.
- It’s well structured — ordered and unordered lists are (mostly) chosen appropriately.
- In my opinion it’s worth the extra tokens and I would consider using this as a summary too, especially for technical content.
WAIT, a lot of this is made up
This is just the the first 4k tokens (about 20 minutes of an hour long video). Did they actually get to the Q&A? Turns out no, the section of the transcript cuts off here:
The model is trying take a section of the transcript and structure it as if it’s the full video. This is potentially very dangerous.
Let’s restructure the prompt
def clarify_verbose_caveated(t: str) -> str:
return ask("repeat the following from a youtube audio transcript as a bullet point list. do not remove information. be very verbose. the transcript could be the full transcript or just a section of the video." + t)
and re-run it
- Speaker: Jan from OpenAI, leads alignment team, previously at DeepMind, PhD in reinforcement learning theory
- Topic: Alignment problem
- Observations:
- Team EI joining the game, with stronger and stronger players over time
- Team AI has many advantages over Team Humans (speed, cost, etc.)
- Team Humans has an advantage in choosing which players join and when
- Main objectives for humans:
- Recruit players from Pmei to play on Team Humans (alignment)
- Write the rules of the game so that Team Human doesn’t lose
- Alignment: build AI systems that follow human intent
- Two main categories of intent: faithful, other intents (e.g., don’t make up stuff, ask follow-up questions)
- Technique for alignment: Iterate reward modeling
- Train reward model from human preferences
- Optimize for the reward model with RL
- How well does it work?
- InstructGPT: even the smallest version is preferred over the largest GPT-3 model - Training costs: fine-tuning on the largest GPT-3 version is about 2% of pre-training cost
- ChatGPT vs InstructGPT
- ChatGPT: dialogue as universal interface, better at refusing harmful tasks, but still has limitations (hallucination, sensitivity to prompting)
- InstructGPT: Upgrade to make models more aligned and useful to humans, but still room for improvement
- Principle: evaluation is easier than generation
- Easier to spot good behavior than doing the task yourself
- Useful for providing comparative feedback to the AI system to improve its performance
Observations
- Ok, seems like it fixed the made up structure problem.
- Noticing that “iterate reward modeling” is likely referring to the same thing it called “inverse reward modeling” earlier.
- This ChatGPT vs InstructGPT framing seems wrong to me, and also the other way around. AFAIK ChatGPT is built on top of InstructGPT. This makes InstructGPT sound like an upgrade of ChatGPT, which it isn’t.
Let’s read the transcript, uh oh
I actually just gave up trying to read the full transcript because it was so unreadable. Makes me appreciate the corrections the model did, and make me wonder if it would have done even better. Here’s a few sample links and a gist link.
...
um I'm also super happy to keep it
interactive if you have questions and
then you find please interrupt me and
um yeah I wanted to start out with
a few very basic observations on kind of
what I think is going on
and
so the first one is
team EI is joining the game
so IMEI has a lot of different players
they own All Join at the same time but
rather they Join one by one
and not all of the their players are
like very a lot in how good they are and
right now a lot of the players that have
joined so far
I'm really that's mine and usually you
can do it on your very narrow setup
...
Nearly all my doubts so far have been about quantitative claims, e.g. the one about the 100x model size. I just don’t trust the transcript with this given how poor it is.
They uploaded a better transcript
from youtube_transcript_api import YouTubeTranscriptApi
transcript_list = YouTubeTranscriptApi.list_transcripts(sample_video_id)
list(transcript_list)[0].is_generated
False
Here’s a link to the raw content of the good transcript. And let’s process it (just the first section) with our final prompt after some cleanip
def clarify_verbose_caveated(t: str) -> str:
return ask("repeat the following from a youtube audio transcript as a bullet point list. do not remove information. be very verbose. the transcript could be the full transcript or just a section of the video." + t)
text_hand = '\n'.join([o['text'] for o in hand_transcript])
hand_transcript_splitted = split_transcript(
text_hand,
oai_yt.TOKEN_LIMIT_FOR_SUMMARY,
numtokens)
len(hand_transcript_splitted), [numtokens(t) for t in hand_transcript_splitted]
(4, [3995, 4000, 4000, 1086])
text_hand_part1 = hand_transcript_splitted[0]
sample_verbose_hand = clarify_verbose_caveated(text_hand_part1)
print_stats(text_hand_part1, sample_verbose_hand)
Original text: 3995 tokens Summary: 266 tokens Summary is 6.66% of original text Number of lines in summary: 15
- Jan from OpenAI led the alignment team and has a PhD in reinforcement learning theory and was previously at DeepMind
- He has been thinking about the AI alignment problem for over 10 years
- Team AI is forming as AI systems improve, with stronger “players” joining over time
- AI players are becoming more powerful and better than humans in some aspects, such as knowing more facts and speaking more languages
- However, Team Human has the advantage of picking which AI players join and when
- Two main objectives for Team Human:
- Recruit AI players to play on Team Human (alignment)
- Write the rules of the game so that Team Human doesn’t lose
- AI alignment is about building systems that follow human intent and preferences
- Two main categories of intent:
- Explicit intent: following instructions and serving as an assistant
- Implicit intent: not making up things, not doing harmful things, asking follow-up questions, etc.
- Reinforcement learning from human feedback is a key technique for building aligned AI systems
- By fine-tuning AI models, they can be made more aligned and useful on a wide range of tasks
- One issue is that AI models can still hallucinate, even with fine-tuning and reinforcement learning from human feedback
- Learnt just one new fact from the good transcript, “more languages”, and it’s not that helpful.
- Overall, amazed that the model nearly all the useful information out of a terrible transcript and the good one didn’t add much.
- Makes me wonder whether they train on “synthetic” data of misspellings explicitly
- I’m tired enough at this point of reading that I think I know the content so I may now be marking things as previously seen that I haven’t actually seen.
- I think this version is the first explicit mention of “einforcement learning from human feedback” (RLHF) which is a popular term now.
Reading the full transcript
The raw transcript was still hard to read given the formatting. Because my previous efforts to use the model as a pure language formatter had failed, and I was only concerned with the first 20 minutes of the video, I just chose to watch the video and hope I could make sense of the audio.
Plus, I’m tired and I’ve read so many summaries that if something new pops up in the audio, I’ll definitely notice
New things learnt from actually watching part 1
- “Governance” is the term he gives to writing rules so team human doesn’t lose. This just isn’t in the audio/transcript.
- The visual slide has a few more examples of implicit intent, but similar to what’s covered.
- The full InstructGPT diagram is on the slide. Somewhat illustrative but also this is too unlabeled of a diagram to understand what’s really going on. I had to read secondary sources to get this process fully. So not that helpful.
- InstructGPT chart of performance is hereis on here. Seeing the exact gap is insightful. “100x smaller instruct model is preferred” point gets proven. He calls it out as wild, which is an important value-add of the video
- PPO-ptx is called out as a subtly different variant mixing pretraining data into the fine tuning. For someone who really cares about details, this is an interesting insight.
- An interesting Q&A referencing constitutional AI as a follow up thread.
My takeaway from watching the first 12 minutes is that there’s several subtle things that someone who really cares about the topic will find the video worth watching for. However, you aren’t really missing and of the very high level points.
In that way, I really like the hierarchical reading the summaries enable.
Merging sections
for text_hand_part in hand_transcript_splitted:
sample_verbose_hand = clarify_verbose_caveated(text_hand_part)
print_stats(text_hand_part, sample_verbose_hand)
Link to the gist. I did skim this very briefly and realized there was a key mistake
Why is the summary of the first part different this time?
if you notice the the part 1 output from the second run, there’s insights in there that are true and weren’t in the first run. Because of the default temperature setting, it outputs a meaningfully different summary for the first part that we’ve already done.
Does lower temperature mean more truthful answers?
Hard to say an outsider given we don’t really know what effect the RL ends up having on the model. We don’t even know if they use the standard formulation of temperature in their implementation. Sampling is unfortunately a total black box.
A key factual error
- “Training a reward model from comparisons is the first step, followed by optimizing the reward model with reinforcement learning.”
The reward model is not the one that’s RLed. It just outputs an approximation of what a human will think of a completion. If you believed this from the summary, you’d have the wrong understanding of this key part of the process and maybe lose a lot of time (more than what you saved).
What happens at the edges of the split
The last 2 splits are at an interesting point
hand_transcript_splitted[-1]
"to train the system to be more--\nor to get..."
It’s interesting because it breaks right in the middle of him explaining something, and a fairly advanced implication.
The last line of summary n-1 is
- Addressing safety concerns and potential biases in user input by training AI on diverse and representative sets of human preferences
and the first line of summary N is
- Train inner optimizers to deal with distributional shift and construct outer alignment signals
While sounding intelligent, this line makes no sense and I have to watch -2 minutes into the video from the split point to understand what he’s even talking about.
This is a really good example that can only really be solved by convolving the split point or with a larger context window.
Conclusions
Somewhat obvious sounding but if you really care I think the specific examples worked through will give you more insights of your own.
- You can read hierarchically and save a lot of time with LLMs — understand the high level arc and dive into the parts that interest you the most.
- You absolutely still have to read critically, maybe more so than before, though the model is mostly right even in the surprising places.
- Even if you read critically for technical material, Subtle false claims like “the reward model is reinforcement learnt” can still leave you confused for much more time than the summary saved you.
- With any summary, there will be a chance of missing out on the most interesting rabbit-holes.
- For me, one turned out to be why did PPO-ptx perform better in the 175B model. Even though one of the deeper summaries had it, it took me seeing the graph to get intrigued by that, and then hearing him say “that’s kind of wild” to follow that further.
Principles followed
- My main goal is understanding the content as a person. I prefer keeping subjective unquantifiable whenever quantifying will lose meaningful fidelity. I don’t try and simplify the goal to make it numerical if it doesn’t serve my understanding.
- Depth instead of breath. I pick one path of analysis and minimize the “what if” questions. A good analysis would require both but I have limited time, like 1/100000th of what the labs have in resources.
- No fancy prompt engineering. Of course it could be better, but I’m trying to use it as a normal person would maybe with one or two tricks they learnt about.
- Make note of my own cognitive states (e.g. tiredness) and cognitive biases when aware.
- Trust first, verify later, and write like a linear notebook. If I only realize later that the model was making stuff up, I don’t pretend otherwise and change the content up top. I tried to be honest about that.
- Most of this analysis is done on the bad transcript. They uploaded a good one later, but what motivated using GPT-4 was that the transcript as so bad originally. In production, if you have a good and a bad transcript, you’d analyze the good one. But for me personally a large part of the value of this exercise was making illegible content legible.
Future work
- This work does’t touch
- Vector based retreival of what to summarize/process
- Cross checking the model’s world knowledge with the retrieved content’s
- Fancy prompt engineering, self-checking, consensus of outputs methods, etc.
- Branches to take within this work
- Using more than 2k tokens out of the allowed 4k for input.
- What if most of the analysis was done on the good transcript?
- What about Claude 100k, GPT-3 or open models
- What if I fine tuned a model for just summarization.
- What else could we try to make it even more verbose.
- Summarization is just a special case of retrieval augmented question answering. We can brannch into other categories of question-answering.