Skip to content

Questions for large LLM labs they'll probably not want to answer

May 30, 2023 at 03:22 PM

In [[GPT4 Summary Evaluation]] I worked through finding a high quality prompt for summarizing a YouTube video using GPT-4. Keeping with the theme of subjective LLM evaluations and going deep with one example evaluated subjectively instead of broad with quantifiability, this post plays around with Claude, specifically Claude-V1.

The video being summarized is a CS25 lecture by Jan Leike on AI Alignment. The content alternates between general and technical. The transcript quality is very poor (using the auto-generated transcript).

Most memorable insights (BUT CAVEAT)

I don’t think one should generalize from one example and limited iteration, but I think the notes have interesting observations. I would just treat these as threads to follow up an verify, not mental models to form. Given limited time, I am explicitly trading off rigor for depth with one example, and this is written like a notebook log, not a refined blog.

Table of contents

Open Table of contents

Transcript and tokenization

print(sample_transcript_raw_text)
it's my pleasure to welcome Jan from
openai I'm he leads the alignment team
there and he was previously a researcher
at deepmind as well what's a PhD in
reinforcement learning theory has been
thinking about the alignment problem for
over 10 years and today he'll be giving
a very interesting topic so hope you
guys enjoy
yeah thanks a lot for the intro and
thanks a lot for having me
um
I'm very excited to talk about this
stuff
numtokens_oai(sample_transcript_raw_text), numtokens_anthropic(sample_transcript_raw_text)
(12146, 12176)
anthropic_tokenzier.encode(sample_transcript_raw_text[:100]).tokens
['it', "'s", 'Ġmy', 'Ġpleasure', 'Ġto', 'Ġwelcome', 'ĠJan', 'Ġfrom', 'Ċ', 'open', 'ai', 'ĠI', "'m", 'Ġhe', 'Ġleads', 'Ġthe', 'Ġalignment', 'Ġteam', 'Ċ', 'there', 'Ġand', 'Ġhe', 'Ġwas', 'Ġprev', 'iou']
oai_tokenizer = oai.tiktoken.encoding_for_model("gpt-4")
[oai_tokenizer.decode([t]) for t in oai_tokenizer.encode(sample_transcript_raw_text[:100])]
['it', "'s", ' my', ' pleasure', ' to', ' welcome', ' Jan', ' from', '\n', 'open', 'ai', ' I', "'m", ' he', ' leads', ' the', ' alignment', ' team', '\n', 'there', ' and', ' he', ' was', ' prev', 'iou']

Splitting (only needed for OpenAI)

# claude 100k won't need splitting so just use OAI as the reference. to keep the transcript consistent

split_up_transcript = split_transcript(
	sample_transcript_raw_text,
	oai_yt.TOKEN_LIMIT_FOR_SUMMARY,
	numtokens_oai)

print(len(split_up_transcript), [numtokens_oai(t) for t in split_up_transcript])
print(len(split_up_transcript), [numtokens_anthropic(t) for t in split_up_transcript])

4 [3997, 4000, 3973, 173]
4 [4009, 4012, 3978, 174]

Observations

GPT-4 summary as a reminder

We already converged upon a preferred summarizer in [[GPT4 Summary Evaluation]] so we’ll continue working with that. Anthropic has document-completion style API instead of chat style one so we may have to keep iterating.

def clarify_verbose_caveated(t: str) -> str:
	return ask("repeat the following from a youtube audio transcript as a bullet point list. do not remove information. be very verbose. the transcript could be the full transcript or just a section of the video." + t)

Result

- Speaker: Jan from OpenAI, leads alignment team, previously at DeepMind, PhD in reinforcement learning theory
- Topic: Alignment problem
- Observations:
  - Team EI joining the game, with stronger and stronger players over time
  - Team AI has many advantages over Team Humans (speed, cost, etc.)
  - Team Humans has an advantage in choosing which players join and when
- Main objectives for humans:
  1. Recruit players from Pmei to play on Team Humans (alignment)
  2. Write the rules of the game so that Team Human doesn't lose
- Alignment: build AI systems that follow human intent
  - Two main categories of intent: faithful, other intents (e.g., don't make up stuff, ask follow-up questions)
- Technique for alignment: Iterate reward modeling
  1. Train reward model from human preferences
  2. Optimize for the reward model with RL
- How well does it work?
  - InstructGPT: even the smallest version is preferred over the largest GPT-3 model
  - Training costs: fine-tuning on the largest GPT-3 version is about 2% of pre-training cost
- ChatGPT vs InstructGPT
  - ChatGPT: dialogue as universal interface, better at refusing harmful tasks, but still has limitations (hallucination, sensitivity to prompting)
  - InstructGPT: Upgrade to make models more aligned and useful to humans, but still room for improvement
- Principle: evaluation is easier than generation
  - Easier to spot good behavior than doing the task yourself
  - Useful for providing comparative feedback to the AI system to improve its performance

Notes

Claude V1 (4k tokens only for dev)

Notes

def summarize_anthropic(input: str, max_tokens_to_sample=5000) -> str:
	return client.completion(
		prompt=f"""{anthropic.HUMAN_PROMPT}
Repeat the following from a youtube audio transcript as a bullet point list. do not remove information. be very verbose. the transcript could be the full transcript or just a section of the video
---
{input}
---
	{anthropic.AI_PROMPT}""",
	stop_sequences=[anthropic.HUMAN_PROMPT],
	model="claude-v1",
	max_tokens_to_sample=max_tokens_to_sample,
	)

Result (GPT4 for reference)

result = summarize_anthropic(split_up_transcript[0])
result['completion']
• It's my pleasure to welcome Jan from openai. He leads the alignment team there and he was previously a researcher at DeepMind as well. Has a PhD in reinforcement learning theory and has been thinking about the alignment problem for over 10 years.
•Today he'll be giving a very interesting topic so hope you guys enjoy. Yeah thanks a lot for the intro and thanks a lot for having me. I'm very excited to talk about this stuff. I'm also super happy to keep it interactive if you have questions and then you find please interrupt me.
•I wanted to start out with a few very basic observations on kind of what I think is going on.So the first one is team EI is joining the game. So IMEI has a lot of different players they own All Join at the same time but rather they Join one by one.Not all of the their players are like very a lot in how good they are and right now a lot of the players that have joined so far.They're really that's mine and usually you can do it on your very narrow setup.
•But one thing that we've kind of observed is that over time you know we're seeing stronger and stronger players drawing and this is kind of where we where we are now.And then in general we expect that hemiah has um incredibly strong pairs. Those will be players that are able to think so much better than humans so much faster and so much more cheaply. And these haven't joined yet.
•So the have like Anchor Point that we have if you think for example about chat Epp chat TBT can already be any human at like knowing more facts or speaking more languages and it can write about 50 words per second and can do so about 100 times cheaper than humans could at minimum wage. And so you know there's tomorrow also has some really important limitations and there's a lot of things that you can't do yet.
•But it is kind of an indicator of you know some of the players that maybe will be done in the future. So it seems like in the long run kimai will have all the advantages over team humans. But and there's one there's an important caveat which is there's one important advantage that team humans has which is team human gets to pick which players from Kimi joined and when. And so this is kind of like an advantage that should we should really be leaning into when we're thinking about what to do and when we're thinking about you know this game that we're playing with cmei and then we'll be playing with pmei.
•So I think two of the main objectives of what we as team human should do it's like first we should try to recruit players from pmei to play on 10 teams and so this is kind of what I would broadly call on it. And this is kind of like the problem that I'm working on.And then there's also other objectives so another objective that I think is going to be really important is you want to write the rules of the game so that team human doesn't lose. And right now team human kind of has the ball and we get to write the rules so we should write rules that you know make sense and uh still playing this game in the future.
numtokens_anthropic(result['completion'])
676
result['stop_reason']
'stop_sequence'

Observations

Summarization V2: max_tokens_to_sample=8000

Changes

def summarize_anthropic_v2(input: str, max_tokens_to_sample=8000) -> str:
	return client.completion(
		prompt=f"""{anthropic.HUMAN_PROMPT}
Repeat the following from a youtube audio transcript as a bullet point list. do not remove information. be very verbose. the transcript could be the full transcript or just a section of the video
---
{input}
---
{anthropic.AI_PROMPT}""",
		stop_sequences=[anthropic.HUMAN_PROMPT],
		model="claude-v1",
		max_tokens_to_sample=max_tokens_to_sample,
		)

System Observations

Result (GPT4 for reference)

Here is the response as a verbose bullet point list:

• The speaker begins by welcoming Jan from OpenAI who leads the alignment team there and previously worked at DeepMind. Jan has a PhD in reinforcement learning theory and has been thinking about the alignment problem for over 10 years.

•Jan expresses excitement to talk about this topic and is happy to keep the discussion interactive with questions.

•Jan wanted to start with some basic observations about what is going on. The first observation is that AI is joining the game gradually with players joining one by one. The players have varying skill levels but over time the players get stronger. AI will eventually have incredibly strong players that can think much better than humans.

•An example is ChatGPT which can beat any human at knowing more facts or speaking more languages. It can write 50 words per second 100 times cheaper than a human. ChatGPT has important limitations but shows what future AI may be capable of.

•AI will eventually have all the advantages but humans get to decide which AI systems join and when. Humans should recruit AI systems to play on the human team. Jan calls this alignment. Humans also get to set the rules of the game so they don't lose. Humans currently have the opportunity to set good rules.

•Jan wants to talk about setting good rules. He phrases alignment as building AI systems that follow human intent and preferences. He will discuss past work trying to make current models as aligned as possible and future work that needs to be done.

•Jan breaks down what it means for AI to follow human intent. AI should follow direct instructions as an assistant but also follow implied intent like not doing harmful things, asking follow up questions when unsure, and not making things up. Capturing implied intent is difficult to specify precisely but important.

•The main technique used today is preference learning through model comparison. Humans compare model responses and the models learn to predict which responses humans would prefer. The models then optimize to generate the preferred responses. Different people may have different preferences and the models average across preferences. This is a simple but general technique that can apply to many AI models.

•Jan shows a graph demonstrating how well models of different sizes scored on human preferences. Surprisingly, a model 100 times smaller was preferred. Optimization and mixing in pre-training data mitigated regressions. The small, optimized model outperformed the much larger model.

•The training costs are also surprisingly cheap relative to pre-training costs. The most expensive optimized model was less than 2% of the pre-training compute. Larger models could still use this technique but there are still important limitations to address with more compute and data.

•Jan considers ChatGPT an upgrade to instructions GPT for alignment and usefulness. It is better at using dialogue, asking follow up questions, avoiding harmful tasks but still hallucinates facts and is sensitive to prompting showing misalignment. Evaluation is easier than generation meaning evaluating model responses is easier for humans than generating responses from scratch. This allows models to do better than individuals could alone.

Content Observations

Summarization V3: a Claude specific prompt (but still 4k tokens)

Changes

def summarize_anthropic_v3(input: str, max_tokens_to_sample=8000) -> str:
	return client.completion(prompt=f"""{anthropic.HUMAN_PROMPT}
Summarize the following from a youtube audio transcript as a bullet point list.
Use strict markdown. Use nested bullet points if appropriate.
Correct spelling and grammar errors.
The transcript could be the full transcript or just a section of the video.
---
{input}
---
{anthropic.AI_PROMPT}""",
	stop_sequences=[anthropic.HUMAN_PROMPT],
	model="claude-v1",
	max_tokens_to_sample=max_tokens_to_sample,
)
result_v3['stop_reason']
'stop_sequence'

Result (GPT4 for reference)

Here is a summary of the transcript in bullet points:

• OpenAI is developing increasingly capable AI systems
• The AI systems joining "the game" (interacting with the world) are getting stronger over time
• OpenAI expects to develop "incredibly strong players" (AI systems) that will far surpass human capabilities

•OpenAI's objectives:
› Recruit AI systems to play on "team human" (ensure AI is aligned with human values) through a process called "Constitutional AI"
› Develop rules for how AI interacts with the world that ensures "team human" does not lose

•Alignment means building AI systems that:
› Follow instructions and human preferences
› Infer human intent beyond direct instructions

• Current technique for alignment:
›Gather human preferences through comparisons of model outputs
›Train a reward model on these preferences
›Optimize the model with reinforcement learning using the reward model

•Surprisingly, fine-tuning a smaller model (InstructGPT) to be aligned results in higher human preference than a much larger model (GPT-3)

• Fine-tuning models to be aligned is very computationally cheap relative to pre-training the models

• OpenAI's chatbot CLIP is an improvement on InstructGPT but still has significant limitations, like hallucinating facts and being sensitive to how it's prompted

• Evaluation (comparing model outputs) is easier for humans than generation (creating responses from scratch)
• This allows the model to perform better than humans could on their own
print(numtokens_anthropic(result_v3['completion']))
339

Observations

GPT 3.5 Turbo comparison

Maybe it’s unfair to compare Claude to 4, and it’s a 3.5 class model, so I added a quick comparison to 3.5 as well.

ask = oai.create_simple_completer(model="gpt-3.5-turbo")

def clarify_verbose_caveated(t: str) -> str:
	return ask("""repeat the following from a youtube audio transcript as a bullet point list.
do not remove information. be very verbose. the transcript could be the full transcript or just a section of the video.
--- TRANSCRIPT ---\n\n""" + t)

Result

- Jan from OpenAI leads the alignment team and has a PhD in reinforcement learning theory - He previously worked at DeepMind
- Jan will be discussing the alignment problem
- OpenAI is working to align AI systems with human intent
-

Observations

Summarization V4: Trying for accuracy, last attempt with smaller sample

Changes

def summarize_anthropic_v4(input: str, max_tokens_to_sample=8000) -> str:
	return client.completion(prompt=f"""{anthropic.HUMAN_PROMPT}
Summarize the following from a youtube audio transcript as a bullet point list.
Use strict markdown. Use nested bullet points if appropriate.
Correct spelling and grammar errors. If you're not sure about a point, leave it out instead of making false claims. Be verbose and include everything you do understand.
The transcript could be the full transcript or just a section of the video.
---
{input}
---
{anthropic.AI_PROMPT}""",
	stop_sequences=[anthropic.HUMAN_PROMPT],
	model="claude-v1",
	max_tokens_to_sample=max_tokens_to_sample,
)

Result

 Here's a summary of the transcript in bullet point form:

• Jan from OpenAI leads the alignment team.

•He has a PhD in reinforcement learning theory and has worked on the alignment problem for over 10 years.

• AI is getting progressively stronger over time. Eventually, there will be AI far beyond human capabilities.

• The objectives for humans should be:

	› Recruit AI to play on team human (alignment)

	› Set rules so team human doesn't lose

• Alignment means building AI that follows human  intent and preferences. This includes:

	› Instructions: Following specific commands

	› Implicit preferences: Avoiding harm, asking clarifying questions, etc.

•The main technique for alignment is:

	› Train a  reward model on human preferences (comparisons between model outputs)

	› Then optimize the model with reinforcement learning and the reward model

• An experiment showed an  "instructed GPT" model 100x smaller than GPT-3 but preferred by humans. This shows the power of alignment techniques.

• Aligning models is surprisingly cheap compared to pre-training them. The most expensive instructed GPT was <2% of the cost of pre-training GPT-3.

• There are still important limitations and more work is needed.

• ChatGPT is an improvement over InstructGPT. It's better at:

	› Using dialogue

	› Avoiding harmful responses

•But ChatGPT still has issues like hallucinating facts and being sensitive to prompts.

• A key principle: Evaluation is easier than generation. Having humans compare model outputs allows spotting good and bad behavior even for very difficult tasks.

• This principle applies broadly, e.g. in distinguishing GPT-3 and CLIP. Generation is harder than evaluation.
398 tokens

Observations

Claude 100k — all at once summary

Because more tokens are expensive, I’m going to mimic a real world setting and do this one shot, using my favorite prompt so far developed in V4

def summarize_anthropic_v5(input: str, max_tokens_to_sample=90000) -> str:
	return client.completion(prompt=f"""{anthropic.HUMAN_PROMPT}
Summarize the following from a youtube audio transcript as a bullet point list.
Use strict markdown. Use nested bullet points if appropriate.
Correct spelling and grammar errors. If you're not sure about a point, leave it out instead of making false claims. Be verbose and include everything you do understand.
The transcript could be the full transcript or just a section of the video.
---
{input}
---
{anthropic.AI_PROMPT}""",
	stop_sequences=[anthropic.HUMAN_PROMPT],
	model="claude-v1-100k",
	max_tokens_to_sample=max_tokens_to_sample,
)

Observations

Result

* AI progress is accelerating as increasingly capable AI systems are developed
* Team human currently has the advantage of determining which AI systems are deployed and when
* A key objective should be aligning AI systems with human values and preferences to ensure human control
* Techniques like reward modeling and Constitutional AI show promise for AI alignment using feedback from humans
* Evaluating AI systems may be easier for humans than generating them from scratch, allowing human-AI collaboration to evaluate more advanced systems
* AI assistance could be leveraged to augment human capability to evaluate more advanced AI systems through critique generation, explanation, fact-checking, etc.
* Targeted perturbations could be used to evaluate the effectiveness of AI assistance by inserting subtle flaws for the AI system to detect.
* The gap between human capability to evaluate systems and AI capability to generate them is a concern, as AI could become too advanced for humans to reliably evaluate without AI assistance
* Scalable oversight may be needed to evaluate very advanced AI systems, but ensuring the effectiveness of scalable oversight itself is challenging without ground truth
* Interpretability may be useful for detecting deception or flaws in AI systems but likely insufficient and possibly unnecessary for full alignment
* The key may be aligning the actual decisions and behaviors of AI systems with human values, not just their internal reasoning processes

Does it help at the boundaries

Split 1

print(split_up_transcript[0][-400:])
print("---")
print(split_up_transcript[1][-400:])

're giving this kind of like feedback now let's you know the system should do better than you actually could and I think that's a very general principle that holds in lots of domains so kind of like you're probably most familiar if you start yes you know that pivas MP and everyone you know we don't actually know whether they're different but in practice it seems like and pitas are just much harder
---
gain I think what gets harder is that you know like as you've deployed the model and people started using it for all kinds of you know tasks that they want to build their company around like they um if you're update and you change the model then they also have to do a bunch of work into like adapting their prompts to whatever they have to do what they're doing and so it doesn't come as a zero cost

Split 2

print(split_up_transcript[1][-400:])
print("---")
print(split_up_transcript[2][-400:])
gain I think what gets harder is that you know like as you've deployed the model and people started using it for all kinds of you know tasks that they want to build their company around like they um if you're update and you change the model then they also have to do a bunch of work into like adapting their prompts to whatever they have to do what they're doing and so it doesn't come as a zero cost
---
en't you just selecting for models that have misalignments they're really hard to find with you and show you the body tools sure sure does not be necessary yes so why would it not be necessary so again this is kind of like an open question but basically what stance you could take is that at the end of the day what really is going to matter is the decisions that the model actually takes and not the

Nested Summary with GPT-4 (for comparison)

final_oai = ask(f"""
Summarize the following bullet points into a high level summary. Use bullet points. Be very verbose:
---

{oai0}

{oai1}

{oai2}

""")

Result

 - Jan from OpenAI discusses AI alignment and building systems that follow human intent
- AI skill levels increasing, with humans "recruiting" AI players for various tasks
- Methods: inferring reward models, optimizing with Proximal Policy Optimization, and using Instruct-GPT
- Limitation: human evaluation level stagnates as AI progresses
- Utilizing AI for code, critiques, and evaluation assistance
- Challenges: avoiding bias and manipulation, handling uncertainty, balancing preference updates
- GPT-3: immense knowledge potential, learns effectively with less data through fine-tuning
- Addressing style preferences, risks of over-reliance on AI in education, and safety concerns
- Importance of developing AI interpretability, outer alignment signals, and leveraging AI assistance for evaluation

Observations