LLM Workshop #3 - How Far Can We Get With Prompting Alone?”
NLP
LLMS
OpenAI
Anthropic
Meta
FireworksAI
Replicate
Llama
learning
projects
You have to start somewhere, and that somewhere is with one or more of the big dogs in world of LLMs. Back in the day, that used to mean OpenAI. Today, however, we live in a time that affords us the opportunity to experiment with a number of both closed and open source models from the likes of OpenAI, Anthropic, FireworksAI, Meta, and many others. In this post I’ll demonstrate how we can use several of these vendors to actually build a pipeline that begins to meet our project objectives defined in the previously. I’m going to use the results as a vibe check to guage how realistic my vision is, build intution around where improvements can be made, and also get an idea if using one or more of the big dogs is good enough.
Author
Wayde Gilliam
Published
July 18, 2024
In The Beginning …
“you should prototype with a powerful model to see what is possible”
This is our chance to do a “vibe check” and create confidence that our objectives are doable. It’s also a good time to get a sense of which knobs we can turn to potentially improve results, understand likely primary failure modes, and see how far we can push the bigger models without fine tuning. As the general advice from the workshop is to avoid fine tuning until you can prove you need it, this step will also provide the base on top of which we begin building our evaluation pipeline for that very purpose.
Tip: Miminize friction
The best way to learn if something will work is to build it. A common theme from the workshop revolves around “minimizing friction”, and in a nutshell you can read that as being encouraged to “get going and build!” The best way to do this in the generative space is to try out one or more of the high quality models available to us via an API.
What do we want our model to do again?
Before we begin, let’s distill our objectives into a few clear bullet points:
It should work on a single document or a collection of related documents in the domain of higher education surveys.
(See the “Data Refinement” section in the previous post in this series where we’ve curated a set of documents representative of what we expect to see generally)
We should be able to provide it NLP task-specific tools to use to perform analysis on those documents
Core tools for translation, summarization, sentiment, NER, and thematic analysis will always be provided and designed to work against English texts
A translation tool is required and need to be executed first if any documents are not in English
The results are formatted as structured output.
With that, let’s define some tools. In partuclar, we will define the “core” tools that should should always be made available to the LLM.
Tools
If you didn’t know this already, apparently “Pydantic is all you need”. There’s even this video by Jason Liu to prove it.
Tip: Use Pydantic to define your structured outputs
Pydantic is is my preferred way to define tools for structured output for a number of reasons. In particular, I like how …
It’s widely supported across most IDEs and libraries (e.g, FastAPI, LangChain, Instructor, OpenAI, etc.., etc…).
It makes your intentions and expectations crystal clear, especially for any developer with OO experience (which is everyone).
It has error handling baked in.
It provides all kinds of hooks you can use in the pre/post-processing of your class attributes.
It enables you to build complex structured outputs composed on nested objects quite intuitively. Even a non-developer can look at your object hierarchy and understand what it is supposed to produce.
Let’s use it to build out our core tools that we’ll want to always make available for our users.
Core Tools
Pydantic classes for the core tools
# Translationclass TranslationTask(BaseModel):"""The translation of a document to English and the original language.""" english_translation: str= Field(..., description="The English tranlsation") source_language: str= Field( ..., description="The language of the original text (e.g., English, Spanish, French, Chinese, German, etc...)" )# Document summarizationclass DocumentSummaryTask(BaseModel):"""A summary and themes to extract from a document.""" summary: str= Field(..., description="A concise and short 1 sentence summary of the author's statements.") themes: list[str] = Field(..., description="A list of no more than 5 concise 1 to 3 word themes discovered in the text", max_items=5)# NERclass NamedEntityType(str, Enum):"""A named entity type.""" PERSON ="PERSON" NORP ="NORP" FAC ="FAC" ORG ="ORG" GPE ="GPE" LOC ="LOC" PRODUCT ="PRODUCT" EVENT ="EVENT" WORK_OF_ART ="WORK_OF_ART" LAW ="LAW" LANGUAGE ="LANGUAGE" DATE ="DATE" TIME ="TIME" PERCENT ="PERCENT" MONEY ="MONEY" QUANTITY ="QUANTITY" ORDINAL ="ORDINAL" CARDINAL ="CARDINAL" OTHER ="OTHER"class NamedEntity(BaseModel):"""The type of named entity and it's value.""" entity_type: NamedEntityType entity_mention: str= Field(..., description="The named entity recognized.")class DocumentNERTask(BaseModel):"""Information about named entities to extract.""" named_entities: list[NamedEntity] = Field( ..., description=f"Perform Named Entity Recognition that finds the following entities: {', '.join([x.name for x in NamedEntityType])}", )# Sentimentclass DocumentSentimentTask(BaseModel):"""Information about the sentiments expressed in a document.""" positivity: int= Field( ..., description="How positive or negative is the author on a scale between 1 and 5 (1=Very Low, 2=Moerately Low, 3=Neutral, 4=Moderately Strong, 5=Very Strong)?", ) positive_statements: list[str] = Field( ..., description="A list of the author's positive statements", ) negative_statements: list[str] = Field( ..., description="A list of the author's negative statements", ) has_suggestions: bool= Field( ..., description="Does the author make any suggestions?", ) suggestions: list[str] = Field( ..., description="A list of any suggestions the author makes", ) feels_threatened: bool= Field( ..., description="Does the author feel fearful, harmed, intimidated, harassased, discriminated against, or threatened in any way?", ) feels_threatened_examples: list[str] = Field( ..., description="A list of how and why the author feels physically/emotionally/mentally threatened, uncomfortable, harassaed", ) profanity: bool= Field( ..., description="Is there any profanity?", ) is_nonsense: bool= Field( ..., description="Is the text uninformative or only contain nonsense? Set to `True` if the document is it too short to be meaningful or only says something like 'N/A', 'None', 'I have nothing to add', 'No suggestions', or 'No comment'.", )# Topic summarizationclass TopicSummaryTask(BaseModel):"""A summary and action plan to extract from a list of related documents.""" theme_name: str= Field(..., description="A 5 to 10 word concise summary of the list of documents") action_plan: list[str] = Field( ..., description="A list of 3-5 specific actions that can be taken based on the documents provided", max_items=5, )```
“Tools” For “Tool Calling”
We’re going to use Instructor here because this is what Hamel suggests …
For open models you should use outlines. for closed models APIs you should use instructor.
Warning
TBH, I’m a noob relative to Instructor use. If anything I say below is wrong and/or can be improved … please, please let me know!
Moving along, as I understand the library, it is designed to work against a single Pydantic class that you pass into it and from which an instance of that class will be returned at the conclusion of your LLM call. BUT, we’ve defined multiple classes as tools and we only want the LLM to call the tools it deems necessary to fulfill the user’s request. What are we going to do?
My answer, perhaps not suprisingly, is to use another Pydantic class. It looks like this:
class DocumentAnalysis(BaseModel): tasks: list[TranslationTask | DocumentSummaryTask | DocumentNERTask | DocumentSentimentTask | TopicSummaryTask] = Field( ..., description="The results of each analysis task the user asked to be performed on a given document as context." )
This makes it easy to support a flexible tool calling system where a user can create their own Analysis class with whatever “tasks” they want the LLM to operate with and use with Instructor. For example, I’ll use the class below when working with collections of related documents since the the available tools should be limited for this use case:
class RelatedDocumentAnalysis(BaseModel): tasks: list[TopicSummaryTask | DocumentSentimentTask | TopicSummaryTask] = Field( ..., description="The results of each analysis task the user asked to be performed on a a list of related documents as context." )
With our Pydantic army above, we can now move to experimenting with the big dogs to see what works and what doesn’t.
Tip: Note
Because of the propietary nature of the survey comments, I can’t show you the actual results. What I can show you is the code and my observations. With that in hand you should have everything you need to get going with your own use cases. If anything isn’t clear, drop a comment below or hit me up on X.
Single Document Analysis
Data
Let’s sample 5 rows from our sampled chunked dataset created in the previous post. We’ll use the full survey comments for exploration here and ensure that 2 of those samples are in Spanish.
Tip: Test with representative examples
Think about what kind of examples are like to be seen at inference time and test a few out here. Our goal is to get a sense of how well our model will generalize over asks its likely to see in the future.
We’ll define the following function to make it easy for us to test different APIs with Instructor. I designed this function for use with OpenAI, Antrhopic, and Fireworks specifically though it will likely work with any supported vendor with little or no modification.
def ask_ai( client, content: str, query: str|None=None, model: str="gpt-4o", instructor_kwargs: dict= {},) -> DocumentAnalysis:ifnot query: query ="Tasks: translation (if the document is not in English), summarization, ner, sentiment analysis."try:return client.chat.completions.create( model=model, response_model=DocumentAnalysis, max_retries=3, messages=[ {"role": "system","content": """Execute each analysis task.Always translate any non-English documents into English before executing other tasks.""", }, {"role": "user","content": f"{query}. Document: {content}", }, ],**instructor_kwargs, )exceptExceptionas e:print(e)
Experiments
OpenAI
client = instructor.from_openai(OpenAI())for r_idx, r in test_df.iterrows():print("==========")print(":: Document ::")print(r["AnswerText"]) results = ask_ai(client, r["AnswerText"])print(":: Results ::")print(results.model_dump_json(indent=2))print("==========")
Observations
Performed the translation step where needed correctly and used the English translation in the rest of the tools.
Generally did an outstanding job of calling the tools and providing results that were as good, and sometimes better, than what I would have done.
Occasionally ran into validation errors because it couldn’t set various properties correctly. This usually happend with the summarization or sentiment tools where I’d get an error like this: tasks.0.DocumentSummaryTask.summary Input should be a valid string [type=string_type, input_value={'summary': 'The author c... 'inadequate planning']}, input_type=dict]
Anthropic
client = instructor.from_anthropic(Anthropic())for r_idx, r in test_df.iterrows():print("==========")print(":: Document ::")print(r["AnswerText"]) results = ask_ai( client, r["AnswerText"], model="claude-3-5-sonnet-20240620", instructor_kwargs={"max_tokens": 1024} )print(":: Results ::")print(results.model_dump_json(indent=2))print("==========")
Observations
Performed the translation step where needed correctly and used the English translation in the rest of the tools.
Generally did an outstanding job of calling the tools and providing results that were as good, and sometimes better, than what I would have done.
Occassionally, it would struggle with the NER task for some reason. I talk about this more in my Structuring Enums for Flawless LLM results with Instructor post. I can’t say if this is an Instructor issue and/or something particular to this model itself.
Fireworks
client = instructor.patch( OpenAI( base_url="https://api.fireworks.ai/inference/v1", api_key=os.environ["FIREWORKS_API_KEY"], ),)for r_idx, r in test_df.iterrows():print("==========")print(":: Document ::")print(r["AnswerText"]) results = ask_ai(client, r["AnswerText"], model="accounts/fireworks/models/firefunction-v2")print(":: Results ::")print(results.model_dump_json(indent=2))print("==========")
Observations
Did not reliably perform the translation step where required
Did not reliably call all the tools
Where called, the results of each tool call were not as accurate as either the OpenAI or Anthropic models
Encountered validation errors more frequently.
It is FAST! Like really fast to run.
The speed difference is pretty noticeable, but I wonder if it comes at the cost of being able to use Fireworks for any complex structured output?
Related Document Analysis
Data
Let’s sample 5 rows (topics) from our topics dataset created in the previous post.
def format_tools(): doc_analysis_schema = DocumentAnalysis.model_json_schema() result =""for k, v in doc_analysis_schema["$defs"].items(): result +=f"<function name='{k}'>\n" result += json.dumps(v, indent=2) result +=f"\n</function name='{k}'>\n"return result.strip()
SYSTEM_PROMPT ="""\{system_persona}You are given a set of tasks to perform and a document inside <query> tags and a set of possible functions inside <function-definitions> tags.Calling these functions are optional. Carefully consider the question and determine if one or more functions can be used to answer the question. Place your thoughts and reasoning behind your decision in <function-thoughts> tags.Below is a list of function definitions:<function-definitions>{tools}</function-definitions>For each function you want to call, execute the function and use the answer to provide values to each function parameter in a way that conforms to that function's schema. Include the function name and parameter values inside the <function-calls> tag.Function calls MUST be in this format: <function-thoughts>Calling func1 would be helpful because of ...</function-thoughts><function-calls>[func1_name(params_name=params_value, params_name2=params_value2...), func2_name(params)]</function-calls>, WITHOUT any answer.If the query is not in English, always translate it into English first and then proceed to call any other functions using the English translation.If you do not wish to call any functions, say so in the <function-thoughts> tags followed by <function-calls>None</function-calls>"""system_persona ="""\You are an expert NLP data scientist, skilled in machine translation, text summarization, NER, thematic analysis, strategic planning, sentiment analysis, and classification tasks."""# print(SYSTEM_PROMPT)
for r_idx, r in test_df.iterrows():print("==========")print(f"Document:\n{r['AnswerText']}\n") prompt =f"Translate the document below into English if necessary. After that, perform the following tasks on the document below: summarize, perform named entity recognition, and sentiment analysis. Document:\n{r['AnswerText']}"input= {"max_tokenx": 1024,"temperature": 0,"top_p": 0.9,"top_k": 50,"presence_penalty": 0,"frequency_penalty": 0,"system_prompt": system_message,"prompt": f"<query>{prompt}</query>", }print("--- Example ---")print(f"prompt: {prompt}")print("")print("===== Llama3-70B-Instruct =====") output = replicate.run("meta/meta-llama-3-70b-instruct", input=input)print("".join(output))print("")print("===== Llama3-8B-Instruct =====") output = replicate.run("meta/meta-llama-3-8b-instruct", input=input)print("".join(output))print("")
Observations
Generally the 70B model did an outstanding job of calling the tools and providing results that were as good, and sometimes better, than what I would have done.
The 8B model would return decent results most of the time, but the formatting was all over the place.
This gave me a lot of confidence that open source models are a contender to consider. Even though the 8B model didn’t produce reliable results, the fact that the correct information was returned along with how well the 70B model performed, definitely make me think it might do really well in fine tuning.
Takeaways
After all this, I’m confident that my ideas are worthy to pursue. Based on the results above, it seems worthwhile to explore turning the following knobs to boost performance with these models:
Improved descriptions in the Pydantic classes
Improve the sytem prompt
Formulate varied human messages to simulate different ways the user may prompt the model
Can you think of any other adjustments that might be beneficial? If so, drop a comment below!
Next Steps
With some insights about what I can play with to improve my results, it’s time for some experimentation. But how can I know whether or not the changes I’m making present meaningful progress? How can I know specificially where the model is struggling and look at those examples to inform future experiments?
The answer: I need to set up an initial evaluation pipeline along with some scoring functions.
And that is exactly what we’ll do in this next post in this series … stay tuned.