LLM Workshop #3 - How Far Can We Get With Prompting Alone?”

NLP

LLMS

OpenAI

Anthropic

In The Beginning …

“you should prototype with a powerful model to see what is possible”

This is our chance to do a “vibe check” and create confidence that our objectives are doable. It’s also a good time to get a sense of which knobs we can turn to potentially improve results, understand likely primary failure modes, and see how far we can push the bigger models without fine tuning. As the general advice from the workshop is to avoid fine tuning until you can prove you need it, this step will also provide the base on top of which we begin building our evaluation pipeline for that very purpose.

Tip: Miminize friction

The best way to learn if something will work is to build it. A common theme from the workshop revolves around “minimizing friction”, and in a nutshell you can read that as being encouraged to “get going and build!” The best way to do this in the generative space is to try out one or more of the high quality models available to us via an API.

What do we want our model to do again?

Before we begin, let’s distill our objectives into a few clear bullet points:

It should work on a single document or a collection of related documents in the domain of higher education surveys.
(See the “Data Refinement” section in the previous post in this series where we’ve curated a set of documents representative of what we expect to see generally)
We should be able to provide it NLP task-specific tools to use to perform analysis on those documents
Core tools for translation, summarization, sentiment, NER, and thematic analysis will always be provided and designed to work against English texts
A translation tool is required and need to be executed first if any documents are not in English
The results are formatted as structured output.

With that, let’s define some tools. In partuclar, we will define the “core” tools that should should always be made available to the LLM.

Tools

If you didn’t know this already, apparently “Pydantic is all you need”. There’s even this video by Jason Liu to prove it.

Tip: Use Pydantic to define your structured outputs

Pydantic is is my preferred way to define tools for structured output for a number of reasons. In particular, I like how …

It’s widely supported across most IDEs and libraries (e.g, FastAPI, LangChain, Instructor, OpenAI, etc.., etc…).
It makes your intentions and expectations crystal clear, especially for any developer with OO experience (which is everyone).
It has error handling baked in.
It provides all kinds of hooks you can use in the pre/post-processing of your class attributes.
It enables you to build complex structured outputs composed on nested objects quite intuitively. Even a non-developer can look at your object hierarchy and understand what it is supposed to produce.

Let’s use it to build out our core tools that we’ll want to always make available for our users.

Core Tools

Pydantic classes for the core tools

# Translation
class TranslationTask(BaseModel):
    """The translation of a document to English and the original language."""

    english_translation: str = Field(..., description="The English tranlsation")
    source_language: str = Field(
        ..., description="The language of the original text (e.g., English, Spanish, French, Chinese, German, etc...)"
    )

# Document summarization
class DocumentSummaryTask(BaseModel):
    """A summary and themes to extract from a document."""

    summary: str = Field(..., description="A concise and short 1 sentence summary of the author's statements.")
    themes: list[str] = Field(..., description="A list of no more than 5 concise 1 to 3 word themes discovered in the text", max_items=5)

# NER
class NamedEntityType(str, Enum):
    """A named entity type."""

    PERSON = "PERSON"
    NORP = "NORP"
    FAC = "FAC"
    ORG = "ORG"
    GPE = "GPE"
    LOC = "LOC"
    PRODUCT = "PRODUCT"
    EVENT = "EVENT"
    WORK_OF_ART = "WORK_OF_ART"
    LAW = "LAW"
    LANGUAGE = "LANGUAGE"
    DATE = "DATE"
    TIME = "TIME"
    PERCENT = "PERCENT"
    MONEY = "MONEY"
    QUANTITY = "QUANTITY"
    ORDINAL = "ORDINAL"
    CARDINAL = "CARDINAL"
    OTHER = "OTHER"


class NamedEntity(BaseModel):
    """The type of named entity and it's value."""

    entity_type: NamedEntityType
    entity_mention: str = Field(..., description="The named entity recognized.")


class DocumentNERTask(BaseModel):
    """Information about named entities to extract."""

    named_entities: list[NamedEntity] = Field(
        ...,
        description=f"Perform Named Entity Recognition that finds the following entities: {', '.join([x.name for x in NamedEntityType])}",
    )

# Sentiment
class DocumentSentimentTask(BaseModel):
    """Information about the sentiments expressed in a document."""

    positivity: int = Field(
        ...,
        description="How positive or negative is the author on a scale between 1 and 5 (1=Very Low, 2=Moerately Low, 3=Neutral, 4=Moderately Strong, 5=Very Strong)?",
    )
    positive_statements: list[str] = Field(
        ...,
        description="A list of the author's positive statements",
    )
    negative_statements: list[str] = Field(
        ...,
        description="A list of the author's negative statements",
    )

    has_suggestions: bool = Field(
        ...,
        description="Does the author make any suggestions?",
    )
    suggestions: list[str] = Field(
        ...,
        description="A list of any suggestions the author makes",
    )

    feels_threatened: bool = Field(
        ...,
        description="Does the author feel fearful, harmed, intimidated, harassased, discriminated against, or threatened in any way?",
    )
    feels_threatened_examples: list[str] = Field(
        ...,
        description="A list of how and why the author feels physically/emotionally/mentally threatened, uncomfortable, harassaed",
    )

    profanity: bool = Field(
        ...,
        description="Is there any profanity?",
    )

    is_nonsense: bool = Field(
        ...,
        description="Is the text uninformative or only contain nonsense? Set to `True` if the document is it too short to be meaningful or only says something like 'N/A', 'None', 'I have nothing to add', 'No suggestions', or 'No comment'.",
    )

# Topic summarization
class TopicSummaryTask(BaseModel):
    """A summary and action plan to extract from a list of related documents."""

    theme_name: str = Field(..., description="A 5 to 10 word concise summary of the list of documents")
    action_plan: list[str] = Field(
        ...,
        description="A list of 3-5 specific actions that can be taken based on the documents provided",
        max_items=5,
    )
```

“Tools” For “Tool Calling”

We’re going to use Instructor here because this is what Hamel suggests …

For open models you should use outlines. for closed models APIs you should use instructor.

Warning

TBH, I’m a noob relative to Instructor use. If anything I say below is wrong and/or can be improved … please, please let me know!

Moving along, as I understand the library, it is designed to work against a single Pydantic class that you pass into it and from which an instance of that class will be returned at the conclusion of your LLM call. BUT, we’ve defined multiple classes as tools and we only want the LLM to call the tools it deems necessary to fulfill the user’s request. What are we going to do?

My answer, perhaps not suprisingly, is to use another Pydantic class. It looks like this:

class DocumentAnalysis(BaseModel):
    tasks: list[TranslationTask | DocumentSummaryTask | DocumentNERTask | DocumentSentimentTask | TopicSummaryTask] = Field(
        ..., description="The results of each analysis task the user asked to be performed on a given document as context."
    )

This makes it easy to support a flexible tool calling system where a user can create their own Analysis class with whatever “tasks” they want the LLM to operate with and use with Instructor. For example, I’ll use the class below when working with collections of related documents since the the available tools should be limited for this use case:

class RelatedDocumentAnalysis(BaseModel):
    tasks: list[TopicSummaryTask | DocumentSentimentTask | TopicSummaryTask] = Field(
        ..., description="The results of each analysis task the user asked to be performed on a a list of related documents as context."
    )

With our Pydantic army above, we can now move to experimenting with the big dogs to see what works and what doesn’t.

Tip: Note

Because of the propietary nature of the survey comments, I can’t show you the actual results. What I can show you is the code and my observations. With that in hand you should have everything you need to get going with your own use cases. If anything isn’t clear, drop a comment below or hit me up on X.

Single Document Analysis

Data

Let’s sample 5 rows from our sampled chunked dataset created in the previous post. We’ll use the full survey comments for exploration here and ensure that 2 of those samples are in Spanish.

Tip: Test with representative examples

Think about what kind of examples are like to be seen at inference time and test a few out here. Our goal is to get a sense of how well our model will generalize over asks its likely to see in the future.

df = pd.read_parquet(f"{DATA_DIR}/clean/{DATA_FILENAME}_sample_14k_chunked.parquet")

test_df = df[df["AnswerLang"] == "Spanish"].sample(n=2)
test_df["AnswerText"] = test_df["AnswerText_NonEnglish"]
test_df = pd.concat([test_df, df[df["AnswerLang"] == "English"].sample(n=3)])

LLM Utility

We’ll define the following function to make it easy for us to test different APIs with Instructor. I designed this function for use with OpenAI, Antrhopic, and Fireworks specifically though it will likely work with any supported vendor with little or no modification.

def ask_ai(
    client,
    content: str,
    query: str | None = None,
    model: str = "gpt-4o",
    instructor_kwargs: dict = {},
) -> DocumentAnalysis:
    if not query:
        query = "Tasks: translation (if the document is not in English), summarization, ner, sentiment analysis."
    try:
        return client.chat.completions.create(
            model=model,
            response_model=DocumentAnalysis,
            max_retries=3,
            messages=[
                {
                    "role": "system",
                    "content": """
Execute each analysis task.
Always translate any non-English documents into English before executing other tasks.""",
                },
                {
                    "role": "user",
                    "content": f"{query}. Document: {content}",
                },
            ],
            **instructor_kwargs,
        )
    except Exception as e:
        print(e)

Experiments

OpenAI

client = instructor.from_openai(OpenAI())

for r_idx, r in test_df.iterrows():
    print("==========")
    print(":: Document ::")
    print(r["AnswerText"])
    results = ask_ai(client, r["AnswerText"])

    print(":: Results ::")
    print(results.model_dump_json(indent=2))
    print("==========")

Observations

Performed the translation step where needed correctly and used the English translation in the rest of the tools.
Generally did an outstanding job of calling the tools and providing results that were as good, and sometimes better, than what I would have done.
Occasionally ran into validation errors because it couldn’t set various properties correctly. This usually happend with the summarization or sentiment tools where I’d get an error like this:
tasks.0.DocumentSummaryTask.summary Input should be a valid string [type=string_type, input_value={'summary': 'The author c... 'inadequate planning']}, input_type=dict]

Anthropic

client = instructor.from_anthropic(Anthropic())

for r_idx, r in test_df.iterrows():
    print("==========")
    print(":: Document ::")
    print(r["AnswerText"])
    results = ask_ai(
        client, r["AnswerText"], model="claude-3-5-sonnet-20240620", instructor_kwargs={"max_tokens": 1024}
    )

    print(":: Results ::")
    print(results.model_dump_json(indent=2))
    print("==========")

Observations

Performed the translation step where needed correctly and used the English translation in the rest of the tools.
Generally did an outstanding job of calling the tools and providing results that were as good, and sometimes better, than what I would have done.
Occassionally, it would struggle with the NER task for some reason. I talk about this more in my Structuring Enums for Flawless LLM results with Instructor post. I can’t say if this is an Instructor issue and/or something particular to this model itself.

Fireworks

client = instructor.patch(
    OpenAI(
        base_url="https://api.fireworks.ai/inference/v1",
        api_key=os.environ["FIREWORKS_API_KEY"],
    ),
)

for r_idx, r in test_df.iterrows():
    print("==========")
    print(":: Document ::")
    print(r["AnswerText"])
    results = ask_ai(client, r["AnswerText"], model="accounts/fireworks/models/firefunction-v2")

    print(":: Results ::")
    print(results.model_dump_json(indent=2))
    print("==========")

Observations

Did not reliably perform the translation step where required
Did not reliably call all the tools
Where called, the results of each tool call were not as accurate as either the OpenAI or Anthropic models
Encountered validation errors more frequently.
It is FAST! Like really fast to run.

The speed difference is pretty noticeable, but I wonder if it comes at the cost of being able to use Fireworks for any complex structured output?

Exprimenting with Replicate

Replicate is quickly becoming a favorite tool in my LLM arsenal. Hamel has a really nice example/tutorial for getting tool calling to work with the Llama3-70B Instruct model out of the box (not fine tuned). I stole some of his ideas and put together a simple workflow that allows me to incorporate testing with Llama3 variants here.

Below, I share running the single document analysis experiment against both the 70B and 8B variants.

doc_analysis_schema = DocumentAnalysis.model_json_schema()

# doc_analysis_schema

def format_tools():
    doc_analysis_schema = DocumentAnalysis.model_json_schema()

    result = ""
    for k, v in doc_analysis_schema["$defs"].items():
        result += f"<function name='{k}'>\n"
        result += json.dumps(v, indent=2)
        result += f"\n</function name='{k}'>\n"
    return result.strip()

PROMPT_TEMPLATE = """\
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

<query>{prompt}</query><|eot_id|><|start_header_id|>assistant<|end_header_id|>

{prefill}"""

# print(PROMPT_TEMPLATE)

SYSTEM_PROMPT = """\
{system_persona}

You are given a set of tasks to perform and a document inside <query> tags and a set of possible functions inside <function-definitions> tags.
Calling these functions are optional. Carefully consider the question and determine if one or more functions can be used to answer the question. Place your thoughts and reasoning behind your decision in <function-thoughts> tags.
Below is a list of function definitions:
<function-definitions>
{tools}
</function-definitions>

For each function you want to call, execute the function and use the answer to provide values to each function parameter in a way that conforms to that function's schema. Include the function name and parameter values inside the <function-calls> tag.
Function calls MUST be in this format: <function-thoughts>Calling func1 would be helpful because of ...</function-thoughts><function-calls>[func1_name(params_name=params_value, params_name2=params_value2...), func2_name(params)]</function-calls>, WITHOUT any answer.
If the query is not in English, always translate it into English first and then proceed to call any other functions using the English translation.

If you do not wish to call any functions, say so in the <function-thoughts> tags followed by <function-calls>None</function-calls>"""

system_persona = """\
You are an expert NLP data scientist, skilled in machine translation, text summarization, NER, thematic analysis, strategic planning, sentiment analysis, and classification tasks."""

# print(SYSTEM_PROMPT)

tools = format_tools()  # .replace("\n", "")
system_message = SYSTEM_PROMPT.format(system_persona=system_persona, tools=tools)

# print(system_message)

for r_idx, r in test_df.iterrows():
    print("==========")
    print(f"Document:\n{r['AnswerText']}\n")

    prompt = f"Translate the document below into English if necessary. After that, perform the following tasks on the document below: summarize, perform named entity recognition, and sentiment analysis. Document:\n{r['AnswerText']}"

    input = {
        "max_tokenx": 1024,
        "temperature": 0,
        "top_p": 0.9,
        "top_k": 50,
        "presence_penalty": 0,
        "frequency_penalty": 0,
        "system_prompt": system_message,
        "prompt": f"<query>{prompt}</query>",
    }
    print("--- Example ---")
    print(f"prompt: {prompt}")
    print("")

    print("===== Llama3-70B-Instruct =====")
    output = replicate.run("meta/meta-llama-3-70b-instruct", input=input)
    print("".join(output))
    print("")

    print("===== Llama3-8B-Instruct =====")
    output = replicate.run("meta/meta-llama-3-8b-instruct", input=input)
    print("".join(output))
    print("")

Observations

Generally the 70B model did an outstanding job of calling the tools and providing results that were as good, and sometimes better, than what I would have done.
The 8B model would return decent results most of the time, but the formatting was all over the place.

This gave me a lot of confidence that open source models are a contender to consider. Even though the 8B model didn’t produce reliable results, the fact that the correct information was returned along with how well the 70B model performed, definitely make me think it might do really well in fine tuning.

Takeaways

After all this, I’m confident that my ideas are worthy to pursue. Based on the results above, it seems worthwhile to explore turning the following knobs to boost performance with these models:

Improved descriptions in the Pydantic classes
Improve the sytem prompt
Formulate varied human messages to simulate different ways the user may prompt the model

Can you think of any other adjustments that might be beneficial? If so, drop a comment below!

Next Steps

With some insights about what I can play with to improve my results, it’s time for some experimentation. But how can I know whether or not the changes I’m making present meaningful progress? How can I know specificially where the model is struggling and look at those examples to inform future experiments?

The answer: I need to set up an initial evaluation pipeline along with some scoring functions.

And that is exactly what we’ll do in this next post in this series … stay tuned.

In The Beginning …

Tools

Core Tools

“Tools” For “Tool Calling”

Single Document Analysis

Data

LLM Utility

Experiments

OpenAI

Observations

Anthropic

Observations

Fireworks

Observations

Related Document Analysis

Data

LLM Utility

Experiments

OpenAI

Observations

Anthropic

Observations

Fireworks

Observations

Exprimenting with Replicate

Observations

Takeaways

Next Steps