Structuring Enums for Flawless LLM results with Instructor
Instructor Best Practices and Cautions
I’m spending some time with the Jason Liu’s Instructor library in building a function calling solution that returns structured output because, well, Hamel recommends it for proprietary models.
For open models you should use outlines. for closed models APIs you should use instructor.
The library is intuitive, fun to use, and has some really nice documentation. When it comes to choosing whether to use enums or literals in your pydantic classes, the docs recommend the following:
For classification we’ve found theres generally two methods of modeling.
- using Enums
- using Literals
Use an enum in Python when you need a set of named constants that are related and you want to ensure type safety, readability, and prevent invalid values. Enums are helpful for grouping and iterating over these constants.
Use literals when you have a small, unchanging set of values that you don’t need to group or iterate over, and when type safety and preventing invalid values is less of a concern. Literals are simpler and more direct for basic, one-off values.
… and they also seems to indicate that getting them to work as expected might be challenging …
If you’re having a hard time with Enum an alternative is to use Literal
I found this out first-hand when I was attempting to define an enum for a number of named entities I wanted an LLM to identifiy in a given document. My intial code worked pretty nicely with GPT-4o but failed miserabley time and time again with every Antrhopic model I tried (I’ll explain why below). If you’re looking for the TL;DR, the final version of my code at the end of this post represents a substantially more resiliant solution that works across vendors (I also tested this with Fireworks), offering a better guaranttee your LLM calls find the entities you care about correctly.
v0: Using Enum
This is the initial Enum
and pydantic classes I started with. It works pretty damn well with OpenAI’s GPT-4o but fails spectacularly when using any of the Anthopic models.
class EntityGroup(str, Enum):
"""A named entity type."""
= "PERSON"
PERSON = "ORGANIZATION"
ORGANIZATION = "LOCATION"
LOCATION = "DATE"
DATE = "TIME"
TIME = "PERCENT"
PERCENT = "MONEY"
MONEY = "QUANTITY"
QUANTITY = "ORDINAL"
ORDINAL = "CARDINAL"
CARDINAL = "EMAIL"
EMAIL = "PHONE_NUMBER"
PHONE_NUMBER = "CREDIT_CARD_NUMBER"
CREDIT_CARD_NUMBER = "SSN"
SSN
class NamedEntity(BaseModel):
"""The type of named entity and it's value."""
= Field(..., description="The type of named entity")
entity_group: EntityGroup str = Field(..., description="The named entity found")
word:
class DocumentNER(BaseModel):
"""Information about named entities to extract."""
list[NamedEntity] = Field(
named_entities:
...,=f"Perform Named Entity Recognition that finds the following entities: {', '.join([x.name for x in EntityGroup])}",
description )
When using the Anthropic models, I would consistently see it trying to set entity_group
to a string rather than a proper enum value from the EntityGroup
enum.
After iterating through a number of prompt and class/field description modifications, I decided to give up and replace my Enum
with a Literal
. And guess what, everything worked great across all model vendors.
I also decided to lookup the named entities used in Spacy and use those names in my Enum
as it makes sense to me that perhaps these libraries might have been included in the training of these LLMs and so maybe will help it do a better job of finding the entities I care about.
v1: Using Literal
Using the Literal
type fixed everything and works great across all models! Here’s what it looks like:
class NamedEntity(BaseModel):
"""A named entity found in a document."""
entity_type: Literal["PERSON",
"NORP",
"FAC",
"ORG",
"GPE",
"LOC",
"PRODUCT",
"EVENT",
"WORK_OF_ART",
"LAW",
"LANGUAGE",
"DATE",
"TIME",
"PERCENT",
"MONEY",
"QUANTITY",
"ORDINAL",
"CARDINAL",
"OTHER",
]str = Field(..., description="The named entity found")
entity:
class DocumentNERTask(BaseModel):
"""Extracts the named entities in the document.
This tool should be used anytime the user asks for named entity recognition (NER)
or wants to identify named entities.
"""
list[NamedEntity] = Field(
named_entities:
...,="Perform Named Entity Recognition and return a list of any 'NamedEntity' objects found.",
description )
This works great … but I really wanted to use an Enum
for the reasons listed at the top of this post. And as I’m the kinda guy who enjoys fighting with CUDA installs on his local DL rig, I decided to give it a go after taking a few hours off to enjoy the Euros and Copa America tourneys (also Germany should have won; that was a handball but nah, I’m not angry, nope, not bent at all).
v2: Using Enum
Revisted
Here’s the TL;DR version of the code. This version is working fabulously across all APIs and I have yet to encounter a single exception involving Instructor being unable to assign a valid value from the Enum
.
class NamedEntityType(str, Enum):
"""Valid types of named entities to extract."""
= "PERSON"
PERSON = "NORP"
NORP = "FAC"
FAC = "ORG"
ORG = "GPE"
GPE = "LOC"
LOC = "PRODUCT"
PRODUCT = "EVENT"
EVENT = "WORK_OF_ART"
WORK_OF_ART = "LAW"
LAW = "LANGUAGE"
LANGUAGE = "DATE"
DATE = "TIME"
TIME = "PERCENT"
PERCENT = "MONEY"
MONEY = "QUANTITY"
QUANTITY = "ORDINAL"
ORDINAL = "CARDINAL"
CARDINAL = "OTHER"
OTHER
class NamedEntity(BaseModel):
"""A named entity result."""
def convert_str_to_named_entity_type(v: str | NamedEntityType) -> NamedEntityType:
"""Ensure entity type is a valid enum."""
if isinstance(v, NamedEntityType):
return v
else:
try:
return NamedEntityType(v)
except ValueError:
return NamedEntityType.OTHER
str, BeforeValidator(convert_str_to_named_entity_type)]
entity_type: Annotated[str = Field(..., description="The named entity recognized.")
entity_mention:
class DocumentNERTask(BaseModel):
"""Extracts the named entities found in the document.
This tool should be used anytime the user asks for named entity recognition (NER)
or wants to identify named entities.
"""
list[NamedEntity] = Field(
named_entities:
...,=f"Perform Named Entity Recognition that finds the following entities: {', '.join([x.name for x in NamedEntityType])}",
description )
Besides the return of the Enum
, the most noticeable change involves the inclusion of a BeforeValidator
that ensures the value is assigned to a valid enum as defined in NamedEntity
. In cases where it wants to add an entity to the list of named_entities
that isn’t defined in the NamedEntityType
enum or is named differently (e.g., “ORGANIZATION” vs. “ORG”), it will assign them to OTHER
.
With this in place, I now have a solution that is:
More resiliant
Can be used in debugging named entity recogintion (e.g, I can explore what named entities might be missing from the
Enum
or getting named differently by looking at those that get associated with theOTHER
value)I can use that same beautiful
Enum
across all parts of my application
v2.0.1: Using Enum
and fuzzywuzzy
A suggestion from a Twitter user inspired me to enhance our approach by implementing similarity-based matching rather than relying on exact matches. To make it so, I installed the fuzzywuzzy
library and made the necessary modifications to increase the likelihood of delivering high-quality results.
class NamedEntityType(str, Enum):
"""Valid types of named entities to extract."""
= "PERSON"
PERSON = "NORP"
NORP = "FAC"
FAC = "ORG"
ORG = "GPE"
GPE = "LOC"
LOC = "PRODUCT"
PRODUCT = "EVENT"
EVENT = "WORK_OF_ART"
WORK_OF_ART = "LAW"
LAW = "LANGUAGE"
LANGUAGE = "DATE"
DATE = "TIME"
TIME = "PERCENT"
PERCENT = "MONEY"
MONEY = "QUANTITY"
QUANTITY = "ORDINAL"
ORDINAL = "CARDINAL"
CARDINAL = "OTHER"
OTHER
class NamedEntity(BaseModel):
"""A named entity result."""
def convert_str_to_named_entity_type(v: str | NamedEntityType) -> NamedEntityType:
"""Ensure entity type is a valid enum."""
if isinstance(v, NamedEntityType):
return v
else:
try:
= fuzzy_process.extractOne(v.upper(), [e.value for e in list(NamedEntityType)])
match, score return NamedEntityType(match) if score >= 60 else NamedEntityType.OTHER
except ValueError:
return NamedEntityType.OTHER
str, BeforeValidator(convert_str_to_named_entity_type)]
entity_type: Annotated[str = Field(..., description="The named entity recognized.")
entity_mention:
class DocumentNERTask(BaseModel):
"""Extracts the named entities found in the document.
This tool should be used anytime the user asks for named entity recognition (NER)
or wants to identify named entities.
"""
list[NamedEntity] = Field(
named_entities:
...,=f"Perform Named Entity Recognition that finds the following entities: {', '.join([x.name for x in NamedEntityType])}",
description )
This improves those cases where, for example, the LLM wants to define the entity type as “ORGANIZATION” but it is defined in the Enum
as “ORG”.
Another option potentially worth exploring is to use the llm_validator
function to make a call out to the LLM when exceptions happen and prompt it to coerce the value into something in the Enum
. This could hike up your costs a bit but I imagine using a cheap model like GPT-3.5-Turbo could do the job just fine, and would likely you give an addtional robustness in quality results.
Conclusion
That’s it.
If you found this helpful and/or have suggestions on how to improve the use of Enum
s in Instructor, lmk in the comments below or on X. Until then, time to enjoy some football and see if Brazil can make it into the semis.