Gemini (Part II) - The Unified SDK

LLMS
Gemini
Google
Today, we’ll look at how to get started with the unified SDK for both the Gemini API and Vertex API users.
Author

Wayde Gilliam

Published

December 24, 2024

Hallelujah!

In the previous post we looked at how Gemini, espeically 2.0, compares with other models relative to both performance and cost and why you should consider it as a potential “go to” in building AI powered applications of almost any kind. Today, we’ll discuss how to get started with their new unified SDK that both Gemini API and Vertex API users can use to build applications.

Amongst the biggest complaints from Gemini devs, has been the inconsistency of the SDKs for both the Gemini API and Vertex API. Well it looks like Google has finally heard our cries and as part of the Gemini 2.0 release, they’ve released a new unified SDK that both Gemini API and Vertex API users can use to build applications. This makes is super easy to bounce back-and-forth between the two API options without having to understand the nuances of two somewhat different SDKs and what might break when you switch between them.

Initializing the SDK client

A note on the Gemini API and Vertex AI (from the official docs)

In the whitepapers, most of the example code uses the Enterprise Vertex AI platform. In contrast, this notebook, along with the others in this series, will use the Gemini Developer API and AI Studio.

Both APIs provide access to the Gemini family of models, and the code to interact with the models is very similar. Vertex provides a world-class platform for enterprises, governments and advanced users that need powerful features like data governance, ML ops and deep Google Cloud integration.

AI Studio is free to use and only requires a compatible Google account to log in and get started. It is deeply integrated with the Gemini API, which comes with a generous free tier that you can use to run the code in these exercises.

If you are already set up with Google Cloud, you can check out the Enterprise Gemini API through Vertex AI, and run the samples directly from the supplied whitepapers.

I’ll be showing how to use both platforms below, but the first step is the same either way … a single pip install :)

pip install google-genai
Note

Here are some reference links to the new SDK docs and a Getting Started notebook courtesy of Google with a lot more information and examples …

If you are using the Gemini Developer API …

# with Google AI API key
GOOGLE_AI_API_KEY = os.getenv("GOOGLE_AI_API_KEY")

client = genai.Client(api_key=GOOGLE_AI_API_KEY)

response = client.models.generate_content(model="gemini-2.0-flash-exp", contents="How far is the moon from the earth?")

display(Markdown(response.text))
# The distance between the Earth and the Moon isn't constant, as the Moon travels in an elliptical orbit around our planet. Therefore, we talk about the distance in terms of averages:

# Average distance: Approximately 238,855 miles (384,400 kilometers).
# However, keep in mind:

# Perigee (Closest Approach): The Moon can get as close as about 225,623 miles (363,104 kilometers).
# Apogee (Farthest Distance): The Moon can be as far as about 252,088 miles (405,696 kilometers).
# So, while the average distance is a good figure to keep in mind, it's important to remember that the actual distance varies throughout the month.

print(response.usage_metadata)
# cached_content_token_count=None candidates_token_count=181 prompt_token_count=10 total_token_count=191
Note

The resonse is an instance of GenerateContentResponse. In addition to the generated content, it includes a bunch of other information including the token usage (via response.usage_metadata), a safety rating and explanations, and a variety of other information that will prove more useful depending on how we use Gemini (e.g., whether we use it with multimodal inputs, or with tool calling, code execution, etc…). You’ll see what I mean down below.

There is an async version of generate_content available via the aio property on your client instance. You just need to add await in front of the call.

response = await client.aio.models.generate_content(
    model="gemini-2.0-flash-exp", contents="How far is the moon from the earth?", config=gen_config
)
Tip

The text response is almost always going to be in Markdown, so why not make it look pretty in notebooks by doing a from IPython.display import Markdown, display and then display(Markdown(response.text))?

If you are using Vertex …

# this is the path to your json credentials file (at least it is to mine, :))
GOOGLE_VERTEX_AI_CREDS = os.getenv("GOOGLE_VERTEX_AI_CREDS")

SCOPES = ["https://www.googleapis.com/auth/cloud-platform"]
creds = service_account.Credentials.from_service_account_file(GOOGLE_VERTEX_AI_CREDS, scopes=SCOPES)

client = genai.Client(vertexai=True, project="generative-playground", location="us-central1", credentials=creds)

response = client.models.generate_content(model="gemini-2.0-flash-exp", contents="How far is the moon from the earth?")

display(Markdown(response.text))
# The distance between the Earth and the Moon is not constant, as the Moon's orbit is elliptical (an oval shape), not a perfect circle. Here's a breakdown:

# Average Distance: The average distance is about 384,400 kilometers (238,900 miles). This is the number most commonly used.
# Perigee: This is the point in the Moon's orbit where it is closest to Earth. At perigee, the distance can be as close as 363,104 kilometers (225,623 miles).
# Apogee: This is the point in the Moon's orbit where it is furthest from Earth. At apogee, the distance can reach as far as 405,696 kilometers (252,088 miles).
# So, the answer is variable, but the average distance is approximately 384,400 kilometers (238,900 miles).

print(response.usage_metadata)
# cached_content_token_count=None candidates_token_count=221 prompt_token_count=9 total_token_count=230

We should expect to get something close to the same response from both the Gemini API and Vertex API. The only difference is with how you instantiate your Gemini client.

Important

As of the time of this writing, Gemini 2.0 models are ONLY available at the “us-central1” location.

Note

The previous SDK had a list_models method that would return a list of models (at least the Gemini API did). This is not supported in the new SDK at the time of this writing.

for model in client.models.list():
    print(model)  # ain't returning nothing useful

Controlling Generation

You can influence the generation of content by providing a GenerateContentConfig object. With this object, you can provide a system instruction, safety settings, and generation parameters (e.g., temperature, top_p, etc.).

See also:

Tip

I tend to turn all of these “safety” settings off and it would be great if there is an easier way to do this in one-line vs what you see below

safety_settings = [
    types.SafetySetting(category="HARM_CATEGORY_HATE_SPEECH", threshold="BLOCK_NONE"),
    types.SafetySetting(category="HARM_CATEGORY_DANGEROUS_CONTENT", threshold="BLOCK_NONE"),
    types.SafetySetting(category="HARM_CATEGORY_HARASSMENT", threshold="BLOCK_NONE"),
    types.SafetySetting(category="HARM_CATEGORY_SEXUALLY_EXPLICIT", threshold="BLOCK_NONE"),
]

gen_config = types.GenerateContentConfig(
    system_instruction="You are an expert in all things astronomy and you ONLY provide very concise answers",
    safety_settings=safety_settings,
    temperature=0,
    top_p=0.95,
    top_k=20,
    candidate_count=1,
    seed=5,
    max_output_tokens=100,
    stop_sequences=["STOP!"],
    presence_penalty=0.0,
    frequency_penalty=0.0,
)

response = client.models.generate_content(model="gemini-2.0-flash-exp", contents="How far is the moon from the earth?", config=gen_config)

With that in place we can see that our response is now much more concise and to the point.

response = client.models.generate_content(model="gemini-2.0-flash-exp", contents="How far is the moon from the earth?", config=gen_config)
print(response.text)
# About 384,400 kilometers.

print(response.usage_metadata)
# cached_content_token_count=None candidates_token_count=12 prompt_token_count=24 total_token_count=36

Tokenization Goodies

You can use the count_tokens method to calculate the number of input tokens before sending a request to the Gemini API. Super helpful to understanding the cost implications of your requests. Async operations are available for both count_tokens and compute_tokens as well as most other methods on your client instance via the aio property.

# count_tokens
response = client.models.count_tokens(model="gemini-2.0-flash-exp", contents="How far is the moon from the earth?")
print(response)
# total_tokens=9 cached_content_token_count=None

print("\n======\n")

# count_tokens async
response = await client.aio.models.count_tokens(model="gemini-2.0-flash-exp", contents="How far is the moon from the earth?")
print(response)
# total_tokens=9 cached_content_token_count=None
Note

All of the async bits are available via the aio property on your client instance.

You can use compute_tokens to get the tokenized inputs (only available in Vertex AI). This can be really helpful in terms of understanding how the Gemini models tokenize your input and troubleshooing issues that might arise from generation.

# compute_tokens
response = client.models.compute_tokens(model="gemini-2.0-flash-exp", contents="How far is the moon from the earth?")
print(response)
# tokens_info=[TokensInfo(role='user', token_ids=['2299', '2166', '603', '573', '11883', '774', '573', '6683', '235336'], tokens=[b'How', b' far', b' is', b' the', b' moon', b' from', b' the', b' earth', b'?'])]


print("\n======\n")

# compute_tokens async
response = await client.aio.models.compute_tokens(model="gemini-2.0-flash-exp", contents="How far is the moon from the earth?")
print(response)
# tokens_info=[TokensInfo(role='user', token_ids=['2299', '2166', '603', '573', '11883', '774', '573', '6683', '235336'], tokens=[b'How', b' far', b' is', b' the', b' moon', b' from', b' the', b' earth', b'?'])]
Tip

Tokenization issues are often some of the biggest gotchas in issues folks have with working with LLMs. There’s an entire course about the importance of looking at your data … and that means looking at your tokens.

Streaming Content

We’ve already seen how we can use the generate_content method to generate content but what if we want to stream the content as it is being generated?

The generate_content_stream method is a great way to stream content as it is generated. This is useful for things like chat interfaces or long-form content where you want to see the progress as it is being generated.

An async version is also available via the aio property.

for chunk in client.models.generate_content_stream(model="gemini-2.0-flash-exp", contents="How far is the moon from the earth?"):
    print(chunk.text, end="|", flush=True)

# The| distance between the Earth and the Moon is not constant, as the Moon's| orbit is elliptical. Here's a breakdown:

# * **Average Distance:**| The average distance between the Earth and the Moon is about **384,400 kilometers (238,900 miles)**. This| is the most commonly cited figure.

# * **Perigee:** This is the point in the Moon's orbit when it is closest to Earth. At| perigee, the distance can be as little as about **363,104 kilometers (225,623 miles)**.

# * **Apogee:** This is the point in the Moon's orbit| when it is farthest from Earth. At apogee, the distance can be as much as about **405,696 kilometers (252,088 miles)**.

# **Key Takeaway:** While we often| talk about the average distance, remember that the Moon's distance varies throughout its orbit, fluctuating between perigee and apogee.

# So, while the average is 384,400 kilometers, it's more accurate to say the distance ranges from about 363,000| km to 406,000 km.
|

I’ve included the pipe character so you can see the chunks as they are being generated.

Multi-Turn Chat

The chats module provides a way to interact with the Gemini API in a multi-turn chat interface.

The create method is used to create a new chat session. The send_message method is used to send a message to the chat session.

The history parameter is used to provide the chat history.

system_instruction = dedent("""
  You are an expert software developer and a helpful coding assistant.
  You are able to generate high-quality code in any programming language.
""").strip()

convo = []
chat = client.chats.create(
    model="gemini-2.0-flash-exp", config=types.GenerateContentConfig(system_instruction=system_instruction, temperature=0.5), history=convo
)

You can now use send_message to add a message to the chat.

response = chat.send_message("Write a simple python program that accepts a person's name and returns a greeting.")

Markdown(response.text)
# def greet_person(name):
#   """
#   Greets a person by name.

#   Args:
#     name: The name of the person to greet (string).

#   Returns:
#     A greeting string.
#   """
#   return f"Hello, {name}!"

# if __name__ == "__main__":
#   person_name = input("Please enter your name: ")
#   greeting = greet_person(person_name)
#   print(greeting)

# ... and an extended explanation of the code I'm omitting here ...

Currently we can look at, and even modify, the chat history by accessing the _curated_history property on the chat instance. That _ prefix tells me this part of the SDK is still under construction and likely to change.

chat._curated_history
# [Content(parts=[Part(video_metadata=None, code_execution_result=None, executable_code=None, file_data=None, function_call=None, function_response=None, inline_data=None, text="Write a simple python program that accepts a person's name and returns a greeting.")], role='user'),
#  Content(parts=[Part(video_metadata=None, code_execution_result=None, executable_code=None, file_data=None, function_call=None, function_response=None, inline_data=None, text='```python\ndef greet_person(name):\n  """\n  Greets a person by name.\n\n  Args:\n    name: The name of the person to greet (string).\n\n  Returns:\n    A greeting string.\n  """\n  return f"Hello, {name}!"\n\nif __name__ == "__main__":\n  person_name = input("Please enter your name: ")\n  greeting = greet_person(person_name)\n  print(greeting)\n```\n\n**Explanation:**\n\n1.  **`def greet_person(name):`**:\n    *   This line defines a function named `greet_person` that takes one argument, `name`.\n    *   The `name` argument will hold the person\'s name as a string.\n\n2.  **`"""..."""`**:\n    *   This is a docstring, which is a multiline string used to document what the function does. It\'s good practice to include docstrings to make your code more understandable.\n\n3.  **`return f"Hello, {name}!"`**:\n    *   This line uses an f-string (formatted string literal) to create the greeting.\n    *   `f"..."`  allows you to embed variables directly into the string by placing them inside curly braces `{}`.\n    *   The function returns the complete greeting string (e.g., "Hello, Alice!").\n\n4.  **`if __name__ == "__main__":`**:\n    *   This is a standard Python construct that ensures the code inside the `if` block only runs when the script is executed directly (not when it\'s imported as a module into another script).\n\n5.  **`person_name = input("Please enter your name: ")`**:\n    *   The `input()` function displays the message "Please enter your name: " to the user and waits for them to type something.\n    *   Whatever the user types is stored as a string in the `person_name` variable.\n\n6.  **`greeting = greet_person(person_name)`**:\n    *   This line calls the `greet_person` function, passing the `person_name` as an argument.\n    *   The function returns the greeting string, which is then stored in the `greeting` variable.\n\n7.  **`print(greeting)`**:\n    *   Finally, this line prints the greeting to the console.\n\n**How to run this code:**\n\n1.  Save the code in a file named, for example, `greeting.py`.\n2.  Open a terminal or command prompt.\n3.  Navigate to the directory where you saved the file.\n4.  Run the command `python greeting.py`.\n5.  The program will prompt you to enter your name. Type your name and press Enter.\n6.  The program will then print the greeting to the console.\n\n**Example Interaction:**\n\n```\nPlease enter your name: Bob\nHello, Bob!\n```\n')], role='model')
response = chat.send_message("Okay, write a unit test of the generated function.")

Markdown(response.text)
# import unittest
# from greeting import greet_person  # Assuming the previous code is in greeting.py

# class TestGreeting(unittest.TestCase):

#     def test_greet_with_valid_name(self):
#         self.assertEqual(greet_person("Alice"), "Hello, Alice!")
#         self.assertEqual(greet_person("Bob"), "Hello, Bob!")
#         self.assertEqual(greet_person("Charlie"), "Hello, Charlie!")

#     def test_greet_with_empty_name(self):
#         self.assertEqual(greet_person(""), "Hello, !")

#     def test_greet_with_name_containing_spaces(self):
#         self.assertEqual(greet_person("John Doe"), "Hello, John Doe!")

#     def test_greet_with_name_containing_numbers(self):
#         self.assertEqual(greet_person("User123"), "Hello, User123!")

#     def test_greet_with_special_characters(self):
#         self.assertEqual(greet_person("!@#$%^"), "Hello, !@#$%^!")

# if __name__ == '__main__':
#     unittest.main()

# ... and an extended explanation of the code I'm omitting here ...

Pretty cool!

Important

Note the “role” names are a bit different than what you might expect. With Gemini 2.0, you’ll see “user” when the message comes from the user and “model” when the message comes from the the LLM.

Conclusion

The SDK is still being fleshed out but I’m excited about the prospect of being able build with Gemini regardless of which API I’m using. Hopefully, some of the tips and tricks above will help you get started regardless of which API flavor you prefer.

In the next post, we’ll look at how to take advantage of Gemini’s multimodal capabilities including how to incorporate few-shot examples for both text only and multimodal tasks.