Want to add ChatGPT, image generation, and AI capabilities to your Python apps? The OpenAI Python SDK makes this straightforward. In this guide, you’ll build AI-powered features—from chat interfaces to semantic search—using Python 3.13 and the latest SDK patterns.

What you’ll learn: Chat completions with streaming responses, function calling for API integration, embeddings for semantic search, vision analysis, and production deployment with proper error handling. All code is ready to copy and run.

Quick heads-up: If you’re using old code with openai.ChatCompletion.create(), that pattern broke in November 2023 when version 1.0 launched. The SDK now uses client instances, which is actually cleaner once you see it in action. Don’t worry—I’ll show you the new patterns step by step.

Quick start: Your first AI-powered response in 5 minutes

Let’s get you making AI requests right away. Install the SDK and make your first API call.

Install the SDK:

The package works with Python 3.13 or higher. If you’re on an older version, this is a good time to upgrade!

Get your API key: Head to platform.openai.com/api-keys and create a new key. Store it as an environment variable (never hardcode it in your source files—trust me, you’ll thank yourself later):

export OPENAI_API_KEY='sk-proj-...'

Make your first request:

from openai import OpenAI
import os

# Create a client instance (reuse this, don't create per request)
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Let's talk to GPT!
response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)
# Output: "Hello! How can I help you today?"

🎉 Nice work! You just made your first AI request. The client handles authentication, retries, and connection pooling automatically—one less thing to worry about.

Pro tip: Create one client instance per application and reuse it. The SDK uses connection pooling internally, which significantly improves performance.

For async applications (FastAPI, asyncio), use the async client:

from openai import AsyncOpenAI

async_client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

async def get_completion(prompt):
    response = await async_client.chat.completions.create(
        model="gpt-5-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Building conversations with chat completions

Chat completions power everything from customer support bots to code generation tools. You send a list of messages (think of it as the conversation history), and the API returns the model’s response. Let’s explore how this works.

Understanding message roles

The messages array is like a conversation transcript. Each message has a role (who’s speaking) and content (what they said). The model reads this entire conversation to understand context and generate relevant responses.

System messages set the assistant’s behavior. They define personality, constraints, and output format. The model treats system messages as instructions that override default behavior.

messages = [
    {"role": "system", "content": "You are a Python expert who explains concepts concisely with code examples."},
    {"role": "user", "content": "What are Python decorators?"}
]

User messages represent input from the person using your application. These are questions, commands, or prompts.

Assistant messages store the model’s previous responses. Include them to maintain conversation context across multiple turns.

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is RAG?"},
    {"role": "assistant", "content": "RAG (Retrieval Augmented Generation) combines information retrieval with text generation..."},
    {"role": "user", "content": "How do I implement it in Python?"}
]

The model sees the entire message history and generates a response based on all context. This enables multi-turn conversations where the assistant remembers previous exchanges.

Model parameters

Control model behavior with parameters. These affect creativity, response length, and output diversity.

temperature (0.0-2.0) controls randomness. Lower values (0.0-0.3) produce deterministic, focused responses. Higher values (0.7-1.5) increase creativity and variation. Use low temperature for factual tasks (code generation, data extraction) and higher temperature for creative tasks (brainstorming, storytelling).

# Deterministic code generation
response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": "Write a Python function to calculate fibonacci"}],
    temperature=0.1
)

# Creative writing
response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": "Write a sci-fi story opening"}],
    temperature=1.2
)

max_tokens limits response length. The model stops generating after reaching this limit. Count both input and output tokens against context window limits. GPT-5.2 supports 200k tokens, GPT-5 supports 128k tokens.

top_p (0.0-1.0) implements nucleus sampling. The model considers only the top P probability mass of tokens. Use top_p=0.1 for focused responses or top_p=0.9 for diverse outputs. Don’t adjust both temperature and top_p simultaneously.

presence_penalty (-2.0 to 2.0) reduces repetition of topics. Positive values penalize tokens that already appeared, encouraging the model to explore new topics.

frequency_penalty (-2.0 to 2.0) reduces repetition of specific tokens. Positive values penalize tokens based on their frequency in the output, discouraging verbatim repetition.

response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": "Explain Python async/await"}],
    temperature=0.7,
    max_tokens=500,
    presence_penalty=0.6,
    frequency_penalty=0.3
)

Streaming responses

Streaming sends response tokens as they’re generated instead of waiting for completion. This improves perceived latency in user-facing applications. Users see text appearing progressively rather than waiting 5-10 seconds for a full response.

Enable streaming with stream=True. The API returns an iterator of chunk objects. Each chunk contains a delta with new content.

response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": "Explain RAG in detail"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

The streaming response yields ChatCompletionChunk objects. Access content via chunk.choices[0].delta.content. The first chunk often has empty content, and the final chunk signals completion with finish_reason.

Handle streaming errors carefully. Network failures mid-stream leave partial responses. Wrap streaming in try/except and implement retry logic.

def stream_completion(prompt):
    try:
        response = client.chat.completions.create(
            model="gpt-5",
            messages=[{"role": "user", "content": prompt}],
            stream=True,
            timeout=30
        )
        
        full_response = ""
        for chunk in response:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                print(content, end="", flush=True)
                full_response += content
        
        return full_response
    except Exception as e:
        print(f"\nStreaming error: {e}")
        return None

Model comparison

Choose models based on task complexity, latency requirements, and budget. GPT-5.2 excels at complex reasoning but costs more. GPT-5-mini handles simple tasks at 1/10th the cost.

Model Cost (1M tokens input/output) Context Window Speed Use Case
GPT-5.2 $8/$24 200k Medium Complex reasoning, research, analysis
GPT-5 $3/$9 128k Fast General tasks, content generation
GPT-5-mini $0.30/$0.90 64k Fastest Simple tasks, high-volume processing

GPT-5.2 outperforms GPT-5 on benchmarks requiring multi-step reasoning (GPQA, MATH, HumanEval). For tasks like classification, summarization, or simple Q&A, GPT-5-mini provides 90% of the quality at 10% of the cost.

Test your specific use case with all three models. Measure quality (human eval or automated metrics) and cost. Many applications can use GPT-5-mini for 80% of requests and route complex queries to GPT-5.2.

Function calling and tool integration

Function calling lets the model decide when to call external functions. Instead of generating text, the model outputs structured JSON with function names and arguments. Your code executes the function and returns results to the model, which incorporates them into the response.

This enables API integration, database queries, calculations, and external tool usage. The model determines which function to call based on the user’s request and available tools.

What is function calling?

Function calling solves the problem of structured output and external data access. Without it, you’d need to parse unstructured text to extract API calls or database queries. Function calling provides a structured interface.

The workflow:

  1. Define available functions with JSON schemas
  2. Send user message with function definitions
  3. Model returns function call (if needed) or text response
  4. Execute function with provided arguments
  5. Send function result back to model
  6. Model generates final response using function output

Use cases include weather APIs, database lookups, calculator functions, web searches, and any external data source the model needs to access.

Tool definitions

Define functions using JSON schemas. Each function needs a name, description, and parameter specification. The model uses descriptions to decide when to call functions.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location. Use this when users ask about weather conditions.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name or coordinates (e.g., 'San Francisco' or '37.7749,-122.4194')"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "calculate",
            "description": "Perform mathematical calculations. Use for arithmetic operations.",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "Mathematical expression to evaluate (e.g., '2 + 2' or 'sqrt(16)')"
                    }
                },
                "required": ["expression"]
            }
        }
    }
]

Write clear descriptions. The model uses them to decide which function matches the user’s intent. Specify parameter types, constraints (enums), and whether parameters are required.

Parallel function calls

The model can call multiple functions in a single request. This reduces latency when multiple operations are independent.

import json

def get_weather(location, unit="celsius"):
    # Simulate API call
    return {"temperature": 22, "condition": "sunny", "unit": unit}

def calculate(expression):
    # Safe eval for demo - use a proper parser in production
    try:
        result = eval(expression)
        return {"result": result}
    except:
        return {"error": "Invalid expression"}

# User asks: "What's the weather in Tokyo and what's 15 * 24?"
response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": "What's the weather in Tokyo and what's 15 * 24?"}],
    tools=tools
)

# Model returns multiple tool calls
tool_calls = response.choices[0].message.tool_calls

if tool_calls:
    # Execute all function calls
    messages = [{"role": "user", "content": "What's the weather in Tokyo and what's 15 * 24?"}]
    messages.append(response.choices[0].message)
    
    for tool_call in tool_calls:
        function_name = tool_call.function.name
        arguments = json.loads(tool_call.function.arguments)
        
        if function_name == "get_weather":
            result = get_weather(**arguments)
        elif function_name == "calculate":
            result = calculate(**arguments)
        
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result)
        })
    
    # Get final response with function results
    final_response = client.chat.completions.create(
        model="gpt-5",
        messages=messages
    )
    
    print(final_response.choices[0].message.content)
    # Output: "The weather in Tokyo is 22°C and sunny. 15 * 24 equals 360."

The model executes both function calls in parallel (conceptually). Your code runs them and returns results. The model then generates a natural language response incorporating both answers.

Tool choice control

Control whether the model must call functions or can respond with text. The tool_choice parameter accepts:

  • "auto": Model decides whether to call functions (default)
  • "required": Model must call at least one function
  • "none": Disable function calling for this request
  • {"type": "function", "function": {"name": "function_name"}}: Force specific function
# Force the model to call get_weather
response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": "Tell me about Tokyo"}],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "get_weather"}}
)

Use "required" when you always need structured output. Use specific function selection when you know which function to call but want the model to extract parameters from natural language.

Embeddings for semantic search and clustering

Embeddings convert text into dense vectors that capture semantic meaning. Similar texts produce similar vectors. Use embeddings for semantic search, clustering, recommendation systems, and anomaly detection.

text-embedding-3-large vs text-embedding-3-small

OpenAI provides two embedding models with different dimension sizes and costs.

Model Dimensions Cost (1M tokens) Use Case
text-embedding-3-large 3072 $0.13 High-quality semantic search, research
text-embedding-3-small 1536 $0.02 Cost-sensitive applications, prototyping

Higher dimensions capture more nuance but cost more to store and search. For most applications, text-embedding-3-small provides sufficient quality at 1/6th the cost.

# Generate embeddings
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Retrieval Augmented Generation combines information retrieval with text generation"
)

embedding = response.data[0].embedding
print(f"Embedding dimensions: {len(embedding)}")
# Output: Embedding dimensions: 1536

The response contains a list of embedding objects. Each has an embedding field with the vector and an index field indicating position in the input batch.

Semantic search with embeddings

Build semantic search by embedding documents, storing vectors in a database, and finding nearest neighbors for queries.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [
    "Python is a high-level programming language",
    "Machine learning models require training data",
    "Vector databases store embeddings for similarity search",
    "FastAPI is a modern web framework for Python",
    "Neural networks consist of layers of interconnected nodes"
]

# Generate embeddings for all documents
doc_embeddings = []
for doc in documents:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=doc
    )
    doc_embeddings.append(response.data[0].embedding)

# Convert to numpy array
doc_embeddings = np.array(doc_embeddings)

# Search function
def search(query, top_k=3):
    # Embed query
    query_response = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    )
    query_embedding = np.array([query_response.data[0].embedding])
    
    # Calculate cosine similarity
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
    
    # Get top k results
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    results = []
    for idx in top_indices:
        results.append({
            "document": documents[idx],
            "similarity": similarities[idx]
        })
    
    return results

# Test search
results = search("What is a web framework?")
for i, result in enumerate(results, 1):
    print(f"{i}. {result['document']} (similarity: {result['similarity']:.3f})")

# Output:
# 1. FastAPI is a modern web framework for Python (similarity: 0.842)
# 2. Python is a high-level programming language (similarity: 0.721)
# 3. Vector databases store embeddings for similarity search (similarity: 0.654)

For production systems, use vector databases like FAISS, Pinecone, or Weaviate instead of computing similarities in Python. These databases provide approximate nearest neighbor search that scales to millions of vectors.

Clustering and classification

Embeddings enable unsupervised clustering and few-shot classification. Similar documents cluster together in embedding space.

from sklearn.cluster import KMeans

# Cluster documents into 2 groups
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(doc_embeddings)

for i, doc in enumerate(documents):
    print(f"Cluster {clusters[i]}: {doc}")

# Output:
# Cluster 0: Python is a high-level programming language
# Cluster 1: Machine learning models require training data
# Cluster 1: Vector databases store embeddings for similarity search
# Cluster 0: FastAPI is a modern web framework for Python
# Cluster 1: Neural networks consist of layers of interconnected nodes

The model groups Python/FastAPI together (programming) and ML/vectors/neural networks together (AI/ML) without supervision.

Image analysis with GPT-5.2 Vision

GPT-5.2 Vision analyzes images and answers questions about visual content. Use it for image captioning, OCR, visual question answering, and content moderation.

Supported image formats

The API accepts images as URLs or base64-encoded data. Supported formats include JPEG, PNG, GIF, and WebP. Maximum file size is 20MB.

# Analyze image from URL
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

For base64 images, encode the file and include it in the data URL:

import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

base64_image = encode_image("diagram.png")

response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this diagram"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}"
                    }
                }
            ]
        }
    ]
)

Multi-image analysis

Send multiple images in a single request. The model analyzes all images and answers questions about them.

response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Compare these two architecture diagrams"},
                {"type": "image_url", "image_url": {"url": "https://example.com/arch1.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/arch2.jpg"}}
            ]
        }
    ]
)

Use cases include comparing before/after images, analyzing multi-page documents, or processing image sequences.

OCR and text extraction

GPT-5.2 Vision extracts text from images without dedicated OCR libraries. It handles handwriting, complex layouts, and multiple languages.

response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract all text from this image and format it as markdown"},
                {"type": "image_url", "image_url": {"url": "https://example.com/document.jpg"}}
            ]
        }
    ]
)

extracted_text = response.choices[0].message.content
print(extracted_text)

The model understands document structure and can format extracted text appropriately (preserving headings, lists, tables).

Production error handling patterns

Production systems need robust error handling. The OpenAI API returns specific exceptions for different failure modes. Handle them appropriately to build reliable applications.

Common errors

The SDK raises typed exceptions for different error conditions:

RateLimitError: You exceeded your rate limit (requests per minute or tokens per minute). This happens during traffic spikes or when processing large batches.

APIError: OpenAI’s servers returned a 500 error. This indicates a temporary server issue. Retry with exponential backoff.

AuthenticationError: Invalid API key or insufficient permissions. Check your API key and organization settings.

InvalidRequestError: Malformed request (invalid parameters, unsupported model, etc.). Fix the request parameters.

APIConnectionError: Network failure or timeout. Retry with backoff.

from openai import OpenAI, RateLimitError, APIError, AuthenticationError

client = OpenAI()

try:
    response = client.chat.completions.create(
        model="gpt-5",
        messages=[{"role": "user", "content": "Hello"}]
    )
except RateLimitError as e:
    print(f"Rate limit exceeded: {e}")
    # Implement backoff and retry
except APIError as e:
    print(f"Server error: {e}")
    # Retry after delay
except AuthenticationError as e:
    print(f"Authentication failed: {e}")
    # Check API key
except Exception as e:
    print(f"Unexpected error: {e}")

Retry strategies with exponential backoff

Implement retries for transient errors (rate limits, server errors). Use exponential backoff to avoid overwhelming the API during outages.

The tenacity library provides decorators for retry logic:

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type
)
from openai import RateLimitError, APIError

@retry(
    retry=retry_if_exception_type((RateLimitError, APIError)),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
def call_gpt5(prompt):
    response = client.chat.completions.create(
        model="gpt-5",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Usage
try:
    result = call_gpt5("Explain Python decorators")
    print(result)
except Exception as e:
    print(f"Failed after retries: {e}")

This retries up to 3 times with exponential backoff (2s, 4s, 8s). The decorator only retries on RateLimitError and APIError, not on authentication or invalid request errors.

Timeout configuration

Set timeouts to prevent hanging requests. The SDK accepts a timeout parameter (in seconds).

client = OpenAI(timeout=30.0)  # 30 second timeout

# Or per-request
response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": "Long task"}],
    timeout=60.0
)

Use shorter timeouts (10-30s) for user-facing applications. Use longer timeouts (60-120s) for batch processing or complex reasoning tasks.

Production deployment and optimization

Production systems need cost tracking, logging, caching, and rate limiting. These patterns reduce costs and improve reliability.

Cost tracking with token counting

Track token usage to monitor costs and optimize prompts. The tiktoken library counts tokens for OpenAI models.

import tiktoken

def count_tokens(text, model="gpt-5"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Count tokens in a prompt
prompt = "Explain retrieval augmented generation in detail"
token_count = count_tokens(prompt)
print(f"Prompt tokens: {token_count}")

# Estimate cost
input_cost_per_1m = 3.00  # GPT-5
output_cost_per_1m = 9.00
estimated_output_tokens = 500

input_cost = (token_count / 1_000_000) * input_cost_per_1m
output_cost = (estimated_output_tokens / 1_000_000) * output_cost_per_1m
total_cost = input_cost + output_cost

print(f"Estimated cost: ${total_cost:.6f}")

Log token usage for every request. Aggregate by user, endpoint, or time period to identify cost drivers.

Caching strategies

Cache responses for identical prompts. This eliminates redundant API calls and reduces costs.

Use Redis for distributed caching:

import redis
import json
import hashlib

redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)

def cached_completion(prompt, model="gpt-5", ttl=3600):
    # Create cache key
    cache_key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
    
    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Call API
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    
    result = response.choices[0].message.content
    
    # Store in cache
    redis_client.setex(cache_key, ttl, json.dumps(result))
    
    return result

Set appropriate TTLs based on content freshness requirements. Cache static content (documentation Q&A) for hours or days. Cache dynamic content (news summaries) for minutes.

Rate limiting

Respect OpenAI’s rate limits to avoid 429 errors. Implement client-side throttling for high-volume applications.

GPT-5 tier limits (as of 2026):

  • Free tier: 200 requests/day, 40k tokens/day
  • Tier 1: 500 requests/minute, 200k tokens/minute
  • Tier 2: 5,000 requests/minute, 2M tokens/minute

Use a token bucket algorithm for rate limiting:

import time
from threading import Lock

class RateLimiter:
    def __init__(self, requests_per_minute):
        self.requests_per_minute = requests_per_minute
        self.tokens = requests_per_minute
        self.last_update = time.time()
        self.lock = Lock()
    
    def acquire(self):
        with self.lock:
            now = time.time()
            elapsed = now - self.last_update
            
            # Refill tokens
            self.tokens = min(
                self.requests_per_minute,
                self.tokens + elapsed * (self.requests_per_minute / 60)
            )
            self.last_update = now
            
            if self.tokens >= 1:
                self.tokens -= 1
                return True
            else:
                # Wait until next token available
                wait_time = (1 - self.tokens) * (60 / self.requests_per_minute)
                time.sleep(wait_time)
                self.tokens = 0
                return True

# Usage
limiter = RateLimiter(requests_per_minute=500)

def rate_limited_completion(prompt):
    limiter.acquire()
    return client.chat.completions.create(
        model="gpt-5",
        messages=[{"role": "user", "content": prompt}]
    )

Logging and monitoring

Log all requests and responses for debugging and analysis. Track latency, error rates, and token usage.

import logging
import time

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def monitored_completion(prompt, model="gpt-5"):
    start_time = time.time()
    
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        
        latency = time.time() - start_time
        
        logger.info({
            "model": model,
            "prompt_length": len(prompt),
            "completion_tokens": response.usage.completion_tokens,
            "total_tokens": response.usage.total_tokens,
            "latency_ms": latency * 1000,
            "status": "success"
        })
        
        return response.choices[0].message.content
        
    except Exception as e:
        latency = time.time() - start_time
        
        logger.error({
            "model": model,
            "error": str(e),
            "error_type": type(e).__name__,
            "latency_ms": latency * 1000,
            "status": "error"
        })
        
        raise

Integrate with observability platforms like LangSmith, Arize, or Datadog for production monitoring.

Frequently asked questions

What’s the difference between GPT-5.2 and GPT-5?

GPT-5.2 is the flagship model with superior reasoning, longer context (200k vs 128k tokens), and better performance on complex tasks. It costs 2.7x more than GPT-5. Use GPT-5.2 for research, analysis, and tasks requiring multi-step reasoning. Use GPT-5 for general content generation, summarization, and conversational AI.

How do I reduce API costs?

Use GPT-5-mini for simple tasks (classification, extraction, simple Q&A). It costs 1/10th of GPT-5 with 90% of the quality. Implement caching to avoid redundant API calls. Count tokens and optimize prompts to reduce input length. Use lower max_tokens limits to cap response length. Batch requests when possible.

Can I use the SDK with Azure OpenAI?

Yes. Azure OpenAI provides the same models through a different endpoint. Initialize the client with Azure credentials:

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="your-azure-key",
    api_version="2024-02-01",
    azure_endpoint="https://your-resource.openai.azure.com"
)

The API is identical except for authentication and endpoint configuration.

What’s the rate limit for GPT-5?

Rate limits depend on your usage tier. Tier 1 (paid accounts) gets 500 requests/minute and 200k tokens/minute. Tier 2 gets 5,000 requests/minute and 2M tokens/minute. Check your limits at platform.openai.com/account/limits.

How do I handle long conversations that exceed context limits?

Implement conversation summarization. When the conversation approaches the context limit (128k tokens for GPT-5), summarize old messages and keep only recent context. Use a sliding window approach or hierarchical summarization. Alternatively, use the Assistants API which manages conversation state automatically.

Should I use sync or async client?

Use AsyncOpenAI() for async applications (FastAPI, asyncio-based services). Use OpenAI() for synchronous scripts and applications. The async client provides better concurrency when making multiple API calls. Don’t use async unless your application is already async.

How do I test OpenAI integrations?

Mock the OpenAI client in tests. Use dependency injection to swap the real client with a mock:

from unittest.mock import Mock

def test_completion():
    mock_client = Mock()
    mock_response = Mock()
    mock_response.choices = [Mock(message=Mock(content="Test response"))]
    mock_client.chat.completions.create.return_value = mock_response
    
    # Test your code with mock_client
    result = your_function(mock_client)
    assert result == "Test response"

For integration tests, use a test API key with low rate limits and monitor costs.

Conclusion

The OpenAI Python SDK 1.x provides a modern, type-safe interface to GPT-5.2, GPT-5, embeddings, vision, and assistants. The SDK uses client instances, Pydantic response models, and proper async support. Production systems need error handling with retries, cost tracking with token counting, caching for redundant requests, and rate limiting to respect API quotas.

Start with GPT-5-mini for prototyping and simple tasks. Use GPT-5 for general applications and GPT-5.2 for complex reasoning. Implement function calling to integrate external APIs and tools. Use embeddings for semantic search and clustering. Add vision capabilities for image analysis. Deploy assistants for stateful conversations with built-in code execution and file search.

Test different models on your specific use case. Measure quality and cost. Optimize prompts to reduce token usage. Implement caching and rate limiting. Monitor latency and error rates. The SDK provides the building blocks for production AI applications in 2026.

Share.
Leave A Reply