The Art & Science of Prompt Engineering
Prompt engineering is the highest-leverage skill in AI development. Well-crafted prompts improve output quality by 50-80%, reduce token costs by 30-50%, and often eliminate the need for fine-tuning entirely.
Core Techniques
1. Zero-Shot Prompting
Direct instruction without examples:
Classify the sentiment of this review as positive, negative, or neutral: "The product arrived late but quality is excellent."
When to use: Simple tasks, quick prototyping. Accuracy: 60-75%.
2. Few-Shot Learning
Provide 2-5 examples of desired input-output pairs:
Extract key information from customer emails: Email: "Hi, I need to return order #12345. It's too small." Output: {"order_id": "12345", "intent": "return", "reason": "size"} Email: "When will order #67890 ship?" Output: {"order_id": "67890", "intent": "shipping_inquiry", "reason": null} Email: "I want to cancel my subscription immediately." Output:
Results: 75-90% accuracy with 3-5 examples. Optimal for most tasks.
3. Chain-of-Thought (CoT)
Ask model to explain reasoning step-by-step:
Problem: A store sold 48 apples. 1/3 were green, rest were red. How many red apples? Solve this step-by-step: 1. Calculate number of green apples 2. Calculate number of red apples 3. Provide final answer
Results: 30-50% improvement on complex reasoning tasks.
4. Role Prompting
Assign specific expertise to the model:
You are an expert Python developer with 15 years of experience in data engineering. Review this code for bugs and suggest improvements: [code here]
Impact: 20-40% improvement in specialized tasks.
5. System Messages (ChatGPT/Claude)
System: You are a helpful customer service agent for TechCorp. Be concise, professional, and always provide order numbers when discussing purchases. Never make promises about refunds without manager approval. User: I want my money back!
Use case: Set consistent behavior and constraints.
6. Output Formatting
Request specific output format:
Extract company information and return as JSON: Text: "Apple Inc., founded in 1976, is headquartered in Cupertino, CA." Output format: { "company": "company name", "founded": "year", "headquarters": "location" }
Advanced Techniques
ReAct (Reasoning + Acting)
Combine reasoning with tool use:
You have access to: [search_web, calculate, get_weather] Question: What's the total GDP of countries with population > 100M? Thought: I need to find countries with population > 100M Action: search_web("countries population over 100 million") Observation: [list of countries] Thought: Now I need their GDPs Action: search_web("GDP of [countries]") ...
Self-Consistency
Generate multiple reasoning paths, choose most common answer:
- Run same prompt 5-10 times with temperature 0.7
- Use majority voting on final answers
- 20-30% accuracy improvement on complex reasoning
Parameter Tuning
Temperature
- 0.0-0.3: Deterministic, factual (data extraction, classification)
- 0.7-0.9: Balanced creativity (content generation, conversations)
- 1.0+: Highly creative (brainstorming, creative writing)
Top-p (Nucleus Sampling)
- 0.1: Very focused (use with low temperature)
- 0.9: Balanced (default for most tasks)
- 1.0: Maximum diversity
Max Tokens
- Set generous limits for reasoning tasks (1000-2000 tokens)
- Constrain for simple tasks (100-300 tokens)
- Monitor cost vs quality tradeoff
Model-Specific Tips
GPT-4
- Excels at complex reasoning and coding
- Use system messages for consistent behavior
- 32K context: leverage long conversations and documents
- Best for: code generation, analysis, complex reasoning
Claude 2/3
- 100K-200K context: process entire books
- Strong at following detailed instructions
- More conservative, less prone to hallucination
- Best for: document analysis, content moderation, safe responses
GPT-3.5 Turbo
- 10x cheaper than GPT-4
- Use for simple tasks: classification, simple Q&A, summarization
- Requires more explicit prompts than GPT-4
Common Pitfalls
- Vague Instructions: Be specific about format, length, tone
- No Examples: Add 2-3 examples for 30-50% accuracy boost
- Wrong Temperature: Low for factual, high for creative
- Ignoring Context Window: Monitor token usage, summarize long conversations
- Not Testing Variations: A/B test prompts, measure quality
Testing & Optimization
- Create Test Set: 50-100 representative examples
- Iterate Prompts: Test variations systematically
- Measure Quality: Accuracy, completeness, format compliance
- Monitor Costs: Track tokens per request, optimize length
- Version Control: Track prompt changes and performance
Real-World Examples
Customer Email Classification
- Technique: Few-shot (3 examples) + output formatting (JSON)
- Model: GPT-3.5 Turbo
- Result: 92% accuracy, 0.5s latency, ₹0.02 per classification
Code Review Agent
- Technique: Role prompting + chain-of-thought + system message
- Model: GPT-4
- Result: 87% bug detection rate, comparable to human reviewers
Legal Contract Analysis
- Technique: Claude 2 (100K context) + structured output
- Result: Process 50-page contracts in 30 seconds vs 2 hours human time
Need help optimizing your LLM prompts? Get a free prompt audit and optimization recommendations.