Course Overview

AI Career Bootcamp

A practical 8-week program to move from AI curious to AI native — grounded in real job postings and the 7 skills employers actually hire for.

Why This Course Exists

The AI job market is K-shaped. Traditional roles are flat or falling. AI-native roles are growing so fast that for every qualified candidate there are 3.2 jobs. Average time to fill: 142 days.

Companies are desperate for people who can build, operate, and evaluate AI systems — not just use them. This course teaches the 7 skills that appear in hundreds of real job postings.

Course Structure

8 weeks, 4–6 hours per week. Self-paced. Each week builds on the last.

Week	Focus	Core Skill
1	Foundations	Specification Precision
2	Quality	Evaluation & Quality Judgment
3	Systems	Multi-Agent Decomposition
4	Reliability	Failure Pattern Recognition
5	Safety	Trust & Security Design
6	Scale	Context Architecture
7	Economics	Cost & Token Economics
8	Integration	Capstone + Portfolio

Prerequisites

Basic AI tool usage (ChatGPT, Claude, or similar). Everything else is built from scratch here.

✋

Before You Start — Self-Check

Answer these honestly. They'll help you track your growth.

Have you shipped an AI feature to users (even a simple chatbot)?
Can you name 3 ways AI fails differently than humans fail?
Have you estimated the cost of an AI-powered feature before building it?
Do you know what a "context window" is and why it matters?

No prep needed if you're new — that's exactly what weeks 1–4 are for.

📝 My Notes ▼

✓ Saved 0/1000

The Reality

The K-Shaped Job Market

Two markets moving in opposite directions. Understanding this split is the first step to being on the right side of it.

The Split

Market 1: Traditional knowledge work — standard PMs, conventional SWEs, business analysts, general administrators. Job openings: flat or falling.

Market 2: AI-native roles — people who design, build, operate, and manage AI systems. Growing fast. Extremely in demand.

"There are essentially infinite AI jobs right now. Not growing demand. Not a hot sector. Functionally infinite. And they cannot find qualified people."
— Nate Jones, source research

The Numbers

Metric	Value	What It Means
Jobs-to-Candidate Ratio	3.2:1	Three jobs for every qualified person
Time to Fill AI Role	142 days	Almost half a year per role
AI Jobs (est.)	1.6M	ManpowerGroup estimate (likely low)
Qualified Applicants	~500K	If you're here, you write your own ticket

Why "I Applied to 500 Jobs and Got Nothing"

You're probably applying to Market 1. The commodity basket is crowded because everyone can do it. Market 2 has the opposite problem — not enough qualified people.

The Other Problem: Bad Actors

Not all job postings are real. Nate Jones found:

▸Resume farming: Companies post AI roles they don't intend to fill, using applications as free labor to learn what candidates know
▸Whitewashed roles: "AI PM" but actually just regular PM with AI tools
▸Overstated skills: Candidates claiming AI expertise they don't have

The skill framework in this course is specifically designed to cut through this noise — these are learnable skills tied to how AI actually works.

The Good News

All 7 skills are learnable. You don't need a CS degree. You don't need to be a genius. You need specificity and practice. That's it.

📝 My Notes ▼

✓ Saved 0/1000

The 7 Skills

What Employers Actually Want

Derived from hundreds of real job postings, backward-analyzed into sub-skills. These are tied to how AI works — not hype cycles.

Skill 1

Specification Precision

Write exact specs agents can execute without inference

Skill 2

Evaluation & Quality

Detect AI errors before they reach production

Skill 3

Multi-Agent Decomposition

Break complex projects into agent-sized chunks

Skill 4

Failure Pattern Recognition

Diagnose why agentic systems break — and fix them

Skill 5

Trust & Security Design

Draw the line between human and agent authority

Skill 6

Context Architecture

Build the information infrastructure agents run on

Skill 7

Cost & Token Economics

Mathematically justify AI investments before building

→

Test Projects

Apply all 7 skills in real scenarios

The Skill That's #1 on Every Posting

Evaluation & quality judgment — checking whether AI output is actually correct vs. just sounding correct. AI is confidently wrong in ways humans don't instinctively catch.

Overview Skill 1 Skill 2 Skill 3 Skill 4 Skill 5 Skill 6 Skill 7

Skill 1

Specification Precision

Not "prompting." Writing exact, unambiguous instructions that agents execute without inferring intent. The 2026 standard for working with AI.

Week 1

1/8 weeks

1 The Fill-in-the-Blank Problem

Humans read between the lines. We infer intent from context, body language, past conversations. Agents don't. They take what you give them literally and fill in the rest with their best guess.

The result: Vague prompt → plausible-sounding output that misses your actual goal → you assume the AI is smart enough to figure it out → it wasn't.

Why this matters for hiring

In 2026, "good at AI" means "can specify exactly what I want." This is why technical writers, lawyers, and QA engineers have a head start — they've trained in exact documentation.

The 2026 Standard

Here's the difference between what most people call "prompting" and what employers mean by specification precision:

❌ Vague:
"Help with customer support"
✅ Precise:
"Build a tier-1 ticket agent that:
- Handles password resets (account verification required)
- Handles order status inquiries (read-only, no changes)
- Handles return initiations (orders < 30 days, original packaging)
- Escalates to human when: sentiment score < 0.3 OR
  ticket involves billing disputes > $200 OR
  customer uses keyword 'lawyer' or 'attorney'
- Logs every escalation with reason_code and customer_sentiment_score
- Never: issue refunds, change shipping addresses, share internal pricing"

The vague version takes 5 seconds to write. The precise version takes 10 minutes. The precise version is what gets hired.

Who Has a Head Start

Profession	Why They Transfer
Technical Writers	Trained to write for audiences who can't infer
Lawyers	Precision is liability — they already think this way
QA Engineers	Writing testable specs is the job
Editors	Already spotting ambiguity and imprecision
Accountants	Exact definitions, no room for interpretation

✏️

Exercise: Specification Audit

30 min

Find a vague task you've given an AI in the past week. Rewrite it as a precise specification.

Define exact inputs — what data does the agent receive?
Define exact outputs — what does success look like?
Define boundaries — what does it not handle?
Define escalation — when does it flag for human review?
Define success metrics — how do you measure correctness?

Test: Give your old vague prompt and your new precise spec to the same AI. Compare outputs. Document the difference.

🎯

Exercise: Decompose These Tasks

45 min

Write precise specifications for each:

An agent that triages inbound sales leads
Hint: What qualifies a lead? What disqualifies? Who escalates?
An agent that summarizes legal contracts
Hint: What sections matter? What risk flags? What can't it do?
An agent that drafts code review summaries
Hint: What context matters? What's the output format? What triggers flags?

🔑 Answer Key

Exercise 1: Specification Audit — What to look for

A good spec has explicit input/output definitions

Look for: exact boundary conditions spelled out, escalation criteria with specific thresholds (not "when appropriate"), and measurable success criteria. If someone couldn't execute your spec without asking questions, it's not precise enough.

Exercise 2: Sales Lead Triage — Key elements

Lead qualification requires concrete disqualifiers

Look for: BANT criteria (Budget, Authority, Need, Timeline) translated into agent-readable rules. E.g., "Budget: query CRM for account size; reject if < $10K ACV." Disqualifiers like "competitor existing customer" or "no budget in last 2 fiscal years" are good signs.

Exercise 2: Legal Contract Summary — What to flag

Legal summaries need risk classification, not just extraction

Look for: specific clause types identified (indemnification, termination, IP ownership), risk flags ranked by severity, and explicit statements of what the agent cannot do (e.g., "never provide legal advice"). Vague specs like "summarize key terms" aren't precise enough.

📝 My Notes ▼

✓ Saved 0/1000

🧠 Knowledge Check

New

Hands-On Code

Real working code. Not diagrams — things you can run.

🧩 A Simple Agent Pipeline (Python)

20 lines. One tool. Shows how specification, tool calling, and evaluation fit together.

from anthropic import Anthropic
client = Anthropic()

SYSTEM_PROMPT = """You are a market research agent. Follow these steps in order:
1. Search the web for: "{query} market trends 2025"
2. Extract exactly 5 key findings
3. Return as a JSON array under the key "findings"
Never skip steps. Never add data not in the search results."""

messages = [{"role": "user", "content": "Research the AI tutoring market"}]

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=SYSTEM_PROMPT,
    messages=messages,
    tools=[{"name": "web_search", "description": "Search the web", "input_schema": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}}]
)
print(response.content)

What you just saw:

Specification — steps are numbered, exact, no ambiguity
Tool calling — the model decides when to call web_search
Structured output — JSON array required, not free text
Evaluation point — did it return exactly 5 findings? That's your eval.

✅ Writing an Eval (Golden Set Pattern)

Golden sets = known-good input/output pairs. The eval checks if the model matches.

GOLDEN_CASES = [
    {
        "input": "Refund my order #4521",
        "expected_actions": ["lookup_order", "validate_eligibility", "process_refund"],
        "forbidden": ["transfer_funds", "issue_store_credit"]
    },
    {
        "input": "Cancel my subscription starting next month",
        "expected_actions": ["schedule_cancellation"],
        "forbidden": ["immediate_cancellation"]
    },
]

def eval_agent_response(input_text, response):
    score = 0
    feedback = []
    case = find_matching_case(input_text, GOLDEN_CASES)
    if not case:
        return {"score": 0, "feedback": "No matching golden case"}

    for action in case["expected_actions"]:
        if action in response.actions:
            score += 1
            feedback.append(f"✓ Took action: {action}")
        else:
            feedback.append(f"✗ Missing action: {action}")

    for action in case["forbidden"]:
        if action in response.actions:
            score -= 2
            feedback.append(f"✗ Took forbidden action: {action}")

    pct = max(0, (score / (len(case["expected_actions"]) * 2)) * 100)
    return {"score": pct, "feedback": feedback, "pass": pct >= 70}

for case in GOLDEN_CASES:
    result = eval_agent_response(case["input"], agent.run(case["input"]))
    print(f"{'PASS' if result['pass'] else 'FAIL'} {result['score']:.0f}%: {case['input']}")

Key insight:

Golden sets grow over time. Every failure in production → new golden case. Over 6 months you have 200 cases covering every edge case.

🔁 Tool Calling with Retry Logic

What when a tool call fails? You need a retry strategy, or the whole pipeline breaks.

def call_with_retry(tool_fn, max_attempts=3, delay=1.0):
    """Retry a tool call on failure with exponential backoff."""
    import time, random
    for attempt in range(max_attempts):
        try:
            return {"success": True, "result": tool_fn()}
        except ToolError as e:
            if attempt == max_attempts - 1:
                return {"success": False, "error": str(e)}
            wait = delay * (2 ** attempt) + random.uniform(0, 0.5)
            print(f"  Tool failed (attempt {attempt+1}): {e}. Retrying in {wait:.1f}s")
            time.sleep(wait)
    return {"success": False, "error": "Max attempts exceeded"}

def safe_agent_step(messages, tool_choice=None):
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=messages,
        tools=TOOL_LIST,
        tool_choice=tool_choice or "auto"
    )
    for block in response.content:
        if hasattr(block, 'type') and block.type == 'tool_result':
            if "error" in block.content.lower():
                retry_result = call_with_retry(
                    lambda: execute_tool(block.name, block.input)
                )
                if not retry_result["success"]:
                    messages.append({
                        "role": "user",
                        "content": f"Tool '{block.name}' failed: {retry_result['error']}"
                    })
    return response

📋 Structured Output (JSON Mode)

Don't trust the model to return clean JSON. Force it with output schemas.

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system="You are a lead scoring agent. Return ONLY valid JSON.",
    messages=[{"role": "user", "content": f"Score this lead: {lead_text}"}],
    extra={"json_schema": {
        "name": "LeadScore",
        "schema": {
            "type": "object",
            "properties": {
                "score": {"type": "integer", "minimum": 0, "maximum": 100},
                "tier": {"type": "string", "enum": ["hot", "warm", "cold"]},
                "reasoning": {"type": "string", "description": "One sentence explaining the score"},
                "recommended_action": {"type": "string", "enum": ["call_today", "email_followup", "nurture"]}
            },
            "required": ["score", "tier", "reasoning", "recommended_action"]
        }
    }}
)
import json
data = json.loads(response.content[0].text)
print(f"Lead is {data['tier']} (score: {data['score']}) — {data['reasoning']}")

🧠 Multi-Agent: Planner + Specialists

A planner breaks down the task, specialists execute, evaluator approves each handoff.

class Pipeline:
    def __init__(self):
        self.planner = Agent(role="planner", system=PLANNER_SPEC)
        self.researcher = Agent(role="researcher", system=RESEARCHER_SPEC)
        self.writer = Agent(role="writer", system=WRITER_SPEC)
        self.evaluator = Agent(role="evaluator", system=EVALUATOR_SPEC)

    def run(self, task):
        plan = self.planner.execute(f"Break down: {task}")
        if not self.evaluator.check(plan, criteria="has_exactly_3_steps"):
            return {"error": "Planning failed eval", "output": plan}

        results = []
        for step in plan["steps"]:
            if step["type"] == "research":
                result = self.researcher.execute(step["query"])
            else:
                result = self.writer.execute(step["task"])

            handoff_eval = self.evaluator.check(result, criteria=step["acceptance"])
            if not handoff_eval["pass"]:
                result = self.researcher.execute(step["query"], context=handoff_eval["feedback"])
            results.append(result)

        return self.writer.execute(f"Combine into coherent output: {results}")

Overview Skill 1 Skill 2 Skill 3 Skill 4 Skill 5 Skill 6 Skill 7

Skill 2

Evaluation & Quality Judgment

The single most-cited skill in AI job postings. The ability to detect when AI output is actually wrong — not just confident.

Week 2

2/8 weeks

2 The Confidence Problem

Humans stumble when they're wrong. We hesitate, qualify, backtrack. AI doesn't. AI generates text that looks exactly the same whether it's right or wrong. The confident tone implies correctness — and it's a lie.

The fluency trap

When AI output looks polished, well-structured, and confident, humans instinctively trust it. This is the failure mode that causes real harm — wrong code shipped to production, incorrect legal summaries filed, bad data fed into decision systems.

The skill: Resisting the temptation to read fluency as competence. Building internal barometers for quality that don't depend on how confident the AI sounds.

Semantic vs. Functional Correctness

Semantic: "The AI said the right things" — the output sounds correct, uses the right terminology, follows the right structure.

Functional: "The AI did the right thing" — the output achieves the actual goal, the data is accurate, the recommendation is valid.

Example: An AI recommends a credit card. It explains its reasoning perfectly. Semantically correct. But the card it recommends doesn't exist in the system. Functionally wrong.

Building Eval Frameworks

An eval framework is a systematic quality barometer for AI output.

For any AI task, define:

What correct looks like — 3 to 5 concrete criteria
What borderline looks like — acceptable but not ideal
What failure looks like — detectable, specific failure modes
What edge cases look like — the 10% of situations that break the general case

Example: Code Review Agent Eval
Criteria:
✓ All security vulnerabilities caught (OWASP Top 10)
✓ Performance issues flagged (> O(n²) without justification)
✓ Style deviations from team guidelines noted
✓ Every "LGTM" has a specific reason, not rubber-stamp approval
Edge cases that should fail:
✗ Silent approval of code with known CVEs
✗ Missing error handling in async code
✗ Approving code that contradicts PR description

🔍

Exercise: The Audit Test

45 min

Take AI output on a topic you know deeply — your area of expertise. Act as if you're the editor or auditor responsible for its accuracy.

Find 1 factual error the AI made
Find 1 edge case it missed
Find 1 place where it "sounded right" but wasn't

Document these. This is your eval muscle forming. Most people discover they missed errors they would have caught if they were looking.

🛠️

Exercise: Build an Eval Harness

60 min

For one AI task you do repeatedly:

Define 5 concrete criteria for "correct" output
Write a 5-question checklist someone could use to evaluate the output
Identify 3 edge cases that should trigger a "fail" rating

Congratulations — you just built an eval harness. This is what employers mean when they say "build evaluation frameworks."

🔑 Answer Key

Exercise 1: The Audit Test — What to look for

Start with your deepest expertise area

Strong answers will identify errors that are: (1) factual — the AI stated something verifiable as wrong, (2) structural — it missed an edge case that a domain expert would catch, (3) synthetic — it confabulated details that "sound right" but aren't grounded in reality. The key insight: fluency ≠ correctness.

Exercise 2: Eval Harness — What makes it good

Criteria must be specific and testable, not subjective

A good harness has: (1) concrete, binary or scalar criteria — not "did a good job," (2) a checklist a non-expert could apply — "would a new hire know if this passed?", (3) edge cases that are plausible in the real world — not theoretical extremes. The best harnesses fail-fast on the most costly errors.

📝 My Notes ▼

✓ Saved 0/1000

🧠 Knowledge Check

Overview Skill 1 Skill 2 Skill 3 Skill 4 Skill 5 Skill 6 Skill 7

Skill 3

Multi-Agent Task Decomposition

Breaking complex projects into agent-sized work units and orchestrating planner/sub-agent architectures. The skill that separates single-use AI from scalable AI.

Week 3

3/8 weeks

3 Why Single Agents Hit Walls

A single agent has hard limits: context window size, task complexity it can hold in memory, and number of steps it can execute before losing the thread. Complex projects — "build our entire customer onboarding flow" — can't be done by one agent in one shot.

The solution: Decompose the project into discrete tasks, each handled by a specialized agent, coordinated by a planner that maintains state across the full run.

The Key Distinction from Regular PM

Human managers: "Figure out the details as you go. Use your judgment." Agents can't do this.

Agent decomposition:
"Planner Agent:
1. Coordinate sub-agents for: market research,
   competitor analysis, pricing strategy,
   content calendar, launch checklist,
   post-mortem template
2. Each sub-agent receives exact task specs
3. Each sub-agent returns output to planner
4. Planner verifies output quality before
   proceeding to next task
5. If any sub-agent fails twice, escalate to human"

The decomposition is the product spec for the multi-agent system. Bad decomposition = system that fails in predictable ways.

The Sizing Question

"Is this task correctly sized for the agentic harness I have?"

Harness Type	Task Size It Can Handle
Single-threaded agent	Single task, ~10-15 steps max, fits in context
Multi-agent (planner + subs)	Large project, multiple workstreams, long-horizon goals
Hierarchical agent swarm	Enterprise-scale, many teams, cross-functional coordination

🔬

Exercise: Decompose a Project

45 min

Take this project: "Research competitor X and produce a 10-page market analysis."

Break it into 7–10 discrete agent-sized tasks
What are the logical chunks?
What's the execution order?
Where are the handoff points?
Which tasks depend on which others?

🏗️

Exercise: Architecture Diagram

60 min

Design a multi-agent architecture for: "Build a content marketing system"

How many agents do you need?
What does each agent do (be specific)?
What does the planner agent coordinate?
How do you verify each agent's output before the next step?
Where do correction loops go?

🔑 Answer Key

Exercise 1: Decompose "Market Analysis" — Key tasks

Break by information domain, not by document section

Good decomposition: (1) Company overview agent, (2) Financial health agent, (3) Product analysis agent, (4) Market positioning agent, (5) SWOT synthesis agent, (6) Recommendation agent, (7) Report assembly agent. Key: agents should be independently executable — no agent needs output from all others before starting.

Exercise 2: Content Marketing System — What agents

Content marketing has distinct phases: ideation, creation, optimization, distribution

Look for: separate agents for ideation/research, drafting, editing/quality check, SEO optimization, distribution scheduling, and analytics review. Strong answers include a "reviewer" agent that evaluates drafts before they move to editing. The planner coordinates but doesn't do the creative work.

📝 My Notes ▼

✓ Saved 0/1000

🧠 Knowledge Check

Overview Skill 1 Skill 2 Skill 3 Skill 4 Skill 5 Skill 6 Skill 7

Skill 4

Failure Pattern Recognition

The six ways agentic systems break — and how to diagnose, fix, and prevent them. This is what separates hobbyists from professionals.

Week 4

4/8 weeks

The Six Failure Modes

Failure	What's Happening	How to Spot It
Context Degradation	Quality drops as session gets long	Output quality inversely correlates with session length
Specification Drift	Agent forgets goals over long tasks	Mid-task output diverges from original intent
Sycopantic Confirmation	Agent validates bad input, builds wrong system	Wrong data → confident wrong output chain
Tool Selection Errors	Agent picks wrong tool from harness	Task done but wrong approach
Cascading Failure	One agent's error propagates through chain	Multiple failures trace to single root cause
Silent Failure	Output looks correct but functionally wrong	Requires deep audit — most dangerous

Silent Failure — The Hardest One

This one deserves extra attention. It's the one that ships to production and causes problems for weeks before anyone notices.

Real Example

AI recommends "brown leather boots" to a customer. The recommendation looks correct in the chat log. The customer receives blue leather boots. Investigation reveals: the warehouse had a mixup. The AI recommended the right product from the catalog — but the catalog image didn't match actual inventory. Output looked identical to correct output.

The fix: Functional correctness checks, not just semantic ones. Does this recommendation actually work in the real world?

🔬

Exercise: Failure Mode Roulette

30 min

For each scenario, identify the failure mode and how you'd fix it:

Scenario 1: A code agent spent 2 hours writing a Python scraper. Output looks perfect. But it scraped the wrong website entirely.
What failure mode? Why? How do you fix it?
Scenario 2: An agent started a 50-step data pipeline. Steps 1–10 were great. Steps 30–50 got increasingly creative — inventing data that wasn't in the source.
What failure mode? Why? How do you fix it?
Scenario 3: An AI recommended a credit card. It explained its reasoning perfectly. The card doesn't exist in the company's product database.
What failure mode? Why? How do you fix it?

📓

Exercise: Build a Failure Log

Ongoing

Start documenting failures you encounter in your own AI work. After 10 entries, you'll have a personal failure mode handbook.

Failure Log Entry Template:
Date:
Task:
What Happened:
Failure Mode:
How Detected:
How Fixed:
Prevention:

🔑 Answer Key

Scenario 1: Code agent scraped wrong website

Context contamination — agent received wrong source URL in prompt

Failure Mode: Sycopantic Confirmation — the agent validated the wrong URL as correct because it "made sense" in context. The fix: pre-flight verification that the scraped data matches expected schema before proceeding to full run. Add a "confirm source" step at the start of any data-fetching pipeline.

Scenario 2: Data pipeline inventing data at step 30+

Context window overflow — later tokens start hallucinating

Failure Mode: Context Degradation / Specification Drift — as the pipeline consumes more context, the agent starts filling gaps with confabulated data. The fix: break into smaller batches, each starting with the original source data fresh, or add "source verification" gates every N steps.

Scenario 3: AI recommended a card that doesn't exist

This is the textbook silent failure example from the lesson

Failure Mode: Silent Failure — output was semantically perfect (correct reasoning, plausible product) but functionally wrong (product doesn't exist in DB). The fix: functional correctness checks. Before any recommendation is returned, verify against authoritative data source. Don't trust the model's implicit "this seems right" confidence.

📝 My Notes ▼

✓ Saved 0/1000

🧠 Knowledge Check

New

Production Concerns

What happens after the demo works.

📊 Observability: Tracing Agent Steps

When a 50-step pipeline fails, you need to know which step failed and why.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

provider = TracerProvider()
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

class TracedAgent:
    def __init__(self, name, agent):
        self.name = name
        self.agent = agent

    def execute(self, task, context=None):
        with tracer.start_as_current_span(f"agent.{self.name}") as span:
            span.set_attribute("task", task[:100])
            span.set_attribute("agent", self.name)
            try:
                result = self.agent.execute(task, context=context)
                span.set_attribute("success", True)
                span.set_attribute("output_tokens", result.usage.output_tokens)
                return result
            except Exception as e:
                span.set_attribute("success", False)
                span.set_attribute("error", str(e))
                raise

# In your pipeline:
planner = TracedAgent("planner", Planner())
researcher = TracedAgent("researcher", Researcher())

with tracer.start_as_current_span("full_pipeline") as outer:
    plan = planner.execute(user_task)
    for step in plan.steps:
        result = researcher.execute(step)
        outer.add_event(f"step_complete", {"step": step.id, "tokens": result.tokens})

⏱️ Latency Budgets

Users expect responses fast. Know where your time goes.

Operation	Typical Latency	Verdict
Embedding a short query	50-200ms	Fast
Vector DB search (top-5)	20-100ms	Fast
Claude/GPT-4 response (short)	1-3s	OK
Claude/GPT-4 response (long)	5-15s	Marginal
10-step agent pipeline	20-60s	Async

Rule of thumb:

Sync response (user waiting): 3 seconds max. Everything else gets an async job with a status page.

🔄 Fallback Strategies

Model down? Slow response? You need a plan, not a crash.

def agent_with_fallback(task, context=None):
    # Tier 1: Best model
    try:
        return call_model("claude-opus-4-5", task, context, timeout=10)
    except (TimeoutError, ModelOverloadedError):
        pass

    # Tier 2: Mid-tier model
    try:
        return call_model("claude-haiku-4", task, context, timeout=8)
    except (TimeoutError, ModelOverloadedError):
        pass

    # Tier 3: Fast cheap model — just get something out
    try:
        result = call_model("gpt-4o-mini", task, context, timeout=5)
        result.metadata["degraded"] = True
        result.metadata["degraded_reason"] = "primary_and_secondary_failed"
        return result
    except:
        pass

    # Tier 4: Last resort — cached answer or error
    cached = cache_lookup(task)
    if cached:
        return cached
    raise AgentUnavailableError("All model tiers failed")

Overview Skill 1 Skill 2 Skill 3 Skill 4 Skill 5 Skill 6 Skill 7

Skill 5

Trust & Security Design

Deciding where humans stay in the loop, how agents are authorized, and how to verify guardrail compliance. The skill that makes AI safe to ship.

Week 5

5/8 weeks

5 The Core Question

Where does the blast radius of a failure meet acceptable risk? Every AI action needs an answer to this before it goes live.

The problem: Telling an agent "be good" in a system prompt doesn't work. These are probabilistic systems. Guardrails have to be structural, not aspirational.

The Four Sub-Skills

Sub-Skill	What It Means	Example
Cost of Error	What's the blast radius if this goes wrong?	Misspelled email draft vs. wrong drug dose
Reversibility	Can this mistake be undone?	Email draft = yes. Wire transfer = no
Frequency	How often does this action run?	10K/day vs. 2/day — same error, different risk
Verifiability	Can you prove it was correct after the fact?	Semantic vs. functional correctness audit

Guardrail Construction Patterns

Pattern 1: Human-in-the-loop at boundaries
Agent recommends → Human approves → Action executes
Used for: High-cost, irreversible, high-frequency actions
Pattern 2: Pre-flight verification
Agent prepares output → Verification agent checks →
  Pass: proceed | Fail: return for revision
Used for: Outputs that go to external customers
Pattern 3: Output constraints in system prompt
[CONSTRAINTS]
- Never mention internal pricing
- Escalate legal questions to human
- Confirm dollar amounts with user before proceeding
[/CONSTRAINTS]
Pattern 4: Rollback-capable transactions
Action → Log for audit → Verify → Commit | Revert

🗺️

Exercise: Risk Map Your AI Work

45 min

For each AI action you take or plan to take, answer:

Blast radius: What's the worst-case outcome?
Reversible? Yes / No / Partially
Frequency: How many times per day/week?
Verifiable? How would you prove it was correct?

Classify each as: Low / Medium / High / Critical risk

🔑 Answer Key

Risk Map — What makes a good answer

Look for: quantified blast radius, not just "it could be bad"

Strong answers include concrete worst-case scenarios ("wrong discount code applied to 10K orders = $500K overcharge"), specific reversibility mechanisms ("email drafts can be recalled; executed wire transfers cannot"), and verifiable audit trails ("every recommendation logged with user ID, timestamp, and product ID"). Vague answers like "could be problematic" don't demonstrate the skill.

📝 My Notes ▼

✓ Saved 0/1000

🧠 Knowledge Check

New

AI Security

Beyond prompt injection. The real production risks.

🔒 Data Leakage Through Outputs

An agent can accidentally expose one user's data to another. This is a real risk.

The risks:

User A's contract details appear in User B's response
Internal pricing in a support chat
Employee PII in an HR query
API keys in error messages

The fixes:

Strict context isolation — separate sessions per user
Output filtering before returning
Data classification — mark sensitive fields
PII detection in outputs (regex + NER)

import re
PII_PATTERNS = [
    (r"\b\d{3}-\d{2}-\d{4}\b", "SSN"),
    (r"\b\d{16}\b", "Credit Card"),
    (r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "Email"),
]

def contains_pii(text):
    for pattern, pii_type in PII_PATTERNS:
        if re.findall(pattern, text):
            return True
    return False

def safe_response(agent_output, user_context):
    if contains_pii(agent_output):
        audit_log.warning(f"PII detected in output")
        return {"status": "blocked", "reason": "pii_detected"}
    if contains_cross_user_data(agent_output, user_context):
        audit_log.error("Cross-user data leakage")
        return {"status": "blocked", "reason": "cross_user_data"}
    return {"status": "allowed", "output": agent_output}

🛡️ Output Filtering Patterns

Your model will say things it shouldn't. These patterns catch it.

OUTPUT_FILTERS = [
    {
        "name": "email_extraction",
        "pattern": r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}",
        "action": "block_and_retry",
        "retry_with": "Remove any email addresses before returning."
    },
    {
        "name": "phone_numbers",
        "pattern": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
        "action": "block_and_retry",
        "retry_with": "Remove any phone numbers before returning."
    },
    {
        "name": "internal_hostnames",
        "pattern": r"(db|server|internal)[a-z0-9-_.]+\.(local|internal)",
        "action": "block_and_retry"
    },
]

def filter_output(raw_output):
    for f in OUTPUT_FILTERS:
        if re.findall(f["pattern"], raw_output):
            if f["action"] == "block_and_retry":
                return {"status": "retry", "filter": f["name"], "retry_prompt": f["retry_with"]}
            elif f["action"] == "redact":
                raw_output = re.sub(f["pattern"], "[REDACTED]", raw_output)
    return {"status": "clean", "output": raw_output}

🔐 RBAC for AI Agents

Not all users should be able to ask the same things. Agents need permission layers too.

PERMISSIONS = {
    "support_tier1": {
        "allowed_tools": ["lookup_order", "reset_password", "issue_refund_under_50"],
        "max_refund": 50,
        "blocked_topics": ["executive_compensation", "competitor_data", "internal_roadmap"]
    },
    "support_tier2": {
        "allowed_tools": ["lookup_order", "reset_password", "issue_refund_under_500", "escalate_to_human"],
        "max_refund": 500,
        "blocked_topics": ["executive_compensation", "competitor_data"]
    },
    "admin": {
        "allowed_tools": "*",
        "max_refund": None,
        "blocked_topics": []
    }
}

def check_permission(user_role, requested_action, tool_name=None):
    role_perms = PERMISSIONS.get(user_role, {})
    for topic in role_perms.get("blocked_topics", []):
        if topic in requested_action.lower():
            return {"allowed": False, "reason": f"Topic '{topic}' not permitted for role '{user_role}'"}
    allowed_tools = role_perms.get("allowed_tools", [])
    if allowed_tools != "*" and tool_name and tool_name not in allowed_tools:
        return {"allowed": False, "reason": f"Tool '{tool_name}' not permitted"}
    return {"allowed": True}

⚠️ Prompt Versioning & Rollback

When you change a prompt, you need to be able to undo it fast.

PROMPT_REGISTRY = {
    "support_agent_v1": {
        "system_prompt": "You are a helpful support agent...",
        "created_at": "2025-01-15",
        "eval_score": 87,
        "created_by": "alice@company.com"
    },
    "support_agent_v2": {
        "system_prompt": "You are a helpful support agent. Always verify customer identity first...",
        "created_at": "2025-03-01",
        "eval_score": 91,
        "created_by": "bob@company.com",
        "parent": "support_agent_v1"
    }
}

def deploy_agent(name, traffic_pct=100):
    config = PROMPT_REGISTRY[name]
    return config["system_prompt"]

def rollback(agent_name):
    config = PROMPT_REGISTRY[agent_name]
    if "parent" in config:
        parent = PROMPT_REGISTRY[config["parent"]]
        audit_log.warning(f"Rolling back {agent_name} -> {config['parent']}")
        return deploy_agent(config["parent"])
    raise ValueError(f"No parent version to rollback to")

Overview Skill 1 Skill 2 Skill 3 Skill 4 Skill 5 Skill 6 Skill 7

Skill 6

Context Architecture

Building the information infrastructure that lets agents find and use the right data at the right time. The skill that turns one agent into dozens.

Week 6

6/8 weeks

6 The Dewey Decimal System for AI

In 2024, "using AI at work" meant pasting the right documents into the prompt. In 2026, it means building systems where the right information is always available to agents — structured, clean, and traversable.

Context architecture is the discipline of designing that information layer so agents can self-serve what they need — without human hand-holding.

Why this is worth $300K+

Get context architecture right → you can deploy dozens of agents on the same data infrastructure. Get it wrong → every agent needs its own human curator. The difference between a platform and a toy.

Persistent vs. Per-Session Context

Persistent context: Always available — company policies, product knowledge base, team roster, past interaction history. Loaded once, used forever.

Per-session context: Loaded for a specific run — user's current request, session-specific data, task-relevant documents. Refreshed each session.

Persistent Context (always available):
├── Company policies (HR, legal, security)
├── Product documentation
├── Team directory + responsibilities
└── Historical decisions + rationale
Per-Session Context (loaded per task):
├── Current user request
├── Relevant documents for this task
├── Session-specific variables
└── Handoff data from previous agents

The Contamination Problem

Dirty data in context = confused agents = confident wrong output. If your product database has outdated prices, your agent will recommend outdated prices — confidently.

Context architecture includes:

Data freshness: When was this data last updated?
Source of truth: Which system is authoritative?
Confidence signals: How certain should the agent be about this data?
Escalation triggers: When should the agent flag data as unreliable?

🗂️

Exercise: Context Design for One Agent

60 min

Design the context architecture for: "A sales agent that answers customer questions using your company's product knowledge base."

What is always in context (persistent)?
What is loaded per session?
How does the agent find the right information?
What would contaminate this system?
How do you verify the agent found the right context?

⚡

Exercise: Scale Test

30 min

Take your single-agent context design. Now make it work for 20 agents simultaneously — different teams, different tasks, same data infrastructure.

What breaks?
What needs to change?
How do agents avoid stepping on each other?

🔑 Answer Key

Exercise 1: Sales Agent Context — Key layers

Distinguish between knowledge base (persistent) and customer query (session)

Persistent: product catalog with last-sync timestamp, pricing tiers, FAQ knowledge base, return/exchange policy, escalation contacts. Per-session: current customer profile, this session's query, relevant product subset. Contamination risks: outdated pricing, discontinued products still in KB, conflicting policies from different teams. Verification: track which KB section the agent cited in its answer, verify against authoritative source.

Exercise 2: Scale to 20 agents — What changes

The hard part: concurrency and access control, not just volume

What breaks: (1) Same data updated simultaneously by multiple agents causes race conditions, (2) Agents from different teams might need different access levels to the same data, (3) One agent's context bloat affects shared infrastructure performance. Solutions: namespace isolation per team, read replicas for high-volume reads, data versioning with optimistic locking, context budgets per agent to prevent one agent from hogging resources.

📝 My Notes ▼

✓ Saved 0/1000

🧠 Knowledge Check

New

RAG Fundamentals

Most AI jobs involve retrieval. This is how agents find your data.

🔍 How RAG Works (End-to-End)

Retrieval-Augmented Generation = find relevant docs, stuff them in context, ask the model.

# 1. INGEST: Chunk documents and embed them
from openai import OpenAI
import chromadb
client = OpenAI()
chroma = chromadb.Client()
collection = chroma.get_or_create_collection("product_docs")

def ingest_doc(doc_path):
    with open(doc_path) as f:
        text = f.read()
    chunks = chunk_by_paragraph(text, max_tokens=500)
    embeddings = [client.embeddings.create(
        model="text-embedding-3-small", input=chunk
    ).data[0].embedding for chunk in chunks]
    collection.add(
        ids=[f"doc_{i}" for i in range(len(chunks))],
        documents=chunks,
        embeddings=embeddings
    )

# 2. RETRIEVE: Find relevant chunks
def retrieve(query, top_k=5):
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small", input=query
    ).data[0].embedding
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    return results["documents"][0]

# 3. GENERATE: Stuff context and ask
def ask(query):
    context = retrieve(query)
    messages = [
        {"role": "system", "content": "Answer using ONLY the provided context."},
        {"role": "user", "content": f"Context: {' '.join(context)}

Question: {query}"}
    ]
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=500
    )
    return response.choices[0].message.content

📐 Chunking Strategies

How you chunk determines what gets retrieved. No single right answer.

Fixed-size (simplest)

text[:500]  # First 500 chars
text[500:1000]  # Next 500

Fast, predictable. Breaks mid-sentence.

By paragraph (recommended)

chunks = text.split("\n\n")
# Each paragraph = one chunk

Natural boundaries. Best semantic coherence.

Recursive character split

# Split on newlines, then sentences
# Works for messy documents

Flexible. Good for mixed formats.

Semantic (best quality)

# Cluster sentences by meaning
# Each cluster = one chunk
# Uses embeddings to score coherence

Best results. Slower and more complex.

The chunk size tradeoff:

Too short → missing context. Too long → dilute the signal. 300-800 tokens is a good default range. Test with your actual queries.

🔀 Hybrid Search

Vector search finds semantically similar. Keyword search finds exact matches. Combine both.

def hybrid_search(query, top_k=10, alpha=0.6):
    """
    alpha=0.6 → 60% vector, 40% keyword
    alpha=1.0 → pure vector
    alpha=0.0 → pure keyword (BM25)
    """
    # 1. Vector similarity search
    query_emb = get_embedding(query)
    vector_results = collection.query(
        query_embeddings=[query_emb],
        n_results=top_k * 2
    )

    # 2. BM25 keyword search
    bm25_scores = bm25_rank(documents, query)

    # 3. Reciprocal Rank Fusion — combine the scores
    fused = {}
    for rank, doc_id in enumerate(vector_results["ids"][0]):
        fused[doc_id] = fused.get(doc_id, 0) + alpha * (1 / (60 + rank))
    for rank, (doc_id, score) in enumerate(bm25_scores.items()):
        fused[doc_id] = fused.get(doc_id, 0) + (1 - alpha) * (1 / (60 + rank))

    sorted_ids = sorted(fused, key=fused.get, reverse=True)[:top_k]
    return [doc for doc_id in sorted_ids if doc_id in vector_results["documents"][0]]

Overview Skill 1 Skill 2 Skill 3 Skill 4 Skill 5 Skill 6 Skill 7

Skill 7

Cost & Token Economics

Mathematically justifying AI investments before building them. The skill that turns "AI is expensive" from a complaint into a decision framework.

Week 7

7/8 weeks

7 The Core Calculation

Before building any AI feature, you need to answer: Is it worth it?

Cost per task = (tokens_used × model_price) + overhead
ROI = value_of_task / cost_per_task
Break-even: cost_per_task < value_of_task

The model selection problem: Frontier models (Claude Opus, GPT-4.5) give the best quality but cost more. Cheap models (Llama, Haiku) cost less but may be wrong more often. The skill is matching model to task correctly.

When to Use Which Model Tier

Task Type	Recommended Tier	Why
Simple classification, routing	Cheap / Fast	Doesn't need frontier reasoning
Drafting, summarization	Mid-tier	Good enough quality, cost-conscious
Complex reasoning, architecture	Frontier	Quality failures are expensive
Code generation, technical docs	Frontier	Subtle errors cause production bugs
Multi-step agentic pipelines	Blended	Cheap for routing, frontier for execution

Building a Token Cost Calculator

The practical skill: build a tool (spreadsheet, script, or dashboard) where you can change variables and see blended cost across models instantly.

Token Cost Calculator Template:
Task: [describe task]
Estimated tokens: [your estimate]
Model          | $/1M tokens | Your cost
------------------------------------
GPT-4.5        | $2.50       | [calc]
Claude Haiku   | $0.25       | [calc]
Claude Sonnet  | $3.00       | [calc]
Llama 4        | $0.10       | [calc]
Volume: [tasks/day] × [days/month] = [monthly_tasks]
Monthly cost at each tier: [calc]
Break-even value per task: [value] / [monthly_tasks]

🧮

Exercise: Model Selection Audit

45 min

Look at your last 10 AI tasks. For each:

Which model did you actually use?
Was that the right model for the job?
Could a cheaper model have done it?
Could a frontier model have justified its cost?

Identify your 3 biggest model selection inefficiencies. This is where you're burning budget without gaining quality.

🔑 Answer Key

Model Selection Audit — What to look for

Common inefficiency: using frontier models for tasks that Haiku-class models handle fine

Common inefficiencies: (1) Using GPT-4o/Claude Sonnet for simple classification tasks — a fast/cheap model would achieve 95%+ accuracy at 10% of the cost, (2) Using frontier models for first drafts that get heavily edited anyway — mid-tier for drafting, frontier for final review, (3) Not batching similar tasks — 100 individual calls vs. one batched call at 50% cost. The goal isn't always "cheapest" — it's "cheapest that still reliably meets quality bar."

📝 My Notes ▼

✓ Saved 0/1000

🧠 Knowledge Check

New

ROI Calculations

Real numbers. Real decisions. Worked examples.

💰 Worked Example: Support Ticket Classifier

Classifying 10,000 support tickets/day. Claude 3.5 Sonnet vs. GPT-4o mini?

GPT-4o Mini — $0.15/1M input tokens

Avg: 200 in + 15 out = 215 tokens × 10K

Cost: $0.32/day → $118/year

Accuracy: ~87%

Wrong tickets: 1,300/day

Claude 3.5 Sonnet — $3/1M input tokens

Avg: 200 in + 15 out = 215 tokens × 10K

Cost: $6.45/day → $2,354/year

Accuracy: ~96%

Wrong tickets: 400/day

ROI Analysis

Accuracy gap: 9% × 10,000 = 900 fewer errors/day

Each wrong ticket costs ~$2 in human review:

Savings: 900 × $2 × 365 = $657,000/year

Cost difference: only $2,236/year more for 900 fewer errors/day

ROI: 29,300% — Sonnet pays for itself on day 1

📊 Blended Model Routing (Real Numbers)

Route simple tasks cheap, complex tasks to frontier. Here's what that actually saves.

# Model tier assignment:
ROUTING_PROMPT = """Classify: "simple" | "medium" | "complex"
- simple: factual questions, classifications, rewrites
- medium: analysis, comparisons, multi-step reasoning
- complex: architectural decisions, strategic recommendations"""

def classify_and_route(ticket):
    tier = call_fast_model(ROUTING_PROMPT.format(input=ticket))
    if "simple" in tier:
        return gpt_4o_mini(ticket), 0.05   # $0.05 per 1K tickets
    elif "medium" in tier:
        return claude_haiku(ticket), 1.50  # $1.50 per 1K tickets
    else:
        return claude_sonnet(ticket), 3.00 # $3.00 per 1K tickets

# Daily volume: 10,000 tickets
# Naive (all Sonnet): 10,000 x $3.00 = $30,000/day
# Blended routing:
#   6,000 x $0.05 + 3,000 x $1.50 + 1,000 x $3.00
# = $300 + $4,500 + $3,000 = $7,800/day
print(f"Savings: $22,200/day = $8,103,000/year")

📊 My Progress

Track Your Journey

Everything you've accomplished, all in one place.

0%

Complete

Overall Course Completion

0

Sections Visited

0/7

Quizzes Taken

—

Avg Quiz Score

0

Notes Written

Skill Progress

Apply What You Learned

Test Projects

Three projects that test all 7 skills in realistic scenarios. Each includes brief, rubrics, and what to submit.

Beginner

Customer Support AI Agent

Design and spec a tier-1 customer support agent that handles common requests, knows when to escalate, and produces audit logs.

The Brief

A mid-size e-commerce company wants to deploy an AI agent to handle their top 20 support ticket types. You're brought in to design the agent system, write the specs, and build a simple prototype.

The agent should: handle password resets, order status checks, return initiations, refund requests under $100, and product information queries. It must escalate anything involving billing disputes over $200, legal keywords, or customer sentiment below 0.3.

Skills This Tests

Skill 1 (Specification) — exact boundaries, escalation criteria, success metrics | Skill 2 (Evaluation) — how you measure quality | Skill 4 (Failure Modes) — what breaks and how to catch it | Skill 5 (Trust & Security) — blast radius, human checkpoints

Deliverables

Complete specification document for the agent
Eval framework: 5-question checklist for output quality
Failure mode analysis: 3 most likely failures + fixes
Guardrail design: where humans stay in the loop
Functional prototype using Claude/GPT (even a single conversation counts)

Rubric

Pass

Spec covers all 5 ticket types with clear boundaries

Good

+ eval framework with specific quality criteria

Exceptional

+ failure mode analysis + functional prototype

Intermediate

Market Research Pipeline

Build a multi-agent pipeline that researches a company, produces a competitive analysis, and generates actionable recommendations.

The Brief

A VC firm wants an AI system that, given a target company, automatically produces: company overview, competitive positioning, market size estimate, risk assessment, and investment recommendation. The system should scale to handle 10 companies per week.

Skills This Tests

Skill 3 (Multi-Agent Decomposition) — how you break this into agent-sized chunks | Skill 6 (Context Architecture) — how information flows between agents | Skill 7 (Cost Economics) — model selection per task + ROI justification | Skill 2 (Evaluation) — quality gates between pipeline stages

Deliverables

Multi-agent architecture diagram with all agents and their roles
Planner agent specification: how it coordinates sub-agents
Context design: what data persists, what's loaded per run
Quality gates: where eval happens between pipeline stages
Cost model: estimated monthly token cost at scale (10 companies/week)
Failure mode analysis: what breaks the pipeline and how you catch it

Rubric

Pass

Multi-agent decomposition with clear agent roles

Good

+ context architecture + cost model

Exceptional

+ working prototype on 1 company + failure analysis

Advanced

Enterprise AI Reliability System

Design the full AI system for a healthcare-adjacent startup that must meet compliance standards, handle PHI data, and operate with verifiable audit trails.

The Brief

A health-tech startup is building an AI system that helps care coordinators manage patient scheduling, insurance verification, and pre-visit prep. Every action must be auditable. The system must pass a compliance audit (HIPAA-equivalent). The team is 5 people.

You need to design the full system — not just the AI, but the human-AI workflow, the guardrails, the context architecture, the eval systems, and the failure recovery procedures.

Skills This Tests

All 7 skills at once. This is a capstone project. The spec for Skill 5 (Trust & Security) should be especially thorough — PHI data, blast radius analysis, compliance requirements change everything about guardrail design.

Deliverables

Full specification for all AI agents in the system
Multi-agent architecture with decomposition rationale
Context architecture: what data, how structured, compliance handling
Eval framework: quality standards for each agent
Guardrail system: blast radius map, human checkpoints, compliance controls
Failure mode handbook: 6 failure types applied to this system
Cost model: break-even analysis for the full system
Compliance section: how audit trails work, what happens in a breach scenario

Rubric

Pass

Full system spec with all 7 skill areas addressed

Good

+ compliance/audit section + failure handbook

Exceptional

+ working prototype + cost model + real blast radius analysis

Track Your Progress

Self-Assessment Checklist

Rate yourself honestly on each skill. These are the questions employers ask in AI-native interviews.

1 Specification Precision

I can write exact specs that agents execute without clarification
I test my prompts as if a new hire will read them
I've documented prompting standards for my team

2 Evaluation & Quality Judgment

I catch AI errors before they reach production
I build eval frameworks for AI tasks
I understand the difference between semantic and functional correctness

3 Multi-Agent Decomposition

I can break complex projects into agent-sized chunks
I understand planner/sub-agent architectures
I've built at least one working multi-agent system

4 Failure Pattern Recognition

I can identify which of the 6 failure modes is occurring
I build correction loops into my agentic systems
I've diagnosed a silent failure in production

5 Trust & Security Design

I map blast radius for every agent action
I know where humans stay in the loop for my systems
I've built guardrails that hold under adversarial input

6 Context Architecture

I can design context systems for scalable agent deployments
I understand persistent vs. per-session context
I think like a librarian when structuring company data for AI

7 Cost & Token Economics

I can estimate token costs before building agents
I select models based on task requirements, not just "best available"
I've built tools to calculate blended AI costs

Your Score

Count your checkmarks. If you answered confidently to 5 or more per skill, you're ready for AI-native roles. Focus your study on the areas where you're under 3.

Further Learning

Resources

Curated resources for each skill area. Everything here is free or low-cost.

Skill 1 — Specification

▸Anthropic Prompt Engineering Guide — anthropic.com
▸OpenAI API Best Practices — platform.openai.com
▸Google's ML Product Guidelines

Skill 2 — Evaluation

▸Anthropic Engineering Blog — especially eval writing posts
▸Braintrust / Helicone — eval tooling for AI
▸OpenAI Evals — open source eval library

Skill 3 — Multi-Agent

▸CrewAI documentation — crewai.com
▸LangGraph examples — langchain.com/langgraph
▸Nate Jones' agent architecture videos

Skill 4 — Failure Modes

▸Search "Claude loops" on Twitter/X — real failure examples
▸Claude documentation on agentic patterns
▸LangChain troubleshooting guides

Skill 5 — Trust & Security

▸OWASP Top 10 for LLMs — owasp.org
▸Anthropic's AI safety guidelines
▸OpenAI's use-case-specific safety guidelines

Skill 6 — Context Architecture

▸RAG tutorials — Retrieval Augmented Generation explainers
▸Pinecone / Weaviate / Chroma — vector DB tutorials
▸"What is a vector database" explainers

Skill 7 — Cost Economics

▸OpenRouter model pricing page — openrouter.ai/models
▸Anthropic pricing — console.anthropic.com/pricing
▸OpenAI pricing — platform.openai.com/pricing
▸Tiktoken — token counting library

Certifications

▸Claude Certified Architect — growing fast, Accenture-backed, likely becomes the "AWS cert" of AI roles
▸AWS Machine Learning Specialty — enterprise credibility
▸Google Cloud Professional ML Engineer — if you're GCP-native

Communities

▸AI Builders Slack — search "AI Builders Slack invite"
▸CrewAI / LangChain Discord servers
▸Nate Jones' hiring board (when it launches — check his Substack)

👋 Welcome back!

⚙️ Settings

AI Career Bootcamp

The K-Shaped Job Market

What Employers Actually Want

Specification Precision

Hands-On Code

Evaluation & Quality Judgment

Multi-Agent Task Decomposition

Failure Pattern Recognition

Production Concerns

Trust & Security Design

AI Security

Context Architecture

RAG Fundamentals

Cost & Token Economics

ROI Calculations

Track Your Journey

Skill Progress

Test Projects

Self-Assessment Checklist

Resources