AI, GenAI & Agentic AI Interview Questions & Answers

FAANG Interview Preparation

Essential interview questions on Generative AI, Large Language Models, Transformer architecture, RAG, Agentic AI, and the latest AI developments for tech interviews in 2025-2026.

Questions

Easy

Medium

Hard

Fundamentals (4)Agentic AI (4)LLMs (3)RAG (2)Architecture (1)Prompt Engineering (1)

1 Explain the Transformer architecture and self-attention mechanism Medium
2 How are Large Language Models trained? Explain pre-training, fine-tuning, and RLHF Medium
3 What are embeddings and vector databases? How are they used in AI applications? Easy
4 Compare GPT-4, Claude, Gemini, and Llama models — strengths and trade-offs Medium
5 What is hallucination in LLMs and how do you mitigate it? Easy
6 Explain model context windows and why they matter Easy
7 What is Retrieval-Augmented Generation (RAG)? Explain the architecture Medium
8 How do you evaluate and improve RAG system quality? Hard
9 What is Agentic AI? Explain the core architecture patterns Medium
10 How do AI Agents use tools and function calling? Medium
11 Explain multi-agent systems and orchestration patterns Hard
12 What is the MCP (Model Context Protocol) and how does it work? Medium
13 What are the best practices for prompt engineering? Easy
14 Explain fine-tuning vs RAG vs prompt engineering — when to use each Medium
15 What are the key AI safety and alignment challenges? Medium

Q1 Medium Architecture

Explain the Transformer architecture and self-attention mechanism

The Transformer architecture, introduced in 'Attention Is All You Need' (2017), replaced RNNs and LSTMs by relying entirely on self-attention. It consists of an encoder-decoder structure: the encoder processes the input sequence in parallel, while the decoder generates the output autoregressively.

Self-attention computes relationships between every pair of tokens in a sequence. For each token, it creates Query (Q), Key (K), and Value (V) vectors via learned projections. Attention scores are computed as softmax(QK^T / sqrt(d_k)), then multiplied by V to produce weighted outputs. This allows each token to attend to all others, capturing long-range dependencies without sequential processing.

Multi-head attention runs multiple attention operations in parallel with different learned projections, then concatenates and projects the results. This lets the model attend to different representation subspaces (e.g., syntax vs semantics). Positional encoding (sinusoidal or learned) injects sequence order since attention is permutation-invariant.

Transformers replaced RNNs because they enable parallelization during training, avoid vanishing gradients, and scale better with compute. Modern LLMs use decoder-only variants (GPT-style) for autoregressive generation.

Key Takeaways

Self-attention computes Q, K, V for each token and aggregates information via weighted sums
Multi-head attention captures different types of relationships in parallel
Positional encoding is essential since attention is permutation-invariant
Decoder-only architectures (GPT) dominate modern LLMs for generation

↑ Back to top Next →

Q2 Medium Fundamentals

How are Large Language Models trained? Explain pre-training, fine-tuning, and RLHF

LLM training occurs in three main stages, each with distinct data and objectives.

Pre-training is the foundation. Models learn from massive text corpora (web, books, code) using next-token prediction. The loss is cross-entropy over the vocabulary. Data is typically tokenized (BPE or SentencePiece), deduplicated, and filtered for quality. Pre-training requires hundreds of billions of tokens and thousands of GPUs. The model learns grammar, facts, and reasoning patterns.

Fine-tuning adapts the base model to specific tasks or behaviors. Supervised Fine-Tuning (SFT) uses high-quality instruction-response pairs (e.g., 10K-100K examples). The model learns to follow instructions, format outputs, and align with human preferences. Loss remains next-token prediction.

RLHF (Reinforcement Learning from Human Feedback) further aligns models with human preferences. A reward model is trained on human rankings of model outputs. The LLM is then optimized via PPO or similar RL algorithms to maximize the reward. DPO (Direct Preference Optimization) has emerged as a simpler alternative that avoids training a separate reward model. As of 2025-2026, constitutional AI and scalable oversight continue to evolve alignment techniques.

Key Takeaways

Pre-training: next-token prediction on massive text; learns general knowledge
Fine-tuning: instruction-response pairs; adapts to specific tasks
RLHF: reward model + PPO; aligns with human preferences; DPO is a simpler alternative
Each stage requires different data quality and scale

← Previous ↑ Back to top Next →

Q3 Easy Fundamentals

What are embeddings and vector databases? How are they used in AI applications?

Embeddings are dense vector representations of text (or other data) that capture semantic meaning. Words or sentences are mapped to fixed-size vectors (e.g., 768 or 1536 dimensions) such that similar meanings cluster together in vector space. Modern embedding models (OpenAI text-embedding-3, Cohere, open-source like BGE or E5) produce high-quality sentence embeddings.

Vector databases store embeddings and enable fast similarity search. Given a query embedding, they return the k most similar vectors using approximate nearest neighbor (ANN) algorithms like HNSW or IVF. Popular options include Pinecone (managed), Weaviate (open-source), FAISS (Facebook's library), Chroma, and pgvector (PostgreSQL extension).

Use cases include semantic search (find documents by meaning, not keywords), RAG (retrieve relevant context for LLMs), recommendation systems, deduplication, and clustering. The workflow is: embed your corpus, store in a vector DB, embed queries at runtime, retrieve top-k similar items.

Key Takeaways

Embeddings map text to dense vectors; similar meaning = similar vectors
Vector DBs enable fast ANN search (FAISS, Pinecone, Weaviate)
Core use cases: semantic search, RAG, recommendations
Embedding model choice affects retrieval quality significantly

← Previous ↑ Back to top Next →

Q4 Medium LLMs

Compare GPT-4, Claude, Gemini, and Llama models — strengths and trade-offs

As of 2025-2026, the leading LLMs have distinct strengths. GPT-4/4o (OpenAI) excels at reasoning, coding, and tool use, with strong API ecosystem and function calling. Context windows have expanded to 128K-1M tokens. It's closed-source with usage-based pricing.

Claude (Anthropic) emphasizes safety, long context (200K+), and nuanced instruction-following. Claude 3.5 Sonnet and Opus are strong for analysis and writing. Anthropic focuses on constitutional AI and transparency. Also closed-source.

Gemini (Google) offers multimodal capabilities (text, image, video, audio) from the ground up, strong integration with Google Cloud, and competitive reasoning. Gemini 1.5 Pro supports 1M+ token context. Good for enterprise and multimodal workloads.

Llama 3/4 (Meta) is open-source, enabling on-prem deployment, fine-tuning, and cost control. Strong for customization and privacy-sensitive use cases. Community and fine-tuned variants (e.g., CodeLlama, Llama Guard) extend capabilities. Trade-off: may lag closed models on hardest benchmarks.

Choose by: open vs closed, cost, context length, multimodal needs, and deployment constraints.

Key Takeaways

GPT-4: strong reasoning, tool use, API ecosystem; closed
Claude: safety, long context, writing; closed
Gemini: multimodal, Google Cloud; closed
Llama: open-source, customizable, on-prem; may lag on hardest tasks

← Previous ↑ Back to top Next →

Q5 Easy LLMs

What is hallucination in LLMs and how do you mitigate it?

Hallucination occurs when LLMs generate plausible-sounding but factually incorrect or nonsensical content. Causes include: training on noisy data, overconfidence in low-probability tokens, lack of grounding in real-world facts, and the model's tendency to complete patterns rather than verify truth.

Mitigation strategies: (1) RAG — retrieve relevant documents and condition generation on them, grounding outputs in verified sources. (2) Structured output — constrain outputs to JSON, SQL, or schemas to reduce free-form fabrication. (3) Guardrails — validate outputs against rules, blocklists, or classifiers before delivery. (4) Confidence scoring — use logprobs or self-consistency to flag low-confidence answers. (5) Citation — require the model to cite sources for factual claims. (6) Chain-of-thought with verification — have the model reason step-by-step and optionally verify intermediate steps.

For production systems, combining RAG with structured output and guardrails is standard. Temperature tuning (lower = less creative, fewer hallucinations) and prompt engineering (e.g., 'If uncertain, say so') also help.

Key Takeaways

Hallucinations stem from pattern completion, noisy data, lack of grounding
RAG grounds generation in retrieved documents
Structured output, guardrails, and citations reduce fabrication
Combine multiple strategies for production systems

← Previous ↑ Back to top Next →

Q6 Easy LLMs

Explain model context windows and why they matter

The context window is the maximum number of tokens (input + output) an LLM can process in a single request. It determines how much information the model can 'see' at once — documents, conversation history, or instructions.

Token limits vary: older models had 4K-8K; modern models support 32K, 128K, 200K, or even 1M+ tokens. Attention complexity is O(n^2) with sequence length, so long context is computationally expensive. Techniques to extend context include: RoPE (Rotary Position Embeddings), which generalizes to longer sequences; ALiBi (Attention with Linear Biases), which uses position-based attention biases; and sparse attention or sliding windows.

Practical implications: long context enables RAG with many chunks, long document analysis, and extended multi-turn conversations. However, 'lost in the middle' — models often perform worse on information in the middle of long contexts. Chunking, summarization, and strategic placement of critical information (beginning/end) can help. Cost scales with context length for API pricing.

Key Takeaways

Context window = max tokens (input + output) per request
RoPE, ALiBi enable longer contexts; attention is O(n^2)
Lost-in-the-middle: info in the middle of long context is often underused
Cost and latency scale with context length

← Previous ↑ Back to top Next →

Q7 Medium RAG

What is Retrieval-Augmented Generation (RAG)? Explain the architecture

RAG combines retrieval and generation to ground LLM outputs in external knowledge. It addresses hallucination and knowledge cutoff by fetching relevant documents before generating a response.

The pipeline: (1) Indexing — chunk documents (by paragraph, sentence, or semantic boundaries), embed chunks with an embedding model, store in a vector database. (2) Retrieval — embed the user query, run similarity search (e.g., top-k or MMR for diversity), optionally rerank results. (3) Generation — concatenate retrieved chunks into the prompt as context, pass to the LLM with the user query, generate the answer.

Chunking strategies matter: too small loses context, too large dilutes relevance. Overlap and semantic chunking (e.g., by section) improve recall. Embedding model choice affects retrieval quality. Hybrid search (vector + keyword) handles edge cases. Reranking (e.g., cross-encoder) improves precision. As of 2025-2026, advanced RAG includes query expansion, multi-query retrieval, and iterative refinement.

PYTHON

from openai import OpenAI
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

client = OpenAI()
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./chroma_db")

def rag_query(question: str, k: int = 5) -> str:
    # 1. Retrieve relevant chunks
    docs = vectorstore.similarity_search(question, k=k)
    context = "\n\n".join(doc.page_content for doc in docs)

    # 2. Build prompt and generate
    prompt = f'''Use the following context to answer the question. If the context doesn't contain the answer, say so.

Context:
{context}

Question: {question}

Answer:'''
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content

Key Takeaways

RAG = retrieve relevant docs + inject as context + generate
Chunking, embedding model, and reranking affect quality
Hybrid search (vector + keyword) improves robustness
Standard pattern for knowledge-grounded applications

← Previous ↑ Back to top Next →

Q8 Hard RAG

How do you evaluate and improve RAG system quality?

RAG evaluation spans retrieval and generation. Retrieval metrics: Recall@k (did we retrieve the relevant chunk?), MRR (Mean Reciprocal Rank — rank of first relevant result), and precision@k. These require labeled relevance judgments.

Generation metrics: Faithfulness (is the answer grounded in the retrieved context?) — use NLI or LLM-as-judge. Relevance (does the answer address the question?) — also often LLM-as-judge. Answer correctness for factual QA — exact match or F1. RAGAS (Retrieval-Augmented Generation Assessment) combines retrieval and generation metrics into a single framework.

Improvement strategies: (1) Chunk optimization — tune chunk size, overlap, semantic boundaries; (2) Hybrid search — combine vector and BM25 for better recall; (3) Reranking — use cross-encoder or LLM to rerank top-k; (4) Query expansion — generate multiple query variants and union results; (5) Fine-tune embeddings on domain data; (6) Add metadata filters (date, source) to retrieval. A/B testing with real user feedback is essential for production.

Key Takeaways

Retrieval: Recall@k, MRR; Generation: faithfulness, relevance
RAGAS provides integrated evaluation framework
Improve via chunking, hybrid search, reranking, query expansion
A/B test and iterate with real feedback

← Previous ↑ Back to top Next →

Q9 Medium Agentic AI

What is Agentic AI? Explain the core architecture patterns

Agentic AI refers to systems where an LLM acts as an autonomous agent that perceives, reasons, and takes actions to achieve goals. Unlike single-turn Q&A, agents operate in loops: observe state, decide action, execute, observe result, repeat until done.

Core patterns: (1) ReAct (Reasoning + Acting) — interleave thought, action, and observation. The model outputs reasoning steps and tool calls; the environment returns observations. (2) Plan-and-execute — first create a plan (possibly with a planner model), then execute steps. Better for complex multi-step tasks. (3) Tool use — agents call external tools (search, calculator, code execution, APIs) via function calling. (4) Memory — short-term (conversation history, recent context) and long-term (vector store of past interactions, summaries) enable continuity.

Frameworks like LangGraph, AutoGen, and CrewAI implement these patterns. As of 2025-2026, agentic workflows power coding assistants, research agents, and autonomous task completion. Key challenges: reliability, cost, and safety.

Key Takeaways

Agents: perceive, reason, act in loops until goal achieved
ReAct: interleave thought, action, observation
Plan-and-execute for complex multi-step tasks
Memory (short + long-term) enables continuity

← Previous ↑ Back to top Next →

Q10 Medium Agentic AI

How do AI Agents use tools and function calling?

Agents extend LLM capabilities by calling external tools. Function calling (tool use) lets the model request structured actions — search, compute, query DBs, call APIs — which the application executes and returns to the model.

The flow: (1) Define tools with schemas (name, description, parameters as JSON Schema). (2) Send user message + tool definitions to the LLM. (3) Model returns a tool_call with function name and arguments. (4) Application executes the function, gets result. (5) Append tool result as a message, send back to model. (6) Model may call more tools or return final answer.

Best practices: clear tool descriptions improve selection accuracy; validate and sanitize arguments; handle errors gracefully (retry, fallback); use structured output for complex returns. OpenAI, Anthropic, and Google all support function calling in their APIs. As of 2025-2026, tool use is standard for agents, with MCP (Model Context Protocol) emerging as a cross-platform standard for tool discovery.

PYTHON

import json
from openai import OpenAI

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        }
    }
]

def get_weather(city: str, unit: str = "celsius") -> str:
    # Simulate API call
    return f"Weather in {city}: 22°{unit[0].upper()}"

messages = [{"role": "user", "content": "What's the weather in Paris?"}]
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
    tool_choice="auto"
)

if response.choices[0].message.tool_calls:
    tc = response.choices[0].message.tool_calls[0]
    args = json.loads(tc.function.arguments)
    result = get_weather(**args)
    messages.append(response.choices[0].message)
    messages.append({"role": "tool", "content": result, "tool_call_id": tc.id})
    final = client.chat.completions.create(model="gpt-4o", messages=messages)
    print(final.choices[0].message.content)

Key Takeaways

Define tools with name, description, JSON Schema parameters
Model returns tool_call; app executes and returns result
Loop until model returns final answer
MCP standardizes tool discovery across platforms

← Previous ↑ Back to top Next →

Q11 Hard Agentic AI

Explain multi-agent systems and orchestration patterns

Multi-agent systems use multiple LLM agents that collaborate, specialize, or debate to solve complex tasks. Orchestration determines how agents interact.

Patterns: (1) Supervisor/worker — a supervisor agent delegates subtasks to specialized workers (researcher, coder, critic). The supervisor routes and aggregates. (2) Debate — multiple agents argue different viewpoints; a judge or consensus mechanism selects the best. Improves reasoning on hard problems. (3) Consensus — agents vote or negotiate to reach agreement. (4) Handoff — agents pass context when ownership changes (e.g., sales agent hands off to support).

Frameworks: AutoGen (Microsoft) supports multi-agent conversations; CrewAI defines roles and tasks; LangGraph models agents as state machines with conditional edges. Use multi-agent when tasks need diverse expertise, parallel exploration, or adversarial validation. Use single agent when the task is straightforward — multi-agent adds latency and cost.

Key Takeaways

Supervisor/worker: delegate to specialists; debate: argue and judge
Consensus and handoff for agreement and ownership transfer
AutoGen, CrewAI, LangGraph implement these patterns
Multi-agent when diverse expertise needed; single agent for simple tasks

← Previous ↑ Back to top Next →

Q12 Medium Agentic AI

What is the MCP (Model Context Protocol) and how does it work?

MCP (Model Context Protocol) is an open standard by Anthropic for connecting LLM applications to external tools and data sources. It provides a uniform way for models to discover and use capabilities without vendor lock-in.

Architecture: MCP servers expose tools, resources (files, data), and prompts. MCP clients (applications, IDEs like Cursor) connect to servers and present capabilities to the model. The protocol defines how tools are discovered (list of names, descriptions, schemas), invoked (with arguments), and how results are returned.

Transport options include stdio (local processes), HTTP/SSE (remote servers), and in-memory for testing. Servers can be written in any language. Popular use cases: database access, file systems, APIs, custom tools. MCP enables a plugin ecosystem — developers build servers once, and any MCP-compatible client can use them. As of 2025-2026, MCP is gaining adoption across AI tooling and agent frameworks.

Key Takeaways

MCP: open standard for connecting LLMs to tools and data
Servers expose tools/resources; clients connect and present to model
Transport: stdio, HTTP/SSE; language-agnostic
Enables plugin ecosystem and vendor-neutral tooling

← Previous ↑ Back to top Next →

Q13 Easy Prompt Engineering

What are the best practices for prompt engineering?

Effective prompt engineering improves reliability and output quality. Best practices: (1) System prompts — set role, tone, and constraints in a system message; keeps instructions separate from user content. (2) Few-shot examples — provide 2-5 input-output examples to demonstrate format and behavior. (3) Chain-of-thought — ask for step-by-step reasoning ('Think through this step by step') to improve complex reasoning. (4) Structured output — request JSON, XML, or markdown with clear schema; use response_format when supported. (5) Temperature — use 0 for deterministic tasks (classification, extraction), 0.7-0.9 for creative tasks. (6) Delimiters — use XML tags or markdown to separate instructions, context, and user input. (7) Negative instructions — state what not to do when relevant. (8) Iterate — test prompts on edge cases and refine. As of 2025-2026, prompt caching and optimized system prompts reduce cost and latency.

PYTHON

from openai import OpenAI

client = OpenAI()

# System prompt: role + constraints
system = '''You are a helpful coding assistant. Output code in Python.
Format: provide a brief explanation, then the code in a fenced block.
If the request is ambiguous, ask for clarification.'''

# Few-shot + structured output
user = '''Convert these to snake_case:
1. userName -> user_name
2. getHTTPResponse -> get_http_response
3. XMLParser -> xml_parser

Output as JSON: {"results": [{"input": "...", "output": "..."}]}'''

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system},
        {"role": "user", "content": user}
    ],
    temperature=0,  # Deterministic for extraction
    response_format={"type": "json_object"}
)
print(response.choices[0].message.content)

Key Takeaways

System prompt for role/constraints; few-shot for format
Chain-of-thought improves reasoning; structured output for parsing
Temperature 0 for deterministic, higher for creative
Delimiters and negative instructions reduce errors

← Previous ↑ Back to top Next →

Q14 Medium Fundamentals

Explain fine-tuning vs RAG vs prompt engineering — when to use each

Each approach adapts LLMs differently. Prompt engineering uses natural language instructions and examples — no model changes. Pros: fast, free, reversible. Cons: limited by context length, no new knowledge, brittle for complex behaviors. Use for: simple tasks, prototyping, when you lack training data.

RAG retrieves external documents and injects them as context. Pros: no training, up-to-date knowledge, citeable sources, lower hallucination. Cons: retrieval quality limits output, latency from retrieval. Use for: knowledge bases, FAQs, domain docs, when knowledge changes frequently.

Fine-tuning updates model weights on task-specific data. Pros: better task performance, consistent formatting, reduced prompt size. Cons: requires data (100s-1000s of examples), cost, risk of catastrophic forgetting. Use for: custom tone, complex instruction-following, when prompts and RAG aren't enough.

Decision framework: start with prompts; add RAG if you need external knowledge; fine-tune if you have data and need deeper adaptation. Many systems combine all three.

Key Takeaways

Prompts: fast, no data; limited by context and complexity
RAG: external knowledge, no training; retrieval quality matters
Fine-tuning: best performance; needs data and compute
Combine: prompts + RAG common; add fine-tuning for deep customization

← Previous ↑ Back to top Next →

Q15 Medium Fundamentals

What are the key AI safety and alignment challenges?

AI safety spans alignment, robustness, and responsible deployment. The alignment problem: ensuring AI systems pursue intended goals and human values. Reward hacking — optimizing the wrong objective — is a core risk.

Jailbreaking and prompt injection: adversaries craft inputs to bypass safety filters or extract training data. Defenses include input sanitization, output filtering, and adversarial training. Red teaming — systematic adversarial testing — helps find vulnerabilities before deployment.

Constitutional AI (Anthropic): train models with explicit principles (e.g., 'Be helpful, harmless, honest'); use AI-generated feedback to refine behavior. Reduces reliance on human labeling at scale.

Guardrails: runtime checks on inputs and outputs — block harmful content, enforce format, validate against policies. Frameworks like Guardrails AI and NeMo Guardrails implement this.

Responsible deployment: transparency, human oversight for high-stakes decisions, monitoring for drift and misuse. As of 2025-2026, regulatory frameworks (EU AI Act, etc.) are shaping requirements.

Key Takeaways

Alignment: ensure AI pursues intended goals; reward hacking is a risk
Jailbreaking, prompt injection: defend with sanitization, red teaming
Constitutional AI: train with explicit principles
Guardrails and oversight essential for responsible deployment

← Previous ↑ Back to top

Explore More Topics

SQL Interview Questions & Answers

15 questions

NoSQL Database Interview Questions & Answers

12 questions

System Designs

15 FAANG-style architectures

AI, GenAI & Agentic AI Interview Questions & Answers

Table of Contents

Explain the Transformer architecture and self-attention mechanism

Key Takeaways

How are Large Language Models trained? Explain pre-training, fine-tuning, and RLHF

Key Takeaways

What are embeddings and vector databases? How are they used in AI applications?

Key Takeaways

Compare GPT-4, Claude, Gemini, and Llama models — strengths and trade-offs

Key Takeaways

What is hallucination in LLMs and how do you mitigate it?

Key Takeaways

Explain model context windows and why they matter

Key Takeaways

What is Retrieval-Augmented Generation (RAG)? Explain the architecture

Key Takeaways

How do you evaluate and improve RAG system quality?

Key Takeaways

What is Agentic AI? Explain the core architecture patterns

Key Takeaways

How do AI Agents use tools and function calling?

Key Takeaways

Explain multi-agent systems and orchestration patterns

Key Takeaways

What is the MCP (Model Context Protocol) and how does it work?

Key Takeaways

What are the best practices for prompt engineering?

Key Takeaways

Explain fine-tuning vs RAG vs prompt engineering — when to use each

Key Takeaways

What are the key AI safety and alignment challenges?

Key Takeaways

Explore More Topics