AI, GenAI & Agentic AI Interview Questions & Answers

FAANG Interview Preparation

Essential interview questions on Generative AI, Large Language Models, Transformer architecture, RAG, Agentic AI, and the latest AI developments for tech interviews in 2025-2026.

15

Questions

4

Easy

9

Medium

2

Hard

Fundamentals (4)Agentic AI (4)LLMs (3)RAG (2)Architecture (1)Prompt Engineering (1)
Q1 Medium Architecture

Explain the Transformer architecture and self-attention mechanism

The Transformer architecture, introduced in 'Attention Is All You Need' (2017), replaced RNNs and LSTMs by relying entirely on self-attention. It consists of an encoder-decoder structure: the encoder processes the input sequence in parallel, while the decoder generates the output autoregressively.

Self-attention computes relationships between every pair of tokens in a sequence. For each token, it creates Query (Q), Key (K), and Value (V) vectors via learned projections. Attention scores are computed as softmax(QK^T / sqrt(d_k)), then multiplied by V to produce weighted outputs. This allows each token to attend to all others, capturing long-range dependencies without sequential processing.

Multi-head attention runs multiple attention operations in parallel with different learned projections, then concatenates and projects the results. This lets the model attend to different representation subspaces (e.g., syntax vs semantics). Positional encoding (sinusoidal or learned) injects sequence order since attention is permutation-invariant.

Transformers replaced RNNs because they enable parallelization during training, avoid vanishing gradients, and scale better with compute. Modern LLMs use decoder-only variants (GPT-style) for autoregressive generation.

Key Takeaways

  • Self-attention computes Q, K, V for each token and aggregates information via weighted sums
  • Multi-head attention captures different types of relationships in parallel
  • Positional encoding is essential since attention is permutation-invariant
  • Decoder-only architectures (GPT) dominate modern LLMs for generation
Q2 Medium Fundamentals

How are Large Language Models trained? Explain pre-training, fine-tuning, and RLHF

LLM training occurs in three main stages, each with distinct data and objectives.

Pre-training is the foundation. Models learn from massive text corpora (web, books, code) using next-token prediction. The loss is cross-entropy over the vocabulary. Data is typically tokenized (BPE or SentencePiece), deduplicated, and filtered for quality. Pre-training requires hundreds of billions of tokens and thousands of GPUs. The model learns grammar, facts, and reasoning patterns.

Fine-tuning adapts the base model to specific tasks or behaviors. Supervised Fine-Tuning (SFT) uses high-quality instruction-response pairs (e.g., 10K-100K examples). The model learns to follow instructions, format outputs, and align with human preferences. Loss remains next-token prediction.

RLHF (Reinforcement Learning from Human Feedback) further aligns models with human preferences. A reward model is trained on human rankings of model outputs. The LLM is then optimized via PPO or similar RL algorithms to maximize the reward. DPO (Direct Preference Optimization) has emerged as a simpler alternative that avoids training a separate reward model. As of 2025-2026, constitutional AI and scalable oversight continue to evolve alignment techniques.

Key Takeaways

  • Pre-training: next-token prediction on massive text; learns general knowledge
  • Fine-tuning: instruction-response pairs; adapts to specific tasks
  • RLHF: reward model + PPO; aligns with human preferences; DPO is a simpler alternative
  • Each stage requires different data quality and scale
Q3 Easy Fundamentals

What are embeddings and vector databases? How are they used in AI applications?

Embeddings are dense vector representations of text (or other data) that capture semantic meaning. Words or sentences are mapped to fixed-size vectors (e.g., 768 or 1536 dimensions) such that similar meanings cluster together in vector space. Modern embedding models (OpenAI text-embedding-3, Cohere, open-source like BGE or E5) produce high-quality sentence embeddings.

Vector databases store embeddings and enable fast similarity search. Given a query embedding, they return the k most similar vectors using approximate nearest neighbor (ANN) algorithms like HNSW or IVF. Popular options include Pinecone (managed), Weaviate (open-source), FAISS (Facebook's library), Chroma, and pgvector (PostgreSQL extension).

Use cases include semantic search (find documents by meaning, not keywords), RAG (retrieve relevant context for LLMs), recommendation systems, deduplication, and clustering. The workflow is: embed your corpus, store in a vector DB, embed queries at runtime, retrieve top-k similar items.

Key Takeaways

  • Embeddings map text to dense vectors; similar meaning = similar vectors
  • Vector DBs enable fast ANN search (FAISS, Pinecone, Weaviate)
  • Core use cases: semantic search, RAG, recommendations
  • Embedding model choice affects retrieval quality significantly
Q4 Medium LLMs

Compare GPT-4, Claude, Gemini, and Llama models — strengths and trade-offs

As of 2025-2026, the leading LLMs have distinct strengths. GPT-4/4o (OpenAI) excels at reasoning, coding, and tool use, with strong API ecosystem and function calling. Context windows have expanded to 128K-1M tokens. It's closed-source with usage-based pricing.

Claude (Anthropic) emphasizes safety, long context (200K+), and nuanced instruction-following. Claude 3.5 Sonnet and Opus are strong for analysis and writing. Anthropic focuses on constitutional AI and transparency. Also closed-source.

Gemini (Google) offers multimodal capabilities (text, image, video, audio) from the ground up, strong integration with Google Cloud, and competitive reasoning. Gemini 1.5 Pro supports 1M+ token context. Good for enterprise and multimodal workloads.

Llama 3/4 (Meta) is open-source, enabling on-prem deployment, fine-tuning, and cost control. Strong for customization and privacy-sensitive use cases. Community and fine-tuned variants (e.g., CodeLlama, Llama Guard) extend capabilities. Trade-off: may lag closed models on hardest benchmarks.

Choose by: open vs closed, cost, context length, multimodal needs, and deployment constraints.

Key Takeaways

  • GPT-4: strong reasoning, tool use, API ecosystem; closed
  • Claude: safety, long context, writing; closed
  • Gemini: multimodal, Google Cloud; closed
  • Llama: open-source, customizable, on-prem; may lag on hardest tasks
Q5 Easy LLMs

What is hallucination in LLMs and how do you mitigate it?

Hallucination occurs when LLMs generate plausible-sounding but factually incorrect or nonsensical content. Causes include: training on noisy data, overconfidence in low-probability tokens, lack of grounding in real-world facts, and the model's tendency to complete patterns rather than verify truth.

Mitigation strategies: (1) RAG — retrieve relevant documents and condition generation on them, grounding outputs in verified sources. (2) Structured output — constrain outputs to JSON, SQL, or schemas to reduce free-form fabrication. (3) Guardrails — validate outputs against rules, blocklists, or classifiers before delivery. (4) Confidence scoring — use logprobs or self-consistency to flag low-confidence answers. (5) Citation — require the model to cite sources for factual claims. (6) Chain-of-thought with verification — have the model reason step-by-step and optionally verify intermediate steps.

For production systems, combining RAG with structured output and guardrails is standard. Temperature tuning (lower = less creative, fewer hallucinations) and prompt engineering (e.g., 'If uncertain, say so') also help.

Key Takeaways

  • Hallucinations stem from pattern completion, noisy data, lack of grounding
  • RAG grounds generation in retrieved documents
  • Structured output, guardrails, and citations reduce fabrication
  • Combine multiple strategies for production systems
Q6 Easy LLMs

Explain model context windows and why they matter

The context window is the maximum number of tokens (input + output) an LLM can process in a single request. It determines how much information the model can 'see' at once — documents, conversation history, or instructions.

Token limits vary: older models had 4K-8K; modern models support 32K, 128K, 200K, or even 1M+ tokens. Attention complexity is O(n^2) with sequence length, so long context is computationally expensive. Techniques to extend context include: RoPE (Rotary Position Embeddings), which generalizes to longer sequences; ALiBi (Attention with Linear Biases), which uses position-based attention biases; and sparse attention or sliding windows.

Practical implications: long context enables RAG with many chunks, long document analysis, and extended multi-turn conversations. However, 'lost in the middle' — models often perform worse on information in the middle of long contexts. Chunking, summarization, and strategic placement of critical information (beginning/end) can help. Cost scales with context length for API pricing.

Key Takeaways

  • Context window = max tokens (input + output) per request
  • RoPE, ALiBi enable longer contexts; attention is O(n^2)
  • Lost-in-the-middle: info in the middle of long context is often underused
  • Cost and latency scale with context length
Q7 Medium RAG

What is Retrieval-Augmented Generation (RAG)? Explain the architecture

RAG combines retrieval and generation to ground LLM outputs in external knowledge. It addresses hallucination and knowledge cutoff by fetching relevant documents before generating a response.

The pipeline: (1) Indexing — chunk documents (by paragraph, sentence, or semantic boundaries), embed chunks with an embedding model, store in a vector database. (2) Retrieval — embed the user query, run similarity search (e.g., top-k or MMR for diversity), optionally rerank results. (3) Generation — concatenate retrieved chunks into the prompt as context, pass to the LLM with the user query, generate the answer.

Chunking strategies matter: too small loses context, too large dilutes relevance. Overlap and semantic chunking (e.g., by section) improve recall. Embedding model choice affects retrieval quality. Hybrid search (vector + keyword) handles edge cases. Reranking (e.g., cross-encoder) improves precision. As of 2025-2026, advanced RAG includes query expansion, multi-query retrieval, and iterative refinement.

PYTHON
from openai import OpenAI
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

client = OpenAI()
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./chroma_db")

def rag_query(question: str, k: int = 5) -> str:
    # 1. Retrieve relevant chunks
    docs = vectorstore.similarity_search(question, k=k)
    context = "\n\n".join(doc.page_content for doc in docs)

    # 2. Build prompt and generate
    prompt = f'''Use the following context to answer the question. If the context doesn't contain the answer, say so.

Context:
{context}

Question: {question}

Answer:'''
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content

Key Takeaways

  • RAG = retrieve relevant docs + inject as context + generate
  • Chunking, embedding model, and reranking affect quality
  • Hybrid search (vector + keyword) improves robustness
  • Standard pattern for knowledge-grounded applications
Q8 Hard RAG

How do you evaluate and improve RAG system quality?

RAG evaluation spans retrieval and generation. Retrieval metrics: Recall@k (did we retrieve the relevant chunk?), MRR (Mean Reciprocal Rank — rank of first relevant result), and precision@k. These require labeled relevance judgments.

Generation metrics: Faithfulness (is the answer grounded in the retrieved context?) — use NLI or LLM-as-judge. Relevance (does the answer address the question?) — also often LLM-as-judge. Answer correctness for factual QA — exact match or F1. RAGAS (Retrieval-Augmented Generation Assessment) combines retrieval and generation metrics into a single framework.

Improvement strategies: (1) Chunk optimization — tune chunk size, overlap, semantic boundaries; (2) Hybrid search — combine vector and BM25 for better recall; (3) Reranking — use cross-encoder or LLM to rerank top-k; (4) Query expansion — generate multiple query variants and union results; (5) Fine-tune embeddings on domain data; (6) Add metadata filters (date, source) to retrieval. A/B testing with real user feedback is essential for production.

Key Takeaways

  • Retrieval: Recall@k, MRR; Generation: faithfulness, relevance
  • RAGAS provides integrated evaluation framework
  • Improve via chunking, hybrid search, reranking, query expansion
  • A/B test and iterate with real feedback
Q9 Medium Agentic AI

What is Agentic AI? Explain the core architecture patterns

Agentic AI refers to systems where an LLM acts as an autonomous agent that perceives, reasons, and takes actions to achieve goals. Unlike single-turn Q&A, agents operate in loops: observe state, decide action, execute, observe result, repeat until done.

Core patterns: (1) ReAct (Reasoning + Acting) — interleave thought, action, and observation. The model outputs reasoning steps and tool calls; the environment returns observations. (2) Plan-and-execute — first create a plan (possibly with a planner model), then execute steps. Better for complex multi-step tasks. (3) Tool use — agents call external tools (search, calculator, code execution, APIs) via function calling. (4) Memory — short-term (conversation history, recent context) and long-term (vector store of past interactions, summaries) enable continuity.

Frameworks like LangGraph, AutoGen, and CrewAI implement these patterns. As of 2025-2026, agentic workflows power coding assistants, research agents, and autonomous task completion. Key challenges: reliability, cost, and safety.

Key Takeaways

  • Agents: perceive, reason, act in loops until goal achieved
  • ReAct: interleave thought, action, observation
  • Plan-and-execute for complex multi-step tasks
  • Memory (short + long-term) enables continuity
Q10 Medium Agentic AI

How do AI Agents use tools and function calling?

Agents extend LLM capabilities by calling external tools. Function calling (tool use) lets the model request structured actions — search, compute, query DBs, call APIs — which the application executes and returns to the model.

The flow: (1) Define tools with schemas (name, description, parameters as JSON Schema). (2) Send user message + tool definitions to the LLM. (3) Model returns a tool_call with function name and arguments. (4) Application executes the function, gets result. (5) Append tool result as a message, send back to model. (6) Model may call more tools or return final answer.

Best practices: clear tool descriptions improve selection accuracy; validate and sanitize arguments; handle errors gracefully (retry, fallback); use structured output for complex returns. OpenAI, Anthropic, and Google all support function calling in their APIs. As of 2025-2026, tool use is standard for agents, with MCP (Model Context Protocol) emerging as a cross-platform standard for tool discovery.

PYTHON
import json
from openai import OpenAI

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        }
    }
]

def get_weather(city: str, unit: str = "celsius") -> str:
    # Simulate API call
    return f"Weather in {city}: 22°{unit[0].upper()}"

messages = [{"role": "user", "content": "What's the weather in Paris?"}]
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
    tool_choice="auto"
)

if response.choices[0].message.tool_calls:
    tc = response.choices[0].message.tool_calls[0]
    args = json.loads(tc.function.arguments)
    result = get_weather(**args)
    messages.append(response.choices[0].message)
    messages.append({"role": "tool", "content": result, "tool_call_id": tc.id})
    final = client.chat.completions.create(model="gpt-4o", messages=messages)
    print(final.choices[0].message.content)

Key Takeaways

  • Define tools with name, description, JSON Schema parameters
  • Model returns tool_call; app executes and returns result
  • Loop until model returns final answer
  • MCP standardizes tool discovery across platforms
Q11 Hard Agentic AI

Explain multi-agent systems and orchestration patterns

Multi-agent systems use multiple LLM agents that collaborate, specialize, or debate to solve complex tasks. Orchestration determines how agents interact.

Patterns: (1) Supervisor/worker — a supervisor agent delegates subtasks to specialized workers (researcher, coder, critic). The supervisor routes and aggregates. (2) Debate — multiple agents argue different viewpoints; a judge or consensus mechanism selects the best. Improves reasoning on hard problems. (3) Consensus — agents vote or negotiate to reach agreement. (4) Handoff — agents pass context when ownership changes (e.g., sales agent hands off to support).

Frameworks: AutoGen (Microsoft) supports multi-agent conversations; CrewAI defines roles and tasks; LangGraph models agents as state machines with conditional edges. Use multi-agent when tasks need diverse expertise, parallel exploration, or adversarial validation. Use single agent when the task is straightforward — multi-agent adds latency and cost.

Key Takeaways

  • Supervisor/worker: delegate to specialists; debate: argue and judge
  • Consensus and handoff for agreement and ownership transfer
  • AutoGen, CrewAI, LangGraph implement these patterns
  • Multi-agent when diverse expertise needed; single agent for simple tasks
Q12 Medium Agentic AI

What is the MCP (Model Context Protocol) and how does it work?

MCP (Model Context Protocol) is an open standard by Anthropic for connecting LLM applications to external tools and data sources. It provides a uniform way for models to discover and use capabilities without vendor lock-in.

Architecture: MCP servers expose tools, resources (files, data), and prompts. MCP clients (applications, IDEs like Cursor) connect to servers and present capabilities to the model. The protocol defines how tools are discovered (list of names, descriptions, schemas), invoked (with arguments), and how results are returned.

Transport options include stdio (local processes), HTTP/SSE (remote servers), and in-memory for testing. Servers can be written in any language. Popular use cases: database access, file systems, APIs, custom tools. MCP enables a plugin ecosystem — developers build servers once, and any MCP-compatible client can use them. As of 2025-2026, MCP is gaining adoption across AI tooling and agent frameworks.

Key Takeaways

  • MCP: open standard for connecting LLMs to tools and data
  • Servers expose tools/resources; clients connect and present to model
  • Transport: stdio, HTTP/SSE; language-agnostic
  • Enables plugin ecosystem and vendor-neutral tooling
Q13 Easy Prompt Engineering

What are the best practices for prompt engineering?

Effective prompt engineering improves reliability and output quality. Best practices: (1) System prompts — set role, tone, and constraints in a system message; keeps instructions separate from user content. (2) Few-shot examples — provide 2-5 input-output examples to demonstrate format and behavior. (3) Chain-of-thought — ask for step-by-step reasoning ('Think through this step by step') to improve complex reasoning. (4) Structured output — request JSON, XML, or markdown with clear schema; use response_format when supported. (5) Temperature — use 0 for deterministic tasks (classification, extraction), 0.7-0.9 for creative tasks. (6) Delimiters — use XML tags or markdown to separate instructions, context, and user input. (7) Negative instructions — state what not to do when relevant. (8) Iterate — test prompts on edge cases and refine. As of 2025-2026, prompt caching and optimized system prompts reduce cost and latency.

PYTHON
from openai import OpenAI

client = OpenAI()

# System prompt: role + constraints
system = '''You are a helpful coding assistant. Output code in Python.
Format: provide a brief explanation, then the code in a fenced block.
If the request is ambiguous, ask for clarification.'''

# Few-shot + structured output
user = '''Convert these to snake_case:
1. userName -> user_name
2. getHTTPResponse -> get_http_response
3. XMLParser -> xml_parser

Output as JSON: {"results": [{"input": "...", "output": "..."}]}'''

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system},
        {"role": "user", "content": user}
    ],
    temperature=0,  # Deterministic for extraction
    response_format={"type": "json_object"}
)
print(response.choices[0].message.content)

Key Takeaways

  • System prompt for role/constraints; few-shot for format
  • Chain-of-thought improves reasoning; structured output for parsing
  • Temperature 0 for deterministic, higher for creative
  • Delimiters and negative instructions reduce errors
Q14 Medium Fundamentals

Explain fine-tuning vs RAG vs prompt engineering — when to use each

Each approach adapts LLMs differently. Prompt engineering uses natural language instructions and examples — no model changes. Pros: fast, free, reversible. Cons: limited by context length, no new knowledge, brittle for complex behaviors. Use for: simple tasks, prototyping, when you lack training data.

RAG retrieves external documents and injects them as context. Pros: no training, up-to-date knowledge, citeable sources, lower hallucination. Cons: retrieval quality limits output, latency from retrieval. Use for: knowledge bases, FAQs, domain docs, when knowledge changes frequently.

Fine-tuning updates model weights on task-specific data. Pros: better task performance, consistent formatting, reduced prompt size. Cons: requires data (100s-1000s of examples), cost, risk of catastrophic forgetting. Use for: custom tone, complex instruction-following, when prompts and RAG aren't enough.

Decision framework: start with prompts; add RAG if you need external knowledge; fine-tune if you have data and need deeper adaptation. Many systems combine all three.

Key Takeaways

  • Prompts: fast, no data; limited by context and complexity
  • RAG: external knowledge, no training; retrieval quality matters
  • Fine-tuning: best performance; needs data and compute
  • Combine: prompts + RAG common; add fine-tuning for deep customization
Q15 Medium Fundamentals

What are the key AI safety and alignment challenges?

AI safety spans alignment, robustness, and responsible deployment. The alignment problem: ensuring AI systems pursue intended goals and human values. Reward hacking — optimizing the wrong objective — is a core risk.

Jailbreaking and prompt injection: adversaries craft inputs to bypass safety filters or extract training data. Defenses include input sanitization, output filtering, and adversarial training. Red teaming — systematic adversarial testing — helps find vulnerabilities before deployment.

Constitutional AI (Anthropic): train models with explicit principles (e.g., 'Be helpful, harmless, honest'); use AI-generated feedback to refine behavior. Reduces reliance on human labeling at scale.

Guardrails: runtime checks on inputs and outputs — block harmful content, enforce format, validate against policies. Frameworks like Guardrails AI and NeMo Guardrails implement this.

Responsible deployment: transparency, human oversight for high-stakes decisions, monitoring for drift and misuse. As of 2025-2026, regulatory frameworks (EU AI Act, etc.) are shaping requirements.

Key Takeaways

  • Alignment: ensure AI pursues intended goals; reward hacking is a risk
  • Jailbreaking, prompt injection: defend with sanitization, red teaming
  • Constitutional AI: train with explicit principles
  • Guardrails and oversight essential for responsible deployment