
DoorDash AI Engineer Case Interview — LLM Menu Intelligence & Marketplace Impact
This case simulates a real DoorDash onsite-style AI Engineering case focused on building and launching an LLM-powered "Menu Intelligence" system that improves catalog quality and downstream marketplace outcomes (search relevance, conversion, onboarding speed, support deflection). It mirrors DoorDash’s data-driven, experiment-heavy culture and expects crisp problem framing, pragmatic trade-offs, and end-to-end ownership. What the case covers: - Product sense in a 3-sided marketplace: articulate how better menu structure (items, options, modifiers, dietary tags) influences consumer conversion, dasher experience (prep time accuracy), and merchant outcomes (onboarding time, cancellations, support tickets). - AI/ML + LLM system design: ingest PDFs/images/HTML menus; OCR and multilingual NLP; LLM/RAG for attribute extraction and taxonomy mapping; structured JSON outputs; guardrails to prevent hallucinations; human-in-the-loop review tools for Ops. - Online serving and scale: low-latency re-ranking for consumer search and menu browse (p95 < 300 ms, p50 < 120 ms); offline ingestion at millions of items; cost controls per call; fallbacks to heuristic baselines when models degrade. - Experimentation & metrics: define north-star and guardrail metrics; design offline evals and online A/B tests; ramp and rollback plans aligned with DoorDash’s bias for action and safety-first ethos. - Reliability, safety, and privacy: PII redaction, abuse/safety filters, multilingual coverage, drift monitoring, lineage and replayability. Scenario prompt (shared at start): "You’re leading MenuIQ, an AI system to convert unstructured merchant menus (PDFs, phone photos, web pages) into a high-quality, structured catalog that powers Storefront, Marketplace search, and Drive. Today: 72% item coverage, 8% attribute error rate, onboarding median 3.5 days, 18% of merchant support tickets are catalog-related, and 6% of consumer searches return poor or zero results due to catalog gaps. You have to ship a v1 in 8 weeks, first to English + Spanish in the U.S., then roll out internationally. Budget is constrained; latency SLO for online search re-ranking is p95 < 300 ms; target cost is <$0.002 per processed item attribute and <$0.005 per online query." Expected candidate workflow: 1) Clarify objectives and success criteria: propose measurable targets (e.g., +0.3–0.5% conversion lift in affected surfaces; -25% catalog-related support tickets; -20% onboarding time; -30% zero-result searches). Identify guardrails (no significant increase in wrong items/modifiers; p95 latency maintained; cost within budget). 2) System design (high level to concrete): - Ingestion: OCR (document + image), language ID, de-duplication, quality scoring. - LLM strategy: prompt vs fine-tune; JSON schema-constrained outputs; retrieval with merchant-specific KB (historical items, policies); taxonomy mapping; confidence scoring; self-consistency or toolformer-style function calls for validation. - Safety/guardrails: PII redaction, policy filters, deterministic post-processing, regex/JSON schema validation, fallbacks to rules-based extraction when confidence low. - Data platform: feature store for embeddings and attributes; vector index; experiment config & event logging; lineage for reproducibility; canary pipelines. - Serving: offline batch for ingestion; online re-ranking for search/menu with caching; p95 and cost SLOs; blue/green deploy. 3) Evaluation plan: - Offline: labeled set with exact-match/F1 for attributes, taxonomy accuracy, MAE for prep-time predictions if used downstream, NDCG@K for search re-ranking. - Online: A/B with pre-declared hypotheses, sample sizing, power, and guardrails (conversion, add-to-cart rate, cancel rate, dasher wait time, ticket volume, latency, unit economics). Define ramp strategy and rollback triggers. 4) Risks & trade-offs: multilingual generalization, long-tail merchants, hallucinations on low-quality images, cost growth at peak, feedback loops (e.g., popular items get better while tail worsens), ops burden; propose mitigations (active learning, human review queues, selective LLM usage, compression/caching, shadow testing). 5) Execution plan: 8-week milestone plan (wk 1–2 data & eval set; wk 3–4 prototype + offline eval; wk 5 limited pilot; wk 6–7 A/B; wk 8 stabilize & docs). Interfaces with Merchant Ops, Search & Discovery, and Support Automation. Interviewer prompts (used to probe depth): - Model choice: Which foundation models and why? Open-source vs hosted; cost/latency implications; fine-tune criteria. - Structured output reliability: How to ensure schema adherence and low invalid-JSON rates at scale? - Measurement: Which metric moves the business most and why? How do you prevent Simpson’s paradox across cuisines/locales? - Experiment design: How to size and segment the A/B? What guardrails stop a bad ramp? - Observability: What do you monitor in real time? How do you detect drift and content regressions? - Ethics & safety: Handling allergens/dietary tags; confidence thresholds before exposing to consumers. Rubric (how success is judged at DoorDash): - Problem framing & marketplace sense (20%): clear linkage from catalog quality to conversion, cancellations, and ops cost. - Technical & system design (25%): scalable, low-latency architecture with pragmatic cost controls and clear interfaces. - ML/LLM strategy (25%): credible plan for data, prompts vs fine-tune, retrieval, guardrails, evaluation depth. - Experimentation & metrics (20%): thoughtful offline+online evaluation, valid A/B design, ramp/rollback, monitoring. - Communication & ownership (10%): crisp structure, assumptions stated, makes scrappy but safe trade-offs, aligns with "get 1% better every day."
8 minutes
Practice with our AI-powered interview system to improve your skills.
About This Interview
Interview Type
PRODUCT SENSE
Difficulty Level
4/5
Interview Tips
• Research the company thoroughly
• Practice common questions
• Prepare your STAR method responses
• Dress appropriately for the role