reSee.it Podcast Summary
From a journey that began with a machine learning PhD detour to a viral, AI‑driven video tool, Gaurav Misra built Captions into an AI powered creative studio. Born in Boston and raised in New Delhi, he grew up with a passion for programming and pursued engineering at Boston University. After interning at Microsoft and declining the software engineer in test path, he joined a Boston startup, Lattice Engines, where he worked on scalable ML for lead scoring. A brief PhD followed, then a pivot to industry: Microsoft on an ML platform, Localytics, and finally Snapchat in New York, drawn by rapid experimentation and prototyping.
At Snapchat in New York, he joined a small engineering team that built an internal culture of experimentation. The New York team, led by Andrew Lin, functioned as a design‑engineering hybrid and used a skunkworks approach called Spooky to ship fast, isolated experiments. They prototyped features like Spotlight, a vertical video feed, and shipped a redesigned five‑tab navigation in production. The team also developed tools to measure and influence user behavior, such as eye‑tracking ideas and teleprompter concepts, and collaborated closely with Evan Spiegel’s design‑led product direction.
After leaving Snapchat, Misra reconnected with Dwight—co‑founder of Captions—and their conversations in New York evolved into a shared opportunity around video creation. In 2021, they saw the rise of talking videos on TikTok and began with a social‑network concept, while Captions itself emerged as a practical tool. They built a transcription‑first editor in days; the app went to the top of the App Store overnight, powered only by Google API calls with no backend. Revenue appeared through a weekend paywall experiment, and personal ARR climbed to $500,000 with no employees, prompting a strategic pivot back to Captions.
With Captions, the focus shifted to making video creation fast and approachable, starting with text‑based editing that lets users scrub by words, insert images, and trim precisely on screen. The team follows two roadmaps: a public list of must‑have improvements and a secret agenda aimed at changing behavior through innovative leaps. Eye contact emerged from teleprompter refinements, a feature later complemented by LipDub, which translates and lip‑synchronizes video across languages. GPT‑4 powers core translations, and hardware advances shorten training cycles, enabling faster iteration. The company is hiring in New York across disciplines as it scales the AI powered studio.