IIT Bombay | B.Tech Chemical Engineering | Minor in Machine Intelligence and Data Science

Shubh Sareen

I build and study AI systems that turn unstructured data into structured, actionable knowledge. My work focuses on LLM pipelines and agentic workflows, along with understanding their limitations in areas like long-context reasoning, retrieval, and reliability.

I focus on building systems, not just models.

Experience

Machine Learning Intern

Recommendation Systems · Embeddings · Similarity Search

Worked on a content-based recommendation system using embedding-based representations of user interactions. Built a pipeline where user preferences were modeled as weighted combinations of interacted content and recommendations were generated using cosine similarity.

Explored limitations of this approach, particularly how aggregating embeddings can collapse multi-interest user behavior and lead to weak or ambiguous representations. Also observed how poor data quality and overlapping categories affected embedding separation and retrieval performance.

This experience helped me understand that the effectiveness of similarity-based systems depends heavily on representation quality, not just the choice of algorithm.

What I work with

AI + ML
LLM SystemsAgentic WorkflowsNLPRetrieval Systems (Learning)Model Evaluation
Deployment & Systems
Full-stack PrototypingAPI IntegrationsDeployment (Vercel, Railway)Frontend Interfaces (React)
Dev Toolkit
PythonTypeScript (Working Knowledge)APIs & IntegrationsLinux
Build Mindset
Systems ThinkingRapid PrototypingProduct + Research BlendUnderstanding Failure ModesData-Driven Decisions

Key Projects

Live

YouTube AI Helper

Built an LLM-based system that converts long-form YouTube videos and playlists into structured notes, flashcards, and a queryable interface. Designed a modular pipeline for transcript extraction, summarization, and Q&A, with support for multi-video context. Explored limitations of full-context prompting and began working toward retrieval-based approaches for better relevance and scalability.

LLMsRAG (in progress)StreamlitPythonAgentic Workflows

Systems & Experiments

Repository

Vision-to-Text SOC

Built a vision-to-text pipeline for extracting structured information from image-heavy inputs.

Combined OCR and NLP techniques to convert visual data into usable text representations

Designed the pipeline for practical document understanding workflows

Explored challenges in noisy inputs, layout variation, and downstream text usability

Focused on making extracted data structured and actionable rather than raw output

OCR PipelinesDocument UnderstandingNLPPython
Repository

WIDS 2025 Agentic AI

Built an agentic AI system exploring multi-step reasoning and tool-based workflows.

Designed modular agents for task decomposition, routing, and response generation

Experimented with prompt orchestration and tool-use patterns for real-world tasks

Explored limitations of agent reliability, control flow, and hallucination handling

Focused on making agent behavior predictable and useful, not just autonomous

Agent SystemsLLMsTool UsePrompt Orchestration