Overview
Designed and developed a production-grade conversational AI backend powering two distinct avatar personas — Nora (DevLearn 2024, Las Vegas) and Puja (Sify Experience Center, Navi Mumbai). The system orchestrates a Unity 3D avatar client and a separate camera feed application over a single Socket.IO server, handling real-time speech transcription, LLM response streaming, text-to-speech synthesis, computer vision, face recognition, and retrieval-augmented generation — all wired together into a seamless, sentence-by-sentence avatar response pipeline.
Key Responsibilities
- •
Lead Backend Architect — Socket.IO Server: Designed the full server architecture using Python and Socket.IO, handling two simultaneous client connections (Unity avatar app and camera feed app) correlated by session ID, with per-session state management, async queues, and clean event-driven separation across the events/ and modules/ layers.
- •
Multi-Modal Pipeline Integration: Engineered the end-to-end audio-visual pipeline — Unity microphone audio → Whisper transcription → LLM streaming → Edge TTS sentence synthesis → avatar response emission — alongside a parallel video pipeline feeding YOLO person detection and HOG face recognition into the same session context.
- •
Dual Avatar Persona System (Nora & Puja): Built a configurable avatar persona framework supporting independent scenario definitions, voice IDs, chat history collections, system prompt contexts, and predefined audio paths per avatar — ingested and managed via MongoDB with a dedicated scenario ingestion script.
- •
Streaming LLM Response Processing: Implemented a custom SSE stream parser consuming structured LLM output (emotion tags, sentence delimiters, end markers) to yield per-sentence (emotion, message) tuples, enabling real-time sentence-by-sentence TTS and avatar animation sequencing with a drain queue driven by avatar playback state events from the Unity client.
- •
RAG & Vision Integration: Integrated ChromaDB-backed retrieval-augmented generation with keyword-triggered context injection, and built a vision LLM path that encodes the latest camera frame alongside the user transcript for image-aware responses — with automatic fallback to text-only when no valid frame is available.
- •
Idle & Interrupt Handling: Developed proactive idle-state logic that triggers vision-based conversation starters when a person is detected but no speech occurs (~40s threshold), and an LLM-driven interrupt decision system that determines whether mid-response user speech should override the current avatar output or be ignored.
- •
Unity Client Integration & Cross-Team Coordination: Defined and maintained the Socket.IO event contract (get_sid, audio_data, nora_response, avatar_state, image_data, etc.) between the backend and the Unity development team, ensuring reliable session correlation, queue drain behavior, and avatar state synchronization across both clients.
