Skip to main content
Conversational AI — Experience Center

Conversational AI — Experience Center

CompanySify Technologies
Python – Socket.IO – aiohttp
Year2025
Impact

Deployed interactive AI avatar experiences at live expo events and enterprise experience centers, enabling real-time personalized conversations at scale — reducing dependency on human staffing for demos and delivering consistent, intelligent visitor engagement across two distinct deployment contexts.

Overview

Designed and developed a production-grade conversational AI backend powering two distinct avatar personas — Nora (DevLearn 2024, Las Vegas) and Puja (Sify Experience Center, Navi Mumbai). The system orchestrates a Unity 3D avatar client and a separate camera feed application over a single Socket.IO server, handling real-time speech transcription, LLM response streaming, text-to-speech synthesis, computer vision, face recognition, and retrieval-augmented generation — all wired together into a seamless, sentence-by-sentence avatar response pipeline.

Key Responsibilities

  • Lead Backend Architect — Socket.IO Server: Designed the full server architecture using Python and Socket.IO, handling two simultaneous client connections (Unity avatar app and camera feed app) correlated by session ID, with per-session state management, async queues, and clean event-driven separation across the events/ and modules/ layers.

  • Multi-Modal Pipeline Integration: Engineered the end-to-end audio-visual pipeline — Unity microphone audio → Whisper transcription → LLM streaming → Edge TTS sentence synthesis → avatar response emission — alongside a parallel video pipeline feeding YOLO person detection and HOG face recognition into the same session context.

  • Dual Avatar Persona System (Nora & Puja): Built a configurable avatar persona framework supporting independent scenario definitions, voice IDs, chat history collections, system prompt contexts, and predefined audio paths per avatar — ingested and managed via MongoDB with a dedicated scenario ingestion script.

  • Streaming LLM Response Processing: Implemented a custom SSE stream parser consuming structured LLM output (emotion tags, sentence delimiters, end markers) to yield per-sentence (emotion, message) tuples, enabling real-time sentence-by-sentence TTS and avatar animation sequencing with a drain queue driven by avatar playback state events from the Unity client.

  • RAG & Vision Integration: Integrated ChromaDB-backed retrieval-augmented generation with keyword-triggered context injection, and built a vision LLM path that encodes the latest camera frame alongside the user transcript for image-aware responses — with automatic fallback to text-only when no valid frame is available.

  • Idle & Interrupt Handling: Developed proactive idle-state logic that triggers vision-based conversation starters when a person is detected but no speech occurs (~40s threshold), and an LLM-driven interrupt decision system that determines whether mid-response user speech should override the current avatar output or be ignored.

  • Unity Client Integration & Cross-Team Coordination: Defined and maintained the Socket.IO event contract (get_sid, audio_data, nora_response, avatar_state, image_data, etc.) between the backend and the Unity development team, ensuring reliable session correlation, queue drain behavior, and avatar state synchronization across both clients.

Gallery

Tech Stack

PythonSocket.IOaiohttpMongoDBChromaDBWhisperEdge TTSYOLOOpenCVHOGRAGLLM Streaming

Continue Exploring