Work · Shivam Sharma

VideoDB

Software Engineer Intern · Jan 2026 – Apr 2026 · Remote

At VideoDB I built a real-time fact-checking system for live audio and video streams. Given a stream from YouTube Live or Google Meet, the pipeline transcribes claims as they are spoken, retrieves indexed video evidence, and decides within a second whether each claim is grounded in what the camera actually shows.

I also open-sourced VideoDB Claude Skills, a small library exposing video search, transcription, and scene understanding through a plain-English terminal interface using the MCP protocol. It shipped to the Claude developer marketplace.

I co-authored a benchmarking paper studying whether Gemini’s intermediate reasoning traces actually improve scene-understanding accuracy, or whether they merely look like they do. The answer depends heavily on task structure.

Nanyang Technological University, Singapore

Research Intern · Speech and Multilingual NLP · Jan 2025 – Dec 2025 · Hybrid

I designed and optimised large-scale multilingual ASR and speaker diarization pipelines over GigaSpeech 2, Emilia, and NVIDIA Granary, analysing trade-offs in normalization strategy, language mixing, and benchmark generalization across regional accents.

My current paper builds on Indian-ASR-Bench, a WER benchmark of five ASR systems on the TIE dataset — 986 Indian English academic lecture clips. I evaluated Whisper Base, Medium, and Large alongside Parakeet-TDT-0.6B and Qwen3-ASR-1.7B, with breakdowns by region, speech rate, audio duration, gender, and discipline.

The paper focuses on why scale is not the whole story for accented academic speech: Whisper Medium beats Whisper Large overall, while Parakeet and Qwen3 are more robust on long clips. The methodological result is just as important — transcript normalization can change WER more than model choice.

BITS Pilani, Goa

Research Intern · NLP and LLM Evaluation · Sep 2024 – Dec 2025 · Goa, India

The first project was PustakAI, a curriculum-aligned QA pipeline for Indian school textbooks. I built the curation and evaluation harness, designed RAG-based prompting strategies, and fine-tuned Gemma3 :1B and LLaMA3.2 :3B variants. The paper was accepted at ACM COMPUTE 2025. The NCERT dataset is public on HuggingFace.

The second project is ICH-QA, a 131,495-pair synthetic QA dataset over 7,506 Wikipedia articles on Indian Cultural Heritage. RAG reaches 48% EM - confirming the dataset requires genuine cultural grounding, not surface pattern-matching.

I am currently writing up the ICH-QA work as a research paper with a PhD scholar under a Prof. from BITS Pilani Goa and a Prof. from RMIT Australia.

Video Fact Checker

Electron, React, Hono, Gemini, VideoDB · GitHub

A desktop application that monitors YouTube, Google Meet, and podcasts in real time. It captures system audio via VideoDB, transcribes it live, and uses Gemini to extract and verify claims every 20 seconds — classifying each as verified, misleading, or missing context with a confidence score.

Built as an Electron app with a React frontend and a lightweight Hono backend. Runs as a menu bar tray icon so it stays out of the way while monitoring any audio on the machine.

BhashaSarthi

React Native, Expo SDK 54, Gemini 2.5 Flash · GitHub

Open-source mobile translation app covering all 22 official Indian languages, powered by Google Gemini 2.5 Flash. Supports voice input with auto-transcription, text-to-speech output, and on-device translation history.

Accident Severity Prediction

Python, CatBoost, Streamlit · GitHub · Live

Built for KSP Datathon 2024. Trained a CatBoostClassifier on 329,000+ Karnataka State Police FIR records (2016–2024) to predict accident severity and recommend emergency response levels. Five interactive dashboards surface trends by district, road type, and time period.

Open Source

Open WebUI PR #23456 — replaced six instances of hardcoded forward-slash path concatenation in audio.py with os.path.join(), fixing a Windows audio load failure. Merged.

OpenAI Codex issue #16303 — a UX improvement for skill display in the TUI: surfacing specific skill names instead of the generic “Read SKILL.md” message. Accepted and incorporated by the maintainers.

Home · Last revised June 2026