Ketaki Dabade — CS Graduate Student

Experience

Research Assistant

CRIS Laboratory, Columbia University

Sep 2025 — Present

Built a scientific content analysis pipeline: MinerU-based PDF extraction across 3,000+ textbook pages, Qwen3-Embedding for 17,000+ dense vectors, and BERTopic with HDBSCAN to discover 493 semantically coherent topics.

Leveraged Gemma for topic labeling and hierarchical clustering to map prerequisite knowledge relationships. This work lays the foundation for Sparse Autoencoder training on structured knowledge.

Data Research Collaborator

LivingScopeHealth

Jan 2026 — Present

Analyzing large-scale patient data in PostgreSQL to identify early indicators of diabetes onset before clinical diagnosis.

Building classification models (XGBoost, Random Forest) with SMOTE for class imbalance and SHAP for feature importance to enable preventive interventions.

Deep Learning Engineer Intern

AI4M Technology Private Limited

Jul 2024 — Dec 2024

Trained YOLOv7/v8 defect detection models for manufacturing QC. Deployed on NVIDIA Jetson with DeepStream SDK and TensorRT (FP16/INT8) achieving 3x inference speedup and 25% reduced detection latency.

Designed Flask REST APIs for real-time inference across 3 production lines. Built multi-threaded Docker backend with AWS/Azure data pipelines, CI/CD, and 85% test coverage.

Data Analyst Intern

ViLA EmachWirken Private Limited

Jun 2022 — Dec 2022

Built K-Means clustering to identify 5 customer personas. Designed Grafana dashboards tracking 15+ KPIs — revenue, churn, CAC, and operational efficiency.

Conducted EDA on 100K+ transactions using Python and SQL. Automated reporting pipelines, reducing manual work by 40% and enhancing operational visibility by 30%.

Projects

1st Place — Columbia AI for Good Hackathon

Patrona — AI Voice Safety Companion

Voice-first AI that walks home with users. Hands-free safety through natural conversation — silence detection, safe words, and live GPS alerts to emergency contacts.

View on GitHub

patrona

☾

Good evening,
Ketaki.

Walk Me Home →

Last walk

Today · 11:24 PM · 18 min

Safe

● Home

☰ History

⚙ Settings

4:32

GPS 40.8075, -73.9626

Listening...

Your companion is right here

Heading to

548 W 113th St, New York

I'm Home

Cancel walk

Alert Active

12:47

Alert sent.

Your contacts have been notified.

Contacts notified

Mom

Parent

Notified

Princess Leia

Roommate

Notified

Call 911

I'm Safe

End walk

Selected Work

More Projects

01 Quant Finance

Quant Portfolio Returns Dashboard

Real-time analytics with 15+ risk metrics (Sharpe, VaR, CVaR), mean-variance optimization via SciPy, Monte Carlo simulations, and TWR/IRR calculations with S&P 500 benchmarks.

Code

02 Agentic AI

PaperTrail — SEC Filing Contradiction Detector

Real-time SEC filing contradiction detector for S&P 500 companies. Fine-tuned FinBERT for claim classification, hybrid retrieval with pgvector + Neo4j temporal knowledge graph, and LLM agent orchestration via LangChain with custom tools for negation detection, temporal reasoning, and insider transaction lookup. Next.js dashboard with live WebSocket feed.

Code

03 NLP Research

Cross-Lingual Indic Hate Speech Detection

LoRA achieves F1 ~0.80 with only 0.95% parameter updates while full fine-tuning catastrophically fails. MuRIL outperforms IndicBERT-v2 in few-shot by 2.1% F1 with just 50 examples.

Code

04 Computer Vision

Pinterest Duplicate Detector

CLIP embeddings + FAISS vector indexing for sub-second similarity search across 10K+ images. Multi-metric scoring combines perceptual hashing, SSIM, and neural embeddings.

Code

05 2nd Place — HACKMITWPU

CanMan — NLP Canteen System

Full-stack canteen management with NLP chatbot for natural language food ordering. D3.js analytics dashboard for sales trends and demand forecasting. Flask + MongoDB + React.

Code

06 IEEE Published

EEG Brain-Computer Interface

Control a 3D hand using brainwaves. Emotiv EPOC X at 256Hz, FFT/wavelet feature extraction, KNN classifier at 97.63% accuracy, real-time Blender visualization. Led team of 4.

Code

07 Top 100 — KPIT Hackathon

ViziAssist — Assistive Driving

Real-time obstacle detection for visually impaired individuals on NVIDIA Jetson Nano. Custom YOLOv7 with TensorRT, Raspberry Pi camera, and audio feedback. Published in Springer CCIS.

Code

08 Springer LNNS

SkillSet Sherpa — AI Career Counselor

AI career guidance platform using GPT-3 + LangChain, EasyOCR resume parsing, RIASEC psychometric assessments, and NLTK entity extraction to generate personalized career path recommendations.

09 Computer Vision

One View — Smart Event Photos

Event management app with intelligent photo organization. DBSCAN clustering on facial embeddings to automatically group event photos by person — no manual tagging needed.

10 Embedded Systems

Automated Door Lock System

Biometric door lock with R307 fingerprint sensor and Arduino. Optimized matching algorithm for sub-second authentication with secure enrollment system. Led a team of 5.

11 Research Survey

ML & Game Theory in Sports

Literature survey covering neural networks for match prediction, inverse RL for player valuation, multi-agent decision-making in formations, and game-theoretic models for penalty kicks and set pieces.

Skills

Languages

Python C/C++ Java JavaScript/TypeScript SQL R Bash MATLAB

ML / DL

PyTorch TensorFlow Scikit-learn XGBoost JAX LoRA / PEFT / QLoRA CNNs RNNs / LSTMs Transformers Diffusion Models

NLP & LLMs

HuggingFace Transformers LangChain LlamaIndex OpenAI API Anthropic API ElevenLabs RAG Prompt Engineering Fine-tuning spaCy Text Classification NER Semantic Search BERTopic Conversational AI

Agentic AI

LLM Tool-calling Multi-step Reasoning Agents Function Calling ReAct Agent Evaluation

Computer Vision

YOLOv7/v8 OpenCV CLIP TensorRT ONNX DeepStream SDK NVIDIA Jetson ONNX

Data & Analytics

Pandas NumPy SciPy Polars Matplotlib Plotly D3.js Tableau Grafana A/B Testing Feature Engineering ETL Pipelines

Quant Finance

VaR CVaR Sharpe / Sortino Monte Carlo Simulation Mean-Variance Optimization ARIMA / GARCH Black-Scholes Portfolio Optimization Backtesting QuantLib

Infrastructure

Flask FastAPI Django React Node.js PostgreSQL REST APIs GraphQL MongoDB Redis FAISS Kafka Spark

Cloud & DevOps

AWS (S3, EC2, Lambda, SageMaker) Azure GCP Docker Kubernetes CI/CD Git Vercel

Education

M.S. in Computer Science

B.Tech in Computer Science

Experience

Research Assistant

Data Research Collaborator

Deep Learning Engineer Intern

Data Analyst Intern

Publications

EEG-Powered Brain-Computer Interface for 3D Hand Gesture Control

SkillSet Sherpa: Career Counseling with Large Language Models

ViziAssist: Visual Assistance for Visually Impaired Drivers

Projects

Selected Work

Quant Portfolio Returns Dashboard

PaperTrail — SEC Filing Contradiction Detector

Cross-Lingual Indic Hate Speech Detection

Pinterest Duplicate Detector

CanMan — NLP Canteen System

EEG Brain-Computer Interface

ViziAssist — Assistive Driving

SkillSet Sherpa — AI Career Counselor

One View — Smart Event Photos

Automated Door Lock System

ML & Game Theory in Sports

Skills

Languages

ML / DL

NLP & LLMs

Agentic AI

Computer Vision

Data & Analytics

Quant Finance

Infrastructure

Cloud & DevOps

Awards

1st Place — Columbia AI for Good Hackathon

2nd Place — HACKMITWPU

Top 100 Nationally — KPIT Hackathon

3 Peer-Reviewed Publications

Organizations

Columbia Lioness Quantitative

Society of Women Engineers (SWE)

Certifications

Google Project Management Professional Certificate

Machine Learning Specialization

Data Analytics & Visualization Job Simulation

Introduction to AI in the Data Center

The Git & GitHub Bootcamp

Google Data Analytics Professional Certificate

Mastering Data Structures & Algorithms (C/C++)

"You talkin' to me?"