Yichi Zhang

Research Interests

Focused on making large language models faster, cheaper, and more accessible.

LLM Infrastructure

Building serving frameworks for TTS / Omni model architectures

Request scheduling & batching
Prefix caching strategies
Multi-modal serving pipelines
Distributed inference systems
GPU cluster orchestration
Model deployment & system reliability

Optimization & Benchmarking

Runtime optimization and GPU performance profiling

CUDA Graph capture & optimization
Sampler vectorization & tensorization
Encoder-stage optimization (LRU caching, batched encoding)
GPU profiling & benchmarking (H100 / H200 / H20)

Resume

Download PDF

Open Source & Projects

April 2026 – Present

SGLang-Omni — Core Contributor (Lead Higgs TTS Optimization)

Led the Higgs TTS inference-optimization workstream: designed optimization roadmap across encoder, AR-decode, and vocoder stages. Delivered +103% throughput, +107% audio-s/s, and −51% RTF on H200. Drove CUDA Graph capture for the autoregressive decode path.

September 2025 – Present

SGLang — OSS Contributor

Created and presented official SGLang tutorial videos (Diffusion, Cookbook). Expanded test coverage for OpenAI-compatible API endpoints across multiple PRs.

Work Experience

July 2025 – Present

Software Engineer (AMTS) — Salesforce

Bellevue, WA

Led Tableau Mobile end-to-end feature efforts. Delivered TabAgent, an embedded AI assistant for Tableau serving millions of users. Built a LangGraph AI agent automating bug-blitz processes, improving UX validation efficiency by 50%+.

May 2024 – Aug 2024

Software Engineer Intern — Salesforce

Seattle, WA

Implemented Tableau-Pulse features (React Native + Redux) shipping to 100k+ customers.

Sep 2023 – Dec 2023

Software Engineer Intern — RevArt

Santa Clara, CA

Built an AI content assistant (ChatGPT APIs) generating social posts from artist prompts, reducing content-creation time by 80% and serving 10k+ artists.

May 2023 – Aug 2023

AI Engineer Intern — Inspur Group

Shandong, China

Deployed production-grade extraction models on cloud inference servers. Built a LangChain + Qwen agent to normalize heterogeneous EMR formats.

Education

Aug 2021 – May 2025

B.A. Computer Science & Mathematics (Double Major)

University of Virginia (UVA)

Graduated with High Distinction.

Skills

LLM Inference & Systems

SGLang Inference Serving Multi-GPU / Tensor Parallelism Disaggregated Inference CUDA Graph Optimization Quantization Benchmarking Regression Testing RAG LangChain / LangGraph

Infra & Tools

CUDA Docker Linux Git GitHub Actions CI/CD

Programming Languages

Python C C++ Go Java TypeScript HTML / CSS / JavaScript SQL R Arm64 x86-64