Currently, I'm a PhD student at Berkeley Sky Computing Lab for machine learning system and cloud infrastructures. I am advised by Prof. Joseph Gonzalez and Prof. Ion Stoica.
My latest focus is building an end to end stack for LLM inference on your own infrastructure:
- vLLM runs LLM inference efficiently.
Previous exploration includes:
- Conex: builds, push, and pull containers fast.
- SkyATC: orchestrate LLMs in multi-cloud and scaling them to zero.
I previously work on Model Serving System @anyscale.
- Ray takes your Python code and scale it to thousands of cores.
- Ray Serve empowers data scientists to own their end-to-end inference APIs.
Before Anyscale, I was a undergraduate researcher @ucbrise.
Publications:
- Under submission: Optimizing LLM Queries in Relational Workloads
- NSDI 2024: Cloudcast: High-Throughput, Cost-Aware Overlay Multicast in the Cloud plan the best network for cloud object store replications.
- VLDB 2024: RALF: Accuracy-Aware Scheduling for Feature Store Maintenance proposes feature update in feature store can be a lot more efficient.
- SoCC 2020: InferLine: ML Inference Pipeline Composition Framework studies how to optimize model serving pipelines.
- VLDB 2020: Towards Scalable Dataframe Systems formalizes Pandas DataFrame.
- SysML Workshop @ Neurips 2018: The OoO VLIW JIT Compiler for GPU Inference tries to multiplex many kernels on the same GPU.
Reach out to me: simon.mo at hey.com