Concepts of LLM Serving
LLMOps Part 14: An overview of the fundamentals of LLM serving, including API-based access, inference with vLLM, and practical decisions.
73 posts published
LLMOps Part 14: An overview of the fundamentals of LLM serving, including API-based access, inference with vLLM, and practical decisions.
LLMOps Part 13: Exploring the mechanics of LLM inference, from prefill and decode phases to KV caching, batching, and optimization techniques that improve latency and throughput.
LLMOps Part 12: Understanding LLM fine-tuning, parameter-efficient methods like LoRA and QLoRA, and alignment techniques such as RLHF, DPO, and GRPO.
LLMOps Part 11: Understanding evaluation of conversational LLM systems, tool evaluations, tracing with Langfuse, and automated red teaming.
LLMOps Part 10: Understanding model benchmarks, LLM application evaluation, and tooling.
LLMOps Part 9: A foundational guide to the evaluation of LLM applications, covering challenges and a practical taxonomy of evaluation methods.
LLMOps Part 8: A concise overview of memory, dynamic and temporal context in LLM systems, covering short and long-term memory, dynamic context injection, and some of the common context failure modes in agentic applications.
LLMOps Part 7: A conceptual overview of context engineering, covering context types, context construction principles, and retrieval-centric techniques for building high-signal inputs.
LLMOps Part 6: Exploring prompt versioning, defensive prompting, and techniques such as verbalized sampling, role prompting and more.
LLMOps Part 5: An introduction to prompt engineering (a subset of context engineering), covering prompt types, the prompt development workflow, and key techniques in the field.
LLMOps Part 4: An exploration of key decoding strategies, sampling parameters, and the general lifecycle of LLM-based applications.
LLMOps Part 3: A focused look at the core ideas behind attention mechanism, transformer and mixture-of-experts architectures, and model pretraining and fine-tuning.
LLMOps Part 2: A detailed walkthrough of tokenization, embeddings, and positional representations, building the foundational translation layer that enables LLMs to process and reason over text.
AI Agents Crash Course—Part 17 (with implementation).
AI Agents Crash Course—Part 16 (with implementation).
LLMOps Part 1: An overview of AI engineering and LLMOps, and the core dimensions that define modern AI systems.
A comprehensive guide to Opik, an open-source LLM evaluation and observability framework.
MLOps Part 18: A hands-on guide to CI/CD in MLOps with DVC, Docker, GitHub Actions, and GitOps-based Kubernetes delivery on Amazon EKS.
MLOps Part 17: ML monitoring in practice with Evidently, Prometheus and Grafana, stitched into a FastAPI inference service with drift reports, metrics scraping, and dashboards.
AI Agents Crash Course—Part 15 (with implementation).
MLOps Part 16: A comprehensive overview of drift detection using statistical techniques, and how logging and observability keep ML systems healthy.
MLOps Part 15: Understanding the EKS lifecycle, getting hands-on with AWS setup, and deploying a simple ML inference service on Amazon EKS.
MLOps Part 14: Understanding AWS cloud platform, and zooming into EKS.
MLOps Part 13: An overview of cloud concepts that matter, from virtualization and storage choices to VPC, load balancing, identity, and observability.
MLOps Part 12: An introduction to Kubernetes, plus a practical walkthrough of deploying a simple FastAPI inference service using Kubernetes.
MLOps Part 11: A practical guide to taking models beyond notebooks, exploring serialization formats, containerization, and serving predictions using REST and gRPC.
MLOps Part 10: A comprehensive guide to model compression covering knowledge distillation, low-rank factorization, and quantization, followed by ONNX and ONNX Runtime as the bridge from training frameworks to fast, portable production inference.
MLOps Part 9: A deep dive into model fine-tuning and compression, specifically pruning and related improvements.
MLOps Part 8: A systems-first guide to model development and optimizing performance with disciplined hyperparameter tuning.
MLOps Part 7: An applied look at distributed data processing with Spark and workflow orchestration and scheduling with Prefect.